Features

About USEARCH
USEARCH is a unique high-throughput sequence analysis tool. It is a distributed as single binary program that implements a suite of algorithms comparable to BLASTN, BLASTP, BLASTX, BLASTCLUST, CD-HIT, CD-HIT-EST, CD-HIT-2D, CD-HIT-EST-2D, CD-HIT-OTU, CD-HIT-454, ChimeraSlayer, Perseus, RAPsearch and more. It supports a rich set of sequence matching options, including E-values, identity, coverage (fraction of query or target sequence covered by the alignment) and maximum gap length, and a range of output file formats including FASTA, BLAST-like, user-defined tabbed text and a native format designed for clustering applications. Supported alignment styles include local (gapped and ungapped), like BLAST, and global, which is most often used in clustering applications. User-settable parameters allow tuning of substitution scores, gap penalties and Karlin-Altschul statistics.

USEARCH is under rapid development; new features are added every few weeks. To stay informed, sign up for the mailing list.
 
Algorithms

See benchmark test results for performance comparisons.

Algorithm

Description

Comments

Database search

Fast, heuristic local search (like BLAST).
Fast, heuristic global search.
Smith-Waterman local search (like SSEARCH).
Needleman-Wunsch global search (like NEEDLE).

Supports fast "top-hit" and "top-few hits" modes and search of entire database. Top-hit modes achieve search speeds up that can be orders of magnitude faster than conventional methods like BLAST.

Clustering

Greedy representative sequence clustering (like CD-HIT), and / or centroid sequence construction (by consensus).

Supports any sort order, e.g. length sort for reducing redundancy, or quality / abundance sort for error correction / denoising. CD-HIT supports length sort only.

Search+clustering

Single-pass method that (i) assigns matching sequences to a database and (ii) clusters sequences that don't match.

Typically used in OTU applications.

OTU clustering

The otupipe script uses USEARCH to construct OTUs from next-gen reads of 16S,  ITS etc.. Performs error-correction, amplicon and abundance estimation, chimera filtering and OTU clustering.

Preliminary results suggest that otupipe gives results comparable to or better than state-of-the-art methods. Otupipe has very low computational overhead. Comprehensive validation is not yet available.

Dereplication

Discard exact duplicates (full-length or substring).

Supports tracking of abundances (total number of reads in a given cluster) through multiple stages of a pipeline.

Error correction/ denoising

Reconstruct biological sequences from noisy reads.

Preliminary results suggest this method gives results comparable to much more expensive methods such as AmpliconNoise; comprehensive validation is not yet available.

Chimera filtering Identifies chimeras using a reference database or de novo (UCHIME). UCHIME is the fastest and most sensitive chimera detection algorithm currently available. It is better than ChimeraSlayer in reference database mode, and comparable to Perseus in de novo mode.

 
Sequence matching
USEARCH offers a rich set of options that determine whether a query sequence matches a database sequence.

Option

Description

Top-hit/
Top-few hits/
Entire database

The  "top-hit" and "top-few hits" modes search the database in a priority order. The sort order is based on the number of short words in common between the query sequence and a database sequence, which correlates with sequence similarity. The strongest hits are therefore expected near the start of the list. Terminating the search when a good hit is found, or after a few good hits are found, achieves search speeds of large databases that can be orders of magnitude faster than conventional methods like BLAST.

E-value

Similar to BLAST. Computed using Karlin-Altschul statistics.

Identity

Sequence identity. USEARCH supports several different definitions of identity.

Coverage

The fraction of the query sequence or database sequence that is covered by the alignment. The query and database coverage can be set independently.

Gap length

A maximum gap length can be set independently for terminal gaps and internal gaps and for the query sequence and database sequence. "Left-justified" and "right-justified" alignments (no terminal gap at the left or right terminal) can be specified, this is useful e.g. for searching and clustering reads that start or end at a known primer.


Output file formats

Format

Description

FASTA

The ubiquitous FASTA format is supported for both input and output. Databases can be provided in FASTA format (convenient for ad hoc searches) or can be indexed (requires an extra step, but saves time and memory if a large database is to be used repeatedly). For output, matching alignments can be generated in FASTA format.

BLAST tabbed

BLAST -m 8 and -outfmt 6 formats (tabbed text).

BLAST verbose

Human-readable BLAST-like format. Examples.

User-defined

User-defined tabbed format. In v4.2, 48 fields are supported, e.g. opens (number of gap opens), ql (full length of query sequence), pv (number of positive matches), rtgaps (number of terminal gaps at right end of alignment), and so on.

UCLUST

Native format designed for clustering applications.

 

Alignment parameters

Parameter

Description

Local/
Global/
Ungapped

Search and clustering can be done using local or global alignments. In the case of local alignments, gapped or ungapped alignments can be used.

Substitution scores

For proteins, the user can specify a substitution matrix. For nucleotides, match and mismatch scores. 

Gap penalties

A total of 12 gap penalties can be specified: penalties can be independently given for opens and extensions, for internal gaps, for left and right terminal gaps, and for the query sequence and database sequence.

Heuristic/
Full d.p.

By default, heuristic alignments are made using a BLAST-like algorithm. If desired, full dynamic programming (Smith-Waterman or Needleman-Wunsch) can be used, and heuristic parameters including X-drop and band width can be tuned for speed and accuracy.

E-value statistics

Karlin-Altschul K, lambda and effective database size parameters can be set by the user.