|
About
USEARCH
USEARCH is a unique
high-throughput sequence analysis tool. It is a distributed as single
binary program that implements a suite of
algorithms comparable to BLASTN, BLASTP, BLASTX, BLASTCLUST, CD-HIT,
CD-HIT-EST, CD-HIT-2D, CD-HIT-EST-2D, CD-HIT-OTU, CD-HIT-454,
ChimeraSlayer, Perseus, RAPsearch and more. It supports a rich set of sequence
matching options, including E-values, identity, coverage (fraction
of query or target sequence covered by the alignment) and maximum gap
length, and a range of output file
formats including FASTA, BLAST-like, user-defined tabbed text and a
native format designed for clustering applications. Supported alignment
styles include local (gapped and ungapped), like BLAST, and global,
which is most often used in clustering applications. User-settable
parameters allow tuning of substitution scores, gap penalties and
Karlin-Altschul statistics.
USEARCH
is under rapid development; new features are added every few weeks. To
stay informed, sign up for the
mailing list.
Algorithms
See benchmark
test results for performance comparisons.
|
Algorithm
|
Description |
Comments |
|
Database
search |
Fast,
heuristic local search (like BLAST).
Fast, heuristic global search.
Smith-Waterman local search (like SSEARCH).
Needleman-Wunsch global search (like NEEDLE). |
Supports
fast "top-hit" and "top-few hits" modes and
search of entire database. Top-hit modes achieve search speeds up
that can be orders of magnitude faster than conventional methods
like BLAST. |
|
Clustering |
Greedy
representative sequence clustering (like CD-HIT), and / or
centroid sequence construction (by consensus). |
Supports any sort order, e.g. length
sort for reducing redundancy, or quality / abundance sort for
error correction / denoising. CD-HIT supports length sort only. |
|
Search+clustering |
Single-pass
method that (i) assigns matching sequences to a database and (ii)
clusters sequences that don't match. |
Typically used in OTU applications. |
|
OTU
clustering |
The
otupipe script uses
USEARCH to construct OTUs from next-gen reads
of 16S, ITS etc.. Performs error-correction, amplicon and
abundance estimation, chimera filtering and OTU clustering. |
Preliminary results suggest that otupipe
gives results comparable to or better than state-of-the-art
methods. Otupipe has very low computational
overhead. Comprehensive validation
is not yet available. |
|
Dereplication |
Discard exact duplicates (full-length
or substring). |
Supports
tracking of abundances (total number of reads in a given cluster)
through multiple stages of a pipeline. |
|
Error
correction/ denoising |
Reconstruct biological sequences from
noisy reads. |
Preliminary
results suggest this method gives results comparable to much more
expensive methods such as AmpliconNoise; comprehensive validation
is not yet available. |
|
Chimera filtering |
Identifies chimeras using a reference
database or de novo (UCHIME). |
UCHIME
is the fastest and most sensitive chimera detection algorithm
currently available. It is better than ChimeraSlayer in reference
database mode, and comparable to Perseus in de novo mode. |
Sequence
matching
USEARCH offers a rich set of options that
determine whether a query sequence matches a database sequence.
|
Option
|
Description |
|
Top-hit/
Top-few hits/
Entire database |
The
"top-hit" and "top-few hits" modes search the
database in a priority order. The sort order is based on the
number of short words in common between the query sequence and a
database sequence, which correlates with sequence similarity. The
strongest hits are therefore expected near the start of the list.
Terminating the search when a good hit is found, or after a few
good hits are found, achieves search speeds of large databases
that can be orders of magnitude faster than conventional methods
like BLAST. |
|
E-value |
Similar
to BLAST. Computed using Karlin-Altschul statistics. |
|
Identity |
Sequence
identity. USEARCH supports several different definitions of
identity. |
|
Coverage |
The
fraction of the query sequence or database sequence that is
covered by the alignment. The query and database coverage can be
set independently. |
|
Gap
length |
A maximum gap length can be set independently
for terminal gaps and internal gaps and for the query sequence and
database sequence. "Left-justified" and
"right-justified" alignments (no terminal gap at the
left or right terminal) can be specified, this is useful e.g. for
searching and clustering reads that start or end at a known
primer. |
Output file formats
|
Format
|
Description |
|
FASTA |
The
ubiquitous FASTA format is supported for both input and output.
Databases can be provided in FASTA format (convenient for ad
hoc searches) or can be indexed (requires an extra step, but
saves time and memory if a large database is to be used
repeatedly). For output, matching alignments can be generated in
FASTA format. |
|
BLAST
tabbed |
BLAST -m 8 and -outfmt 6 formats (tabbed text). |
|
BLAST
verbose |
Human-readable
BLAST-like format. Examples. |
|
User-defined |
User-defined tabbed format. In v4.2,
48 fields are supported, e.g. opens (number of gap opens), ql
(full length of query sequence), pv (number of positive matches),
rtgaps (number of terminal gaps at right end of alignment), and so
on. |
|
UCLUST |
Native format designed for clustering
applications. |
Alignment
parameters
|
Parameter
|
Description |
|
Local/
Global/
Ungapped |
Search
and clustering can be done using local or global alignments. In
the case of local alignments, gapped or ungapped alignments can be
used.
|
|
Substitution
scores |
For
proteins, the user can specify a substitution matrix. For
nucleotides, match and mismatch scores. |
|
Gap
penalties |
A
total of 12 gap penalties can be specified: penalties can be independently
given for opens and extensions, for internal gaps, for left and
right terminal gaps, and for the query sequence and database sequence. |
|
Heuristic/
Full d.p. |
By
default, heuristic alignments are made using a BLAST-like
algorithm. If desired, full
dynamic programming (Smith-Waterman or Needleman-Wunsch) can be
used, and heuristic parameters including X-drop and band width can
be tuned for speed and accuracy. |
|
E-value
statistics |
Karlin-Altschul
K, lambda and effective database size parameters can be set by the
user. |
|