Benchmarking USEARCH against BLAST
Benchmarking the USEARCH algorithm against other search methods is challenging because there are fundamental differences in design.
Note that the USEARCH binary supports several other algorithms such as UCLUST and UBLAST. This causes an unfortunate confusion in terminology -- it was not a good decision to call the package "USEARCH".
Top hit(s) vs. all hits
The USEARCH algorithm (usearch_global
and usearch_local commands) is designed to find the top hit, or a few top hits, while most other
search algorithms are designed to find all hits that satisfy threshold criteria
such as identity or E-value. I would argue that USEARCH reflects what biologists
usually want, because in practice, only the few best hits from traditional
search algorithms like BLAST are typically retained for downstream analysis.
Global vs. local hits
The most popular search command,
usearch_global, uses global alignments, while most search algorithms such as BLAST
use local alignments. Local alignments can also be used (see
usearch_local command), but these are
rarely used in practice. Whether global or local alignments are more appropriate
depends on the context. For example, with single-gene databases, such as SSU
rRNA, or orthologs, global alignment is often better, and in these cases the
global hits generated by USEARCH often give better estimates of sequence
identity and more accurate assignments of taxonomy.
These essential differences mean that rigorous comparison
of the USEARCH algorithm and BLAST is not really possible, and benchmarks should therefore not
be taken too seriously. In these pages, I have done my best to design tests that
give a realistic indication of the relative performances of typical search
tasks.