Home Software Services About Contact     
 
Follow on twitter

Robert C. Edgar on twitter

11-Aug-2018 New paper describes octave plots for visualizing alpha diversity.

12-Jun-2018 New paper shows that one in five taxonomy annotations in SILVA and Greengenes are wrong.

18-Apr-2018 New paper shows that taxonomy prediction accuracy is <50% for V4 sequences.

05-Oct-2017 PeerJ paper shows low accuracy of closed- and open-ref. QIIME OTUs.

22-Sep-2017 New paper shows 97% threshold is wrong, OTUs should be 99% full-length 16S, 100% for V4.

24-Nov-2016
UPARSE tutorial video posted on YouTube. Make OTUs from MiSeq reads.

 

USEARCH v11

USEARCH algorithm

The USEARCH algorithm searches a database for high-identity hits to one or more database sequences ("targets"). USEARCH is used by the usearch_global and usearch_local commands and is used as a subroutine by cluster_fast and cluister_smallmem. This algorithm is fundamentally different from the UBLAST algorithm that is designed for low identity local searches.

Image

Recommended identity ranges
USEARCH is effective at identities of ~50% and above for proteins and ~75% and above for nucleotides.

Word counts and U-sorting
USEARCH exploits that fact that similar sequences tend to have several short words in common. These words have a fixed length k, and are sometimes called kmers. Unlike other programs that use kmer counting, USEARCH does not attempt to estimate sequence identity from the number of matching kmers. This is because identity correlates only approximately with the word count and does not give an accurate estimate, especially for lower identities. Instead, USEARCH uses the word count to prioritize the database search. Target sequences are compared to the query in order of decreasing unique word count (U). This is the "U" in USEARCH. If a hit above the identity threshold exists in the database, it is likely to be found close to the start of the U-sorted list of targets (the "U vector").

Accepts and rejects
An "accept" is a target sequence that is above the identity threshold so can be considered a hit. A "reject" is a target sequence that is compared to the query but falls below the threshold. If targets are compared to the query in U-sorted order, then:

(i) the first hit that is found is likely to be the best hit that exists in the database, or close to it, and

(ii) the more rejects that have occurred without a hit, the less likely it is that a hit exists.

This means that the search can be terminated early if (i) a hit is found, or (ii) several rejects have occurred, with only a small loss in sensitivity. The termination options -maxaccepts and -maxrejects determine when the search is stopped. This technique can give a dramatic improvement in speed, which is a decisive advantage for the very large sequence datasets that are increasingly common in biology.

Word index
USEARCH uses an index on words in the database for rapid calculation of the U vector for a given query. Index options allow fine tuning of speed, sensitivity and memory use. The index can be stored in a UDB file, which can be advantageous when a large database will be searched repeatedly. Alternatively, the index can be constructed on the fly, for example when a database is provided in FASTA format. Indexes are also used by cluster_fast and cluister_smallmem, which use USEARCH as a subroutine to match query sequences to a database of centroids of existing clusters. With clustering, the index is necessarily constructed on the fly since the database is updated as new clusters are identified.