USEARCH performance

Accuracy measures for USEARCH benchmarks

Top-hit and top-k hit accuracy measures

USEARCH introduced a new paradigm for database search methods (Edgar, 2010). The key innovation is to use a fast filter to sort the database in a priority order that correlates with decreasing sequence similarity. Database sequences are compared to the query in priority order using a BLAST-like alignment algorithm for speed. By considering the database in priority order, the best possible matches in the database are likely to be among the first few sequences tested. Many applications require only the best hit or the best few hits, and in such cases the search can be terminated as soon as a given (user-settable) number of acceptable hits have been found. Criteria for accepting a hit including %id, E-value, fraction of the query and/or target sequence that is covered by the alignment, and more. Terminating the search after examining only a small fraction of the database can achieve much faster execution times than algorithms like BLAST that search for all possible hits and must therefore attempt to align all database sequences with every query.

Traditional sensitivity tests measure the fraction of all homologous matches in the database found by a given algorithm. In the case of USEARCH algorithms that use priority sorting, this type of measure is not appropriate as the algorithms are designed to find one or a few good hits, not all hits. I therefore use a "top-k hits" measure, as follows. For a given query, each hit above the E-value threshold is classified as a true positive (TP) or false positive (FP). The total number of TPs found is N_TP and the total number of FPs is N_FP.. If there are N queries, the sensitivity is S = N_TP/N and the error rate is E = N_FP/N. The sensitivity and error rates are then in the range S = 0 .. k and E = 0 .. k. In the special case k=1, this is known as the top-hit measure.

References
Edgar, R.C. (2010), Search and clustering orders of magnitude faster than BLAST, Bioinformatics 26(19) 2460-61,doi: 10.1093/bioinformatics/btq461.