Accuracy measures for USEARCH benchmarks |
Top-hit and top-k hit accuracy measures USEARCH introduced a new paradigm for database search methods (Edgar, 2010). The key innovation is to use a fast filter to sort the database in a priority order that correlates with decreasing sequence similarity. Database sequences are compared to the query in priority order using a BLAST-like alignment algorithm for speed. By considering the database in priority order, the best possible matches in the database are likely to be among the first few sequences tested. Many applications require only the best hit or the best few hits, and in such cases the search can be terminated as soon as a given (user-settable) number of acceptable hits have been found. Criteria for accepting a hit including %id, E-value, fraction of the query and/or target sequence that is covered by the alignment, and more. Terminating the search after examining only a small fraction of the database can achieve much faster execution times than algorithms like BLAST that search for all possible hits and must therefore attempt to align all database sequences with every query.
Traditional
sensitivity tests measure the fraction of all homologous matches in the
database found by a given algorithm. In the case of USEARCH algorithms
that use priority sorting, this type of measure is not appropriate as
the algorithms are designed to find one or a few good hits, not all
hits. I therefore use a "top-k hits" measure, as
follows. For a given query, each hit above the E-value threshold is
classified as a true positive (TP) or false positive (FP). The total
number of TPs found is N_TP and the total number of FPs is N_FP..
If there are N queries, the sensitivity is S = N_TP/N
and the error rate is E = N_FP/N. The sensitivity and
error rates are then in the range S = 0 .. k and E
= 0 .. k. In the special case k=1, this is known as the
top-hit measure. References |