Nucleotide search benchmark (RFAM)

USEARCH performance home page
See here for RFAM results..

The RFAM benchmark is based on the RFAM database (Gardner et al., 2011), and was published in (Edgar, 2010). One thousand sequences were extracted from RFAM at random to use as a query set, the remaining sequences were used as a search database. The resulting database is a 41Mb FASTA file containing 191,445 sequences. A hit is considered to be a true positive if it belongs to the same RFAM family, and a false positive otherwise, although some families may be distantly related and error rates may therefore be exaggerated. This approach was chosen due to the lack of a large nucleotide database designed for homology detection, and I believe is reasonable for ranking algorithms, although sensitivity and error rates may not be predictive of performance on other types of nucleotide sequence. Sensitivity is measured by considering the top hit or top few hits.

Edgar, R.C. (2010), Search and clustering orders of magnitude faster than BLAST, Bioinformatics 26(19) 2460-61,doi: 10.1093/bioinformatics/btq461.
Gardner, J. Daub, J. Tate, B.L. Moore, I.H. Osuch, S. Griffiths-Jones, R.D. Finn E.P. Nawrocki, D.L. Kolbe, S.R. Eddy, A. Bateman (2011) Rfam: Wikipedia, clans and the "decimal" release, NAR doi: 10.1093/nar/gkq1129.