See also
UTAX algorithm
utax command
Taxonomy benchmark home
Taxonomy overclassification results
Why not use a large reference
database like Greengenes or SILVA?
A fundamental challenge for microbial taxonomy prediction algorithms is the sparse coverage of sequence databases with authoritative classifications. For example, the total number of extant prokaryotic species has been estimated to be of the order of ten million (Curtis et al. 2002) to a billion ((Dykhuizen, 2011), but at the time of writing the RDP 16S rRNA training database (RDP14) has10,678 unique sequences covering 2,799 different taxa of which 1,133 (40%) have only one representative sequence. (Larger reference database such as Greengenes have predicted taxonomies which cover a similar number of named taxa so these classifications cannot be considered to be authoritative).
Optimistically, a single training sequence might be sufficient for a classifier to recognize novel members of a given taxon, but a large majority of taxa are surely missing. Assuming an order of magnitude fewer genera than species, there are one million to one hundred million genera, while RDP14 contains only 2,126 genera. By these estimates, if a species is picked at random, the probability that its genus is present in RDP14 is then between 0.2% and 0.002%.
To the best of my knowledge, the only published method for
quantifying taxonomy prediction performance is the RDP Classifier
"leave-one-out" strategy in which one sequence (the query) is removed from the
trusted set and the query is classified using the remaining sequences as a
reference4. Accuracy at each rank is defined to be the fraction of queries for
which the taxon at that rank is correctly identified. With RDP14 as the
reference, the probability that the genus of a random query is present after
removing the query is 91% (far greater than the most optimistic estimate of 0.2%
for a random species) with a mean of 4.2 remaining training sequences for its
genus, and 99.5% that the family is present with a mean of 27 training
sequences. Leave-one-out thus models a highly unrealistic scenario where the
reference database has several training examples for all ranks of most query
sequences.