Defining "accuracy" of taxonomy classifiers
Taxonomy benchmark home
I measure sensitivity and error rate by
dividing a gold
standard database into several query and reference subsets by
splitting nodes at
different taxonomic levels, as described in
Validating Taxonomy Classifiers.
Reference data is sparse
In practice, microbial reference databases with known
taxonomy cover only a small fraction of extant species.
Incomplete reference data is therefore a severe problem for taxonomy
classifiers. To get a realistic validation of classifier performance it is
therefore important to have a balanced mix of "possible" and "impossible" cases at each level.
Leave-one-out validation is unrealistic
The RDP "leave-one-out" validation approach is not
realistic because a large majority of taxa are "possible" because there will
usually be several examples of the query sequence genus in the training data,
which will often not be the case with real data.
Possible and Impossible taxa
For a given query-reference pair, some taxa
will be present in both the query and reference ("possible" taxa), some will be
present only in the query set ("impossible" taxa), and some taxa will be present
only in the reference set ("decoy" taxa -- if these appear in a prediction they
are always false positives).
Defining sensitivity and error rates
I define sensitivity to be the fraction of "possible" taxa that are correctly
predicted by the classifier at a given value of a
confidence score, averaged over all query-reference pairs. I define the error
rate to be the fraction of predictions that are incorrect at that score (false
positives and false negatives, again averaged over all pairs. For classifiers
that do not report a confidence score, all predictions are included. See
taxonomy classification errors.
Down-weighting over-represented taxa
Averages are weighted so that there are an equal number of possible and
impossible taxa on the query-reference pair, and each taxon name has the same
weight (to correct for highly overrepresented taxa such as the genus Streptomyces which is found in 513 = 5% of the RDP training sequences).