See also
Microbial
taxonomy
SILVA and Greengenes
"taxonomies"
(See Edgar 2018, Taxonomy annotation and guide tree errors in 16S rRNA databases for more details and references).
The above paper shows that roughly one in ten RDP taxonomy annotations are wrong, and roughly one in five annotations in Greengenes and SILVA are wrong.
Methods
If an environmental sequence
is annotated as belonging to a taxon which is
defined by traits, then this is a prediction which can always be checked
in principle, and sometimes can be checked in practice. For example,
according to Bergey's Manual, Salmonella has peritrichous flagella,
is non-spore-forming, Gram-negative, with cell lengths from 2 to 5
micrometers and diameters between about 0.7 and 1.5 micrometers. If an
environmental sequence is annotated as Salmonella, it is a
prediction that these traits are present. This prediction can be checked if
the sequence is later found in a cultured strain.
Rhodococcus has distinctly different traits, e.g. it is non-motile and Gram-positive. If SILVA annotates a sequence as Salmonella and Greengenes annotates the same sequence as Rhodococcus, then it is certain that at least one of the annotations is objectively wrong, though verification or falsification is not possible until the sequence is found in cells which can be examined for traits.
Similarly, if SILVA leaves the genus blank and Greengenes annotates the same sequence as Salmonella, then either SILVA is a false negative or Greengenes is a false positive, and again at least one of the databases is certainly wrong.
Many environmental sequences do not belong to named genera. For a given sequence, this can be verified experimentally by finding the sequence in an isolated strain having traits that do not match anything in the standard.
Suppose SILVA and Greengenes agree that the genus is blank but give different family names. If the standard defines both families by traits, then at least one of them is certainly wrong. However, in microbial taxonomy, some higher taxa are defined simply as groups of lower taxa. In these cases, the annotation can be regarded as a prediction that the sequence is found below the lowest common ancestor node for the group in the true phylogenetic tree. Such a prediction is objectively true or false, but cannot be verified with certainty by any foreseeable method because the true tree is unknown. Regardless of these difficulties, if SILVA and Greengenes disagree on the annotation of a group that is predicted to be monophyletic, then at least one of them must be wrong.
Results
I found 784,242 identical
sequences in SILVA and Greengenes, of which 732,048 (93%) had taxonomy
annotations from the common nomenclature in one or both databases.
An
example of a conflict between a pair of names in the common nomenclature is
Greengenes 4366627 which has the same sequence as SILVA
LOSM01000005.1106908.1108461. Greengenes assigns this sequence to family
Pseudoalteromonadaceae while SILVA assigns it to family
Vibrionaceae.
Conflicts were found at all ranks: 7,804 at
phylum, 23,082 class, 110,308 order, 45,712 family and 62,584 genus,
counting only the highest conflicted rank.
An example of a phylum conflict in a sequence classified
to genus level by both databases is Greengenes 851746 which is assigned to
genus Streptococcus in phylum Firmicutes and has the same
sequence as SILVA JZIA01000003.3571786.3573313, which is assigned to genus
Xanthomonas in phylum Proteobacteria.
All
conflicts are necessarily due to annotation errors in one or both databases.
The lower bound on the sum of the number of errors in both databases is
obtained when all conflicts are due to an error in one database but not the
other. This follows because cases where both databases make the same error
(so there is no disagreement, but the annotations are nevertheless wrong),
or a given sequence has errors in both databases, can only add to this sum.
Thus, a lower bound for the sum of the annotation error rates of SILVA and
Greengenes is (number of identical sequences with common nomenclature
conflicts) / (number of identical sequences annotated by the common
nomenclature) = 249,490 / 732,048 = 34%.
The number of
conflicts between the SILVA and Greengenes
trees and an independent type-strain-based reference (the RDP training
set) is similar, and neither tree contains substantially more pure taxa
according to the RDP training set. These results imply that the annotation
error rate of both databases is similar, as might be expected.
If the
error rates are similar and the sum is close to its lower bound of 34%, then
the annotation error rates of Greengenes and SILVA are both approximately
17%. If one database is somewhat better (say, its error rate is 15%) then
the error rate of the other must be correspondingly higher (19%) so that the
sum remains at least 34%. It seems very unlikely that either database could
have an error rate as low as 10% as this would imply that the other has an
error rate of at least 24%, and a difference of this magnitude should be
noticeable in the numbers of pure taxa. Noting that the sum may be somewhat
higher than its lower bound due to systematic errors reproduced in both
databases, a reasonable rough estimate is that one in five sequences in
Greengenes and SILVA have incorrect taxonomy annotations.