Taxonomy errors in 16S databases

(See Edgar 2018 , Taxonomy annotation and guide tree errors in 16S rRNA databases for more details and references).

The above paper shows that roughly one in ten RDP taxonomy annotations are wrong, and roughly one in five annotations in Greengenes and SILVA are wrong .

Methods
If an environmental sequence is annotated as belonging to a taxon which is defined by traits , then this is a prediction which can always be checked in principle, and sometimes can be checked in practice. For example, according to Bergey's Manual, Salmonella has peritrichous flagella, is non-spore-forming, Gram-negative, with cell lengths from 2 to 5 micrometers and diameters between about 0.7 and 1.5 micrometers. If an environmental sequence is annotated as Salmonella , it is a prediction that these traits are present. This prediction can be checked if the sequence is later found in a cultured strain.

Rhodococcus has distinctly different traits, e.g. it is non-motile and Gram-positive. If SILVA annotates a sequence as Salmonella and Greengenes annotates the same sequence as Rhodococcus , then it is certain that at least one of the annotations is objectively wrong, though verification or falsification is not possible until the sequence is found in cells which can be examined for traits.

Similarly, if SILVA leaves the genus blank and Greengenes annotates the same sequence as Salmonella, then either SILVA is a false negative or Greengenes is a false positive, and again at least one of the databases is certainly wrong.

Many environmental sequences do not belong to named genera . For a given sequence, this can be verified experimentally by finding the sequence in an isolated strain having traits that do not match anything in the standard.

Suppose SILVA and Greengenes agree that the genus is blank but give different family names. If the standard defines both families by traits, then at least one of them is certainly wrong. However, in microbial taxonomy, some higher taxa are defined simply as groups of lower taxa. In these cases, the annotation can be regarded as a prediction that the sequence is found below the lowest common ancestor node for the group in the true phylogenetic tree. Such a prediction is objectively true or false, but cannot be verified with certainty by any foreseeable method because the true tree is unknown. Regardless of these difficulties, if SILVA and Greengenes disagree on the annotation of a group that is predicted to be monophyletic, then at least one of them must be wrong.

Results
I found 784,242 identical sequences in SILVA and Greengenes, of which 732,048 (93%) had taxonomy annotations from the common nomenclature in one or both databases.

An example of a conflict between a pair of names in the common nomenclature is Greengenes 4366627 which has the same sequence as SILVA LOSM01000005.1106908.1108461. Greengenes assigns this sequence to family Pseudoalteromonadaceae while SILVA assigns it to family Vibrionaceae .

Conflicts were found at all ranks: 7,804 at phylum, 23,082 class, 110,308 order, 45,712 family and 62,584 genus, counting only the highest conflicted rank.

An example of a phylum conflict in a sequence classified to genus level by both databases is Greengenes 851746 which is assigned to genus Streptococcus in phylum Firmicutes and has the same sequence as SILVA JZIA01000003.3571786.3573313, which is assigned to genus Xanthomonas in phylum Proteobacteria .

All conflicts are necessarily due to annotation errors in one or both databases. The lower bound on the sum of the number of errors in both databases is obtained when all conflicts are due to an error in one database but not the other. This follows because cases where both databases make the same error (so there is no disagreement, but the annotations are nevertheless wrong), or a given sequence has errors in both databases, can only add to this sum. Thus, a lower bound for the sum of the annotation error rates of SILVA and Greengenes is (number of identical sequences with common nomenclature conflicts) / (number of identical sequences annotated by the common nomenclature) = 249,490 / 732,048 = 34%.

The number of conflicts between the SILVA and Greengenes trees and an independent type-strain-based reference (the RDP training set) is similar, and neither tree contains substantially more pure taxa according to the RDP training set. These results imply that the annotation error rate of both databases is similar, as might be expected.

If the error rates are similar and the sum is close to its lower bound of 34%, then the annotation error rates of Greengenes and SILVA are both approximately 17%. If one database is somewhat better (say, its error rate is 15%) then the error rate of the other must be correspondingly higher (19%) so that the sum remains at least 34%. It seems very unlikely that either database could have an error rate as low as 10% as this would imply that the other has an error rate of at least 24%, and a difference of this magnitude should be noticeable in the numbers of pure taxa. Noting that the sum may be somewhat higher than its lower bound due to systematic errors reproduced in both databases, a reasonable rough estimate is that one in five sequences in Greengenes and SILVA have incorrect taxonomy annotations.