Response to comments by the ARB/SILVA team

This is a response to comments on my preprint " Taxonomy annotation and guide tree errors in 16S rRNA databases ", an updated version of which is now published in PeerJ .

" [Edgar] claims that the better performance of RDP vs. SILVA and Greengenes is due to the fact that RDP uses a fully-automated annotation system while SILVA and Greengenes are based on a semi-automated procedure involving manual curation by taxonomic experts".

The paper does not make this claim. On the contrary, I say (emphasis added) "the uncertainties in these estimates are too great to support a firm conclusion that annotations in any one of these databases are better or worse than another". I say the results "suggest" that predictions by the RDP Classifier are more accurate than those in SILVA and Greengenes, but I do not say this is because RDP is fully automated.

"SILVA is not only based on LPSN but, like RDP, uses the Bergey's Manual as the main source of taxonomy".

My mistake, I will correct this in the next revision.

"SILVA team has already made attempts to quantify and eliminate them see ( Kozlov 2016 )."

I appreciate this reference -- I wasn't aware of it and should have cited it; I will do that in the next revision.

" Edgar implies that only for RDP the history of genesis of the datasets is available."

Old SILVA trees are available, but to the best of my knowledge the authoritative subset is not documented for any release (i.e., the subset of sequences with taxonomy annotations that were considered reliable because they were based on expert classifications of cultured strains). My point here is that it is not possible perform a blinded test of an old SILVA release compared with a new SILVA release.

"It is also a general misconception that we know the “true gene/phylogenetic tree” as stated on page 5, 13, 14 and 20."

My paper does not state that the true gene tree or true phylogenetic tree is known. On the contrary, I emphasize that the true trees are unknown and may not even be well-defined, and that estimated trees surely have many branching order errors compared to any of the possible "true" trees (gene tree, phylogenetic tree, etc.). Examples: p.5 "However, even if there is good reason to believe that the true gene tree would be an accurate guide to taxonomy, it is very challenging to estimate the gene tree from multiple sequence alignments that span vast evolutionary distances, and it is an open question whether estimated gene trees enable accurate taxonomy prediction in practice." p.6: "Note that the annotations of this subset tree are considered to be authoritative, but not necessarily the branching order ." p8: "Trees inferred from large sequence sets are unreliable guides to phylogeny due to incorrect branching orders . Naive assumptions that the tree is correct and the taxonomy is consistent with the tree are violated by branching order errors and also by taxa which are found to overlap by sequence".

Consequently it is impossible to give a clear answer what is correct and what is wrong with respect to different taxonomic annotations.

If an environmental sequence is annotated as belonging to a taxon which is defined by traits, then this is a prediction which can always be checked in principle, and sometimes can be checked in practice. For example, Salmonella is non-spore-forming, gram-negative, predominantly motile, with cell diameters between about 0.7 and 1.5 µm, lengths from 2 to 5 µm, and peritrichous flagella all around the cell body. These are objective criteria which can be assessed by examining cells. If an environmental sequence is annotated as Salmonella , it is a prediction that these traits are present . This prediction can be checked if the sequence is later found in a cultured strain; this is what I call a blinded test. Rhodococcus has distinctly different traits, e.g. it is non-motile and gram-positive. If SILVA annotates a sequence as Salmonella and Greengenes annotates the same sequence as Rhodococcus , then it is certain that at least one of the annotations is objectively and verifiably wrong.

If SILVA leaves the genus blank and Greengenes annotates the same sequence as Salmonella , then either SILVA is a false negative or Greengenes is a false positive, and this can be verified. Again, in this scenario, at least one of the databases is certainly wrong, though we don't know which unless we find the sequence in cells which we can examine for traits.

Many environmental sequences do not belong to named genera. For a given sequence, this can be verified experimentally by finding the sequence in an isolated strain having traits that do not match anything in the standard, say Bergey's. Suppose SILVA and Greengenes agree that the genus is blank but give different family names. If the standard (e.g., Bergey's) defines the families by their traits, then one of them is certainly wrong. However, in microbial taxonomy, some higher groups are defined simply by grouping lower taxa without explicitly listing characteristic traits. These groups are defined based on evidence that the lower taxa are related, and are subject to revision when new evidence suggests that they are polyphyletic. In these cases, the annotation can be regarded as a prediction that the sequence is found below the lowest common ancestor node for the group in the true phylogenetic tree. This prediction is objectively true or false providing that we assume that a true phylogenetic tree exists despite complications such as lateral gene transfer. If such predictions are objectively true or false, and if SILVA and Greengenes disagree on the annotation of this type of named group, then they cannot both be true.

Taxonomy must be based on objective criteria, otherwise it is not scienticially useful or valid. Objective criteria are by definition true or false, though in some cases (e.g., monophyly) it may be impractical or impossible to definitively verify criteria experimentally. Which criteria should be used is debated, and there is plenty of room for subjective opinions, e.g. whether to split or lump new groups, or whether to revise the criteria for a clade. But regardless, the criteria used for classification should be consistent and objective.

"Taxonomy is never static, but rather a moving target influenced by the constant insertion of new sequences in the tree as well as substantially faster the tree calculation algorithms that allow to reconstruct trees with many more sequences."

The true gene tree of a given set of 16S sequences is static, though of course it is not known. If the SILVA tree for a given set of sequences is a moving target, this is due to errors in the tree.

The true taxonomy of a strain is static according to a given classification standard, e.g. Bergey's Manual, 7th edn. The true taxonomy of a sequence is therefore also static, though it is often not known because its traits are not known.

A taxonomic classification standard, e.g. Bergey's Manual, should be static as far as practically possible because the meaning of a name such as Salmonella should not change over time. Occasionally, the classification criteria (characteristic traits) for a taxon are revised; e.g., the description of a genus in the 7th edn. of Bergey's may be updated compared to the 6th edn., and in this limited sense taxonomy is a moving target. Also, assignments of taxa to higher ranks can change over time, typically because evidence from molecular sequences suggests that a group is polyphyletic. However, these issues account for only a small minority of conflicts in database annotations.

The predicted gene trees in SILVA and LTP are moving targets primarily because they have many errors, and the errors are different in each release. This surely also applies to Greengenes, but to the best of my knowledge only one Greengenes tree is publicly available so changes cannot be assessed. Possibly, predicted gene trees are gradually improving, though this is not certain because improvements in algorithms and computing power could be offset by the increased number of sequences, which makes alignment and tree inference more difficult. Regardless, the consistency of these trees with each other and with type strain taxonomy is very poor, almost certainly because they have pervasive branching order errors, and they are therefore not suitable as authoritative guides to phylogeny. The trees could nevertheless be useful for taxonomy prediction, but the results in my paper suggest that the RDP Classifier is probably better.

"[I]t is rather the rule than the exception to locate differences in the taxonomic annotations between databases... the results provided are neither new nor unexpected".

In ( Kozlov 2016 ), a recent paper by the ARB/SILVA team, the Abstract states: "[A]n analysis of four widely used microbial 16S reference databases (Greengenes, LTP, RDP and SILVA) indicates that they currently contain between 0.2% and 2.5% mislabels". My paper shows that this is a drastic understimate of the error rates, and the results are therefore both new and unexpected by the team's own standards.

Reference
Alexey M. Kozlov, Jiajie Zhang, Pelin Yilmaz, Frank Oliver Glockner, Alexandros Stamatakis (2016); Phylogeny-aware identification and correction of taxonomically mislabeled sequences, NAR , 44 (11), pp 5022-5033.