See also
Edgar, R.C. (2017), Updating the 97% identity threshold for 16S ribosomal RNA OTUs
Comments on Westcott and Schloss (2017) "OptiClust" paper
Comments on the the "MCC" measure proposed by Schloss & Westcott
Below is my response to Dr. Schloss's comments (blue text) posted at biorxiv.
"[U]sing taxonomy as the ground truth for assessing a threshold is problematic."
Taxonomy is based on observed traits deemed by experts to be characteristic of related groups of organisms, and thus provides an independent biological standard for assessment of methods based on sequence, unlike the MCC_sw metric advocated by Dr. Schloss which is entirely based on a fixed 97% identity threshold. I would agree that lumping and splitting is often arbitrary, and the species concept itself is problematic for prokaryotes. At 97% V4 identity, a pair of sequences are roughly equally likely to belong to the same genus, family, order or class (P ~ 0.2 for each rank; paper in submission). The solution to the inherent problems with defining species and OTUs is to use ZOTUs, which gives the best possible resolution of phenotype.
"[S]ome species are finely split...and others are lumped".
The paper repeatedly emphasizes that clustering is an approximation, that the goal of the work is to optimize the correspondence of clusters to species, and that any clustering method will inevitably lump and / or split some taxa, so these points are already thoroughly covered. I appreciate the references to prior work supporting the main conclusion that the 97% threshold is too low, and will be glad to cite them in the next update of the paper.
"Previously, Edgar has indicated that he thinks this variation is the result of sequencing artifacts or contamination [but they are actually due to] intragenomic variation".
Not true. My previous papers, and this one, clearly state that intragenomic variation is one of several issues to be considered in clustering. In the UNOISE2 paper cited by Dr. Schloss: "the reference database for the HMP mock community (21 strains) has 115 different 16S sequences, an average of 5.5 distinct 16S sequences per strain." In this paper: "I constructed a database... [which] ensure[s] that intra-genome variation between 16S rRNA paralogs was accurately represented."
"...the more restrictive threshold suggested by Edgar would generate 3 OTUs.... the method would split sequences from the same strain into different OTUs".
The paper proactively draws attention to this point. Some degree of lumping and/or splitting is unavoidable; the question is how best to deal with the issue. Splitting is a benign problem for most purposes because the biological conclusions of typical studies would be at least as good, and perhaps better, than lumping at 97%. For example, a random forest classifier would identity all three OTUs as predictive, or not, of the metadata; a Wilcoxon signed rank test would find, or not, that alpha diversity varies significantly between different healthy and diseased samples, and so on. However, if OTUs often lump distinct phenotypes, as will often be the case with the 97% OTUs advocated by Dr. Schloss, then strains with biological significance may be lumped with strains that are not significant and the signal could be diluted or lost. Thus, lumping can be harmful but splitting is generally benign.
"Although Edgar's Pcs calculations seem to account for intraspecies variation, it does not seem to factor in intrastrain variation."
This is incorrect. The reference databases were carefully constructed to model instrastrain variation as well as intra-species variation. As quoted above: "I constructed a database...[which] ensure[s] that intra-genome variation between 16S rRNA paralogs was accurately represented."
"Some [denoisers] are aggressive in removing rare sequences that may be true sequences..."
Clustering sequences at 97% is much more aggressive than denoising in removing variants -- even with short V4 sequences, variants with up to 7 or 8 differences are lumped into the same cluster and with longer sequences, more differences are allowed. In the case of UNOISE2, unique sequences with total abundance <8 over all samples are discarded by default, and Dr. Schloss is correct that some of these sequences will be valid. However, most of them are noise, and in a typical study with tens or hundreds of samples, a sequence with total abundance <8 will be a singleton or absent in most samples and it will not be possible to make statistically valid inferences about them.
"I am far more confident in the quality of sequences generated from fully overlapping 250 nt MiSeq reads for the V4 region..."
Fully overlapping V4 is effective for suppressing sequencer error, but not PCR error, contaminants or cross-talk. Denoisers are very effective at removing both sequencer error and PCR error, at least on mock community tests which is the most realistic model we have of the noise generated by the complete protocol from library preparation through sequencing. I showed in the UPARSE paper that highly accurate sequences could be generated by the R2 reads alone on mock community tests. The R2s were much noisier than the R1s, but UPARSE was nevertheless able to recover a 97% subset of the mock community sequences almost perfectly, showing that UPARSE is very robust against noisy reads without any overlapping. It is therefore better to sequence longer amplicons with partial overlap and use a state of the art algorithm such as UNOISE2, UPARSE or DADA2. This will give significantly better phylogenetic and phenotype resolution than the fully-overlapping V4 reads advocated by the Schloss lab in Kozich et al. 2013.
"...mothur is ... considerably fast[er] compared to [methods that use] pairwise alignment[s]."
Not true. In fact, UPARSE and UNOISE2 are both much faster than mothur for large datasets, and both methods construct pair-wise alignments using dynamic programming.
"The MCC_sw metric... evaluates how well an algorithm balances the need to split and lump similar 16S rRNA gene sequences when assigning sequences into a bin."
MCC_sw is based on the assumption that clustering at 97% is inherently meaningful and useful, which is (rightly) criticized by Dr. Schloss in his own comments. Most importantly, MCC_sw does not consider whether the splitting biologically correct or whether the sequences in an OTU are valid or noise. For example, MCC_sw does not penalize splitting paralogs from a single strain into two clusters (a significant error by Dr. Schloss's) 's own standards) or lumping different phenotypes into the same cluster. There is also no penalty for a cluster which is an experimental artifact, e.g. entirely composed of chimeric sequences. Dr. Schloss uses MCC_sw in a published benchmark test to compare clusters made with the OptiClust method in mothur against clusters made by other methods. Since OptiClust explicitly attempts to maximize MCC_sw, this benchmark is egregiously biased towards mothur.
MCC_sw is a defensible heuristic for making de novo 97% OTUs that works reasonably well in practice if stringent error filtering is performed before clustering. However, MCC_sw has a stronger tendency to split paralogs and species than UPARSE because UPARSE considers distance to be significant in assigning sequences to clusters (closer is better), while MCC_sw considers any assignment <97% to be equally good. This is shown by example #3 in my critique of MCC, where MCC_sw makes a sub-optimal assignment but UPARSE would correctly identify the best clusters. The MCC_sw metric is therefore inferior to the optimization metric defined by UPARSE.