Spurious OTUs in mock and real samples

See also
Tolstoy's paradox

I believe that for a given protocol from sample collection, libary prep., sequencing and data analysis,the number of spurious OTUs is approximately independent of community structure.

For very noisy methods such as QIIME closed-reference, this can be supported (though not shown definitively) by looking for dominant OTUs (manuscript in preparation).

The following arguments show that the claim is plausible.

Real samples are like mock plus more
Imagine sorting the species in a sample by decreasing abundance. Call the top 20 the mock-like subset and the remainder the low-abundance tail. A given number of reads of the mock-like subset will contain a similar set of errors due to PCR and sequencing, regardless of whether a low-abundance tail was also sequenced. If the top 20 represent only a small fraction of the sample, then repeat the thought experiment with the top 20 in the low-abundance tail, and so on.

If errors are random, the probability of a spurious OTU per read is constant
Errors are approximately random, which implies that each time a new read is added, the probability it will induce a new spurious OTU is approximately constant, because the probability of reproducing a previous spurious OTU is small due to the large number of ways in which errors can be different. Therefore, approximately the same number of spurious OTUs will be generated by generating a given number of reads, regardless of whether those reads were derived from few or many species.

Spurious OTUs are not an artifact of mock community tests
If the number of spurious OTUs does not strongly depend on the structure of the community, then we will get similar numbers of spurious OTUs from mock and real samples.

Reference (please cite)
R.C. Edgar (2017), Accuracy of microbial community diversity estimated by closed- and open-reference OTUs, PeerJ 5:e3889
  • QIIME closed- and open-reference clustering generates huge numbers of spurious OTUs
  • Closed-reference OTU assignment splits strains and species even when no sequence errors
  • Closed-reference fails to assign different hyper-variable regions to the same OTU
  • Closed-reference discards many well-known species that are present in Greengenes