Singletons are included by changing the -maxsize option of cluster_otus
from 1 to 2. The expected error threshold is changed by setting the ‑fastq_maxee
option of fastq_filter to 2.0.
Compare the otus.uparseref file in out/mock1
to the file we saw in out/mock (discussed in
Part 2 of this tutorial). We still see
19 perfect OTUs. We now have a new good OTU for P.acnes with 99.6% identity
to the reference sequence, as intended.
However, we also get 19
additional OTUs which are classified as "other" by uparse_ref because they
are not close enough to the reference sequences. For example:
Otu20 other 94.1 94.1 P.aeruginosa.1 Otu22 other
80.6 82.5 P.gingivalis.1 Otu23 other 95.7 95.7 B.cereus.1 Otu25 other
80.6 80.6 P.gingivalis.1
What are these
OTUs? Where do they come from? Most of them appear to be good biological
sequences with at most one or two incorrect bases: 16 out of 19 are >99%
identical to a Greengenes sequence. As with the Pseudomonas OTU we found
in Part 1, these appear to be contaminants
which may be caused by sample cross-talk. So we should not conclude that the
algorithms have a problem because we got 39 OTUs on a mock community with 21
species -- the algorithms cannot correct for cross-talk errors.
However, in other cases I have found that it is better to cluster with the
more stringent parameters because otherwise the number of spurious OTUs due
to read errors and undetected chimeras increases. I think it is always a
good idea to sequence a mock community and thoroughly understand the results
before making a final decision on what parameters to use on your "real"
samples. Another useful exercise is to check the error rates after quality
filtering. This is discussed in Part 4.