See also
OTU / denoising analysis
pipeline
I recommend pooling samples, i.e. combining all reads for all samples, for generating OTUs and making an OTU table. This applies both to UPARSE (97% OTU clustering) and UNOISE (error-correction to obtain "ZOTUs", i.e. all biological sequences in the reads). The only exception is cross-talk, as explained next.
Cross-talk detection
To detect
cross-talk manually or by using the UNCROSS algorithm,
it is important to include all reads from one
sequencer run, even if the samples are from different
environments. Each sequencing run should be analysed separately. If one run
contains unrelated samples, e.g. if your sequencing center did one run with
samples from several users, then you should get the reads for the other
samples for cross-talk analysis but not for other analyses as explained
below.
OTU clustering and denoising
For OTU clustering and
denoising, it is usually better to include reads from all related
samples, e.g. all samples from a given environment, even if they
were sequenced in more than one run or were sequenced together with other
environments (e.g., a sequencing center did one run with samples from
several users).
Comparing samples
Creating a single set of OTUs is the most natural and intuitive basis for sample
comparison, e.g. using a beta diversity metric. If you create separate
97% OTUs for
each sample, they are not directly comparable because the clustering will give
different results in each sample even if they contain mostly the same biological
sequences. With denoising, comparing OTUs from different samples is less of a
problem because if the same biological sequence is found in two samples, it
should be found in both cases. However, see this doesn't always work correctly
and better results are obtained by pooling as explained under "Error detection"
below.
Improved amplicon abundance estimation and singleton
detection
If samples are pooled, then a sequence that appears as a singleton in one
sample may also appear in another sample. If singletons are discarded after
pooling (as usually recommended in order to reduce spurious OTUs), then more
low-abundance species will be retained compared with discarding singletons for
each sample separately.
Chimera detection
The UPARSE-OTU and
UNOISE algorithms both require
that a chimera has lower read abundance than its parents. Chimeras are not
detected if a parent has the same number or fewer reads. This
most often happens with low-abundance parents, e.g. when a chimera and one of
its parents are both present in exactly two reads. If samples are pooled, parent
abundances usually increase because they are found in multiple samples, while
chimeras are only rarely reproduced so will usually be found only in a single
sample. Even if chimeras are reproduced, pooling will tend to increase both
chimera and parent abundances, leading to a more accurate reflection of amplicon
abundance so that parent abundances become greater than their chimeras.
Conversely, pooling is highly unlikely to increase the abundance of a chimera
relative to its parents. Pooling is therefore effective in reducing the number
of spurious OTUs due to chimeras.
Error detection
The UNOISE algorithm
uses unique sequence abundances to detect bad reads. If a read (R) with low
abundance that is very similar to a read with much higher abundance (H),
then R is probably a bad read with correct sequence H. This is most
effective when all samples are pooled together to give the highest possible
abundances for correct reads.
When to pool in your pipeline
Samples should be combined after non-biological
sequences such as barcodes have been stripped from the reads, and before
dereplication. This is required so that dereplication reflects the abundances of
unique biological sequences across all samples.