See also
OTU / denoising
pipeline
Quality control for OTUs
Types of control sample
There are three main types of
sample: (a) null controls such as distilled water, (b) artificial ("mock") communities containing a mix of known
strains, and (c) single strains.
Null samples show contaminants and cross-talk. Cross-talk can be distinguished from contamintans because cross-talk OTUs that appear in the null sample usually have high abundance in one of the real samples. Contaminants on the flow-cell may appear with roughly equal abundance in all samples. The distribution of contaminants introduced earlier will depend on your sample and library preparation methods.
Mock samples are typically classified as "even" or "staggered", referring to the abundances of the strains. An even sample may be designed to contain equal numbers of cells for each strain, or equal concentrations of 16S DNA. These are quite different definitions of evenness because the number of 16S genes per genome varies in different species. Mock communities for 16S can be ordered from BEI Resources (google search). Single strains provide a cleaner test in some ways, espeically if the strain is chosen to have only one 16S sequence, while a mock community enables measurement of the chimera formation rate and chimera filtering efficiency which cannot be measured with a single strain. Staggered mock communities have low abundances for some strains, and therefore provide a better test of sensivitity. In theory, even communities enable measurement of abundance bias, but in practice this is dubious because the accuracy of the mixing is unknown unless you can verify it by shotgun sequencing, qPCR or some other method independent of amplicon sequencing. With staggered communities, accurate mixing is presumably more challenging.
Why control samples?
I believe that it is important to include control samples,
preferably a few mock
community sample replicates, in every
sequencing run. It is not enough to do this once to validate a protocol, because
conditions change -- the next run may be on a different machine with a different
version of the base caller software, the PCR and library preparation protocol
may be different,
etc. Without control samples, it is very difficult to validate that an analysis
pipeline is working well -- you won't know how many of your OTUs are spurious
due to read errors, chimeras, cross-talk, contamination and so on. You need
replicates of the mock sample to determine the variation in abundances
between samples, otherwise you won't know whether differences in diversity
metrics reflect real biological variation or fluctuations due to sampling
effects etc.
Reference database
You should get,
or make, a reference database of known sequences in your single strain and
mock control samples. If you are using BEI Resources mock samples, let me
know and I will send you a high quality database made using the
search_16s command from the latest
finished genomes in Genbank.
Pool all samples, control and
"real"
You should pool the mock samples together with the other samples sequenced in
the same run. Otherwise, you won't be able to see cross-talk. Your goal should
be to manually annotate an OTU table like this
MiSeq example which shows where all of the OTUs come from: the expected
species in the mock community, cross-talk, chimeras or spurious sequences with
>3% error. Your goal should be to account for
all of the OTU sequences for the reads which are assigned to the
mock samples.
Classify the OTU sequences
The
uparse_ref command and the
annot command can be used to classify the OTU
sequences which appear in your mock samples. To get just those OTU
sequences, you can use the
otutab_sample_subset command to get just the mock samples, then
otutab_trim to discard OTUs with zero
counts, then the Linux cut command (cut -f1 otutab.txt) to get the OTU
labels, then fastx_getseqs to get the
corresponding OTU sequences.