Control samples

Types of control sample
There are three main types of sample: (a) null controls such as distilled water, (b) artificial ("mock") communities containing a mix of known strains, and (c) single strains.

Null samples show contaminants and cross-talk . Cross-talk can be distinguished from contamintans because cross-talk OTUs that appear in the null sample usually have high abundance in one of the real samples. Contaminants on the flow-cell may appear with roughly equal abundance in all samples. The distribution of contaminants introduced earlier will depend on your sample and library preparation methods.

Mock samples are typically classified as "even" or "staggered", referring to the abundances of the strains. An even sample may be designed to contain equal numbers of cells for each strain, or equal concentrations of 16S DNA. These are quite different definitions of evenness because the number of 16S genes per genome varies in different species. Mock communities for 16S can be ordered from BEI Resources ( google search ). Single strains provide a cleaner test in some ways, espeically if the strain is chosen to have only one 16S sequence, while a mock community enables measurement of the chimera formation rate and chimera filtering efficiency which cannot be measured with a single strain. Staggered mock communities have low abundances for some strains, and therefore provide a better test of sensivitity. In theory, even communities enable measurement of abundance bias , but in practice this is dubious because the accuracy of the mixing is unknown unless you can verify it by shotgun sequencing, qPCR or some other method independent of amplicon sequencing. With staggered communities, accurate mixing is presumably more challenging.

Why control samples?
I believe that it is important to include control samples, preferably a few mock community sample replicates, in every sequencing run. It is not enough to do this once to validate a protocol, because conditions change -- the next run may be on a different machine with a different version of the base caller software, the PCR and library preparation protocol may be different, etc. Without control samples, it is very difficult to validate that an analysis pipeline is working well -- you won't know how many of your OTUs are spurious due to read errors, chimeras, cross-talk , contamination and so on. You need replicates of the mock sample to determine the variation in abundances between samples, otherwise you won't know whether differences in diversity metrics reflect real biological variation or fluctuations due to sampling effects etc.

Reference database
Y ou should get, or make, a reference database of known sequences in your single strain and mock control samples. If you are using BEI Resources mock samples, let me know and I will send you a high quality database made using the search_16s command from the latest finished genomes in Genbank.

Pool all samples, control and "real"
You should pool the mock samples together with the other samples sequenced in the same run. Otherwise, you won't be able to see cross-talk. Your goal should be to manually annotate an OTU table like this MiSeq example which shows where all of the OTUs come from: the expected species in the mock community, cross-talk, chimeras or spurious sequences with >3% error. Your goal should be to account for all of the OTU sequences for the reads which are assigned to the mock samples.

Classify the OTU sequences
The uparse_ref command and the annot command can be used to classify the OTU sequences which appear in your mock samples. To get just those OTU sequences, you can use the otutab_sample_subset command to get just the mock samples, then otutab_trim to discard OTUs with zero counts, then the Linux cut command (cut -f1 otutab.txt) to get the OTU labels, then fastx_getseqs to get the corresponding OTU sequences.