Part 2. Mock community analysis This analysis illustrates a very
important point that is often misunderstood in the literature:
it is NOT reasonable to validate an OTU clustering algorithm by testing
whether the number of OTUs is the same as the number of species in the mock
community. You MUST understand what the OTUs represent (expected species?
contaminants? read errors? chimeras?) and how many clusters can reasonably
be expected (are some species >97% similar? are some species absent from the
reads?). With the parameters we use here in Part 1, the number of OTUs is close to the
expected number (21 species), but a couple of them are misleading. In other studies, the
discrepancies are larger; for example it is quite common with HMP "Uneven"
mock communities for several species to be missing from the reads. Part 3 of
this tutorial will show that there are at least 40 species in the mock data,
so an algorithm cannot be criticized for making ~40 OTUs from these reads and
the fact that we get 20 here in Part 2 is partly due to luck.
Here we look at the results in misop/mock which were generated by the
run_mock.bash script. The otus.fa file contains the OTU representative
sequences. Notice that there are 20 OTUs, but there are 21 species in the
designed community. See misop/ref/HMP_MOCK.v35.fasta for the known 16S
sequences in those species.
The
uparse_ref command is used to compare OTUs with a reference database for
the mock community; HMP_MOCK.v35.fasta in our case. The command line in
run_mock1.bash is:
The output file is out/mock/otus.uparseref. Look at
this file using a Linux command like less, more or cat, or alternatively
open it in a text editor such as vi or emacs. See
uparseout option for description of the
file format. You'll see that most of the lines show a perfect 100% match to
a reference sequence, like this:
This is only 94% similar to the closest reference sequence. We can try to
identify it by searching a larger database of 16S sequences, e.g. the
gold UCHIME reference
(right-click on the link to download). Make a file otu20.fa (e.g. by copying
otus.fa and deleting all the other sequences), then run:
(The -notrunclabels option is needed because there are blanks in the
species names in gold.fa).
If you look in otu20.aln you will see a
100% match to AJ492829_S000145154, which is Pseudomonas poae. So this is
another perfect sequence, but not from the designed community. Normally I
would suspect a contaminant in the sample before sequencing, but in this
case it is because the mock and fecal samples were sequenced in the same run
as some soil samples. Pseudomonas is one of the most abundant OTUs in the
soil samples, so this is a contaminant which is probably explained by sample
cross-talk in the sequencing, e.g. due to barcode read errors.
So we
have 19 perfect OTUs from the expected community and one perfect contaminant
sequence. This leaves two species in the designed community which do not
have OTUs: P.acnes and S.aureus.
S.aureus drops out because it is
99.4% identical to S.epidermidis, so the S.aureus reads are assigned to the
S.epidermidis OTU. To check this, make a file pair.fa containing sequences
S.epidermidis.1 and S.aureus.1 from HMP_MOCK.v35.fasta and run:
usearch -cluster_fast pair.fa id 0.9 -alnout pair.aln
Look in pair.aln to see the alignment. You can find the aureus
reads by making a file s_aureus.fa with just the S.aureus reference
sequences and running this command:
usearch -usearch_global merged.fq -db s_aureus.fa -strand both -id 1.0 \
-userout s_aureus_reads.txt -userfields query Note the use of -id 1.0 to select reads that are
100% identical in order to filter out S.epidermidis reads. You will find 513
hits, so S.aureus is present but does not induce a separate OTU.
That
leaves P.acnes. We can perform the same exercise to see if P.acnes is
present in the reads. Make a file p_acnes.fa with just the P.acnes reference
sequence and run:
You will find three hits
in p_acnes_hits.txt, one with 100.0% identity and two with 99.6% identity.
If you do the same with the reverse reads (_R2_), you will find no hits. We
can ask if the P. acnes reads were merged by using the -tabbedout option of
fastq_mergepairs:
So P. acnes fell out of the analysis because the R2s failed to merge due
to having a large number of differences. They will be included if you set
the -fastq_maxdiffs and -fastq_maxdiffpct to sufficiently high values, e.g.
-fastq_maxdiffs 99 -fastq_maxdiffpct 99, though this would not usually be
recommended because you will tend to get many false-positive merges. Another
possible strategy is to make OTUs from the
forward reads only.