Problems with closed- and open-reference OTU assignment
See also
Closed-reference OTU algorithm
closed_ref commandI have performed a detailed assessment of
closed- and open-reference OTU assignment, which identified several problems summarized below. A manuscript is in submission. If you
are interested, please let me know and I will send you a copy. I have not
posted a preprint because I don't know if public preprints are allowed by the
journal. Unless otherwise stated, all results were obtained with QIIME v1.9
using default / recommended parameters.
Specific species and strains mentioned below were chosen because they are in the
HMP mock community which has been sequenced as a control in several
different studies, providing test data both with and without experimental
error because high-quality full-length genome sequences are available.
Error-free sequences for one species are often split over several
OTUs
A single strain often contains two or more 16S paralogs, i.e.
separate copies of the 16S gene (see
16S copy
number). Sometimes these paralogs have different sequences, espeically
when there is variation between different strains of the same species. I
found that the known sequences in single strains are often split over two or
more closed-reference OTUs. For example, the known V4 sequences in
C. beijerinckii strain NCIMB 8052 are split over two OTUs (555688
and 238205). This splitting occurs with correct sequences, so is not due to
sequencing or PCR errors.
Splitting is severe in practice,
which can grossly inflate diversity estimates
When errors due to sequencing and PCR are present, a single strain
is often split over tens to hundreds of OTUs by the recommended QIIME
protocol. On Illumina V4 reads of a mock
community with 22 strains (
Bokulich
et al. 2013),
4,482 OTUs were generated. Even when rarefied to 1,000 reads per sample
(using the -
a option of
core_diversity_analysis.py), which is very shallow by current
standards, the number of OTUs was ~100, inflating richness by a factor of
~5x. (
Figure).
Well-known species often fail (are not assigned
to a closed-reference OTU)
Sequences that are present in the full Greengenes
database before clustering are often not assigned to a closed-reference OTU.
These
are fails, in QIIME
terminology. With the HMP mock community, the known sequences for one species (
S. aureus) fail on the
V4 region and four out of 21 species fail on the V3-V5 region (
L. monocytogenes, S. mutans, N. meningitidis and
P. aeruginosa).
Non-overlapping regions in most
species are assigned to different closed-reference OTUs
The QIIME documentation
states "You must use closed-reference OTU picking if you are comparing
non-overlapping amplicons, such as the V2 and the V4 regions of the 16S
rRNA" (
http://qiime.org/tutorials/otu_picking.html
accessed 25 April 2017). Presumably, this is based on
the assumption that non-overlapping tags for a given species will usually be
assigned to the same OTU by closed-reference. However, the probability that
two non-overlapping tags from the same full-length sequence will be assigned to
the same OTU is only ~30%. If three or more tags are used, the probability
is lower.
Invalid taxonomyIn
QIIME v1.9, taxonomy annotations for closed-reference OTUs are in
gg_13_8_otus/97_otu_taxonomy.txt. In these
annotations, 36 genera are placed in two or more families, violating the
structure required for a valid taxonomy. To give some examples, genus
Rhodospirillum is placed in family Rhodospirillaceae (e.g. in the annotation
for OTU 326714), which is correct according to Bergey's Manual, and also in
family Alcaligenaceae (OTU 119663). Genus Vibrio is in Vibrionaceae (OTU
9303, correct), and Pseudoalteromonadaceae (OTU 1115975). Genus Flexibacter
is placed in three families: Cytophagaceae (OTU 1142767, correct),
Flammeovirgaceae (OTU 4447268), and Flavobacteriaceae (OTU 1136639).