Problems with closed- and open-reference OTU assignment
See also
Closed-reference OTU algorithm
closed_ref command
I have performed a detailed assessment of closed- and open-reference OTU assignment, which identified several problems summarized below. A manuscript is in submission. If you are interested, please let me know and I will send you a copy. I have not posted a preprint because I don't know if public preprints are allowed by the journal. Unless otherwise stated, all results were obtained with QIIME v1.9 using default / recommended parameters.
Specific species and strains mentioned below were chosen because they are in the HMP mock community which has been sequenced as a control in several different studies, providing test data both with and without experimental error because high-quality full-length genome sequences are available.
Error-free sequences for one species are often split over several OTUs
A single strain often contains two or more 16S paralogs, i.e. separate copies of the 16S gene (see
16S copy number). Sometimes these paralogs have different sequences, espeically when there is variation between different strains of the same species. I found that the known sequences in single strains are often split over two or more closed-reference OTUs. For example, the known V4 sequences in
C. beijerinckii strain NCIMB 8052 are split over two OTUs (555688 and 238205). This splitting occurs with correct sequences, so is not due to sequencing or PCR errors.
Splitting is severe in practice, which can grossly inflate diversity estimates
When errors due to sequencing and PCR are present, a single strain is often split over tens to hundreds of OTUs by the recommended QIIME protocol. On Illumina V4 reads of a mock community with 22 strains (
Bokulich et al. 2013), 4,482 OTUs were generated. Even when rarefied to 1,000 reads per sample (using the -
a option of
core_diversity_analysis.py), which is very shallow by current standards, the number of OTUs was ~100, inflating richness by a factor of ~5x. (
Figure).
Well-known species often fail (are not assigned to a closed-reference OTU)
Sequences that are present in the full Greengenes database before clustering are often not assigned to a closed-reference OTU. These
are fails, in QIIME terminology. With the HMP mock community, the known sequences for one species (
S. aureus) fail on the V4 region and four out of 21 species fail on the V3-V5 region (
L. monocytogenes, S. mutans, N. meningitidis and P. aeruginosa).
Non-overlapping regions in most species are assigned to different closed-reference OTUs
The QIIME documentation states "You must use closed-reference OTU picking if you are comparing non-overlapping amplicons, such as the V2 and the V4 regions of the 16S rRNA" (
http://qiime.org/tutorials/otu_picking.html accessed 25 April 2017). Presumably, this is based on the assumption that non-overlapping tags for a given species will usually be assigned to the same OTU by closed-reference. However, the probability that two non-overlapping tags from the same full-length sequence will be assigned to the same OTU is only ~30%. If three or more tags are used, the probability is lower.
Invalid taxonomy
In QIIME v1.9, taxonomy annotations for closed-reference OTUs are in
gg_13_8_otus/97_otu_taxonomy.txt. In these annotations, 36 genera are placed in two or more families, violating the structure required for a valid taxonomy. To give some examples, genus Rhodospirillum is placed in family Rhodospirillaceae (e.g. in the annotation for OTU 326714), which is correct according to Bergey's Manual, and also in family Alcaligenaceae (OTU 119663). Genus Vibrio is in Vibrionaceae (OTU 9303, correct), and Pseudoalteromonadaceae (OTU 1115975). Genus Flexibacter is placed in three families: Cytophagaceae (OTU 1142767, correct), Flammeovirgaceae (OTU 4447268), and Flavobacteriaceae (OTU 1136639).
Reference (please cite)
R.C. Edgar (2017), Accuracy of microbial community diversity estimated by closed- and open-reference OTUs, PeerJ 5:e3889
• QIIME closed- and open-reference clustering generates huge numbers of spurious OTUs
• Closed-reference OTU assignment splits strains and species even when no sequence errors
• Closed-reference fails to assign different hyper-variable regions to the same OTU
• Closed-reference discards many well-known species that are present in Greengenes