See also
Taxonomy benchmark home
Greengenes fragment test
Greengenes fragment test results
HMP sample test
HMP sample test results
Making OTUs from different gene regions
QIIME closed--reference OTU picking
The QIIME reference-based OTU picking strategies are described in
Rideout et al.
2014. Closed-reference is described as follows:
"In closed-reference OTU picking, input sequences are aligned to pre-defined cluster centroids in a reference database. If the input sequence does not match any reference sequence at a user-defined percent identity threshold [default 97%], that sequence is excluded."
Claimed advantages of closed-reference
Rideout et al. claim that closed-reference OTUs 1. give
accurate taxonomy assignment and 2. can be used to compare different regions of
the same gene:
"{Closed reference picking] has the convenient feature that, because OTUs are defined by a pre-existing reference, there are typically high-quality taxonomic assignments for each OTU, and a high-quality phylogenetic tree, often based on full-length sequences rather than fragments, exists and describes the relationships among those OTUs. Furthermore, because input sequences are not compared directly to one another, but rather to an external reference, the input sequences need not overlap. This is essential, for example, if performing a meta-analysis including sequences derived from different amplification products of the same marker gene, such as the V2 and V4 regions of the 16S rRNA."
Claim 1: Closed-ref typically gives high-quality OTU
taxonomy assignments
The database used by QIIME for closed-reference picking is constructed by
clustering full-length 16S sequences in Greengenes at 97% identity. All
Greengenes sequences have taxonomy assignments, but the large majority of these
sequences are not derived from well-characterized strains. Most taxonomies are
assigned by a combination of computational and manual analysis. In fact, the
Greengenes taxonomy is defined by these assignments and therefore
cannot be wrong! So what is the real biological content of this claim?
Even if we accept that a reference sequence has a "high-quality taxonomic assignment", then it does not follow that the reads assigned to the corresponding OTU have the same taxonomy. All we know for sure is that the read has an alignment with at least 97% identity to that sequence. It could belong to a different species, genus or even family -- it is not unusual for members of different genera or (less often) different families to have identity of 97% across one or two hyper-variable regions.
Benchmark results show that the QIIME "-m uclust" method, which is used by assign_taxonomy.py and pick_closed_reference_otus.py, has an error rate which is typically in the range 50% to 80%; see benchmark results here (the "QIIME" entries in these tables show results for the "-m uclust" method).
Claim 2: Closed-ref enables meta-analysis of different
gene segments
Testing on Greengenes sequence
fragments shows that fragments of the same gene are assigned to
different OTUs by QIIME closed-ref in a large majority of cases (85%).
This matters in practice --
test results on HMP samples shows that
the similarity (weighted Jaccard index) of closed-ref OTU sets from identical
samples but different V-regions is less than 10% in all cases, and in
one case is only 0.9%! By contrast, OTUs constructed from the same reads using
the UTAX algorithm have 67%
to 85% similarity across different V-regions.