USEARCH manual

QIIME closed-reference OTU picking

QIIME closed--reference OTU picking
The QIIME reference-based OTU picking strategies are described in Rideout et al. 2014. Closed-reference is described as follows:

"In closed-reference OTU picking, input sequences are aligned to pre-defined cluster centroids in a reference database. If the input sequence does not match any reference sequence at a user-defined percent identity threshold [default 97%], that sequence is excluded."

Claimed advantages of closed-reference
Rideout et al. claim that closed-reference OTUs 1. give accurate taxonomy assignment and 2. can be used to compare different regions of the same gene:

"{Closed reference picking] has the convenient feature that, because OTUs are defined by a pre-existing reference, there are typically high-quality taxonomic assignments for each OTU, and a high-quality phylogenetic tree, often based on full-length sequences rather than fragments, exists and describes the relationships among those OTUs. Furthermore, because input sequences are not compared directly to one another, but rather to an external reference, the input sequences need not overlap. This is essential, for example, if performing a meta-analysis including sequences derived from different amplification products of the same marker gene, such as the V2 and V4 regions of the 16S rRNA."

Claim 1: Closed-ref typically gives high-quality OTU taxonomy assignments
The database used by QIIME for closed-reference picking is constructed by clustering full-length 16S sequences in Greengenes at 97% identity. All Greengenes sequences have taxonomy assignments, but the large majority of these sequences are not derived from well-characterized strains. Most taxonomies are assigned by a combination of computational and manual analysis. In fact, the Greengenes taxonomy is defined by these assignments and therefore cannot be wrong! So what is the real biological content of this claim?

Even if we accept that a reference sequence has a "high-quality taxonomic assignment", then it does not follow that the reads assigned to the corresponding OTU have the same taxonomy. All we know for sure is that the read has an alignment with at least 97% identity to that sequence. It could belong to a different species, genus or even family -- it is not unusual for members of different genera or (less often) different families to have identity of 97% across one or two hyper-variable regions.

Benchmark results show that the QIIME "-m uclust" method, which is used by assign_taxonomy.py and pick_closed_reference_otus.py, has an error rate which is typically in the range 50% to 80%; see benchmark results here (the "QIIME" entries in these tables show results for the "-m uclust" method).

Claim 2: Closed-ref enables meta-analysis of different gene segments
Testing on Greengenes sequence fragments shows that fragments of the same gene are assigned to different OTUs by QIIME closed-ref in a large majority of cases (85%).

This matters in practice -- test results on HMP samples shows that the similarity (weighted Jaccard index) of closed-ref OTU sets from identical samples but different V-regions is less than 10% in all cases, and in one case is only 0.9%! By contrast, OTUs constructed from the same reads using the UTAX algorithm have 67% to 85% similarity across different V-regions.