USEARCH manual

USEARCH manual > algorithms > OTU clustering

16S OTUs

Constructing OTUs by database matching
Ideally, given a set of 16S reads, we would like to identify all known and novel species. In practice, this is very challenging owning to several complications. Typically, many reads do not match a reference database well enough to allow a species assignment.

Identity threshold for species assignment
Traditionally, a 97% match has been considered sufficient for species assignment in 16S sequences, though it should be noted that this is only approximate: sometimes two different species have identical 16S sequences, and conversely a single species may have two copies of the 16S gene that differ by more than 97%. With shorter reads, the 97% cutoff approximation becomes worse.

Constructing OTUs by de novo clustering
Usually, the best we can do with unmatched reads is to cluster them into groups that are 97% similar. For consistency, database matching is often done after clustering so that some OTUs are assigned to species and others are flagged as novel or unknown. Some of these clusters may contain reads of PCR artifacts such as undetected chimeras, and others may be due to gene duplications in known or novel species.

Do not expect a one-to-one correspondence between OTUs and species
Due to the complications discussed above, we cannot expect a 1:1 correspondence between OTUs and species. At best, we can aim for a 1:1 correspondence between OTUs and unique copies of the 16S gene, though this ideal is undermined by experimental error that is hard or impossible to eliminate, including sequencing errors and PCR artifacts.