See also
Lowest common rank probabilities
calc_lcr_probs command
Setting OTU threshold
The canonical clustering threshold is 97% identity, which was proposed in 1994 when few 16S rRNA sequences were available, motivating a reassessment on current data.
Using a large set of high-quality 16S rRNA sequences from finished genomes, I assessed the correspondence of OTUs to species for five representative clustering algorithms using four accuracy metrics. All algorithms had comparable accuracy when tuned to a given metric.
Optimal identity thresholds were ~99% for full-length sequences and ~100% for the V4 hypervariable region.
This result is confirmed by calculating the lowest common rank probability for species as a function of identity, as shown in the figure below. This shows that at 97% identity, the probability that a pair of sequences is from the same species is almost zero, so it is more likely that they are from different species in the same genus, or even from different genera.
Probability that a pair of sequences is from the same species as a function of identity.