The lowest common rank (LCR) of two sequences is the lowest rank where both have the same taxon name. For example, Enterococcus avium and Pilobacter termitis belong to different genera in the Enterococcaceae family, and their LCR is therefore family.
The identity of a pair of sequences is an approximate guide to their LCR. For example, if their 16S rRNA identity is 92%, it is a reasonable guess that their LCR is family. With high confidence, the identity is too low for them to belong to the same species, and it is almost certain that the LCR is below phylum. The degree of certainty can be quantified by the probability that the LCR of a pair of sequences is a particular rank (e.g., family) given their pair-wise sequence identity (e.g., 92%).
This probability depends on how sequences are selected, which can be specified by a frequency distribution over possible sequences. For a given taxonomy reference database, the simplest frequency distribution is defined by selecting pairs of sequences at random. However, this distribution usually has strong taxonomic biases and is likely to be quite different from the distribution encountered in practice.
This approach confirms that the conventional 97% OTU threshold is too low.
Lowest common rank probability as a function of
LCR probability for ranks from phylum to species for V4 and full-length 16S sequences.