Definitions of pair-wise sequence identity

Definitions of pair-wise sequence identity

<< CD-HIT analysis
<< Comparing USEARCH and CD-HIT

Pair-wise identity varies by method
The pair-wise identity between two sequences depends on the alignment and the definition of identity. Alignments vary due to the use of different parameters such as gap penalties and substitution scores. Definitions of identity vary depending on the treatment of gaps.

Some popular definitions of %id
Terminal gaps are ignored; identity is calculated from the remaining columns. An "identity" is a column with two identical letters; a "mismatch" is a column with two different letters. An "indel" is a consecutive series of gaps in one sequence. In other words, two or more consecutive gaps count as one indel. GAST is a SSU taxonomy assignment method that counts one indel rather than one gap column as one difference. USEARCH supports several definitions (--iddef option), default is to use the CD-HIT definition. The MBL definition in USEARCH is the same as GAST's.

CD-HIT definition
Identities / (Length of shorter sequence)

BLAST definition
Identities / Columns

GAST definition
(Columns - Mismatches - Indels) / Columns

Problems with the CD-HIT definition
The CD-HIT definition is not symmetrical between the longer and shorter sequence. Gaps in the longer sequence reduce %id but gaps in the shorter sequence do not. Gappier alignments therefore tend to have higher identities according to CD-HIT compared to other methods, and the CD-HIT %id correlates less well with evolutionary distance. A measure of %id that counts gaps as differences is more robust against the choice of alignment parameters (gap penalties and substitution matrices). For these reasons, I now prefer the BLAST definition for most purposes and may make this the default in USEARCH v6.

Example where CD-HIT id is 97% and USEARCH id is 86%

Example where CD-HIT id is 97% and USEARCH id is 95%