To calculate an identity, an alignment is required. In usearch, the alignment is almost always a pair-wise alignment between a query sequence and a target sequence in a database. In the case of clustering commands, the target sequence is a cluster centroid.
The BLAST definition of identity is used, which is the number of identities divided by the number of alignment columns (see below for discussion). In the case of a global alignment, columns containing terminal gaps are discarded, but internal gaps do count as differences.
Pair-wise identity varies by method
The pair-wise identity between two sequences depends on the alignment and
the definition of identity. Alignments vary due to the use of different
parameters such as gap penalties and substitution scores. Definitions of
identity vary depending on the treatment of gaps.
Some popular definitions of %id
Terminal gaps are ignored; identity is calculated from the remaining columns. An
"identity" is a column with two identical letters; a "mismatch" is a column with
two different letters. An "indel" is a consecutive series of gaps in one
sequence. In other words, two or more consecutive gaps count as one indel.
GAST is a SSU taxonomy
assignment method that counts one indel rather than one gap column as one
difference. USEARCH supports several definitions (--iddef option),
default is to use the CD-HIT definition. The MBL definition in USEARCH is the
same as GAST's.
BLAST definition
Identities / Columns
GAST definition
(Columns - Mismatches - Indels) / Columns
CD-HIT definition
Identities / (Length of shorter sequence)
Problems with the CD-HIT definition
For historical reasons, versions 5 and earlier of usearch used the CD-HIT
definition of identity. In versions 6 and later, the BLAST definition is used. I
made the change because I felt the CD-HIT definition had several important
weaknesses. The CD-HIT definition is not symmetrical between the longer and
shorter sequence. Gaps in the longer sequence reduce %id but gaps in the shorter
sequence do not. Gappier alignments therefore tend to have higher identities
according to CD-HIT compared to other methods, and the CD-HIT %id correlates
less well with evolutionary distance. A measure of %id that counts gaps as
differences is more robust against the choice of alignment parameters (gap
penalties and substitution matrices). For these reasons, I now prefer the BLAST
definition for most purposes and may make this the default in USEARCH v6.