<< CD-HIT analysis
<< Comparing USEARCH and CD-HIT
Pair-wise identity varies by method
The pair-wise identity between two sequences depends on the
alignment and the definition of identity. Alignments vary due to the use
of different parameters such as gap penalties and substitution scores.
Definitions of identity vary depending on the treatment of gaps.
Some
popular definitions of %id
Terminal gaps are ignored;
identity is calculated from the remaining columns. An
"identity" is a column with two identical letters; a
"mismatch" is a column with two different letters. An "indel"
is a consecutive series of gaps in one sequence. In other words, two or
more consecutive gaps count as one indel. GAST
is a SSU taxonomy assignment method that counts one indel rather than
one gap column as one difference. USEARCH supports several
definitions (--iddef option), default is to use the CD-HIT
definition. The MBL definition in USEARCH is the same as GAST's. CD-HIT definition
Identities / (Length of shorter sequence) BLAST
definition
Identities / Columns GAST
definition
(Columns - Mismatches - Indels) / Columns
Problems with the CD-HIT definition
The CD-HIT definition is not symmetrical between the longer and shorter
sequence. Gaps in the longer sequence reduce %id but gaps in the shorter
sequence do not. Gappier alignments therefore tend to have higher
identities according to CD-HIT compared to other methods, and the CD-HIT
%id correlates less well with evolutionary distance. A measure of
%id that counts gaps as differences is more robust against the choice of
alignment parameters (gap penalties and substitution matrices). For
these reasons, I now prefer the BLAST definition for most purposes and
may make this the default in USEARCH v6.
Example
where CD-HIT id is 97% and USEARCH id is 86%
Example
where CD-HIT id is 97% and USEARCH id is 95%
|