USEARCH uses the BLAST
definition of sequence identity. Through version 5, USEARCH used
the CD-HIT definition by
default.
For a given alignment, BLAST identity <= CD-HIT
identity. This is because BLAST counts gaps as differences,
but CD-HIT sometimes does not. Insertions and deletions are generally less
probable than substitutions. Therefore, gaps should count as least
as much as substitutions as a measure of evolutionary distance, and
the BLAST definition is
more biologically realistic.
Increased number of clusters
One effect of this change has been to increase the number of
clusters (smaller average size) in versions 6 and later compared to version 5 at a given identity threshold,
especially at high identities. This due to the tendency for %id to
be reduced by the new definition, so fewer sequences match a given
centroid. In some applications, notably OTU picking for SSU rRNA
genes by clustering at 97% id, the number of clusters is sometimes
used as a measure of cluster quality, and the increased number of
qualities might then be interpreted as a reduction in cluster
quality. In fact, I believe that this the clusters produced by the new definition
are better because the BLAST definition of identity is a
better estimate of
evolutionary distance.
|