<< CD-HIT analysis
<< Comparing USEARCH and CD-HIT
At 97%, CH-HIT clusters many pairs with %ids
that are much lower than 97% according to USEARCH and other
programs. Clustering the Costello et al.
set using CD-HIT at 97% id, 547 / 804 (68%) of pairs had < 97% id
according to USEARCH, but only 1 / 804 (0.1%) had > 97% id. This
means that clustering results are not
comparable at a given %id threshold. See here for detailed
results.
There is a also striking
anomaly: a sub-cluster with 97% id according to CD-HIT but 86% id
according to USEARCH. See here for a typical
alignment from this sub-cluster.
There are four main reasons
why identities differ, summarized in the table below.
Issue |
Discussion |
Examples |
Alignment
parameters |
CD-HIT has
a low mismatch score and low gap
penalties, resulting in "gappy" alignments. With
some definitions of %id (next), gappy alignments tend to have
higher identities. |
Gappy
example 1.
Gappy
example 2. |
Definition
of %id |
The CD-HIT definition
of %id is (number of alignment columns containing identical
letters) / (length of shorter sequence). This is also the default
definition in USEARCH through v5, though USEARCH does support
other definitions through the -iddef option. With this
definition, gaps in the shorter sequence do not decrease the
reported %id, with the result that gappy alignments often give
higher %ids than other definitions. |
Banding
errors |
Both CD-HIT
and USEARCH use banded dynamic programming for faster alignments.
However, the strategies used by the two programs are significantly
different, and CD-HIT is more prone to banding errors. |
Banding
error example. |
|