Why do CD-HIT and USEARCH report very different %ids 
for the same sequences?

 
<< CD-HIT analysis
<< Comparing USEARCH and CD-HIT
  
At 97%, CH-HIT clusters many pairs with %ids that are much lower than 97% according to USEARCH and other programs. Clustering the Costello et al. set using CD-HIT at 97% id, 547 / 804 (68%) of pairs had < 97% id according to USEARCH, but only 1 / 804 (0.1%) had > 97% id. This means that clustering results are not comparable at a given %id threshold. See here for detailed results

There is a also striking anomaly: a sub-cluster with 97% id according to CD-HIT but 86% id according to USEARCH. See here for a typical alignment from this sub-cluster.

There are four main reasons why identities differ, summarized in the table below.
 
Issue

Discussion

Examples
Alignment parameters CD-HIT has a low mismatch score and low gap penalties, resulting in "gappy" alignments. With some definitions of %id (next), gappy alignments tend to have higher identities. Gappy example 1.

Gappy example 2.

Definition of %id The CD-HIT definition of %id is (number of alignment columns containing identical letters) / (length of shorter sequence). This is also the default definition in USEARCH through v5, though USEARCH does support other definitions through the -iddef option. With this definition, gaps in the shorter sequence do not decrease the reported %id, with the result that gappy alignments often give higher %ids than other definitions.
Banding errors Both CD-HIT and USEARCH use banded dynamic programming for faster alignments. However, the strategies used by the two programs are significantly different, and CD-HIT is more prone to banding errors. Banding error example.