Home Software Services About Contact usearch manual
USEARCH quick start for CD-HIT users

 
Clustering commands
In CD-HIT, different programs are used depending on the sequence type (protein or nucleotide), with specialized variants for sequencing reads. In USEARCH, there is only one program. There are two clustering commands: cluster_fast and cluster_smallmem. Like most USEARCH commands, both proteins and nucleotides are supported; the sequence type is automatically detected. Accept options provide a rich set of criteria for sequence matching. For OTU clustering, use cluster_otus.

See also Comparison of USEARCH and CD-HIT.

CD-HIT program

Seq. type

USEARCH equivalent

cd-hit

protein

Use cluster_fast or cluster_smallmem.

cd-hit-est

nucl.

Use cluster_fast or cluster_smallmem.

cd-hit-2d

protein

Database search. The -db2 option of cd-hit-2d is the query sequence and -db1 is the database. The equivalent in USEARCH is usearch_global. The uc output file can be used to identify query sequences that did not match the database (reported as N records).

cd-hit-454

454 reads

In cd-hit-454, additional clustering criteria include (1) the sequences must start at the same position, i.e. terminal gaps are not allowed at the left end of the alignment, and (2) gaps longer than one are not allowed. (1) can be implemented using the ‑leftjust or ‑idprefix accept options (‑idprefix is preferred for faster speed). (2) can be implemented by disallowing internal gap extensions (see alignment options). However, I believe (2) makes little difference in practice and is not worth the trouble.

cd-hit-est-2d

nucl.

See comments for cd-hit-2d.

cd-hit-otu

16S reads

Script for clustering 16S reads into OTUs. Use cluster_otus.

cd-hit-dup

nucl.

Dereplication. Use derep_fulllength or derep_prefix.

Clustering threshold
The -c option of CD-HIT is roughly equivalent to the ‑id option of USEARCH. There are two important differences. USEARCH and CD-HIT use different definitions of identity. USEARCH counts gaps as differences, but CD-HIT sometimes does not. This means that CD-HIT assigns systematically higher identities to alignments containing gaps. In addition, CD-HIT has lower gap and mismatch penalties than other programs. This means that CD-HIT tends to produce "gappier" alignments with more match columns. This effect also produces systematically higher identities. The net result of these issues is that the CD-HIT clustering threshold is not directly comparable with USEARCH. For example, I would estimate that -c 0.95 is roughly comparable to ‑id 0.97 in USEARCH, but it should be noted that the differences cannot be compensated by a re-scaling of identity.
 
Alignment heuristics and banding
Both CD-HIT and USEARCH use fast heuristics to compute global alignments. However, there are important differences. Both programs use a technique called "banding" to limit the region of the dynamic programming matrix that is filled in,
but the strategies are quite different, and the ‑band option of USEARCH is not equivalent to the -b option of CD-HIT. In CD-HIT, the band spans the entire alignment. USEARCH starts by finding HSPs using an X-drop algorithm similar to BLAST. Banding is used only for regions between HSPs, and the band width is set dynamically according to the length difference of the aligned regions. The ‑band option of USEARCH sets a minimum radius of the band (width = radius x 2 + 1), while the CD-HIT -b option sets the maximum width of the band. The net result is that CD-HIT alignments are much more prone to banding artifacts.