Clustering commands In CD-HIT, different programs are
used depending on the sequence type (protein or nucleotide), with
specialized variants for sequencing reads. In USEARCH, there is
only one program. There are two clustering commands: cluster_fast and cluster_smallmem. Like most USEARCH
commands, both proteins and nucleotides are supported; the sequence
type is automatically detected. Accept options provide a rich set of
criteria for sequence matching. For OTU clustering, use
cluster_otus.
See also Comparison of USEARCH and
CD-HIT.
CD-HIT program |
Seq. type |
USEARCH
equivalent |
cd-hit |
protein |
Use cluster_fast or cluster_smallmem. |
cd-hit-est |
nucl. |
Use cluster_fast or cluster_smallmem. |
cd-hit-2d |
protein |
Database search. The -db2 option of cd-hit-2d is the
query sequence and -db1 is the database. The equivalent in USEARCH
is usearch_global. The
uc output file can be used to identify query
sequences that did not match the database (reported as N
records). |
cd-hit-454 |
454 reads |
In cd-hit-454, additional clustering criteria include (1) the
sequences must start at the same position, i.e. terminal gaps are
not allowed at the left end of the alignment, and (2) gaps longer
than one are not allowed. (1) can be implemented using the
‑leftjust or ‑idprefix accept
options (‑idprefix is preferred for faster speed). (2)
can be implemented by disallowing internal gap extensions (see
alignment options). However, I
believe (2) makes little difference in practice and is not worth
the trouble. |
cd-hit-est-2d |
nucl. |
See comments for cd-hit-2d. |
cd-hit-otu |
16S reads |
Script for clustering 16S reads into OTUs. Use
cluster_otus. |
cd-hit-dup |
nucl. |
Dereplication. Use
derep_fulllength or derep_prefix. |
Clustering threshold The -c
option of CD-HIT is roughly equivalent to the ‑id option of
USEARCH. There are two important differences. USEARCH and CD-HIT
use different definitions of
identity. USEARCH counts gaps as differences, but CD-HIT sometimes does
not. This means that CD-HIT assigns systematically higher
identities to alignments containing gaps. In addition, CD-HIT has lower gap
and mismatch penalties than other programs. This means that
CD-HIT tends to produce "gappier" alignments with more match
columns. This effect also produces systematically higher
identities. The net result of these issues is that the CD-HIT
clustering threshold is not directly comparable with USEARCH. For
example, I would estimate that -c 0.95 is roughly comparable to ‑id
0.97 in USEARCH, but it should be noted that the differences cannot
be compensated by a re-scaling of identity.
Alignment heuristics and banding Both CD-HIT and USEARCH
use fast heuristics to compute
global alignments. However, there are important differences. Both
programs use a technique called "banding" to limit the region of
the dynamic programming matrix that is filled in,
but the strategies are quite different,
and the ‑band option of USEARCH is not equivalent to the -b
option of CD-HIT. In CD-HIT, the band spans the entire alignment.
USEARCH starts by finding HSPs using an X-drop algorithm similar to
BLAST. Banding is used only for regions between HSPs, and the band
width is set dynamically according to the length difference of the
aligned regions. The ‑band option of USEARCH sets a minimum
radius of the band (width = radius x 2 + 1), while the CD-HIT -b option sets the
maximum width of the band. The net result is that CD-HIT alignments
are much more prone
to banding
artifacts.
|