<< CD-HIT analysis
<< Comparing USEARCH and CD-HIT
Results
The table below is an updated version of Table 2 from the USEARCH
paper (Edgar 2010), which compared CD-HIT v3 with the UCLUST
algorithm as implemented in USEARCH v1.1.570. This shows that version 4
of CD-HIT produces significantly different results compared to CD-HIT
v3. Input is a set of 1.1M 16S rRNA reads from Costello et al., 2009.
These tests were run on a quad-core CPU, which would be expected to
enable performance improvements with up to four threads. CD-HIT v3 and
USEARCH v5 do not support multi-threading for clustering, so those results are for a
single thread.
These results are misleading
Please note that I now consider this methodology to be highly
misleading, for reasons discussed here.
I am providing these results only for comparison with the paper to
indicate the improvements in speed in CD-HIT v4. Greedy
clustering is not recommended for 16S OTUs
I do not recommend using the UCLUST algorithm or CD-HIT for
generating OTUs, especially if a decreasing length sort is used with
USEARCH (CD-HIT-EST always does a length sort). There are several
problems with this approach; e.g., the longest sequence in a cluster
tends to be an outlier relative to an abundant biological sequence, so
is not appropriate as a representative sequence and tends to greatly
overestimate the number of OTUs. I recommend using otupipe
for OTU clustering.
Min
%id |
|
USEARCH
v5 |
CD-HIT
v3 |
CD-HIT
v4
defaults |
CD-HIT
v4
2 threads |
CD-HIT
v4
4 threads |
70% |
Time |
115
s
1m 55s |
(CD-HIT
cannot cluster < 80%) |
Clusters |
258 |
75% |
Time |
116
s
1m 54s |
Clusters |
543 |
80% |
Time |
102
s
1m 42s |
37,801
s
10h 30m 1s |
1,078
s
17m 58s |
585
s
9m 45s |
305
s
5m 5s |
Clusters |
1,143 |
1,987 |
679 |
679 |
679 |
90% |
Time |
88
s
1m 28s |
3,231
s
53m 51s |
1,152
s
19m 11s |
585
s
9m 45s |
345
s
5m 45s |
Clusters |
5,398 |
6,366 |
4,325 |
4,325 |
4,325 |
95% |
Time |
121
s
2m 1s |
1,729
s
28m 49s |
1,102
s
18m 22s |
670
s
11m 10s |
239
s
3m 59s |
Clusters |
16,289 |
16,304 |
13,257 |
13,257 |
13,257 |
97% |
Time |
167
s
2m 47s |
2,151
s
35m 51s |
1,794
s
29m 54s |
649
s
10m 49s |
346
s
5m 46s |
Clusters |
29,586 |
28,446 |
24,899 |
24,899 |
24,899 |
Software and hardware versions
CD-HIT-EST v3.1.2 and v4.5.7.
USEARCH v5.2.13 (32-bit Linux i86).
CPU: Quad-core Xeon X5450 3.0GHz. Command-lines
cd-hit-est -i costello.fasta -o costello -c 0.97 -M 0 [-t 4]
usearch -cluster costello.fasta -uc costello.uc -id 0.97 References
Edgar,R.C. (2010), Search and clustering orders of magnitude faster
than BLAST, Bioinformatics 26(19), 2460-2461.
doi:
10.1093/bioinformatics/btq461
Costello, E.K. et al. (2009), Bacterial community variation
in human body habitats across space and time, Science 326,
1694-97.
|