USEARCH
performance home page
Results
See here for
COSTELLO results..
Comments
on comparing CD-HIT and the UCLUST algorithm in USEARCH.
Methods
The COSTELLO benchmark was described in (Edgar,
2010). The following description is taken
from that paper. Here, UCLUST refers to the clustering algorithm in
USEARCH.
To
illustrate typical improvements achieved by UCLUST compared to the
widely-used program CD-HIT (Li and Godzik, 2006), the most efficient
previous clustering method, a set of 1.1 x 10^6
pyrosequencing reads of length ~300 nucleotides was taken from a
recent microbial ecology study (Costello, et al., 2009) and clustered
at representative identities. UCLUST produced higher-quality
clusters, enabled clustering at lower identities, used substantially
less memory for large sets and was often one or more orders of
magnitude faster. CD-HIT generated clusters with larger, and thus
apparently better, average size at 99% identity, but this proved to be
an artifact of a bug in CD-HIT as almost half (47%) of the reported
identities were 98% and therefore fell below the threshold by the CD-HIT's
own measure. Many of these assignments were verified as being below
99% by creating independent alignments using MUSCLE (Edgar, 2004).... To demonstrate the scalability of UCLUST, 100
copies of the Costello et al. set were concatenated, giving an input
file with 1.1 x 10^8 [more than one hundred million] sequences. Clustering a set of this
size with previous methods would require large-scale computational
resources, while UCLUST was able to generate high-quality clusters at
identities between 85% and 95% in three to five hours on a commodity
laptop computer in <100 Mb memory.
The
average cluster size is used as a measure of sensitivity. This is
equivalent to measuring the number of clusters, which is usually not a
good measure of cluster quality. However, in the particular case of
comparing USEARCH and CD-HIT it is appropriate since the programs
use closely related algorithms (length sort followed by greedy list
removal) and essentially the same measure of %id (similar alignment
parameters and definition of sequence identity).
References
Costello,
E.K., Lauber, C.L., Hamady, M., Fierer, N., Gordon, J.I. and Knight,
R. (2009) Bacterial community variation in human body habitats across
space and time, Science,
326,
1694-1697.
Edgar,
R.C. (2010), Search and clustering orders of magnitude faster than
BLAST, Bioinformatics 26(19) 2460-61,doi:
10.1093/bioinformatics/btq461.
Edgar,
R.C. (2004) MUSCLE: multiple sequence alignment with high
accuracy and high throughput, Nucleic
Acids Res, 32,
1792-1797.
Li,
W. and Godzik, A. (2006) Cd-hit: a fast program for clustering and
comparing large sets of protein or nucleotide sequences, Bioinformatics,
22,
1658-1659.
|