Nucleotide clustering benchmark (COSTELLO)

USEARCH performance home page
See here for COSTELLO results..
Comments on comparing CD-HIT and the UCLUST algorithm in USEARCH.


The COSTELLO benchmark was described in (Edgar, 2010). The following description is taken from that paper. Here, UCLUST refers to the clustering algorithm in USEARCH.

To illustrate typical improvements achieved by UCLUST compared to the widely-used program CD-HIT (Li and Godzik, 2006), the most efficient previous clustering method, a set of 1.1 x 10^6 pyrosequencing reads of length ~300 nucleotides was taken from a recent microbial ecology study (Costello, et al., 2009) and clustered at representative identities. UCLUST produced higher-quality clusters, enabled clustering at lower identities, used substantially less memory for large sets and was often one or more orders of magnitude faster. CD-HIT generated clusters with larger, and thus apparently better, average size at 99% identity, but this proved to be an artifact of a bug in CD-HIT as almost half (47%) of the reported identities were 98% and therefore fell below the threshold by the CD-HIT's own measure. Many of these assignments were verified as being below 99% by creating independent alignments using MUSCLE (Edgar, 2004).... To demonstrate the scalability of UCLUST, 100 copies of the Costello et al. set were concatenated, giving an input file with 1.1 x 10^8 [more than  one hundred million] sequences. Clustering a set of this size with previous methods would require large-scale computational resources, while UCLUST was able to generate high-quality clusters at identities between 85% and 95% in three to five hours on a commodity laptop computer in <100 Mb memory.

The average cluster size is used as a measure of sensitivity. This is equivalent to measuring the number of clusters, which is usually not a good measure of cluster quality. However, in the particular case of comparing USEARCH and CD-HIT it is appropriate since the  programs use closely related algorithms (length sort followed by greedy list removal) and essentially the same measure of %id (similar alignment parameters and definition of sequence identity).

Costello, E.K., Lauber, C.L., Hamady, M., Fierer, N., Gordon, J.I. and Knight, R. (2009) Bacterial community variation in human body habitats across space and time, Science, 326, 1694-1697.
Edgar, R.C. (2010), Search and clustering orders of magnitude faster than BLAST, Bioinformatics 26(19) 2460-61,doi: 10.1093/bioinformatics/btq461.
Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Res, 32, 1792-1797.
Li, W. and Godzik, A. (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences, Bioinformatics, 22, 1658-1659.