Comparing CD-HIT and USEARCH

<< CD-HIT analysis

CD-HIT and USEARCH clustering
In the USEARCH paper (Edgar, 2010) I reported a comparison of CD-HIT to USEARCH clustering (the UCLUST algorithm) by using a set of 16S rRNA reads from Costello et al. When I wrote the paper, I believed these two methods could be compared directly based on simple quality metrics (number of clusters and average identity) because both program use greedy list removal algorithms with the same definition of %id.

Problems comparing CD-HIT to USEARCH
I now realize that there are several complications, and a direct comparison is not possible using the number of clusters or average cluster identity. Despite having the same definition of %id, CD-HIT and USEARCH often disagree on the %id of a given pair of sequences. CD-HIT reports systematically higher identities, so that using CD-HIT at 97% is very roughly comparable to USEARCH at 95%. However, this problem cannot be solved by adjusting the clustering threshold. The distribution of differences is nothing like a normal distribution, so the mean difference is misleading. Sometimes CD-HIT reports lower rather than higher identities due to gross misalignments caused by banding errors. So if we assume a given USEARCH cluster is correct by some standard, then we expect CD-HIT to report many false positives at the same numerical value of the threshold due to over-estimating the identity, and perhaps also some false negatives due to banding errors. These two effects may tend to cancel each other with the result that the number of clusters and average identity are misleadingly similar, while the cluster quality is in fact very different. It is therefore important to consider not only the size of the cluster, but also the relatedness of the sequences within each cluster, which must be measured using a method that is independent of an alignment.

Alignment-free cluster quality metrics
Measures derived from alignments, e.g. %id, cannot be used to determine relatedness because this would bias results towards (against) methods with similar (different) alignment parameters such as gap penalties and substitution matrices, and would also bias towards (against) methods with similar (different) definitions of %id.. In the case of 16S, this could be done using a method similar to the RDP Naive Bayesian Classifier, which uses word-counting rather than alignments. Cluster quality could be assessed e.g. by measuring whether the taxonomic assignments are monophyletic. Unfortunately, using the RDP Classifier, er, naively, also has problems because it overpredicts with short reads. By this I mean that it often assigns a high P-value to a genus when several genera have the same sequence. So the P-value is a confidence indication that the predicted genus has the given sequence, while what we really want to know is whether the read has the predicted genus, which may have a much lower probability given the data.

Edgar,R.C. (2010), Search and clustering orders of magnitude faster than BLAST, Bioinformatics 26(19), 2460-2461.
doi: 10.1093/bioinformatics/btq461
Costello, E.K. et al. (2009), Bacterial community variation in human body habitats across space and time, Science 326, 1694-97.

Wang, Q, G. M. Garrity, J. M. Tiedje, and J. R. Cole. (2007), Naive Bayesian Classifier for Rapid Assignment of rRNA
   Sequences into the New Bacterial Taxonomy, Appl Environ Microbiol. 73(16):5261-7.