<< CD-HIT analysis
In the USEARCH
paper (Edgar, 2010) I reported a comparison of CD-HIT to USEARCH
clustering (the UCLUST algorithm) by using a set of 16S rRNA reads from
Costello et al. When I wrote the paper, I believed these two
methods could be compared directly based on simple quality metrics
(number of clusters and average identity) because both program use greedy list
removal algorithms with the same definition of %id.
comparing CD-HIT to USEARCH
I now realize that there are
several complications, and a direct comparison is not possible using the
number of clusters or average cluster identity. Despite
having the same definition of %id, CD-HIT
and USEARCH often disagree on the %id of a given pair of sequences.
CD-HIT reports systematically higher
identities, so that using CD-HIT at 97% is very roughly comparable
to USEARCH at 95%. However, this problem cannot be solved by adjusting the clustering threshold.
The distribution of differences is nothing like a normal distribution,
so the mean difference is misleading. Sometimes CD-HIT reports lower rather than
higher identities due to gross misalignments
caused by banding errors. So if we assume a given USEARCH cluster is
correct by some standard, then we expect CD-HIT to report many false
positives at the same numerical value of the threshold due to over-estimating the identity, and perhaps also
false negatives due to banding errors. These two effects may tend to cancel each
other with the result that the number of clusters and average identity
are misleadingly similar, while the cluster quality is in fact very different. It is therefore important to consider not only the size
of the cluster, but also the relatedness of the sequences within each
cluster, which must be measured using a method that is independent of an alignment.
cluster quality metrics
Measures derived from alignments, e.g. %id, cannot be used to determine
relatedness because this would bias results towards (against) methods
with similar (different) alignment parameters such as gap penalties and
substitution matrices, and would also bias towards (against) methods
with similar (different) definitions of %id.. In the case of 16S, this could be done using a
method similar to the
RDP Naive Bayesian Classifier, which uses word-counting rather than
alignments. Cluster quality could be assessed e.g. by measuring whether the
taxonomic assignments are monophyletic. Unfortunately, using the RDP
Classifier, er, naively, also has problems because it overpredicts with
short reads. By this I mean that it often assigns a high P-value to a
genus when several genera have the same sequence. So the P-value is a
confidence indication that the predicted genus has the given sequence,
while what we really want to know is whether the read has the predicted
genus, which may have a much lower probability given the data.
Edgar,R.C. (2010), Search and clustering orders of magnitude faster
than BLAST, Bioinformatics 26(19), 2460-2461.
Costello, E.K. et al. (2009), Bacterial community variation
in human body habitats across space and time, Science 326,
Q, G. M. Garrity, J. M. Tiedje, and J. R. Cole. (2007), Naive Bayesian
Classifier for Rapid Assignment of rRNA
Sequences into the New Bacterial Taxonomy, Appl
Environ Microbiol. 73(16):5261-7.