Pair-wise sequence identity varies significantly by method

<< CD-HIT analysis
<< Comparing USEARCH and CD-HIT
Pair-wise identity varies by method
The pair-wise identity between two sequences depends on the alignment and the definition of identity. Alignments vary due to the use of different parameters such as gap penalties and substitution scores. Definitions of identity vary depending on the treatment of gaps. Link to definitions used here.

CD-HIT tends to produce gappy alignments due to the use of low gap penalties and mismatch scores, combined with a definition of %id that does not account for gaps in the shorter sequence. Compared to USEARCH, CD-HIT reports systematically higher %ids, which means that CD-HIT clusters are not directly comparable to USEARCH at a given identity threshold.

Below are alignments by CD-HIT, USEARCH and CLUSTALW of a pair of 16S rRNA reads (in FASTA format at bottom of this page). Terminal gaps are not shown. The pair has identity 97% according to CD-HIT and 86% according to USEARCH. See here for instructions on how to view CD-HIT alignments

The following table summarizes the %ids assigned to this pair of reads by different methods. The CD-HIT value stands out as anomalously high: 97%, compared with the next-largest value of 93.5% (CLUSTALW alignment with CD-HIT definition). Results from the MUSCLE alignment are also reported (alignment not shown).
  Identity definition (details)
CD-HIT 97.0% 89.6% 92.8%
USEARCH 86.1% 85.4% 86.7%
CLUSTALW 93.5% 87.4% 90.7%
MUSCLE 92.6% 85.6% 91.6%

How close are the reads taxonomically?
The RDP Naive Bayesian Classifier assigns these reads to different families in the same order. Since the RDP classifier uses an alignment-free method, we can assume that it is independent of alignment biases. In this example, the divergence reported by USEARCH is closer to the expected taxonomic divergence.