<< CD-HIT analysis
<< Comparing USEARCH and CD-HIT
Below are two alignments of a pair of 16S reads (in
FASTA format at bottom of this page). The top alignment is by CD-HIT and the lower
alignment is by USEARCH. There is a highly conserved region at the end
of the reads that is grossly misaligned by CD-HIT (red letters, highlighted
in yellow below). The reason for the misalignment is the banded dynamic
programming algorithm used by CD-HIT. If the difference in length in a
segment between two conserved regions exceeds the band width, one or
both of the conserved regions will be misaligned. If you're interested in reproducing these results, see here
for instructions on how to view
CD-HIT alignments.
CD-HIT banding causes more errors than
USEARCH banding
USEARCH also uses
banding by default, but the strategy is quite different, and the --band
option of USEARCH is not equivalent to the -b option of CD-HIT. In CD-HIT, the band
spans the entire alignment. In USEARCH banding is used only for
regions between HSPs, and
the band width is set dynamically according to the length difference of
the aligned regions. The --band option of USEARCH sets a minimum
width of the band, while the CD-HIT -b option sets the maximum
width of the band. USEARCH alignments are therefore much less prone to
banding artifacts.
USEARCH options for assessing and
tuning banding
USEARCH alignment heuristics, including banding,
can be disabled by using --nofastalign, which enables comparison of
alignments with and without heuristics. This enables USEARCH users to to
check for artifacts and tune speed heuristics to trade alignment quality
for speed.
>M31Mout_50786
CTGGGCCGTGTCTCAGTCCCAGTGTGGCTGGTCATCCTCTCAGACCAGCTAGAGATCGTCGGCTTGGTGAGCCTTTACCT
CACCAACTACCTAATCCCACTTGGGCTCATCCTATGGCATGTGGCCCGAAGGTCCCACACTTTCATCTTCCGTACGTAAC
TTACCGTACCGGGTACGGTTAAGTTACGTACCTAACGTTTACCCGGTTTACCCGGTTTAACGTTTACCCCCTTCCCCCCT
ACCCTAAAGTAACTACGTAAGTTACCCTTAACCCGAACGACTTAA
>M22Fcsp_243936
CTGGGCCGTGTCTCAGTCCCAGTGTGGCTGGTCATCCTCTCAGACCAGCTAGAGATCGTCGGCTTGGTGAGCCTTTACCT
CACCAACTACCTAATCCCACTTGGGCTCATCCTATGGCATGTGGCCCGAAGGTCCCACACTTTCATCTCTCGATTCTACG
CGGTATTAACTACTACTTACCGTTTACCGGTTACGTTTACCCCTTCCCCCTACCTAATAACGTACGTAAGTTACCTTAAC
CCGAACGACTTAA
|