USEARCH manual

USEARCH manual > UCLUST > consensus sequences

consensus sequences

If reads are approximately globally alignable to one biological sequence, then a multiple alignment of a biological sequence to its reads will look something like this. Read errors are highlighted.

The biological sequence can be estimated as the consensus sequence derived from the multiple alignment. In each column of the alignment, the most common letter is taken. If the column contains a gap, the column is discarded. In this example, the biological sequence is recovered correctly. In general, there might be some remaining errors but we expect the consensus sequence to be closer than the longest read or a randomly chosen read from the cluster.

Limitations of consensus sequences
The multiple alignment constructed by USEARCH is made using method that is designed to be as fast as possible with reasonable accuracy. The alignments, which can be reviewed by using the masout option. may be less accurate than popular multiple alignment programs like MUSCLE, especially at lower sequence identities. In USEARCH, consensus sequences are most appropriate high identities (say, 99%) when the alignments contain few gaps. At lower identities, the accuracy of the multiple alignment will tend to degrade, giving lower quality consensus sequences.