Home Software Services About Contact     

Consensus sequences

See also
  consout option

If reads are approximately globally alignable to one biological sequence, then a multiple alignment of a biological sequence to its reads will look something like this. Read errors are highlighted.


The biological sequence can be estimated as the consensus sequence derived from the multiple alignment. In each column of the alignment, the most common letter is taken. If the column contains a gap, the column is discarded. In this example, the biological sequence is recovered correctly. In general, there might be some remaining errors but we expect the consensus sequence to be closer than the longest read or a randomly chosen read from the cluster.

Denoising is better!
For amplicon reads such as 16S and ITS tags, the denoised sequences generated by the unoise3 command will be much better predictions of biological sequences!

OTUs are better
For amplicon reads such as 16S and ITS tags, the centroid sequences generated by cluster_otus will be better predictions of biological sequences. I do not recommend using consensus sequences as OTU representatives.

Limitations of consensus sequences
The multiple alignment constructed by USEARCH is made using method that is designed to be as fast as possible with reasonable accuracy. The alignments, which can be reviewed by using the masout option. may be less accurate than popular multiple alignment programs like MUSCLE, especially at lower sequence identities. In USEARCH, consensus sequences are most appropriate high identities (say, 99%) when the alignments contain few gaps. At lower identities, the accuracy of the multiple alignment will tend to degrade, giving lower quality consensus sequences.