See also
consout option
If reads are approximately globally alignable to one biological sequence, then a multiple alignment of a biological sequence to its reads will look something like this. Read errors are highlighted.
The biological sequence can be estimated as the consensus sequence derived from the multiple alignment. In each column of the alignment, the most common letter is taken. If the column contains a gap, the column is discarded. In this example, the biological sequence is recovered correctly. In general, there might be some remaining errors but we expect the consensus sequence to be closer than the longest read or a randomly chosen read from the cluster.
OTUs are better
For amplicon reads such as 16S and ITS tags, the centroid sequences
generated by cluster_otus will be better
predictions of biological sequences. I do not recommend using consensus
sequences as OTU representatives.
Limitations of consensus sequences
The multiple alignment constructed by USEARCH is made using method that is
designed to be as fast as possible with reasonable accuracy. The alignments,
which can be reviewed by using the masout option.
may be less accurate than popular multiple alignment programs like
MUSCLE, especially at lower sequence
identities. In USEARCH, consensus sequences are most appropriate high identities
(say, 99%) when the alignments contain few gaps. At lower identities, the
accuracy of the multiple alignment will tend to degrade, giving lower quality
consensus sequences.