Pre-clustering

See also
OTU clustering

Preclustering is a method suggested by Huse et al. (2010). The key observation is that reads with errors (or amplicons with errors) should be less abundant than correct reads (or correct amplicons). This is because sequencer errors are unlikely to be reproduced by chance, and amplicons with PCR errors will have undergone fewer rounds of amplification.

This suggests the following technique: if a read (R) has only a small number of differences with a read of higher abundance (H), then assume H is the correct read corresponding to R, so add the abundance of R to H and discard R.

This merging is performed by cluster_otus, which simultaneously performs chimera filtering and greedy OTU construction by considering reads in order of decreasing abundance.

Preclustering can increase sensitivity

If you follow my recommendation to discard singleton reads, sensitivity may be reduced because "lone singletons" are lost, i.e. cases where the highest-abundance read for a given species is a singleton. Some of these species can be preserved by a preclustering step that very similar reads. If there is one correct read for a species, there may be one other read of the same gene that has one error. In this case, the species has two lone singleton reads, and these will be lost when singletons are discarded. They can be preserved by merging reads that have only a single difference, e.g. using this command:

usearch -cluster_smallmem derep.fa -id 0.99 -maxdiffs 1 -centroids preclustered.fa

Motivation for maxdiffs 1
The choice of -id 0.99 is arbitrary; an identity threshold must be provided because it is required by cluster_smallmem. If two reads have only a single difference, then most likely one of them is correct (because it is very unlikely that two bad reads would agree on all errors except one). So the maximum of one difference is not arbitray. If larger numbers of differences are allowed, then two bad reads may be merged and the merged read is more likely to create a spurious OTU.