See also OTU clustering
Preclustering is a method suggested by
Huse et al.
(2010). The key observation is that reads with errors (or
amplicons with errors) should be less abundant than correct reads (or correct
amplicons). This is because sequencer errors are unlikely to be reproduced by
chance, and amplicons with PCR errors will have undergone fewer rounds of
amplification.
This suggests the following technique: if a read (R) has
only a small number of differences with a read of higher abundance (H), then
assume H is the correct read corresponding to R, so add the abundance of R to H
and discard R.
This merging is performed by
cluster_otus, which simultaneously performs chimera filtering and greedy OTU
construction by considering reads in order of decreasing abundance.
Preclustering can increase sensitivity
If you follow my recommendation
to discard singleton reads, sensitivity may be reduced because "lone
singletons" are lost, i.e. cases where the highest-abundance read for a given
species is a singleton. Some of these species can be preserved by a
preclustering step that very similar reads. If there is one correct read for a
species, there may be one other read of the same gene that has one error. In
this case, the species has two lone singleton reads, and these will be lost when
singletons are discarded. They can be preserved by merging reads that have only
a single difference, e.g. using this command:
usearch -cluster_smallmem derep.fa -id 0.99 -maxdiffs
1 -centroids preclustered.fa
Motivation for maxdiffs 1
The choice of -id 0.99 is arbitrary; an identity threshold must be provided because it is required by cluster_smallmem. If two reads have only a single difference, then most likely one of them is correct (because it is very unlikely that two bad reads would agree on all errors except one). So the maximum of one difference is not arbitray. If larger numbers of differences are allowed, then two bad reads may be merged and the merged read is more likely to create a spurious OTU. |