UPARSE OTUs with radius different from 3% (i.e., different
from 97% identity)
In previous versions of USEARCH the cluster_otus command command had an otu_radius_pct option for specifying a radius different from the default of 3%.
However, please note that it is not
recommended to use non-default values.
The main reason is that chimera detection
degrades. Each input sequence is run
through UPARSE-REF using the current set of
OTUs as a reference database. If the optimal model is chimeric, the sequence
is discarded. If an OTU radius > 3% is used, then chimera detection becomes
more difficult because more true biological sequences will also be discarded
when they don't create new OTUs. The set of OTU sequences becomes sparser,
and the correct parents of a chimera will more often be missing from the OTU
database. Chimeras can still be detected when there are OTUs which are
sufficiently close to their parents, but the false negative rate will tend
to increase.
Chimera detection also gets more difficult when the OTU radius is <3%.
This is because you get many more false positives due to "fake models" where
a correct biological sequence can be exactly reconstructed from segments of
two other valid sequences. This surprising result is explained in detail in
the UCHIME2 paper.
Recommended:
make OTUs with 100% clustering identity
My current
recommendation is to use the UNOISE
error-correction (denoising) algorithm to reconstruct the set of correct
biological sequences in the reads. These sequences are valid OTUs which I
call "ZOTUs" (zero-radius OTUs). This is better than traditional 97%
clustering because it has better phenotype resolution as it allows you to
distinguish species and strains which would be lumped together at 97%. See
unoise2 command for details.
Recommended procedure for OTUs with clustering identity <100%
To make OTUs at
identities different from 97%, the best method is to use
UNOISE followed by
UCLUST, e.g. the unoise2 command followed
by cluster_smallmem. For example,
to make OTUs at 100%, 99%, 97%, 95% and 90% identity:
usearch -unoise2 uniques.fa -fastaout otus100.fa -minampsize 4
for id in 99 97 95 90
do
usearch -cluster_smallmem
otus100.fa -id 0.$id -centroids otus$id.fa
done