Dereplication (finding unique sequences)
See also
UPARSE home page
UPARSE pipeline
home page
Unique sequences and abundances must be identified
The
input sequences to cluster_otus must be
a set of unique sequences sorted in order of decreasing abundance with
size annotations in the labels. The
fastx_uniques command can be to
find the unique sequences and add the size annotaitions. I suggest you use -relabel Uniq so that the unique sequences
are labeled Uniq1, Uniq2 and so on. The input to fastx_uniques should be the reads after
any quality filtering or length trimming.
Samples should be pooled before dereplication
I recommend pooling samples, i.e. concatenating reads for all samples that
were sequenced in the same run. This is important for getting the best
detection of chimeras and cross-talk, and for
getting the best sensitivity to low-abundance sequences that could be lost
if individual samples or subsets of samples are clustered separately. See
pooling samples for discussion.
Singleton uniques should be discarded or ignored
I
recommend excluding singletons from the clustering
step. There are a couple of different ways to do this.
1. Use the -minsize 2 of the sortbysize command to discard them.
2. Use the -minsize 2 option of the cluster_otus command (this requires v8.1.1803 or later). This option ignores singleton input sequences, saving the step of making a separate file with singletons deleted.
It is especially important to discard singletons if you do not have quality scores, e.g. because you have reads in FASTA format which has already undergone some processing.
Most of the singletons will probably be recovered when you
make the OTU table.