Home Software Services About Contact usearch manual

Dereplication (finding unique sequences)

See also
 
UPARSE home page
  UPARSE pipeline home page

Unique sequences and abundances must be identified
The input sequences to cluster_otus must be a set of unique sequences sorted in order of decreasing abundance with size annotations in the labels. The fastx_uniques command can be to find the unique sequences and add the size annotaitions. I suggest you use -relabel Uniq so that the unique sequences are labeled Uniq1, Uniq2 and so on. The input to fastx_uniques should be the reads after any quality filtering or length trimming.

Samples should be pooled before dereplication
I recommend pooling samples, i.e. concatenating reads for all samples that were sequenced in the same run. This is important for getting the best detection of chimeras and cross-talk, and for getting the best sensitivity to low-abundance sequences that could be lost if individual samples or subsets of samples are clustered separately. See pooling samples for discussion.

Singleton uniques should be discarded or ignored
I recommend excluding singletons from the clustering step. The cluster_otus and unoise3 commands do this automatically with default options.

It is especially important to discard singletons if you do not have quality scores, e.g. because you have reads in FASTA format which has already undergone some processing.

Most of the singletons will probably be recovered when you make the OTU table (otutab command).