See also
UPARSE pipeline
OTU benchmark
results
Making an OTU table (mapping reads
to OTUs)
Should I use UPARSE or UNOISE?
The cluster_otus command performs
97% OTU clustering using the UPARSE-OTU algorithm.
For most purposes, I consider 97% OTU clustering obsolete. It is better to use the unoise command to recover the full set of biological sequences in the reads. These are also valid OTUs; I call them "ZOTUs" for zero-radius OTUs, to emphasize this. See the UNOISE paper for full discussion.
Input to cluster_otus is a FASTA file containing quality filtered, globally trimmed and dereplicated reads from a marker gene amplicon sequencing experiment, e.g. 16S or ITS. It is generally recommended that singleton reads should be discarded. See UPARSE pipeline for discussion of how to prepare reads before clustering.
Reads must be globally trimmed before finding unique sequences. See global trimming.
Input sequence labels must have size annotations giving the abundance of the unique sequence. Size annotations are generated by the -sizeout option of clustering commands; typically fastx_uniques is used.
The -sizein and -sizeout options are no longer supported by cluster_otus because they were misleading for evaluating the results. To determine the number of reads in each OTU, it is better to make an OTU table using reads before quality filtering and deleting singletons, which recovers many (usually, most) of the reads that were discarded. Using -sizein and -sizeout can give the impression that UPARSE discards a large fraction of the reads, which is usually not the case if you use my recommended approach.
The -minsize option can be used to specify a minimum abundance; for example you can use -minsize 2 to discard singletons.
The -otu_radius_pct option specifies the OTU "radius" as a percentage, i.e. the maximum difference between an OTU member sequence and the representative sequence of that OTU. Default is 3.0, corresponding to a minimum identity of 97%. It not recommended to use a non-default value; see UPARSE OTU radius for discussion and solution for making OTUs at different identities.
The -otus option specifies a FASTA output file for the OTU representative sequences. By default, OTUs labels are taken from the input file, with size annotations stripped. The -relabel option specifies a string that is used to re-label OTUs. If -relabel xxx is specified, then the labels are xxx followed by 1, 2 ... up to the number of OTUs. OTU identifiers in the labels is required for making an OTU table using usearch_global
The -uparseout option specifies a tabbed text output file documenting how the input sequences were classified.
The -uparsealnout option species a text file containing a human-readable alignment of each query sequence to its UPARSE-REF model.
Parsimony score options are supported.
Alignment parameters and heuristics are supported.
Example
usearch -cluster_otus derep.fa -otus
otus.fa -uparseout out.up -relabel OTU -minsize 2
usearch -usearch_global
all_reads.fa -db otus.fa -strand plus -id 0.97 -otutabout otu_table.txt