See also
OTU clustering
UPARSE pipeline
cluster_otus command
Output from cluster_otus is a FASTA file
containing OTU representative sequences. Further analysis often requires an
OTU table, which requires assigning reads to OTUs.
I recommend creating OTUs from pooled samples, i.e. by concatenating reads for all samples that were sequenced in the same run. This is important for getting the best detection of chimeras and cross-talk, and for getting the best sensitivity to low-abundance sequences that could be lost if individual samples or subsets of samples are clustered separately.
One method for assigning a read to an OTU is to find the OTU representative sequence with highest identity with the read, noting that there may be ties in which case the assignment is ambiguous. This is a database search task: reads are query sequences and the OTU representative sequences are the database to be searched. A threshold of 97% is typically used. Reads which do not map to an OTU with this identity are discarded.
The usearch_global command supports generating OTU tables using the options described below.
Sequence labels must have sample identifiers (input set) and OTU identifiers (database) as explained later in this page. This means that you cannot use the input file to cluster_otus for this step because several samples often have the same unique sequence, so the dereplicated (unique) sequence labels either do not have a sample identifier, or have a misleading sample identifier because the same sequence may be found in other samples. The way to deal with this is usually to go back to the "raw" reads after merging or truncating to a fixed length. See sample identifiers for ways to add sample identifiers to the read labels.
-otutabout filename
QIIME classic tabbed
text format.
-biomout filename
BIOM v1.0 format (JSON). The
biom
utility can be used to convert to
BIOM v2.1 format (HDF5).
-mothur_shared_out filename
Mothur "shared" file.
The OTU sequences must have OTU identifiers in
the labels
See OTU identifiers for details.
Reads must have sample identifiers in the labels
See sample identifiers for details.
Singletons and low-quality reads
You
can (probably should) include singletons and reads which did not pass the quality filter.
If they are 97% similar to an OTU sequence, they are probably good enough to
count even if they do have some sequencer or PCR error.
Reads should be trimmed
The reads
should be trimmed in the same way (if any) as the input sequences you used
for cluster_otus.
Typical command to generate an OTU table
With correctly formatted labels, the OTU table is generated using a command
like this.
usearch -usearch_global reads.fa -db
otus.fa -strand plus -id 0.97 -otutabout otu_table.txt \
-biomout otu_table.json