See also
otupipe
What is an OTU?
Finding species
SSU reference databases
Typical commands for OTU clusteringPreprocessing
Raw reads should be filtered before processing with USEARCH. The procedure depends on
whether the reads are paired (Illumina) or unpaired (454 or Illumina).
Currently, USEARCH does not support "raw" read files such as flowgrams (sff
format)
or FASTQ, so
preprocessing must be done using third-party
tools. If you think USEARCH could do a better job than existing software, or you
can't find a suitable tool for the job,
let me know.
Preprocessing steps include (1) quality filtering
and (2) global trimming. In the case of
paired reads, global trimming is accomplished by matching to both primers and
overlapping.
Preprocessing paired reads
Discard reads that (1) do not match the primers, or (2) have ambiguous bases
(Ns), or (3) that do not have perfect matching overlaps.
Preprocessing unpaired reads
Discard reads with (1) low average quality scores, or (2) have ambiguous
bases (Ns). Truncate at the first low-quality base. Then truncate reads to a
fixed length and discard reads shorter than the selected length.
OTU clustering
The recommended steps for creating OTUs from preprocessed reads are
summarized in the table below. See also typical
USEARCH commands for OTU clustering.
Step |
Description |
Example
command-line options |
Dereplicate |
Discard duplicated sequences,
annotate with cluster sizes and sort by decreasing cluster size. The
derep_fulllength command can do this in a
single pass. With unpaired Illumina reads, I suggest setting a minimum cluster
size by using the minuniquesize option. The goal is to reduce the number of
reads with errors. I typically use -minuniquesize 4, but so far there is no good
benchmark for tuning this parameter so the value 4 is just an educated guess.
|
-derep_fulllength reads.fa -sizeout
-output derep.fa |
Denoise |
Cluster at 99% id, this has the effect of removing most
sequences with up to 1% errors, setting a minimum cluster size (next step) tends
to discard reads with >1% errors.
|
-cluster_smallmem
derep.fa -id 0.99
-centroids denoised.fa -sizein -sizeout |
Abundance sort |
Sort denoised reads in order of decreasing abundance,
as required by uchime_denovo. The minimum cluster size should be set based on
test data, e.g. a mock community. I find values of 3 typically work well for pyrosequencing reads (454).
|
-sortbysize
denoised.fa -output denoised.fas
-minsize 3 |
Chimera filter |
The uchime_denovo and
uchime_ref commands should both be used.
|
-uchime_denovo derep.fa
-nonchimeras nonch_denovo.fa |
Abundance sort |
Use the sortbysize
command to sort by decreasing abundance. More abundance sequences make better
centroids.
|
-sortbysize nonch_ref.fa
-output sorted.fa
|
Clustering |
Use cluster_smallmem
at the desired threshold, which is typically 97%. You should not use
cluster_fast because it resorts the input
sequences.
|
-cluster_smallmem
sorted.fa
-id 0.97 -sizein
-sizeout
-centroids otus.fa |
OTU size threshold
(optional) |
If desired, OTUs smaller than a minimum size threshold
(M) can be discarded. I suggest keeping small clusters only if they match a
suitable reference database, e.g. 16S sequences from isolates. Smaller clusters are more likely to be spurious due to
read errors, undetected chimeras and other artifacts. The size threshold M
controls sensitivity vs. specificity in a similar way to a BLAST E-value. With
small M, you will tend to get more valid biological OTUs, but also more spurious
OTUs. With large M, you will tend to get lower sensitivity and fewer bad OTUs. A
size threshold can be imposed using the minsize option of
sortbysize. See
OTU size threshold for discussion of how to set M.
|
-sortbysize otus.fa -minsize 4
-output otus_minsiz4.fa |
|