USEARCH manual > algorithms > OTU clustering
OTU clustering
 
See also
 
otupipe
  What is an OTU?
  Finding species

  SSU reference databases

  Typical commands for OTU clustering

Preprocessing
Raw reads should be filtered before processing with USEARCH. The procedure depends on whether the reads are paired (Illumina) or unpaired (454 or Illumina). Currently, USEARCH does not support "raw" read files such as flowgrams (sff format) or FASTQ, so preprocessing must be done using third-party tools. If you think USEARCH could do a better job than existing software, or you can't find a suitable tool for the job, let me know.

Preprocessing steps include (1) quality filtering and (2) global trimming. In the case of paired reads, global trimming is accomplished by matching to both primers and overlapping.

Preprocessing paired reads
Discard reads that (1) do not match the primers, or (2) have ambiguous bases (Ns), or (3) that do not have perfect matching overlaps.

Preprocessing unpaired reads
Discard reads with (1) low average quality scores, or (2) have ambiguous bases (Ns). Truncate at the first low-quality base. Then truncate reads to a fixed length and discard reads shorter than the selected length.

OTU clustering
The recommended steps for creating OTUs from preprocessed reads are summarized in the table below. See also typical USEARCH commands for OTU clustering.

Step Description Example command-line options
Dereplicate Discard duplicated sequences, annotate with cluster sizes and sort by decreasing cluster size. The derep_fulllength command can do this in a single pass. With unpaired Illumina reads, I suggest setting a minimum cluster size by using the minuniquesize option. The goal is to reduce the number of reads with errors. I typically use -minuniquesize 4, but so far there is no good benchmark for tuning this parameter so the value 4 is just an educated guess.
 
-derep_fulllength reads.fa -sizeout
  -output derep.fa

 

Denoise Cluster at 99% id, this has the effect of removing most sequences with up to 1% errors, setting a minimum cluster size (next step) tends to discard reads with >1% errors.
 
-cluster_smallmem derep.fa -id 0.99
  -
centroids denoised.fa -sizein -sizeout
Abundance sort Sort denoised reads in order of decreasing abundance, as required by uchime_denovo. The minimum cluster size should be set based on test data, e.g. a mock community. I find values of 3 typically work well for pyrosequencing reads (454).
 
-sortbysize denoised.fa -output denoised.fas
  -minsize 3
Chimera filter The uchime_denovo and uchime_ref commands should both be used.
 
-uchime_denovo derep.fa
  -nonchimeras nonch_denovo.fa
Abundance sort Use the sortbysize command to sort by decreasing abundance. More abundance sequences make better centroids.
 

-sortbysize nonch_ref.fa
  -output sorted.fa

 

Clustering Use cluster_smallmem at the desired threshold, which is typically 97%. You should not use cluster_fast because it resorts the input sequences.
 
-cluster_smallmem sorted.fa
  -id 0.97 -sizein -sizeout
  -centroids otus.fa
OTU size threshold (optional) If desired, OTUs smaller than a minimum size threshold (M) can be discarded. I suggest keeping small clusters only if they match a suitable reference database, e.g. 16S sequences from isolates. Smaller clusters are more likely to be spurious due to read errors, undetected chimeras and other artifacts. The size threshold M controls sensitivity vs. specificity in a similar way to a BLAST E-value. With small M, you will tend to get more valid biological OTUs, but also more spurious OTUs. With large M, you will tend to get lower sensitivity and fewer bad OTUs. A size threshold can be imposed using the minsize option of sortbysize. See OTU size threshold for discussion of how to set M.
 
-sortbysize otus.fa -minsize 4
  -output otus_minsiz4.fa