See also
UPARSE home page
OTU benchmark results
UPARSE algorithm
Example UPARSE command lines
A UPARSE pipeline clusters NGS amplicon reads into OTUs using the cluster_otus command. This page discusses the pre- and post-processing steps that are typically required to get the best results from cluster_otus in practice.
It is not possible to give a single set of command lines that will work for all reads because there are many variations in the input data, especially of read layouts.This page summarizes steps that are usually performed by a pipeline, and provides links to further discussion and details of the command lines that can be used. You could start from these examples.
Reads with Phred (quality) scores
I strongly recommended starting from "raw" reads, i.e. the reads
originally provided by the sequencing machine base-calling software. Phred
scores should be retained, and you should do quality filtering with USEARCH
rather than using
reads that have already been filtered by third-party software. Start by converting to FASTQ format, if needed. If you
have 454 reads in FASTA +
QUAL format,
you can use the
faqual2fastq.py script to convert.
Sample pooling
I recommend combining reads from as many samples as possible if they contain
similar communities and / or if you are planning to compare samples using
measures such as beta diversity. See sample pooling
for discussion.
Read quality filtering
Quality
filtering of the reads should be done using USEARCH because I
believe that the maximum expected error
filtering method is much better than most other filters, e.g. those based on
average Q scores. I recommend
choosing quality filtering parameters
manually for each run based on Phred score statistics.
Demultiplexed Illumina reads
See here for discussion of adding sample labels to
demultiplexed Illumina reads, i.e. reads that
are already split into separate FASTQ files by barcode/sample identifier.
Flowgram denoising
If you have 454 reads, then as an alternative to quality filtering with
USEARCH you can generate FASTA format reads by denoising flowgrams using a
third-party algorithm [Pubmed:20805793,
Pubmed:19668203]. This
may give a small improvement in OTU sequence accuracy compared to Phred score
quality filtering, but denoising can be very computationally intensive and I
generally don't consider it worth the effort. If you do choose to use denoising,
then you should convert the output from the denoising program so that
size annotations are added to the labels in
USEARCH format, remove barcodes (adding sample identifiers to the read labels),
and then skip ahead to the abundance sort step below. If you use a denoising
package (e.g. AmpliconNoise)
that includes a chimera filter (Perseus in the case of AN), then you should turn
off the chimera filter, i.e., extract denoised reads before any chimera
filtering step.
FASTA reads
See "Flowgram denoising" above if the FASTA reads were produced by a
denoising program. If "raw" reads or reads that have been quality-filtered by a
third-party program are only available in FASTA format, then you should start by
trimming them to a fixed length, unless the reads contain full-length amplicons,
in which case this step may not be necessary. See
global trimming for discussion. Since quality information is not available,
you cannot choose the trim length based on predicted error rates. Instead, you
could choose a value that is, say, a few percent longer than the average length in
order to maximize the number of bases retained. However, you should be cautious
here because quality tends to get worse towards the end of a read. For example,
if you have 454 reads that are, say, 400 bases or longer, then it might be
better to truncate to a shorter length, e.g. 250 or 300 bases as this could
substantially reduce the error rate.
Length trimming
Trimming to a fixed position is
critically important for achieving the best results. For unpaired
reads, trim to a fixed length. For overlapping paired reads, the reverse read
should start at an amplification primer, which achieves an equivalent result.
The important point is that identical or very similar reads must be globally
alignable with no terminal gaps. See
global trimming for discussion.
Paired reads
Paired reads should be merged using the
fastq_mergepairs command before quality filtering. The only quality
filtering that should be done at this stage is truncating reads at the first low
score using the -fastq_truncqual option to fastq_mergepairs, otherwise you may
find that many pairs will not align to each other due to poor quality tails in the reads.
After merging, you can use fastq_filter with a
maximum expected error threshold. Length truncation is typically not needed
since the merged pairs usually cover full-length amplicons (see
global trimming for discussion).
Barcodes
Barcodes and any other non-biological sequence must be stripped from the
reads before dereplication. This can be done using the
fastq_strip_barcode_relabel.py script or any other convenient method. The
barcode must be removed before dereplication to allow finding of identical
sequences derived from the biological tag. The barcode sequence or label should be inserted
into the read label so that the read can later be
mapped back to an OTU. It is recommended to strip barcodes and other
non-biological sequence before quality filtering.
Dereplication
Input to dereplication is a set of reads in
FASTA format with non-biological sequences such as barcode stripped. The reads
should be globally trimmed before
dereplication, and quality filtered if
possible as described above. You should use the
derep_fulllength command with the -sizeout option. I recommend
pooling samples before dereplication.
Abundance sort
Use the sortbysize command to sort the
dereplicated reads by decreasing abundance. To discard
singletons (usually recommended), use the
‑minsize 2 option.
OTU clustering
To create OTUs, run the cluster_otus command
with abundance-sorted reads as input. This will generate a set of OTU
representative sequences.
Chimera filtering
The cluster_otus command discards reads that have chimeric models built from
more abundant reads. However, a few chimeras may be missed, especially if they
have parents that are absent from the reads or are present with very low
abundance. It is therefore recommend to use a reference-based chimera filtering
step using UCHIME if a suitable database is
available. Use the uchime_ref command for this
step with the OTU representative sequences as input and the
‑nonchimeras option to get a chimera-filtered set of OTU sequences. For
the 16S gene, I recommend the
gold database (do
not use a large 16S database like Greengenes). For
the ITS region, you could try using the UNITE
database as a reference.
Labeling OTUs
At this stage the OTU sequence labels are usually the original read label
with a size annotation appended. Note that this is
the size of the dereplication cluster, i.e. the number of reads having this
unique sequence, not the number of reads assigned to the OTU (see Making
a OTU table below). It is therefore useful to generate a new set of labels for
the OTUs, e.g. OTU_1, OTU_2 ... OTU_N where N is the number of OTUs. This can be
done using the fasta_number.py script.
Creating an OTU table
To create an OTU table, you should first map
reads to OTUs. Then you can use the
uc2otutab.py
script to generate the OTU table.
Taxonomy assignment
You can use the utax command
(requires USEARCH version 8) to assign taxonomy
to the OTU representative sequences.