See also
UNOISE paper
A UNOISE pipeline recovers biological sequences from an amplicon sequencing experiment by performing error-correction (denoising) of Illumina reads. UNOISE is not designed for other sequencing technologies, e.g. 454 pyrosequencing reads. The UNOISE algorithm is implemented in the unoise command.
See Tutorials for example scripts & data.
Reads in FASTQ format
I strongly recommended starting from "raw" reads, i.e. the reads originally
provided by the sequencing machine base-calling software. You should do quality
filtering with USEARCH rather than using reads that have already been filtered
by third-party software.
Reads in FASTA format
The unoise
command supports reads in FASTA format. You may need to do this if your
reads have already been quality filtered by some other method and you don't
have access to the original FASTQ reads.
Sample pooling
I recommend combining reads from as many samples as possible. See sample pooling
for discussion.
Read quality filtering
Quality
filtering of the reads should be done using USEARCH because
maximum expected error
filtering method is much more effective at suppressing reads with high error
rates than other filters, e.g. those based on
average Q scores. Using a maximum expected errors of 1.0
is a good default choice (-fastq_maxee 1.0 option to
fastq_filter or fastq_merge_maxee 1.0 option
of fastq_mergepairs). You can use
fastx_learn to estimate the error rate after
filtering.
Global trimming
You should trim reads
to a fixed length unless the sequences are contigs generated by a paired
read assembler, in which case it may not be necessary. You should also trim
any primer-binding sequences at the ends of the reads. See
global trimming for discussion.
Unique sequences
Get the set of
unique sequences with abundances using the
fastx_uniques command with the -sizeout option. This will be the input
file for the unoise command.
Creating an OTU table
Denoised sequences are valid OTUs (the clustering identity is 100%, if you
like) and can be used to generate an OTU table
in just the same way as 97% OTUs. Reads must have
sample identifiers for this to work. The
simplest way to do this is usually to use the -relabel @ option of fastq_filter
or fastq_mergepairs.
Example commands
For typical Illumina reads with one
pair of FASTQ files (R1 and R2) per sample.
usearch -fastq_mergepairs *_R1*.fastq -relabel @ -fastqout reads.fq
usearch -fastq_filter reads.fq -fastq_maxee 1.0 -fastaout filtered.fa
usearch -fastx_uniques filtered.fa -fastaout uniques.fa -sizeout
usearch -unoise uniques.fa -tabbedout out.txt -fastaout denoised.fa
usearch -usearch_global reads.fq -db denoised.fa -strand
plus -id 0.97 -otutabout otu_table.txt