See also
UNOISE paper
The UNOISE algorithm performs error-correction (denoising) on amplicon reads. It is implemented in the unoise command. UNOISE is designed for Illumina reads, not earlier technologies such as 454 pyrosequencing.
Correct biological sequences are recovered from the reads, resolving distinct sequences down to a single difference (sometimes) or two or more differences (almost always). I consider this approach superior to traditional OTU clustering at 97% identity because OTUs may merge different species (or more generally, different phenotypes) with distinct sequences while denoising gives the best possible resolution.
Errors are corrected as follows:
- Reads with sequencing error are identified and
removed.
- Abundances are corrected (when the
OTU table is generated).
- Chimeras are removed.
-
PhiX sequences are removed.
- Low-complexity sequences due to
Illumina artifacts are removed.
Using denoised sequences as OTUs has two possible drawbacks: a single species may be split into two OTUs due to different strains or paralogs, and the sensitivity is slightly lower because UPARSE can make robust OTUs from unique sequences with abundance as low as 2 while the minimum abundance for UNOISE is around 4. I consider splitting of strains to be a good thing, because they may have different phonotypes and hence different ecological roles. Splitting due to paralogs is relatively benign (what does it matter?), and is not solved by clustering at 97% identity because paralogs have identities <97% in some cases. Splitting or lumping is unadvoidable regardless of whether the clustering identity is 97% or 100% so I would argue that it is better to resolve as many distinct biological sequences as possible. Sensitivity to unique sequences with abundance <8 (summed over all samples) is rarely important in practice.
Denoised sequences are valid OTUs (the clustering identity is 100%, if you
like) and can be used to generate an OTU table
in just the same way as 97% OTUs.