UCHIME algorithm

Please note: I now consider de novo UCHIME obsolete for OTU clustering pipelines because UPARSE is a superior approach..

About UCHIME
UCHIME is an algorithm for detecting chimeric sequences. It is implemented in the uchime_ref and uchime_denovo commands. See UCHIME in practice for a detailed discussion of the assumptions and limitations of the method.

The fundamental step in UCHIME is a search for a 3-way alignment of a query sequence with two parent sequences (A and B) such that one parent is more similar to one segment of the query (Q) and the other parent is similar over another segment, as in the figure below. A score is calculated from the alignment. Higher scores indicate a stronger chimeric signal. A score cutoff set by the ‑minh option (0.28 by default) determines whether the query is classified as a chimera.

This search can be performed with a reference database of parent sequences provided by the user, or the database can be constructed de novo from the query sequences. In de novo mode, parent sequences are assumed to be more abundant than their chimeras because the parent amplicons will have undergone more rounds of amplification.

Parameter tuning
UCHIME parameters are optimized for detection of very low-divergence chimeras. In typical applications such as 16S OTU picking from next-gen reads, chimeras over divergence less than the OTU radius may not be important, in which case it may be better to retune parameters. This can be done by increasing ‑mindiv and reducing -minh.

Reference database mode
The reference database should contain high-quality sequences that are believed to be chimera-free. See UCHIME downloads for some suggested 16S reference databases.

De novo mode
In de novo mode, abundance skew is used to distinguish chimeras from parents. input should be estimated amplicon sequences with integer abundances specified using size annotations, e.g.:

>FQ23BBGZ5;size=23;

The minimum abundance skew is specified by the ‑abskew parameter, which defaults to 2.0 (because one round of PCR doubles the abundance). Abundance is a measure of how many amplicons with a given unique sequence were present in the sample after amplification by PCR. One way to estimate this is to sum the total number of reads in the cluster used to estimate the given amplicon sequence. UCHIME uses only ratios of abundances, so the absolute value does not matter. However, using the number of reads is a useful indicator—for example, a cluster containing one read is likely to be spurious. Amplicon sequences and abundances can be estimated using USEARCH, or by using another algorithm such as Chris Quince's PyroNoise or AmpliconNoise. When using de novo mode, sequences should be estimated amplicons from one sequencing run (strictly, one PCR amplification stage), otherwise abundances may not be directly comparable.

Reference
Edgar,RC, Haas,BJ, Clemente,JC, Quince,C, Knight,R (2011) UCHIME improves sensitivity and speed of chimera detection, Bioinformatics doi: 10.1093/bioinformatics/btr381 [PMID 21700674].