A database file of nucleotide sequences must be specified using the ‑db option. The database may be in FASTA or UDB format. UDB format is faster to load. The reference database should include sequences that might appear as parents in the query set.
Don't
use UCHIME(2) for OTU clustering or denoising!
I do not recommend using uchime2_ref
or uchime2_denovo in an OTU clustering pipeline because of the risk of false positives. The
cluster_otus and
unoise commands have built-in de
novo chimera
filtering which works very well for most data.
Reference database
It is usually strongly recommended to use
the largest possible database, e.g. SILVA for 16S or
UNITE for ITS. The advice to use a small, high-quality database in the first
UCHIME paper and in previous versions of the USEARCH manual was wrong!
The following output files are supported:
-uchimeout
out.txt (tabbed text)
-chimeras ch.fa (FASTA file with predicted
chimeras)
-notmatched not.fa (FASTA file with sequences not
matched to the database)
-uchimealnout aln.txt (alignments)
The -nonchimeras option is no longer supported. This is because it is not possible to determine that a sequence is non-chimeric, the best we can say is that it is found / not found in the database (the reasons are explained in the UCHIME2 paper). The -notmached output is the equivalent for uchime2_ref, but as with uchime_ref you should not interpret the output as containing non-chimeric sequences!
The -mode option is required, must be one of:
-mode high_confidence
Report chimera predictions
which with confidence, at the expense of a high false negative rate.
-mode specific
Report chimera predictions which
with confidence, at the expense of a high false-negative rate. Similar to
high_confidence mode, but less stringent so the false negative rate is lower
but the false positive rate may be higher. Gives results similar to the old
UCHIME algorithm.
-mode balanced
Attempts to balance false
negatives and false positives to minimize the overall error rate on typical
data. Of course, the rates are highly data-dependent.
-mode sensitive.
Emphasizes high sensitivity at the
expense of a high false positive rate.
-mode denoised
Reports all perfect chimeric models.
Mostly used for designing and validating algorithms -- this mode is rarely,
if ever, useful in practice because the database is implicitly assumed to be
complete (i.e., all parent sequences are exactly present) and the query set
and database are both assumed to have no errors. A single difference will
prevent the model from being reported, causing false negatives. Conversely,
fake models are common, causing false positives (see
UCHIME2 paper for details).
The -strand option is required. Currently this must be specified as -strand plus because searching on both strands is not supported.
Multithreading is supported.
The ‑self option specifies that a reference sequence matching the query sequence should be ignored. This is useful for estimating the false-positive rate using a database of sequences known to be free of chimeras. Then, -self does a leave-one-out test. The -self option requires that the query and database are the same file.
Paper
R. C. Edgar (2016), UCHIME2: Improved chimera detection for amplicon
sequences, http://dx.doi.org/10.1101/074252,
Example
usearch -uchime_ref reads.fasta -db
16s_ref.udb -uchimeout out.txt -strand plus -mode sensitive