USEARCH manual

Taxonomy annotations

Taxonomy annotations specify the taxonomy of a sequence in a reference database. A taxonomy annotation is specified as a tax=names field in the sequence label. The tax=names field is separated from other fields in the labels by a semi-colon, for example:

>AB008314;tax=d:Bacteria,p:Firmicutes,c:Bacilli,o:Lactobacillales,f:Streptococcaceae,g:Streptococcus;

Taxon names are separated by commas. A name must start with a single letter specifying the taxonomic level followed by a semi-colon. Supported levels are shown in the table below.

k	Kingdom	o	Order
d	Domain	f	Family
p	Phylum	g	Genus
c	Class	s	Species

It is not required that the same set of levels appear in all labels. Levels may be omitted, and there may be different lowest levels in different labels. So the following label could appear in the same reference database as the example above:

>M59123;tax=d:Bacteria,p:Firmicutes,c:Clostridia,f:Halobacteroidaceae;

Note that order and genus are not present.

Commas and semi-colons are not permitted in taxon names. They could be replace by another punctuation character, e.g. underscore. Otherwise, any character other than end-of-line is allowed, including colons. Parentheses ( ... ) are allowed but are discouraged as they may confuse scripts which parse taxonomy predictions.

Names which are entirely blank or empty must not be included. These are indicated by omitting the rank entirely. For example, the Greengenes convention is to include all ranks but not specify the name, as in:

f__Mycobacteriaceae; g__Mycobacterium; s__

(HIgher levels omitted for brevity). In UTAX notation, the species would be omitted:

tax=f:Mycobacteriaceae,g:Mycobacterium;

White space (blanks and tabs) is allowed within a taxon name, but otherwise is not allowed in the taxonomy annotation, so for example a space after a comma is not allowed. If white space is present, the the ‑notrunclabels option must be used when the reference database is in FASTA format. The -notrunclabels option is not required if the database has been converted to .udb format using the makeudb_utax command.

Non-ASCII characters (8-bit or Unicode) should not be used. These are sometimes found in clade names, e.g. genus Heteroepichloë in UNITE (notice the dots over the last e). Like most command-line informatics programs, USEARCH assumes plain old 7-bit ASCII text files and does not understand other formats such as UTF-8 which have ASCII as a subset. See here for a python script that finds non-ASCII characters in a text file. I use this script to find names with non-ASCII characters then manually edit, e.g. replace ë by e.

The complete set of clade names and their parent-child relationships is implicitly specified by the annotations in a FASTA file. No separate file is required to specify other information about the taxonomy. The taxonomy need not be fully consistent with a tree, meaning that some taxa may have more than one parent, e.g. a given family name may appear in two different orders. This allows the use of names such as "incertae sedis", "sp." and "unknown" which do not correspond to true taxa. However, such names should be excluded for training; see training on user data for discussion.