See also
utax command
makeudb_utax
command
How important is it
to re-train for my region / length?
Why not use a larger reference database like
Greengenes or SILVA?
Introduction
To create your own reference data, there are three main steps: 1. create a
FASTA file with utax-compatible taxonomy annotations,
2. trim the sequences to the appropriate region, and 3. train parameters on a
subset without placeholder names (like sp. for species or
incertae sedis).
Do you really need to train your own parameters
for 16S or ITS?
If you have 16S or ITS data for a region or read
length which is not included in the pre-trained taxconfs file, there may be
no need to train your own parameters. See
Should I re-train for my region / length? for discussion.
Preparing a UTAX-compatible FASTA
file
You'll need to write some code to convert existing sequence and taxonomy
data to UTAX-compatible FASTA. See utax taxonomy annotations
for discussion of the format requirements.
Within reason, it is ok to have identical sequences in the reference set, and it is also ok if they have different taxonomies. However, if there are large numbers of identical sequences this can degrade performance. I therefore recommend that you dereplicate the database and assign the consensus taxonomy to each unique sequence (i.e., delete ranks that are not identical in all copies of a unique sequence). You can use the -constax option of derep_fulllength or derep_prefix to do this:
usearch -derep_prefix ref.fa -constax -fastaout ref_uniques.fa -constax_report report.txt
If you trim the database to a shorter region, then you should dereplicate after trimming (because non-identical sequences may become identical when they are trimmed).
Trimming the FASTA file
Typically, the sequences you want to classify are shorter than your
reference sequences. If so, you should train UTAX parameters on sequences which
have been trimmed to the relevant region. For classification, you can use
full-length or trained sequences, but for training you should use trimmed
sequences.
I can suggest a couple of different methods for trimming: 1. the search_pcr command using the PCR primers, and 2. the qsegout option of usearch_global, using the reads as the search database and the reference sequences as the query.
Trimming using search_pcr
If you have the PCR primer sequences, and the sequences you want to classify
cover the full length of an amplicon, then you can use the ampout option of
search_pcr to extract the amplified region, like this:
usearch -search_pcr ref.fa -db primers.fa -strand plus -ampout ref_trimmed.fa -pcr_strip_primers
If your reads are shorter than the amplicons so don't reach the reverse primer, then you can use the -userout file to get the coordinates of the primer-matching region in the reference sequence and write a script to extract the reference sequence segment starting at this position.
Trimming using usearch_global
An alternative which can be convenient, especially if you don't know the
primer sequences, is to align reference sequences to the reads and extract the
aligned segment using the the qsegout option of
usearch_global. To make a search database of
tractable size from the reads, you could
dereplicate using the derep_fulllength
command and discard singletons by using -minuniquesize 2. If the database is
still too large, you could cluster at a lower identity, say 97%, using
cluster_fast or set a higher abundance
threshold, say -minuniquesize 10. Once you have a tractable read database, you
can use usearch_global to extract the matching segment of the reference
sequences, e.g.:
usearch -usearch_global ref.fa -db readdb.fa -strand both -id 0.7 -qsegout ref_trimmed.fa
Here, ref.fa is a FASTA file containing the reference sequences with taxonomy annotations and readdb.fa is the database constructed from the reads. Typically the reads will only match a fraction of the reference database, so you should use as diverse a set of reads as possible (e.g., don't use a mock community). You can extend the subset by repeating the process using the trimmed set as a query:
usearch -usearch_global ref.fa -db ref_trimmed.fa -strand both -id 0.7 -qsegout ref_trimmed2.fa
You can iterate this several times. Another approach for increasing the coverage is to extend the set of reads. For example, you can do this:
usearch -usearch_global all_reads.fa -db readdb.fa -strand plus -id 0.7 -notmatched not.fa
This gives a new set not.fa which is <70% similar to the sequences in the original readdb.fa database. Now you can dereplicate the reads in not.fa and use this as a second reference:
usearch -derep_fulllength not.fa -fastaout notd.fa -minuniquesize 2
usearch -usearch_global ref.fa -db notd.fa-strand both -id 0.7 -qsegout ref_trimmed_new.fa
You would then need to merge ref_trimmed_new.fa with ref_trimmed.fa to delete duplicated sequences, which you can do using derep_fulllength.
Exclude placeholder names
For training, sequences with taxonomic names which do not correspond to
monophyletic clades should
be excluded. The most common such name is "sp" for "unknown / unclassified
species". These will confuse the training algorithm, which will assume that two
sequences annotated as, say, s:Clostridium_sp are the same species, when in fact
they could be the same species or two different species. Other common
placeholders are incertae
sedis and (less common) variants such as
incerti ordinis and
incerti subordinis.
In the RDP 16S training set, Chloroplast degrades parameters (I'm not sure if it the clades are multiphyletic or there is some other reason), so should be excluded.
Placeholder names should be excluded for training, but will not degrade classification accuracy, so for best results training should be done with placholders excluded and classification should be done with the full set.
The fastx_getseqs command can be used to obtain a subset without placeholder names. The -label_words option is used to specify a text file name containing the words to be excluded, the -label_field tax option specifies that only the "tax=..." taxonomy annotation should be matched, and the -notmatched option is used to get sequences which do not match the words. For example;
usearch -fastx_getseqs ref.fa -label_words placeholder_names.txt -label_field tax -notmatched refx.fa -matched excluded.fa
A suggested starting point for a list of placeholder names is:
sp
incertae
incerti
unknown
unclassified
chloroplast
Let me know if you have other suggestions. You should also look for warnings "taxonomy nodes have >1 parent" when training. If you do get it, see the log file (-log filename) to get the names. These names are probably placeholders, and either way should be excluded.
Non-unique names
As with placeholder
names, non-unique names cause problems for training. In bacteria a given
species name is sometimes found in more than one genus, for example
Salinispora tropica and Pseudonocardia tropica. If you are
training to species level this will cause a warning " taxonomy nodes have >1
parent". You can check the log file (-log usearch.log option) to see which
taxa have this problem. The way I usually fix this is to add the genus name
to the species name, like this:
>AF374480;tax=g:Salinispora,s:Salinispora_tropica;
This is only needed during training, it doesn't matter either way if non-unique names are present in your reference database for classification.
Adding "decoy" sequences
With the
standard training procedure, the prediction accuracy of the topmost levels
may be degraded if there are no similar sequences to act as "decoys"
(potential false positives that have detectable similarity to the training
set). For example, with ITS, all training sequences belong to the Fungi
domain and the only decoys are reversed sequences which have very low
similarity. With 16S, all sequences are Bacteria or Archaea. A lack of decoy
examples can cause a classifier to assign over-optimistic confidence values.
A striking example of this is the RDP Classifier, which often assigns high
confidence to false-positive domain predictions on sequences that in fact
are not homologous to any SSU region.
To mitigate this problem, you can add a decoy set. This should be chosen to be as closely related as possible to the top level (or levels) in your training set. For example, if you are training 16S, then you could add 18S sequences because 18S is homologous to 16S. You should only annotate the top level(s) where you need decoys. For example, if you are adding 18S to a 16S training set, then the annotation for the 18S sequences should be tax=d:eukaryota; without lower levels such as phylum or class. Then d:eukaryota will act a decoy for d:Bacteria and d:Archaea, but parameters for lower levels will be specific to 16S.
Training on the trimmed set with
placeholders excluded
The next step is to generate a taxconfs file by training on the
trimmed set without placeholder names. Use the
utax_train command with the -taxconfsout option to save the parameters, for
example:
usearch -utax_train refx.fa -taxconfsout user.tc -utax_splitlevels NVpcofgs -utax_trainlevels dpcofgs
See taxonomy parameter training for discussion of how to set the utax_splitlevels and utax_trainlevels options.
Building a .udb database
The final step is to build a .udb database
using the makeudb_utax command. You can use
trimmed or full-length sequences here; I recommend using trimmed sequences as
they may give slightly better results. Placeholder names do no harm here, so you
should include them to get the best possible coverage. Use the -taxconfsin
option with the parameter file you generated in the previous step (training on
trimmed sequences with placeholders excluded). Since you are using -taxconfsin,
the makeudb_utax command does not re-train and you don't need to specify the -utax_splitlevels
or -utax_trainlevels options. A typical command-line looks like this:
usearch -makeudb_utax ref.fa -taxconfsin user.tc -output ref.udb
This database is now ready for running the utax command, e.g.:
usearch -utax reads.fastq -db ref.udb
-strand both -utaxout reads.utax