Search commands In BLAST, different programs are used
depending on the sequence types (protein or nucleotide) of the
query and database sequences. In USEARCH, there is only one program
which supports several commands. All
search commands support protein-protein, nucleotide-nucleotide and
nucleotide-protein (translated) search. The sequence types
of the query and database are detected automatically. Different
USEARCH commands are used for different alignment styles (local or global) and for different
underlying search algorithms: USEARCH for top hit(s) at higher
identities, or UBLAST, which is
slower but sensitive to lower identities.
BLAST
program |
Query |
DB |
Comments |
BLASTN |
nucl. |
nucl. |
The usearch_global
(most often) and usearch_local
(rarely) commands are commonly used as replacements for BLASTN and
MEGABLAST. The ublast command can also be
used if it is important to find as many hits as possible. See
comments below about nucleotide
searches. BLASTN always searches both plus and minus strands, in USEARCH the
-strand option allows searching on the plus strand
only (faster). |
BLASTP |
protein |
protein |
The usearch_global
and ublast commands are most often
used as replacements for BLASTP. Use usearch_global for
high-identity top-hit(s) searches, e.g. for orthologs. Use ublast
if you need good sensitivity to low-identity hits, e.g. <50%,
and/or if it important to find as many hits as possible to the
database. |
BLASTX |
nucl. |
protein |
See translated
search. The ublast command is
commonly used as a fast replacement for BLASTX in applications such
as searching for genes in metagenomic shotgun reads. BLASTX
attempts to extend alignments through frameshifts. USEARCH does not
do this, though frameshifts can be inferred from adjacent hits in
different frames. Let me know if this feature would be important to
you. |
TBLASTX |
nucl. |
nucl. |
Search with translated sequences. This type of translated
search is rarely used in practice and is not directly supported by
USEARCH. If you have a need for this, please let me know. Searches
of this type can be handled by using the findorfs command to get ORFs, i.e. possible
coding sequences. Use the ‑xlat option to get translated sequences.
Then you can use a straightforward protein search with ublast (low-identity) or usearch_global or usearch_local (high identity). |
TBLASTN |
protein |
nucl. |
See comments above for TBLASTX. |
MEGABLAST |
nucl. |
nucl. |
The usearch_global
(most often) and usearch_local
commands are commonly used as replacements for MEGABLAST. The
ublast command can also be used if it is
important to find as many hits as possible. |
Local and global alignments
BLAST supports only local searches, while USEARCH supports both
local and global search. In some
applications, global alignments can be more effective. For example,
in 16S community analysis, sequence identity is used as a simple
measure of evolutionary distance, with rules of thumb like >97%
indicates the same species, >95% the same genus. Here, sequence
Identities are better measured from a global alignment because a
local alignment may not extend through hypervariable regions,
resulting in an overestimated high identity for the sequence. See
also database trimming.
Database files BLAST requires
the database to be formatted using formatdb or makeblastdb. USEARCH
commands allow the database file to be
provided in FASTA or UDB format. The
filename is specified using the ‑db option, the format is
automatically detected. FASTA is convenient because it saves the
makeudb step, but UDB files are faster
to load and take less memory.
Output file compatibility
USEARCH supports the tabbed output file format of BLAST (-m8 or
-outfmt 6 option) with the blast6out
option. In most respects, the format is identical. At the time of
writing, the only difference I'm aware of is that USEARCH does not
sort all hits for a given nucleotide sequence by E-value in the
case of translated search.
This is because each ORF is treated internally by USEARCH as a
separate query; hits for a given ORF are sorted by E-value. This
may be changed in future USEARCH releases, if I ever get around to
it (doubtful, unless you can convince me that this is a real
problem in practice).
E-value threshold In BLAST, the E-value threshold
defaults to 10. This threshold is presumably intended to maximize
sensitivity and the expense of a very high error rate. This
threshold will produce many false positive hits, and may cause slow
execution due to the large number of gapped extensions that must be
attempted and perhaps also due to writing large output files.
USEARCH does not have a default E-value
threshold, which must be specified by the user using the
‑evalue option. It is not possible to calculate E-values for global
alignments. For global alignments, an identity threshold must usually be
specified. Unlike BLAST, USEARCH supports a rich set of "accept options" providing additional
criteria to decide if an alignment is a hit.
Search termination USEARCH
allows a search to be terminated if a sufficiently strong
hit is found, saving the time needed to search the rest of the
database. See also weak
hits.
Nucleotide searches Nucleotide
sequence homology cannot be reliably detected below roughly 75%
identity. Below 50%, most hits are probably noise. Most nucleotide
searches are therefore medium- or high-identity by USEARCH
standards, and the USEARCH algorithm
is usually effective (usearch_global and usearch_local commands). The ublast command might be preferred if it is
important to find all possible hits. The important limitations of
USEARCH for nucleotide search are related to sequence length (see
below).
Chromosomes and other long
sequences USEARCH is not designed for long database or
query sequences. So while BLASTN can be used to find local
similarities between a pair of chromosomes, USEARCH cannot do this
directly. See long sequences.
|