See also
Search flowchart
The UBLAST algorithm searches a database for local alignments below
an E-value threshold. UBLAST is fundamentally different from the
USEARCH algorithm, which is
designed for high-identity searches. UBLAST is most often used for
protein or translated
searches, where low-similarity alignments can be informative.
Nucleotide searches are also supported by UBLAST, though USEARCH is
usually more appropriate because nucleotide homology is only
detectable at high sequence similarity.
See also: ublast command.
UBLAST is designed to be
sensitive to more distant sequence relationships where USEARCH has
low sensitivity, e.g. below 50% identity for proteins. When
sequence identity is low, a query and a database sequence (target)
might only have a single short matching word, as shown in the
figure above. This matching word is called a "seed".
Seed extensions A
local alignment is constructed "extending" the seed, i.e. by adding
columns to the left and right of the seed, using fast heuristics
that attempt to maximize the total score. This is done in two
stages: ungapped extension, followed by gapped extension if a
sufficiently high-scoring gapless alignment (high-scoring segment
pair, or HSP) is found. The alignment construction phase of the
UBLAST algorithm is similar to gapped BLAST. See alignment heuristics.
Non-exact seeds
When identity is low, it often happens that there are no exact
matches of the required length between homologous sequences.
Searches based on exact word seeds therefore suffer reduced
sensitivity at lower sequence identities. UBLAST exploits two
techniques for improving sensitivity in this regime by using
non-exact seeds: (1) patterns, also
called spaced seeds, and (2) compressed amino acid alphabets.
Patterns can be used for both nucleotide and protein databases;
compressed alphabets are for proteins only. See also indexing options.
Search
acceleration With seeds that are sensitive enough to detect
low-identity hits, there are usually many false positives, i.e.
matching seeds found in pairs of sequences that are not homologous.
Constructing and rejecting the resulting alignments is
computationally expensive. UBLAST uses an unpublished, proprietary
method to reduce the number of alignments that are constructed.
This is controlled by the ‑accel parameter, which defaults to 0.8
and can have values between 0 (no extensions are attempted, so no hits will be
found) and 1
(all matching seeds in the index are extended). Values < 1 can give dramatic
improvements in speed with only a relatively small loss of
sensitivity. Adjusting the -accel value allows the user to tune the
trade-off between speed and sensitivity.
Seed index UBLAST
uses an index of seeds in the database.
This enables faster searches compared to BLAST, which does not use
an index and must therefore parse every target sequence in the
database to find matching seeds. (It is a common misconception that
the formatdb
and makeblastdb
commands in BLAST create seed indexes, but in fact their main
function is to reduce the size of the database by using <8 bits
per letter. Only the makembindex command for MEGABLAST
actually indexes the database). Index options allow fine tuning of
speed, sensitivity and memory use. The index can be stored in a
UDB file, which can be advantageous
when a large database will be searched repeatedly. Alternatively,
the index will be constructed on the fly if the database is
provided in FASTA format.
|