UBLAST algorithm

The UBLAST algorithm searches a database for local alignments below an E-value threshold. UBLAST is fundamentally different from the USEARCH algorithm, which is designed for high-identity searches. UBLAST is most often used for protein or translated searches, where low-similarity alignments can be informative. Nucleotide searches are also supported by UBLAST, though USEARCH is usually more appropriate because nucleotide homology is only detectable at high sequence similarity.

See also: ublast command.

UBLAST is designed to be sensitive to more distant sequence relationships where USEARCH has low sensitivity, e.g. below 50% identity for proteins. When sequence identity is low, a query and a database sequence (target) might only have a single short matching word, as shown in the figure above. This matching word is called a "seed".

Seed extensions
A local alignment is constructed "extending" the seed, i.e. by adding columns to the left and right of the seed, using fast heuristics that attempt to maximize the total score. This is done in two stages: ungapped extension, followed by gapped extension if a sufficiently high-scoring gapless alignment (high-scoring segment pair, or HSP) is found. The alignment construction phase of the UBLAST algorithm is similar to gapped BLAST. See alignment heuristics.

Non-exact seeds
When identity is low, it often happens that there are no exact matches of the required length between homologous sequences. Searches based on exact word seeds therefore suffer reduced sensitivity at lower sequence identities. UBLAST exploits two techniques for improving sensitivity in this regime by using non-exact seeds: (1) patterns, also called spaced seeds, and (2) compressed amino acid alphabets. Patterns can be used for both nucleotide and protein databases; compressed alphabets are for proteins only. See also indexing options.

Search acceleration
With seeds that are sensitive enough to detect low-identity hits, there are usually many false positives, i.e. matching seeds found in pairs of sequences that are not homologous. Constructing and rejecting the resulting alignments is computationally expensive. UBLAST uses an unpublished, proprietary method to reduce the number of alignments that are constructed. This is controlled by the ‑accel parameter, which defaults to 0.8 and can have values between 0 (no extensions are attempted, so no hits will be found) and 1 (all matching seeds in the index are extended). Values < 1 can give dramatic improvements in speed with only a relatively small loss of sensitivity. Adjusting the -accel value allows the user to tune the trade-off between speed and sensitivity.

Seed index
UBLAST uses an index of seeds in the database. This enables faster searches compared to BLAST, which does not use an index and must therefore parse every target sequence in the database to find matching seeds. (It is a common misconception that the formatdb and makeblastdb commands in BLAST create seed indexes, but in fact their main function is to reduce the size of the database by using <8 bits per letter. Only the makembindex command for MEGABLAST actually indexes the database). Index options allow fine tuning of speed, sensitivity and memory use. The index can be stored in a UDB file, which can be advantageous when a large database will be searched repeatedly. Alternatively, the index will be constructed on the fly if the database is provided in FASTA format.