Home Software Services About Contact     
Follow on twitter

Robert C. Edgar on twitter

11-Aug-2018 New paper describes octave plots for visualizing alpha diversity.

12-Jun-2018 New paper shows that one in five taxonomy annotations in SILVA and Greengenes are wrong.

18-Apr-2018 New paper shows that taxonomy prediction accuracy is <50% for V4 sequences.

05-Oct-2017 PeerJ paper shows low accuracy of closed- and open-ref. QIIME OTUs.

22-Sep-2017 New paper shows 97% threshold is wrong, OTUs should be 99% full-length 16S, 100% for V4.

UPARSE tutorial video posted on YouTube. Make OTUs from MiSeq reads.



UBLAST algorithm

The UBLAST algorithm searches a database for local alignments below an E-value threshold. UBLAST is fundamentally different from the USEARCH algorithm, which is designed for high-identity searches. UBLAST is most often used for protein or translated searches, where low-similarity alignments can be informative. Nucleotide searches are also supported by UBLAST, though USEARCH is usually more appropriate because nucleotide homology is only detectable at high sequence similarity.

See also: ublast command.


UBLAST is designed to be sensitive to more distant sequence relationships where USEARCH has low sensitivity, e.g. below 50% identity for proteins. When sequence identity is low, a query and a database sequence (target) might only have a single short matching word, as shown in the figure above. This matching word is called a "seed".

Seed extensions
A local alignment is constructed "extending" the seed, i.e. by adding columns to the left and right of the seed, using fast heuristics that attempt to maximize the total score. This is done in two stages: ungapped extension, followed by gapped extension if a sufficiently high-scoring gapless alignment (high-scoring segment pair, or HSP) is found. The alignment construction phase of the UBLAST algorithm is similar to gapped BLAST. See alignment heuristics.

Non-exact seeds
When identity is low, it often happens that there are no exact matches of the required length between homologous sequences. Searches based on exact word seeds therefore suffer reduced sensitivity at lower sequence identities. UBLAST exploits  two techniques for improving sensitivity in this regime by using non-exact seeds: (1) patterns, also called spaced seeds, and (2) compressed amino acid alphabets. Patterns can be used for both nucleotide and protein databases; compressed alphabets are for proteins only. See also indexing options.

Search acceleration
With seeds that are sensitive enough to detect low-identity hits, there are usually many false positives, i.e. matching seeds found in pairs of sequences that are not homologous. Constructing and rejecting the resulting alignments is computationally expensive. UBLAST uses an unpublished, proprietary method to reduce the number of alignments that are constructed. This is controlled by the -accel parameter, which defaults to 0.8 and can have values between 0 (no extensions are attempted, so no hits will be found) and 1 (all matching seeds in the index are extended). Values < 1 can give dramatic improvements in speed with only a relatively small loss of sensitivity. Adjusting the -accel value allows the user to tune the trade-off between speed and sensitivity.