Protein search benchmarks (PDB90-R and PDB90-X) |
USEARCH
performance home page
The standard method for testing the accuracy of database search algorithms such as BLASTP is use an ASTRAL subset, usually PDB40 or PDB90 (Brenner et al., 1998). Hits to the same superfamily or fold are considered true positives, hits to different superfamilies or folds are considered false positives. In practice, results are very similar whether folds or superfamilies are used, so the choice is not important. I used folds for the PDB90-R and PDB90-X tests.
I used the PDB90 set from ASTRAL v1.75, which has 15,545 sequences (2.9Mb FASTA file). This set is too small for testing high-throughput methods such as UBLAST or RAPsearch (Ye et al., 2011) that create indexes on words length length >~ 5 because the expected frequency of a given k-mer for k>=5 is close to one. I therefore added random sequences to create a larger search database using the amino acid background frequencies. For technical reasons related to the design of the indexes used by UBLAST and RAPsearch, it is important to do this because it creates a more realistic challenge for the search algorithms. For example, A (Alanine) is ~6x more common than W (Tryptophan), so AAA is 6^3 = 213x more common than WWW. This means that there is a large skew in the frequency of different words, which creates highly unbalanced indexes, especially in the case of RAPsearch which uses a reduced amino acid alphabet. If uniform frequencies were used, then the skew would be minimal and the search speed would be unrealistic. I added 10^8 (one hundred million) random letters in 667,715 sequences of randomly selected lengths 100 - 200aa, which is the typical length range of the domains in PDB90. Hits to random sequences were counted as false positives. I used an E-value threshold of 1 for all tests reported here. This threshold is an arbitrary choice; different thresholds gave similar results (not shown).
To measure accuracy by %id, I used SSEARCH, a Smith-Waterman implementation, as a reference. If a homologous pair had an SSEARCH E-value < 1, then I took the %id from the SSEARCH alignment. This procedure is not ideal as it could bias towards programs having similar alignment parameters to SSEARCH (substitution matrix, gap parameters, Karlin-Altschul statistics parameters, etc.); it would be better to measure %id from structural alignments. However, it was more convenient for me to use SSEARCH and I believe the bias is probably minimal because all the tested programs use similar parameters to SSEARCH (BLOSUM62 etc.). For algorithms that do comprehensive search (no U-sorting), sensitivity is measured by considering all hits, as in Brenner et al.. If U-sorting is enabled, sensitivity is measured using the top hit or top ten hits.
To create a query set for the translated search test (PDB90-X), I reverse-translated the PDB90 sequences using randomly-selected codons for degenerate amino acids. I then selected a random subsequence of length 350 - 450nt from each domain to simulate next-generation sequencer reads (assuming a modest improvement in read lengths from those achievable at the time the test was designed). Using shorter sequences was necessary as RAPsearch consistently crashed with longer query sequences. Read errors were not simulated.
References Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE. (2004), The ASTRAL compendium in 2004. NAR 32:D189-D192. Yuzhen Ye, Jeong-Hyeon Choi and Haixu Tang. RAPSearch: a Fast Protein Similarity Search Tool for Short Reads (2011), BMC Bioinformatics 12:159. |