Designing sequence search benchmark tests

See also
USEARCH benchmark home

Search benchmark design is challenging
Designing fair, informative homology search benchmarks is very challenging as there are many complications to consider.

SCOP approach to homology search benchmarking
Perhaps the most rigorous published approach is the SCOP methodology developed by Brenner et al (http://www.pnas.org/content/95/11/6073.short). Search algorithms such as BLASTP are evaluated on an all-vs-all comparison of sequences of protein structures which have been classified into homologous superfamilies by human experts. Sensitivity and error rates at a given E-value threshold are measured by counting the number of true-positive hits (same superfamily), false positive hits (different superfamilies) and false negatives (homologous pairs which are not reported).

Homology search for next-generation sequencing
Unfortunately, the metrics used by Brenner et al. are not relevant for most next-generation sequencing experiments where searches are typically performed on very large databases with millions of sequences both in the database and the query set, and computational cost in time and memory use are thus critical limiting factors which should be measured by the benchmark.

Sensitivity and error rate metrics
In next-gen experiments, it is usually sufficient to find one or a few good hits for a read. Finding more distant homologs is not needed and may actually be a disadvantage if finding and reporting these hits requires significant additional resources (time, memory, disk space etc.). If the only homologs present in the search database are very distant, then they may be of limited value relative to the goals of the experiment. For example, if a protein hit is a local alignment with identity in the twilight zone, then it may not be possible to make useful inferences about the function of the gene. A sensitivity measure that gives equal weight to hits of high and low identities may therefore be misleading. With longer reads or contigs, there may be more than one gene in the sequence, complicating the design of a sensitivity measure even further. Bottom line, I don't know how to design a general-purpose, rigorous measure of sensitivity.

Database size
The execution time of BLAST scales linearly with database size because a database sequence requires the same amount of processing regardless of the rest of the database. Other algorithms, including UBLAST, have sub-linear scaling where, say, doubling the database size may increase execution time by a factor much less than two. It is therefore not valid to compare execution times on a small database like the PDB90 or PDB40 subsets of SCOP because the relative times may be very different on larger databases. Even if a large database is used, results may not be predictive if the search database size or content is quite different from the benchmark's. Using larger databases is also problematic because structures are only known for a relatively small number of proteins (about 100k in 2014), so homology cannot be determined except by sequence methods. If sequence methods are used to create a benchmark, then the test may be biased in favor of algorithms that use related data or parameters (e.g., substitution matrix, gap penalties, Karlin-Altschul K and lambda parameters). There is no completely satisfactory solution to this problem. One approach I have used is to embed subsets of PDB90 into the query set and search database, adding a large number of other proteins to the database. Sensitivity and errors can then be assessed only on PDB pairs, assuming that they are typical for all query-database pairs.

Algorithm parameters
Search algorithms often have many parameters which can be used to tune performance measures such as sensitivity, execution time and memory use according to the user's needs. Using default parameters may give a misleading impression of the capabilities of a program and the results that could be achieved in practice. For example, the UBLAST algorithm has a parameter called 'accel' which defaults to 0.8. Tuning this value can give dramatic reductions in execution times with minimal loss in sensitivity for some applications. Other parameters such as the word length and pattern (spaced seed) used for indexing can also be important. Here, you would ideally discuss with the author(s) of the program how to choose the best parameters for the given database, query set and chosen performance metrics (e.g. sensitivity, error rates, time & memory).

Search threshold parameters
In general, different algorithms will report different measures of similarity for a given pair of sequences (e.g., E-value or identity). This is because programs use different alignment parameters (e.g. substitution matrix and gap penalties) and different Karlin-Altschul statistics parameters (K and lambda plus corrections for length, composition etc.). It is therefore not valid to compare performance at a single E-value threshold. Either the comparison should consider a reasonable range of E-values (e.g., by making a ROC curve and integrating over a useful range of error rates, e.g. <1%) or by optimizing a single E-value threshold according to the benchmark metrics.

Single-gene search
In marker gene metagenomics (e.g., 16S, ITS or COI), the database is usually a single gene. Databases such as Greengenes and RDP contain hundreds of thousands of sequences for a single gene or region. This search problem is distinctively different from general-purpose homolog recognition, and different performance metrics are needed. Note for example that most or all query sequences are homologous to the entire database. Here, very different parameters or even a different underlying algorithm may be appropriate. For example, the USEARCH algorithm is usually faster and more accurate than UBLAST for searching 16S reference databases.

Database load time and execution time measurement
With very large databases, the time required to load a database file into memory may be quite significant, perhaps on the order of minutes. For large query sets, as usually found in practice, the database load time is often only a small fraction of the total time and should therefore be subtracted out of execution time tests so that the average search time per query sequence can be measured in a meaningful way. For example, this can be done by ensuring that the query set is large enough that the search time is much greater than the load time. More generally, you should be careful that execution time is dominated by the search itself rather than file input/output operations, which may be slow e.g. due to a slow or busy local network.