SCOP40 benchmark
SCOP40 is the de-facto standard benchmark for protein homology search.
It is based on domains in the SCOP database clustered at 40% amino acid identity.
To measure homolog detection accuracy, an algorithm performs an all-vs-all search of SCOP40 domains, discarding
trivial self-hits. At a given E-value or alignment score cutoff, the number of
true positivs (TPs) and false postives (FPs) is measured. To show the range of trade-offs that can be achieved,
the results are usually summarized in a plot which shows sensitivity on the x axis and a measure
which includes errors on the y axis.
Sensitivity may be called "recall" or "true positive rate", these are the same thing. If the y axis is
false positive rate (FPR), i.e. the fraction of hits which are FPs, then the graph is called a
Receiver Operator Characteristic plot (ROC).
Alternatively, the y axis may be can be precision, which is the fraction of hits above the threshold which are TPs,
this is a precision-recall (P-R) plot.
While P-R curves have been used to report SCOP40 results in recent papers, these plots are problematic for a few reasons.
They were designed for benchmarking binary classifiers, but it is a bit of a stretch to consider a protein search
algorithm as a binary classifier. The number of TPs is a very small fraction of the database, and may be zero. Also,
the ubiquitous use of E-values shows that biologists prefer to control errors by limiting the number of expected errors
per query. For example, if with an E-value threshold of 10 there should be roughly 10 errors per query according
to the algorithm's estimate. But if you mix TPs and FPs by using a measure such as precision or FPR, you can't
assess sensitivity as a function of false-positive errors per query (FPEPQ). Also, ROC and P-R plots do not
scale, so accuracy measured on SCOP40 does not predict accuracy in a much bigger database such as
AFDB. These points are discussed in detail in the Reseek paper.
To address these issues, the developers of SCOP40 (Brenner et al 1998) proposed a new type of plot, Coverage versus Error (CVE).
Here, the y axis is FPEPQ. If the search algorithm estimates a perfect E-value, then FPEPQ = E.
Sensitivity vs. errors for homolog detection on SCOP40.
TPs are hits to the same superfamily, FPs are
hits to different superfamilies. Higher accuracy is reflected by fewer errors at a given sensitivity,
which gives a curve lower and to the right. This shows that Reseek has substantially
higher sensitivity than previous methods including DALI, TM-align and Foldseek.
Inset is reported E-value vs. FPEPQ, which shows
that Reseek E-values are in good agreement with measured error rates, while Foldseek E-values are underestimated.
For example, at Foldseek E-value threshold 1E-6 (circled), the measured FPEPQ is ~0.1, i.e. five orders of magnitude higher than the estimate.
Reference
Brenner, S.E., Chothia, C. and Hubbard, T.J., 1998. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proceedings of the National Academy of Sciences, 95(11), pp.6073-6078.
https://www.pnas.org/doi/pdf/10.1073/pnas.95.11.6073