E-values
What is an E-value and how do you interpret it?
How to choose your E-value threshold
The -dbsize option
Above and left is the Reseek alignment of 3beg SR protein kinase 1
aligned to 1sco scorpion toxin, covering almost
all amino acids in the toxin.
It was found by using 1sco as a query to search the SCOP40 database,
which has 11,211 domains. The superposition looks good,
with one alpha helix and three beta strands in similar conformation.
The AQ is 0.3949, the E-value is 7.87 and the
RMSD is low (1.6 Angstroms).
AQ is designed to be roughly comparable to the
TM score. The range is from zero to one,
AQ>0.5 suggests a homologous alignment and AQ<0.5 suggests a spurious alignment due to chance
similarity. This is a rule of thumb which of course is far from perfect. In this case, the rule of thumb
gives the correct answer -- this alignment is spurious, there is no homology or functional relationship despite the
fact that the alignment covers the full-length toxin protein with low RMSD. However, TM=0.63
so the suggested rule TM>0.5 by the TM-align authors is wrong in this example.
An E-value provides more information to guide whether or not you should believe the hit.
It is an estimate of how many hits of at least this AQ that will occur
just by chance. The E-value in this example is ~8, which means that Reseek estimates
that you will get ~8 hits as good as this, or better, due to spurious similarity which
has nothing to do with evolutionary or functional relatedness.
While AQ is an intrinsic property of the alignment (you don't need anything other
than the alignment to calculate it), E-value also depends on the database. In particular,
larger database will tend to give more spurious hits, and the E-value of a given
alignment therefore increases in proportion to the database size.
E-values don't consider what you are looking for (true positives).
You could be interested in a particular protein sub-family (say, Coronavirus polymerases), a
family (virus polymerases), a homologous superfamily (RNA and DNA-binding palm domains), or
a larger, non-homologous group which are probably convergent (Ferrodoxin-like folds,
to which palm domains belong). E-values don't know anything about categories like this.
Sometimes, it is reasonable to develop rough AQ or E-value thresholds for
identifying categories-- say, E<1e-9 for Coronavirus polymerases, E<1e-3 for virus polymerases
and E<10 for Ferrodoxin-like folds. However, rules of thumb
like this are category- and database-specific;
you cannot make a reliable rule that says something like "E<1e-6 is a good threshold" without considering
both the category you want to find and the database.
In typical biological sequence and structure search tasks, the number of true
positives is at best a tiny fraction of the database (maybe there are no TPs at all).
The large majority of proteins in the database are false positives which give a background distribution looking roughly
like a bell curve. The right-hand tail of this background curve probably overlaps with the tiny TP distribution that you
want to capture, as shown in the sketch above, which makes it tricky to decide where to set a AQ threshold.
If a hit is under the bell curve or in the left-hand tail,
then it is very likely to be a false positive, regardless of what you are looking for.
The danger is that if you set the AQ cutoff just slightly too low, you quickly get into the fatter part of
the orange bell curve where most of your hits are FPs. E-values address this dilemma by allowing the user to limit
the estimated number of expected errors.
How can an E-value estimate false positives without knowing anything about what how true positives are defined?
The shape of the background curve does not depend on the TPs and can be predicted. If an algorithm does a good job of this,
then its E-values will be good estimates of false positive rates.
All of this was figured out a long time ago for sequence search, where E-values were first introduced
into BLAST and are now widely used in other popular search packages such as HMMer. Setting
an E-value limits the number of false positives; implicitly, sensivity is considered a secondary issue
compared to controlling for false positive errors.
If E-value is the expected number of errors, then you may be wondering why it is not an integer and how it can
be <<1. This is because "expectation" is defined by averaging (conceptually, at least) over a large
number of randomized experiments. If only one in a million of these experiments gives a false positive,
then the E-value is 1e-6.