Home About Contact
Reseek
 
 
 
 
  


How to choose your E-value threshold

Sometimes, it is reasonable to develop rough AQ or E-value thresholds for identifying categories-- say, E<1e-9 for Coronavirus polymerases, E<1e-3 for virus polymerases and E<10 for Ferrodoxin-like folds. However, rules of thumb like this are category-specific; you cannot make a reliable rule that says something like "E<1e-6 is a good threshold" without knowing more about the category you want to find and which database you are searching. Keep in mind that contrary to some popular misconceptions, you cannot set a robust threshold for AQ scores or an E-value unless you focus on (1) a particular category of proteins and also (2) a particular database.

Setting an empirical threshold
If you have a protein family, say Coronavirus polymerases, then you can figure out a threshold empiricially by doing an all-vs-all search of the family.

When you do this, you must set the ‑dbsize option to the size of the database you will be searching in practive, otherwise the E-value you get for a given alignment on the test will be different from the E-value you get in production.

The highest E-value reported is then a reasonable guess at a good threshold for searching a large database.

You can improve this threshold by searching your family against a negative set containing members of closely related but different families that would be false positives from your perspective, say Mesonvirus polymerases and Euronivirus polymerases. Quite likely, the lowest E-value from your negative set will be lower than the highest E-value from your positive set; you would then choose your threshold somewhere in between, depending on your desired trade-off between false positives and false negatives.

Setting a threshold when you don't have training data
If you don't have training data, then I would suggest the following strategy. Set a high threshold (the default of 10 would be a good choice to start) and look at the E-values of the top hits. If they are very small, say 1e-6 or less, then you are in good shape (or the hits are... 😜); you can be confident that the structures are homologous and will have related function, though you can't be sure how close the functions are without more information. If the E-values are high, say around 0.01 or greater, then you should start to worry that the hits are spurious; the alignment may reflect similarity without homology or functional relatedness. You can't be sure unless you can check for family-specific motifs in the structures.