How to choose your E-value threshold
Sometimes, it is reasonable to develop rough AQ
or E-value thresholds for
identifying categories-- say, E<1e-9 for Coronavirus polymerases, E<1e-3 for virus polymerases
and E<10 for Ferrodoxin-like folds. However, rules of thumb like this are category-specific;
you cannot make a reliable rule that says something like "E<1e-6 is a good threshold" without knowing
more about the category you want to find and which database you are searching.
Keep in mind that contrary to some popular misconceptions, you cannot set
a robust threshold for AQ scores or an E-value unless you focus on (1) a particular category of
proteins and also (2) a particular database.
Setting an empirical threshold
If you have a protein family, say Coronavirus polymerases, then you can figure out
a threshold empiricially by doing an all-vs-all search
of the family.
When you do this, you must set the ‑dbsize option
to the size of the database you will be searching in practive, otherwise
the E-value you get for a given alignment on the test will be different from the E-value
you get in production.
The highest E-value reported is then a reasonable guess at a good
threshold for searching a large database.
You can improve this threshold by searching your family against a negative set
containing members of closely related but different families that would be
false positives from your perspective, say Mesonvirus polymerases and Euronivirus polymerases.
Quite likely, the lowest E-value from your negative set will be lower than the highest E-value
from your positive set; you would then choose your threshold somewhere in between,
depending on your desired trade-off between false positives and false negatives.
Setting a threshold when you don't have training data
If you don't have training data, then I would suggest the following strategy.
Set a high threshold (the default of 10 would be a good choice to start) and look
at the E-values of the top hits. If they are very small, say 1e-6 or less, then
you are in good shape (or the hits are... 😜); you can be confident that the
structures are homologous and will have related function, though you can't
be sure how close the functions are without more information. If the E-values
are high, say around 0.01 or greater, then you should start to worry that
the hits are spurious; the alignment may reflect similarity without homology or
functional relatedness. You can't be sure unless you can check for family-specific
motifs in the structures.