Can SINTAX predict species?
On the SINTAX data
downloads page you will find a link to a version of the
RDP training set with species names. If you use this database with the
sintax command, you will usually get a species
prediction with a bootstrap confidence. (Not always, because there are species annotations
only for the 11,424 of 13,212 sequences where I found a 100% identical match to
a named strain).
You should be skeptical of species predictions because it
is possible, probably common, to get a high bootstrap value for an incorrect
prediction. This happens with genus too, but the problem is worse for
If you're using short tags like V4, then it often
happens that two or more species have identical tag sequences, making it impossible to
identify which species you're looking at. This scenario might not be detectable from
the database because the vast majority of species have not been named by
taxonomists and do not appear in the RDP training set, so there could be a
novel species with an identical sequence. In other words, the reference
database is sparse: it has missing data --- lots of missing data.
If you use a "top-hit" classifier like SINTAX or RDP with a sparse reference database, then you get a problem with
over-classification as shown in the figure below
(taken from the SINTAX paper). This is what happens:
suppose the top hit for your query sequence has 95% identity. Then it
probably belongs to a different species. Now suppose the second-best hit has
much lower identity, say 90%. The bootstrapping in SINTAX and RDP repeatedly
takes a sub-sample of words (8-mers) in the query sequence and checks the
top taxonomy when considering only the subset. If there is a big drop in
identity between the top hit and second-best hit, then you will get the same
top hit every time even if it has relatively low identity, and the result is
a high confidence for all ranks in the taxonomy. This is most obvious when
the genus is a singleton, i.e. has only one sequence in the database (call
it S). Then S is very likely to be the top hit for any species in that
genus, even under sub-sampling of the 8-mers, in which case you'll get a
spuriously high confidence that the species of your query is the same as S
if it belongs to the same genus. About half (1,157 / 2,472) of the genera in
the RDP training set v16 are singletons, so this is a significant concern.
from the SINTAX paper).