USEARCH manual

Can SINTAX predict species?

On the SINTAX data downloads page you will find a link to a version of the RDP training set with species names. If you use this database with the sintax command, you will usually get a species prediction with a bootstrap confidence. (Not always, because there are species annotations only for the 11,424 of 13,212 sequences where I found a 100% identical match to a named strain).

You should be skeptical of species predictions because it is possible, probably common, to get a high bootstrap value for an incorrect prediction. This happens with genus too, but the problem is worse for species.

If you're using short tags like V4, then it often happens that two or more species have identical tag sequences, making it impossible to identify which species you're looking at. This scenario might not be detectable from the database because the vast majority of species have not been named by taxonomists and do not appear in the RDP training set, so there could be a novel species with an identical sequence. In other words, the reference database is sparse: it has missing data --- lots of missing data.

If you use a "top-hit" classifier like SINTAX or RDP with a sparse reference database, then you get a problem with over-classification as shown in the figure below (taken from the SINTAX paper). This is what happens: suppose the top hit for your query sequence has 95% identity. Then it probably belongs to a different species. Now suppose the second-best hit has much lower identity, say 90%. The bootstrapping in SINTAX and RDP repeatedly takes a sub-sample of words (8-mers) in the query sequence and checks the top taxonomy when considering only the subset. If there is a big drop in identity between the top hit and second-best hit, then you will get the same top hit every time even if it has relatively low identity, and the result is a high confidence for all ranks in the taxonomy. This is most obvious when the genus is a singleton, i.e. has only one sequence in the database (call it S). Then S is very likely to be the top hit for any species in that genus, even under sub-sampling of the 8-mers, in which case you'll get a spuriously high confidence that the species of your query is the same as S if it belongs to the same genus. About half (1,157 / 2,472) of the genera in the RDP training set v16 are singletons, so this is a significant concern.