Home Software Services About Contact usearch manual
UTAX or SINTAX? Which taxonomy database?

Use a small database with authoritative classifications
The database is the easier decision: I believe it's clear you should use a small set of authoritatively classified sequences, e.g. for 16S that could be the RDP training set or the SILVA LTP subset.

With the big databases (SILVA, Greengenes, RDP), you have the problem of unknown rates of annotation error and ambiguous blank names. Is a blank name a high-confidence prediction that the group is not named, or is it probably a known name with a confidence that is too low to annotate but high enough you want to know about it, say P=0.8? It could be either.

The full RDP database is definitely not a good choice because the taxonomies were predicted by the RDP Classifier, which has a high rate of over-classification errors on full-length sequences (see SINTAX paper). If you want predictions using the Bergey's nomenclature, then I would recommend using the RDP training set with SINTAX or UTAX.

Bottom line, when you add a second layer of taxonomy prediction (your sequences) on top of ambiguous / error-prone predictions in a big database, the results are hard / impossible to interpret in a meaningful way.

UTAX and SINTAX have different strengths and weaknesses
SINTAX is brand new so I don't have much experience with it yet (this was written just after version 9 was released). On short 16S tags like V4, SINTAX and RDP have very similar performance. On longer 16S sequences and on ITS sequences, SINTAX is better than RDP. SINTAX is simpler because it doesn't need training, while training UTAX or RDP is quite challenging if you want to use your own database. UTAX is the only algorithm which tries to account for sparse reference data and has the lowest over-classification rate of any algorithm (except possibly the k-nearest-neighbor method in mothur, but knn has low sensitivity in general). However, UTAX sometimes has lower sensitivity than SINTAX to known taxa. Neither algorithm is a clear winner over the other.