Use a small database with authoritative classifications
The database is the easier decision: I believe it's clear you should use a
small set of authoritatively classified sequences, e.g. for 16S that could
be the RDP training set or the SILVA LTP subset.
With the big databases (SILVA, Greengenes, RDP), you
have the problem of unknown rates of annotation error and ambiguous blank
names. Is a blank name a high-confidence prediction that the group is not
named, or is it probably a known name with a confidence that is too low to
annotate but high enough you want to know about it, say P=0.8? It could be
either.
The full RDP database is definitely not a good choice because
the taxonomies were predicted by the RDP Classifier, which has a high rate
of over-classification errors on full-length sequences (see
SINTAX paper). If you want predictions using the
Bergey's nomenclature, then I would recommend using the RDP training set
with SINTAX or UTAX.
Bottom line, when you add a second layer of
taxonomy prediction (your sequences) on top of ambiguous / error-prone
predictions in a big database, the results are hard / impossible to
interpret in a meaningful way.
UTAX and SINTAX have different
strengths and weaknesses
SINTAX is brand new so I don't have
much experience with it yet (this was written just after version 9 was
released). On short 16S tags like V4, SINTAX and RDP have very similar
performance. On longer 16S sequences and on ITS sequences, SINTAX is better
than RDP. SINTAX is simpler because it doesn't need training, while training
UTAX or RDP is quite challenging if you want to use your own database. UTAX
is the only algorithm which tries to account for sparse reference data and
has the lowest over-classification rate of any algorithm (except possibly
the k-nearest-neighbor method in mothur, but knn has low sensitivity in
general). However, UTAX sometimes has lower sensitivity than SINTAX to known
taxa. Neither algorithm is a clear winner over the other.