TAXXI dataset names
FASTA filenames that start with sp_ have species annotations, otherwise
the lowest rank is genus. Both variants are needed to test species
predictions of methods that give different predictions when trained on
different ranks; e.g. the RDP Classifier because genus is the default
training rank for 16S.
To ensure compatibility with as many methods
as possible, all taxonomy annotations in a given dataset have exactly the
same set of ranks (no missing ranks and no intermediate ranks such as
sub-order).
rdp_16s = RDP 16S training set v16
ncbi_16s = BLAST16S
(BLAST 16S database)
ten_16s = BLAST16S/10 (BLAST 16S database with max
10 sequences per genus)
rdp_its = WITS (RDP Warcup training set v2)
FIlenames are formatted as setname.pctid
or setname_segment.pctid where segment is _v35 or
_v4 to indicate the hypervariable region(s). If no segment is given, the
sequences are full-length.