TAXXI dataset names

FASTA filenames that start with sp_ have species annotations, otherwise the lowest rank is genus. Both variants are needed to test species predictions of methods that give different predictions when trained on different ranks; e.g. the RDP Classifier because genus is the default training rank for 16S.

To ensure compatibility with as many methods as possible, all taxonomy annotations in a given dataset have exactly the same set of ranks (no missing ranks and no intermediate ranks such as sub-order).

rdp_16s = RDP 16S training set v16
ncbi_16s = BLAST16S (BLAST 16S database)
ten_16s = BLAST16S/10 (BLAST 16S database with max 10 sequences per genus)
rdp_its = WITS (RDP Warcup training set v2)
 
FIlenames are formatted as setname.pctid or setname_segment.pctid where segment is _v35 or _v4 to indicate the hypervariable region(s). If no segment is given, the sequences are full-length.