Calculates lowest common rank (LCR) probabilities from a distance matrix for sequences with taxonomy annotations.
The distance matrix can be calculated using the calc_distmx command.
Output is written to a tabbed text file specified by the -tabbedout option. Example output file.
Weighting
By default, probabilities are calculated
assuming that pairs of sequences are selected by choosing an entry in the
distance matrix at random with uniform probability. This method can have
substantial taxonomic bias because reference databases often have highly
over-represented species and genera, e.g. common pathogens such as E.
colii may have many sequences while other species or genera have only a
single sequence, plus many unnamed genera are absent. This bias can be
mitigated by weighting. For example, if weighting is done at genus rank, this
means that sequences are selected first by choosing a genus with uniform
probability (with replacement), then selecting a sequence from the genus
(again, with uniform probability and with replacement). Weighting is specified
by the -weight_rank option; the value of the option is a single letter
representing the rank, e.g. s for species or g for genus (see taxonomy annotations
for valid letters).
Example
usearch -calc_distmx tax_16s.fa -maxdist 0.2 -termdist 0.3 -tabbedout distmx_16s.txt
usearch -calc_lcr_probs distmx_16s.txt -weight_rank g
-tabbedout lcr_16s.txt