Home Software Services About Contact     
Follow on twitter

Robert C. Edgar on twitter

11-Aug-2018 New paper describes octave plots for visualizing alpha diversity.

12-Jun-2018 New paper shows that one in five taxonomy annotations in SILVA and Greengenes are wrong.

18-Apr-2018 New paper shows that taxonomy prediction accuracy is <50% for V4 sequences.

05-Oct-2017 PeerJ paper shows low accuracy of closed- and open-ref. QIIME OTUs.

22-Sep-2017 New paper shows 97% threshold is wrong, OTUs should be 99% full-length 16S, 100% for V4.

UPARSE tutorial video posted on YouTube. Make OTUs from MiSeq reads.


 New in v11 

calc_lcr_probs command

Calculates lowest common rank (LCR) probabilities from a distance matrix for sequences with taxonomy annotations.

The distance matrix can be calculated using the calc_distmx command.

Output is written to a tabbed text file specified by the -tabbedout option. Example output file.

By default, probabilities are calculated assuming that pairs of sequences are selected by choosing an entry in the distance matrix at random with uniform probability. This method can have substantial taxonomic bias because reference databases often have highly over-represented species and genera, e.g. common pathogens such as E. colii may have many sequences while other species or genera have only a single sequence, plus many unnamed genera are absent. This bias can be mitigated by weighting. For example, if weighting is done at genus rank, this means that sequences are selected first by choosing a genus with uniform probability (with replacement), then selecting a sequence from the genus (again, with uniform probability and with replacement). Weighting is specified by the -weight_rank option; the value of the option is a single letter representing the rank, e.g. s for species or g for genus (see taxonomy annotations for valid letters).


usearch -calc_distmx tax_16s.fa -maxdist 0.2 -termdist 0.3 -tabbedout distmx_16s.txt

usearch -calc_lcr_probs distmx_16s.txt -weight_rank g -tabbedout lcr_16s.txt

References (please cite)
R.C. Edgar (2018), Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences, PeerJ 6:e4652
  • Cross-validation by identity, novel benchmark strategy enabling realistic accuracy estimates

  • Genus accuracy of best methods is 50% on V4 sequences

  • Recent algorithms do not improve on RDP Classifier or SINTAX

R.C. Edgar (2018), Taxonomy annotation and guide tree errors in 16S rRNA databases, PeerJ 6:e5030
  • Approx. one in five SILVA and Greengenes taxonomy annotations are wrong

  • SILVA and Greengenes trees have pervasive conflicts with type strain taxonomies

R.C. Edgar (2017), Updating the 97% identity threshold for 16S ribosomal RNA OTUs, Bioinformatics 34(14) 2371-2375
  • Standard 97% OTU identity threshold is too low

  • Optimal OTU threshold is 99% for full-length 16S, 100% for V4