Home Software Services About Contact     
Follow on twitter

Robert C. Edgar on twitter

11-Aug-2018 New paper describes octave plots for visualizing alpha diversity.

12-Jun-2018 New paper shows that one in five taxonomy annotations in SILVA and Greengenes are wrong.

18-Apr-2018 New paper shows that taxonomy prediction accuracy is <50% for V4 sequences.

05-Oct-2017 PeerJ paper shows low accuracy of closed- and open-ref. QIIME OTUs.

22-Sep-2017 New paper shows 97% threshold is wrong, OTUs should be 99% full-length 16S, 100% for V4.

UPARSE tutorial video posted on YouTube. Make OTUs from MiSeq reads.


 New in v11 

distmx_split_identity command

The distmx_split_identity command divides sequences into subsets such that the top-hit identity is a given value. This is used to create test-training pairs for cross-validation by identity. Input is a distance matrix created by the calc_distmx command.

The range of allowed top-hit identities is given by the -mindist and -maxdist options, which must both be specified. Identities are specified as distances as they appear in the matrix. For example, -mindist 0.025 -maxdist 0.035 specifies an identity range from 97.5% to 96.5%.

Output is a tabbed text given by the -tabbedout option. Fields are:

#1 Subset name.
#2 Label1
#3 Label2
#4 Dist

There are four subsets with names 1, 2, 1x and 2x. Label1 is the label of a sequence in the subset given by #1. Label2 is the top hit in the other subset (1 or 2), and Dist is the distance between Label1 and Label2.

Subsets 1 and 2 have top hits to each other in the specified range.

Subset 1x has lower identities with subset 2, and can therefore be added to the training set if subset 2 is the query set. Similarly, subset 2x has lower identities with subset 1 and can be added to the training set if subset 1 is the query.


usearch -calc_distmx tax_16s.fa -maxdist 0.2 -termdist 0.3 -tabbedout distmx_16s.txt

usearch -distmx_split_identity distmx_16s.txt -mindist 0.025 -maxdist 0.035 \
  -tabbedout subsets.txt