Generate a distance matrix from an input file in FASTA or FASTQ format. See also calc_distmx_smallmem.
The distance matrix filename is specified by the -distmxout option.
The matrix format is specified by the -format option,
which can be tabbed_pairs (default), square, phylip_square or
phylip_lower_triangle. See distance matrix for details
for these file formats.
Distance values are in the range zero (identical
sequences) to one (no similarity) corresponding to the range 100% identity to 0%
identity.
Multithreading is supported.
Clusters can be generated from a distance matrix with the cluster_aggd command.
By default, pairs are prioritized by the U-sort heuristic as used in the USEARCH algorithm. This means that pairs are considered in decreasing order of the number of unique words (U) they have in common. Since U correlates with identity, this means that pairs are considered in approximately increasing order of distance. U-sorting can be turned off using the -nousort option. U-sorting plus additional heuristics used to find HSPs can all be disabled using the -distmx_brute option, which forces all pairs of sequences to be aligned. This is guaranteed to give a complete matrix, but can be much slower for large datasets. Note that low-identity pairs generally have little effect on clustering or tree topology, so the additional "accuracy" of a brute force calculation often has little biological value.
The -sparsemx_minid option gives the minimum identity which should be written to a matrix in tabbed_pairs format.
An identity threshold for terminating the calculation can be specified using the termid option, which is in the range 0.0 to 1.0, where 1.0 means identical sequences (100% sequence id). This is a speed optimization that saves time by skipping alignments of low-identity pairs. If a pair is encountered with fractional identity < termid, the calculation is stopped. Because U-sorted order does not correlate perfectly with identity, you should set termid somewhat lower than the minimum identity that you care about. For example, if you want all pairs with >80% id to appear in the matrix, then you might set -termid 0.7. Tests on small datasets can be used to tune -termid to a reasonable value. By default, termid is set to 0 and the calculation continues for all pairs that have at least one word in common. The word length is set by the wordlength option.
Examples
usearch -calc_distmx seqs.fa -distmxout mx.txt -sparsemx_minid 0.8 -termid 0.7
usearch -calc_distmx seqs.fa -distmxout dist.tree -format
phylip_lower_triangular