Cluster linkage

In agglomerative clustering, linkage specifies how the distance between two clusters is calculated. If the clustering is used to construct a tree, linkage determines the order internal nodes are created and hence the tree topology. If the clustering is used to find groups at a given distance threshold, linkage determines where the final cluster boundaries occur. In USEARCH, linkage is specified using the linkage option.

USEARCH supports three types of linkage in which the distance between two clusters is calculated by
considering pairs of sequences (one from each cluster), and taking the **
maximum**, **minimum** or **average** distance over all pairs.

If the final clusters are determined by a fixed distance threshold, e.g. 97% identity, then:

**Maximum linkage = complete linkage
Minimum linkage = single linkage**

**Maximum linkage** means that *all* pairs of
sequences in a cluster must be closer than the threshold. This means that
clusters tend to be smaller compared to single or average linkage, because a new
sequence must be close enough to all existing sequences, not just one as in the
case of single linkage. With complete linkage, the diameter of the cluster
(the maximum distance between a pair) cannot be larger than the distance threshold. Complete linkage is
equivalent to maximum
linkage and to finding the
cliques of a
graph where the edges include all pair-wise distances within the threshold.

**Minimum linkage** means that a sequence should be
included in a cluster if the distance to any other sequence is below the
threshold. This means that clusters tend to grow large as new sequences are
introduced, because a new sequence only needs to be close to one existing
sequence inside the cluster. This means that with single linkage there is no
upper bound on the diameter of a cluster.
Single linkage is equivalent to minimum linkage and to finding the
connected components of a graph where the edges include
all pair-wise distances within the threshold.

**Average linkage** is harder to visualize. In the figure
below, black lines represent distances below the threshold. The blue dots in the
figure for average linkage represent "average sequences" for the two existing
clusters {ABC} and {EF}, and the dotted line represents the distance between
them. The clusters are joined if this distance is less than the threshold. This
picture is designed to give an intuitive idea of how clusters are created, but
it should not be taken too seriously. In fact, the average is calculated over
all pairs. In this example, the average is (AE + AF + BE + BF + CE + CF)/6,
where XY means the distance between sequences X and Y. With average linkage,
clusters tend to be larger than maximum linkage and smaller than minimum
linkage. The diameter of a cluster can be larger than the threshold.

Using average linkage is equivalent to the UPGMA algorithm that was introduced for making OTUs in classical numerical taxonomy before the development of phylogenetic tree reconstruction methods such as Neighbor-Joining and next-generation clustering algorithms such as UPARSE. Today, UPGMA should be considered obsolete for OTU clustering of biological sequences.