cluster sizes

See also: smaller cluster sizes in v6.

In some applications, sequences are clustered in two or two or more passes by different USEARCH commands and/or by other programs. Sometimes, the size of a cluster is required in terms of the number of sequences that were provided to the first stage of a pipeline. For example, 16S reads might dereplicated then clustered at 97% by cluister_smallmem.

To handle multi-step clustering, USEARCH provides a mechanism to propagate cluster size annotations. If the ‑sizein option is specified, input sequences are required to have a size annotation. If the ‑sizeout option is specified, size annotations are added to the output labels. If both -sizein and -sizeout are given, then the output size for a cluster takes into account the input sizes. If the -sizein option is not specified, input sizes default to 1.

Typical use is:

1. First clustering or dereplication step in the pipeline uses -sizeout.
2. Subsequent clustering steps use both -sizein and -sizeout.

If another program is used before the first USEARCH step, then it is up to you to write scripts to produce size annotations for USEARCH.