Home Software Services About Contact


Non-redundant databases
A "non-redundant" (NR) database contains only one representative of a given type of sequence. Dereplication removes identical sequences. Clustering at a lower threshold, e.g. 90%, may reduce the database size, enabling faster searches with only a small loss in sensitivity. See also database optimization.
OTU construction In marker gene metagenomics, reads of genes such as small-subunit RNA (16S, 18S and ITS) and cytochrome oxidase I (COI) are often clustered into groups called Operational Taxonomic Units (OTUs), typically at a 97% identity threshold. The UPARSE pipeline achieves the best throughput and highest published biological accuracy at the time of writing (Nature Methods, Aug 2013). The UCHIME algorithm can be used for stand-alone chimera filtering in an OTU pipeline. UCHIME is implemented in the uchime_ref and uchime_denovo commands.
Amplicon diversity Clustering of amplicon reads, e.g. from 16S marker genes, antibody or T-cell receptor (TCR) immune system repertoire sequencing, can be used to estimate the biological diversity represented in the amplicons.


UCLUST is a general-purpose clustering algorithm which achieves significantly higher speed and sensitivity compared with CD-HIT and other alternative algorithms (see benchmarks). The UCLUST algorithm is implemented in the cluster_fast and cluster_smallmem commands.
UPARSE UPARSE is an algorithm for constructing OTUs from amplicon reads. A full implementation of UPARSE requires a pipeline which takes FASTQ reads and generates clusters. The cluster_otus command performs the clustering step after quality filtering and length trimming of the reads.
Dereplication Dereplication reports one copy of every unique sequence in the input data. This is a special case of clustering at 100% identity, which can be implemented more efficiently using specialized algorithms. USEARCH supports both full-length and prefix dereplication, which are implemented in the derep_prefix and derep_fulllength commands.