Dereplication

Dereplication is the identification of unique sequences so that only one copy of each sequence is reported. USEARCH supports dereplication with the derep_fulllength and derep_prefix commands.

Different definitions of duplicate sequences are possible, as shown in the figure.

The full-length definition is the easiest to understand and also the easiest to implement in an algorithm: if two or more sequences are identical, all except one are kept, for example:

A = GATTACA
B = GATTACA

With prefixes, a sequence A is discarded if it is a prefix of some other sequence B in the set, for example:

A = GATTAC
B = GATTACA

With substrings, a sequence A is discarded if it is a substring of some other sequence B, for example:

A = -ATTAC-
B = GATTACA

All prefixes are substrings, and full-length matches are both prefixes and substrings. So:

substrings >= prefixes >= full length matches.

USEARCH supports full-length and prefix dereplication, but currently not substring. In next-generation sequencing applications, prefix dereplication is useful for quality-trimmed reads, which are often truncated at their right-hand end where quality scores tend to fall, but rarely if ever at their left-hand ends. So far, I haven't come across a case where substring dereplication is needed -- if you know of one, please let me know.