Home Software Services About Contact usearch manual
Dereplication
 
Dereplication is the identification of unique sequences so that only one copy of each sequence is reported. USEARCH supports dereplication with the derep_fulllength and derep_prefix commands.

See also
 
search_exact command
 
Substring dereplication using version 5.
  Global trimming

Different definitions of duplicate sequences are possible, as shown in the figure.

The full-length definition is the easiest to understand and also the easiest to implement in an algorithm: if two or more sequences are identical, all except one are kept, for example:

A = GATTACA
B = GATTACA

With prefixes, a sequence A is discarded if it is a prefix of some other sequence B in the set, for example:

A = GATTAC
B = GATTACA

With substrings, a sequence A is discarded if it is a substring of some other sequence B, for example:

A = -ATTAC-
B = GATTACA

All prefixes are substrings, and full-length matches are both prefixes and substrings. So:

  substrings >= prefixes >= full length matches.

USEARCH supports full-length and prefix dereplication, but currently not substring. In next-generation sequencing applications, prefix dereplication is useful for quality-trimmed reads, which are often truncated at their right-hand end where quality scores tend to fall, but rarely if ever at their left-hand ends. So far, I haven't come across a case where substring dereplication is needed -- if you know of one, please let me know.