The -id option is an accept option that specifies the minimum sequence identity of a hit. It is expressed as a fraction between 0.0 and 1.0, meaning from 0% to 100% as a percentage. It is supported by most search and clustering commands. Identity is the fraction of columns in an alignment with matching letters.
Example
usearch -cluster_fast reads.fasta -centroids
c.fasta -id 0.90
Rules for wildcards and matching letters (version 8
and later)
Case is ignored for calculating identity, so an upper case letter can match a lower case letter.
(See Masking for discussion of lower-case for
indexing). Wildcards match, so for example in a amino acid alignment, a column containing
AX is an identity, and in a nucleotide alignment AN and AW are identities
(because W is the IUPAC ambiguity symbol for A or
T). Two wildcard letters match each other if they represent at least one identical residue, so
for example NN matches in a nucleotide alignment, and MR matches in a nucleotide
alignment (because both M and R include A). Identical letters always
match, even if they are not part of a known alphabet. These rules for matching
wildcards give an upper bound on the identity of the true sequences when
wildcards are replaced by fully specified residues. Other rules are possible,
e.g. always considering wildcards to be mismatches (which would give a lower
bound), or ignoring columns containing wildcards. There is no one best rule for
dealing with wildcards; all possible rules have advantages and disadvantages in
different situations. Previous versions of USEARCH had different rules for
handling wildcards which were not always consistently applied in different
situations.
Identity in global alignments
In global alignments, columns containing
terminal gaps are discarded before identity is
calculated, while internal gaps always count as differences. The example below
has a terminal gap of length 3 at the end of the alignment, the identity is
therefore calculated over the remaining seven columns which contain six matches
and the identity is 6/7 = 0.86.
GATTACA---
||| |||
GATAACAATC
Fractional identity vs. percentage
identity
To convert between fractional identity and percentage identity, multiply or
divide by 100, as appropriate. Since percentage identity is much more commonly
used in practice, using fractional identity was a minor design mistake -- it
would have been better to use percentage. The historical reason is that the
USEARCH code began with UCLUST, motivated as an attempt to improve on CD-HIT,
and CD-HIT is one of the few programs to use fractional identity (its -c
option).