Masking is a procedure that identifies low-complexity sequence. See
also masking options.
Low-complexity sequences are simple
repeats such as ATATATATAT or regions that are highly enriched for
just one letter, e.g. AAACAAAAAAAAGAAAAAAC. Protein segments with
only a few amino acids are also considered to be low complexity,
e.g. PPCDPPPPPKDKKKKDDGPP. This could align with a high score to
another region with many Ps and Ks, but that would not necessarily
indicate an evolutionary relationship.
Why mask? Repetitive and
low-complexity sequences cause problems for search and clustering
algorithms based on matching words or patterns. Low-complexity sequences cause
certain words to have high frequencies, which can cause performance
problems if they are not masked. For example, words that are mostly
or all composed of a single letter such as AAAAAA or TTTTCTTT often
have have high frequencies. For UBLAST, most of these words would be false
positives if used as alignment seeds. For USEARCH, they are expensive to count and
degrade the correlation between word count and sequence
identity.
Soft and hard
masking
Soft masking indicates masked regions by using lower-case letters.
Hard masking (-hardmask option) overwrites masked regions with a wildcard letter, N
for nucleotides or X for proteins.
Masking excludes words and
seeds
In USEARCH, masking is used only for one purpose: for excluding
seeds or word matches. In the case of making an index, a word or
seed is not indexed if it contains one or more masked letters.
Similarly, a word or seed in the query sequence is not considered
if it has any masked letters.
Masked regions are included in the
alignment score An alignment will not be initiated in a
masked region (because seeds are excluded), but may extend through
a masked region. In USEARCH, masked regions are always included in
the score. Hard masking can be used to exclude them from the score
(because a wildcard letter has zero substitution score against all
letters).
Masking methods USEARCH
supports four masking algorithms as shown in the table.
Method |
Type |
Description |
fastamino |
protein |
Unpublished method. Default for
proteins. |
fastnucleo |
nucleotide |
Unpublished method. Default for
nucleotides. |
seg |
protein |
Entropy-based method as used by
BLASTP. |
dust |
nucleotide |
Ad-hoc method as used by BLASTN. |
The fastamino and fastnucleo
methods were developed because the seg and dust methods used by
BLAST are slow enough to have a significant impact on search times
with the faster algorithms used by USEARCH. These masking methods
emphasize detection of simple repeats and tend to mask less than
dust and mask. In my experience, they are effective for most
applications where USEARCH is commonly used.
|