masking

Masking is a procedure that identifies low-complexity sequence. See also masking options.

Low-complexity sequences are simple repeats such as ATATATATAT or regions that are highly enriched for just one letter, e.g. AAACAAAAAAAAGAAAAAAC. Protein segments with only a few amino acids are also considered to be low complexity, e.g. PPCDPPPPPKDKKKKDDGPP. This could align with a high score to another region with many Ps and Ks, but that would not necessarily indicate an evolutionary relationship.

Why mask?
Repetitive and low-complexity sequences cause problems for search and clustering algorithms based on matching words or patterns. Low-complexity sequences cause certain words to have high frequencies, which can cause performance problems if they are not masked. For example, words that are mostly or all composed of a single letter such as AAAAAA or TTTTCTTT often have have high frequencies. For UBLAST, most of these words would be false positives if used as alignment seeds. For USEARCH, they are expensive to count and degrade the correlation between word count and sequence identity.

Soft and hard masking
Soft masking indicates masked regions by using lower-case letters. Hard masking (-hardmask option) overwrites masked regions with a wildcard letter, N for nucleotides or X for proteins.

Masking excludes words and seeds
In USEARCH, masking is used only for one purpose: for excluding seeds or word matches. In the case of making an index, a word or seed is not indexed if it contains one or more masked letters. Similarly, a word or seed in the query sequence is not considered if it has any masked letters.

Masked regions are included in the alignment score
An alignment will not be initiated in a masked region (because seeds are excluded), but may extend through a masked region. In USEARCH, masked regions are always included in the score. Hard masking can be used to exclude them from the score (because a wildcard letter has zero substitution score against all letters).

Masking methods
USEARCH supports four masking algorithms as shown in the table.

Method	Type	Description
fastamino	protein	Unpublished method. Default for proteins.
fastnucleo	nucleotide	Unpublished method. Default for nucleotides.
seg	protein	Entropy-based method as used by BLASTP.
dust	nucleotide	Ad-hoc method as used by BLASTN.

The fastamino and fastnucleo methods were developed because the seg and dust methods used by BLAST are slow enough to have a significant impact on search times with the faster algorithms used by USEARCH. These masking methods emphasize detection of simple repeats and tend to mask less than dust and mask. In my experience, they are effective for most applications where USEARCH is commonly used.