See also
Default index parameter values
Memory requirements
Most USEARCH commands use a database index to enable fast
searching. There are two types of index: one for finding matching
seeds for the UBLAST algorithm, and
another for fast calculation of common word counts for the USEARCH algorithm. Clustering uses a USEARCH-style
index. Indexing parameters apply to both types of index.
During search and clustering, indexes are
always accessed directly in memory rather than being retrieved from
a disk file, in order to maximize speed. The amount of RAM required
to store the index is approximately the same as the size of a UDB
file created with the same sequences and options. The physical RAM
in the computer should be bigger than the index, otherwise virtual
memory paging will cause much slower execution.
Indexes are constructed in three different
ways:
(1) Loaded from in a UDB file.
(2) Built from a FASTA file.
(3) Built dynamically during clustering.
The index is initially empty, then grows as centroid sequences are
added to the database.
Indexing options In the
following table, "word" refers generically to the fixed-length
segment of the database sequence that is indexed. It may be a k-mer
or a pattern. The effective word length
is the length of the k-mer or the number of 1s in the
pattern.
Option |
Value |
Description |
-wordlength |
N |
Word length. If this is
given, an all-ones pattern is assumed
and the -pattern option may not be given. For long word lengths,
the ‑slots option can be used to reduce memory use. |
-pattern |
string |
A pattern specified as a
string of 1s and 0s. A pattern of all ones is equivalent to a kmer
of that length and can also be specified by the -wordlength
option. It is not valid to specify both -wordlength and -pattern.
The default for protein ublast is 10111011. For long patterns, the
‑slots option can be used to reduce memory use. |
-alpha |
string |
Alphabet. Either nt (nucleotide), aa (20-letter amino
acid alphabet), or a compressed
amino acid alphabet expressed as a string containing the 20
standard letters with groups separated by commas. Default for
protein ublast is the 10-group alphabet
A,KR,DENQ,C,G,H,ILVM,FYW,P,ST. Other indexed search and clustering
commands default to the full 20-letter alphabet but support
compressed alphabets as an option. |
‑dbstep |
N |
Specifies that every Nth database word should be indexed.
Default is N=1, meaning that all words are indexed. Similar to the
stride parameter of MEGABLAST.
Setting N>1 saves memory by reducing the size of the index,
roughly by a factor of N for large databases. |
-dbaccelpct |
N |
Specifies an acceleration parameter in the range 0 to
100, similar to the ‑accel parameter of ublast. Expressed as an integer percentage. Usually it is more effective to use
‑accel than ‑dbaccelpct, though this may depend on the database. The
main advantage of -dbaccel is reduced memory and UDB file size.
This parameter can only be used for database file indexes, it is not valid for
clustering. |
‑dbmask |
method |
See masking options
for supported methods. A word with one or more masked letters is
not indexed. Default is fastnucleo or fastamino. |
‑slots |
N |
Use a hashed index with the given number of slots (table
entries at the top level of the index). It is recommended to
use a prime
number as this reduces the frequency of hash collisions. Each
slot requires a minimum of several bytes, even of the word
corresponding to that slot is not found in the database. By
default, if the alphabet size is A and the effective word length is
w, the index has Aw slots. This is the fastest way to do
a word lookup, but can use too much memory for long word lengths.
For example, a word length of 16 for proteins would require
1021 slots. A hash table index can
save memory by using fewer slots, enabling longer word lengths to
be used. Index operations become somewhat slower, though the
difference in overall search speed is often negligible. |
|