Home Software Services About Contact usearch manual
indexing options
 
See also
  Default index parameter values
  Memory requirements

Most USEARCH commands use a database index to enable fast searching. There are two types of index: one for finding matching seeds for the UBLAST algorithm, and another for fast calculation of common word counts for the USEARCH algorithm. Clustering uses a USEARCH-style index. Indexing parameters apply to both types of index.

During search and clustering, indexes are always accessed directly in memory rather than being retrieved from a disk file, in order to maximize speed. The amount of RAM required to store the index is approximately the same as the size of a UDB file created with the same sequences and options. The physical RAM in the computer should be bigger than the index, otherwise virtual memory paging will cause much slower execution.

Indexes are constructed in three different ways:

(1) Loaded from in a UDB file.

(2) Built from a FASTA file.

(3) Built dynamically during clustering. The index is initially empty, then grows as centroid sequences are added to the database.

Indexing options
In the following table, "word" refers generically to the fixed-length segment of the database sequence that is indexed. It may be a k-mer or a pattern. The effective word length is the length of the k-mer or the number of 1s in the pattern.

Option Value Description
-wordlength N Word length. If this is given, an all-ones pattern is assumed and the -pattern option may not be given. For long word lengths, the ‑slots option can be used to reduce memory use.
-pattern string A pattern specified as a string of 1s and 0s. A pattern of all ones is equivalent to a kmer of that length and can also be specified by the -wordlength option. It is not valid to specify both -wordlength and -pattern. The default for protein ublast is 10111011. For long patterns, the ‑slots option can be used to reduce memory use.
-alpha string Alphabet. Either nt (nucleotide), aa (20-letter amino acid alphabet), or a compressed amino acid alphabet expressed as a string containing the 20 standard letters with groups separated by commas. Default for protein ublast is the 10-group alphabet A,KR,DENQ,C,G,H,ILVM,FYW,P,ST. Other indexed search and clustering commands default to the full 20-letter alphabet but support compressed alphabets as an option.
dbstep  N Specifies that every Nth database word should be indexed. Default is N=1, meaning that all words are indexed. Similar to the stride parameter of MEGABLAST. Setting N>1 saves memory by reducing the size of the index, roughly by a factor of N for large databases.
-dbaccelpct N Specifies an acceleration parameter in the range 0 to 100, similar to the ‑accel parameter of ublast. Expressed as an integer percentage. Usually it is more effective to use ‑accel than ‑dbaccelpct, though this may depend on the database. The main advantage of -dbaccel is reduced memory and UDB file size. This parameter can only be used for database file indexes, it is not valid for clustering.
dbmask method See masking options for supported methods. A word with one or more masked letters is not indexed. Default is fastnucleo or fastamino.
‑slots N Use a hashed index with the given number of slots (table entries at the top level of the index).  It is recommended to use a prime number as this reduces the frequency of hash collisions. Each slot requires a minimum of several bytes, even of the word corresponding to that slot is not found in the database. By default, if the alphabet size is A and the effective word length is w, the index has Aw slots. This is the fastest way to do a word lookup, but can use too much memory for long word lengths. For example, a word length of 16 for proteins would require 1021 slots. A hash table index can save memory by using fewer slots, enabling longer word lengths to be used. Index operations become somewhat slower, though the difference in overall search speed is often negligible.