| See also
 Indexing options
 Memory requirements
 
The estimated memory required for a database index is:   Bytes = S + 4 D + 8 W. Here, S is size of the FASTA file containing the 
sequences, D is the number of words or seeds in the database that are indexed, 
and W is either the number of possible words, or the number of slots if a 
hashing is used (-slots option) If the database is very large, then this will be 
dominated by the database size D. If the word length is long, then it will be 
dominated by W = Ak 
where A is the size of the alphabet (4 for nucleotides, 20 for the standard 
amino acid alphabet or less if a compressed alphabet is used), and k is the word 
length (or effective word length, if a pattern is 
used). To significantly reduce the amount of memory needed, we 
can use any combination of the following strategies. Reduce the database size by clusteringIf your database has a lot of redundancy, then it may be reasonably to 
reduce the size by clustering. For example, 16S rRNA gene reference database 
often have many sequences that are 100% identical, or very close, especially if 
the sequences are trimmed to sequencing primers 
(which is generally recommended).
 Reduce the database size by splittingYou can split the database into pieces and run the same query on each piece 
separately (serially, i.e. one after the other, or in parallel e.g. on a 
cluster). A drawback of this strategy is that search and clustering speed in 
USEARCH often has sub-linear scaling. This means that if you split a database 
into two pieces, it may take more than twice as long to do the search.
 Reduce the number of indexed words or seedsFor high-identity search and clustering, we can index only a subset of 
words/seeds in the database with only a small loss in sensitivity. This is the 
strategy used by MEGABLAST, and a similar strategy can be used in USEARCH with 
the -dbstep N option which corresponds to the stride parameter of MEGABLAST. 
This specifies that only every Nth word should be indexed. So for example with -word_length 
16 -dbstep 16, every 16th 16mer will be indexed and the database will be covered 
completely with non-overlapping words.
 Hashed indexesWith large databases and high-identity searches, it can be advantageous to 
use longer word lengths. For example, we might use nucleotide 16mers instead of 
the default 8mers. Then W = Ak = 416 = 4.3 x 109 
and we need 34 Gb just for the 8W term in the index. This is where hashing can 
save a substantial amount of memory. Use of a hashed index is specified by the 
-slots option, which specifies the number of index slots, which is generally 
chosen to be << the number of different possible words. The number of slots 
should ideally be (i) a 
prime number and (ii) large enough that hash collisions are rare. For (i), 
the Prime Pages site has a
handy page that you 
can use to find a prime close to a desired number. Condition (ii) is trickier, 
but in practice there's not much point in worrying about it because the amount 
of available RAM is a stronger constraint. The rule of thumb is: use as many 
slots as you can.
 Reducing the maximum 
sequence lengthIn some cases, long sequences can cause excessive memory use in version 6.0. 
This is being addressed, and will hopefully be improved in v6.1. In the mean 
time, setting a lower value of -maxseqlength 
may reduce memory requirements if the input data has long sequences, especially 
if many threads are being run in parallel sharing the same memory space.
 
 |