memory requirements

USEARCH manual > memory requirements

memory requirements for storing a database index

See also
Reducing memory
32-bit and 64-bit binaries

Indexed and non-indexed commands
The most commonly used commands in USEARCH, including ublast, usearch_global, cluster_fast, cluster_smallmem, uchime_ref and uchime_denovo, use a database index to improve search speeds. With these commands, memory requirements are usually dominated by the space required to store the index. An exception is cluster_fast, which may be dominated by the input size when there is high redundancy.

With non-indexed commands such as sortbylength, search_global and search_local, memory requirements are usually dominated by loading the FASTA file into RAM. In these cases, the RAM required for large datasets is approximately the size of the FASTA file.

Memory used for a database index is shared between threads, so there is no significant additional memory required for multi-threading.

Index size = UDB file size
The easiest way to determine the size of an index is to make a UDB file using makeudb_ublast or makeudb_usearch. The size of the UDB file is very close to the amount of RAM that is needed to store the index.

Memory use for clustering
The cluster_fast command loads the entire input set into memory, so the memory required is the size of the input set plus the size of the index (see below). The cluster_smallmem uses a different strategy which stores only the index in memory. With high-redundancy datasets, this can substantially reduce the amount of memory needed.

Index size for clustering
With clustering commands, the database is the set of centroid sequences. The size of the index therefore scales with the number of clusters, not the size of the input dataset. If the input has low redundancy (average cluster size close to 1), then the size of the index can be estimated from the input size. If the input has high redundancy (average cluster size >> 1), then the memory requirements can be estimated from the size of the centroid database (-centroids option). Of course, to do that, you must run the command and then there is no need to estimate because you can see the memory use from the USEARCH progress display or by using a system monitor (e.g. top command under Linux/OSX or taskmgr under Windows). Estimating memory in advance requires that you are able to make a prior estimate of the average cluster size.

Estimating index memory size
An index on a database is usually significantly larger than a FASTA file containing the sequences; typically 3x to 5x larger. The size can be controlled using indexing options, which apply to both search and clustering commands. The size of the index for a large database can be estimated using the following formula:

Bytes = S + 4 D + 8 W

Here, S is size of the FASTA file containing the sequences, D is the number of words or seeds in the database that are indexed, and W is either the number of possible words, or the number of slots if a hashing is used (-slots option). The number of possible words is W = A^k where A is the size of the alphabet (4 for nucleotides, 20 for the standard amino acid alphabet or less if a compressed alphabet is used), and k is the word length (or effective word length, if a pattern is used). If all the words or seeds in the database are indexed, then S = D (close enough), and the estimate is:

Bytes = 5 D + 8 W

Example: Nucleotide database
RFAM is a nucleotide database. As a FASTA file, it's about 40Mb at the time of writing. With the default nucleotide word length k=8, we have S = D = 40 x 10⁶ and W = 4⁸ = 65,536, so our estimated size is:

Bytes = 5 x 40 x 10⁶ + 8 x 65,536 = 200 Mb

The actual size is 155 Mb = 3.9x bigger than the FASTA file.

Example Protein database
The PDB90X database is a 100 Mb FASTA file, so S = D = 100 x 10⁶. Assuming default parameters for a ublast-compatible index, the effective word length is 6 and the alphabet size is 10, so W = A^k = 10⁶ and the total size estimate is:

Bytes = 5 x 100 x 10⁶ + 8 x 10⁶ = 500 Mb

The actual size is 487 Mb = 4.9x bigger than the FASTA file.