memory requirements for storing a database index

Indexed and non-indexed commands
The most commonly used commands in USEARCH, including ublast, usearch_global, cluster_fast, cluster_smallmem, uchime_ref, cluster_otus and uchime_denovo, use a database index to improve search speeds. With these commands, memory requirements are usually dominated by the space required to store the index.

With non-indexed commands such as sortbylength, search_global and search_local, memory requirements are usually dominated by loading the database FASTA file into RAM. In these cases, the RAM required for large datasets is approximately the size of the database FASTA file.

For most non-indexed commands, input is read sequentially and the size of the input set does not contribute to the total memory required. However, the cluster_fast, uchime_ref, uchime_denovo and cluster_otus commands load the entire input set into memory, so the input dataset size must be added to the memory required by the index.

Memory used for a database index is shared between threads, so there is no significant additional memory required for multi-threading.

Index size = UDB file size
The easiest way to determine the size of an index is to make a UDB file using makeudb_ublast or makeudb_usearch. The size of the UDB file is very close to the amount of RAM that is needed to store the index.

Memory use for clustering
The cluster_fast command loads the entire input set into memory, so the memory required is the size of the input set plus the size of the index (see below). The cluster_smallmem uses a different strategy which stores only the index in memory. With high-redundancy datasets, this can substantially reduce the amount of memory needed.

Index size for clustering
With clustering commands, the database is the set of centroid sequences. The size of the index therefore scales with the number of clusters, not the size of the input dataset. If the input has low redundancy (average cluster size close to 1), then the size of the index can be estimated from the input size. If the input has high redundancy (average cluster size >> 1), then the memory requirements can be estimated from the size of the centroid database (-centroids option). Of course, to do that, you must run the command and then there is no need to estimate because you can see the memory use from the USEARCH progress display or by using a system monitor (e.g. top command under Linux/OSX or taskmgr under Windows). Estimating memory in advance requires that you are able to make a prior estimate of the average cluster size.

Estimating index memory size
An index on a database is usually significantly larger than a FASTA file containing the sequences; typically 3x to 5x larger. The size can be controlled using indexing options, which apply to both search and clustering commands. The size of the index for a large database can be estimated using the following formula:

Bytes = S + 4 D + 8 W

Here, S is size of the FASTA file containing the sequences, D is the number of words or seeds in the database that are indexed, and W is either the number of possible words, or the number of slots if hashing is used (-slots option). The number of possible words is W = A^k where A is the size of the alphabet (4 for nucleotides, 20 for the standard amino acid alphabet or less if a compressed alphabet is used), and k is the word length (or effective word length, if a pattern is used). If all the words or seeds in the database are indexed, then S = D (close enough), and the estimate is:

Bytes = 5 D + 8 W

Example: Nucleotide database
RFAM is a nucleotide database. As a FASTA file, it's about 40Mb at the time of writing. With the default nucleotide word length k=8, we have S = D = 40 x 10⁶ and W = 4⁸ = 65,536, so our estimated size is:

Bytes = 5 x 40 x 10⁶ + 8 x 65,536 = 200 Mb

The actual size is 155 Mb = 3.9x bigger than the FASTA file.

Example Protein database
The PDB90X database is a 100 Mb FASTA file, so S = D = 100 x 10⁶. Assuming default parameters for a ublast-compatible index, the effective word length is 6 and the alphabet size is 10, so W = A^k = 10⁶ and the total size estimate is:

Bytes = 5 x 100 x 10⁶ + 8 x 10⁶ = 500 Mb

The actual size is 487 Mb = 4.9x bigger than the FASTA file.