See also
Reducing memory
32-bit and 64-bit binaries
Indexed and non-indexed commands
The most commonly used commands in USEARCH, including
ublast, usearch_global,
cluster_fast,
cluster_smallmem, uchime_ref and
uchime_denovo, use a database index to improve
search speeds. With these commands, memory requirements are usually dominated by
the space required to store the index. An exception is cluster_fast,
which may be dominated by the input size when there is high redundancy.
With non-indexed commands such as
sortbylength,
search_global and search_local, memory
requirements are usually dominated by loading the FASTA file into RAM. In these
cases, the RAM required for large datasets is approximately the size of the FASTA file.
Memory used for a database index is shared between
threads, so there is no significant additional memory required for
multi-threading.
Index size = UDB file size
The easiest way to determine the size of an index is to make a UDB file
using makeudb_ublast or
makeudb_usearch. The size of the UDB file is
very close to the amount of RAM that is needed to store the index.
Memory use for clustering
The cluster_fast command loads the entire
input set into memory, so the memory required is the size of the input set plus
the size of the index (see below). The
cluster_smallmem uses a different strategy which stores only the index in
memory.
With high-redundancy datasets, this can substantially reduce the amount of
memory needed.
Index size for clustering
With clustering commands, the database is the set of centroid sequences. The
size of the index therefore scales with the number of clusters, not the size of
the input dataset. If the input has low redundancy (average cluster size close
to 1), then the size of the index can be estimated from the input size. If the
input has high redundancy (average cluster size >> 1), then the memory
requirements can be estimated from the size of the centroid database (-centroids
option). Of course, to do that, you must run the command and then there is
no need to estimate because you can see the
memory use from the USEARCH progress display or by using a system monitor (e.g.
top command under
Linux/OSX or taskmgr
under Windows). Estimating memory in advance requires that you are able to make
a prior estimate of the average cluster size.
Estimating index memory size
An index on a database is usually significantly larger than a FASTA file
containing the sequences; typically 3x to 5x larger. The size can be controlled
using indexing options, which apply to both
search and clustering commands. The size of the index for a large database can
be estimated using the following formula:
Bytes = S + 4 D + 8 W
Here, S is size of the FASTA file containing the
sequences, D is the number of words or seeds in the database that are indexed,
and W is either the number of possible words, or the number of slots if a
hashing is used (-slots option). The number of possible words is W = Ak
where A is the size of the alphabet (4 for nucleotides, 20 for the standard
amino acid alphabet or less if a compressed alphabet is used), and k is the word
length (or effective word length, if a pattern is
used). If all the words or seeds in the database are indexed, then S = D (close
enough), and the estimate is:
Bytes = 5 D + 8 W
Example: Nucleotide database
RFAM is a nucleotide database. As a FASTA file, it's about 40Mb at the time
of writing. With the default
nucleotide word length k=8, we have S = D = 40 x 106 and W = 48
= 65,536, so our estimated size is:
Bytes = 5 x 40 x 106 + 8 x 65,536 =
200 Mb
The actual size is 155 Mb = 3.9x bigger than the FASTA
file.
Example Protein database
The PDB90X
database is a 100 Mb FASTA file, so S = D = 100 x 106. Assuming
default parameters for a ublast-compatible
index, the effective word length is 6 and the alphabet size is 10, so W = Ak
= 106 and the total size estimate is:
Bytes = 5 x 100 x 106 + 8 x 106
= 500 Mb
The actual size is 487 Mb = 4.9x bigger than the FASTA
file.
|