Builds a reference genome index in UFI (Ultra-Fast Index) format. Input is a FASTA file containing the reference genome.
Currently, urmap is not alt-aware and I therefore recommend that alt chromosomes should be removed from the human reference.
Example
urmap -make_ufi hg38.fa -veryfast -output hg38.ufi
Description
The UFI index file is roughly 8 to 10x larger than the
FASTA file. The UFI file must fit into available memory, so as a rule of thumb
your RAM size should be at least 10x the genome size. For the human genome,
32Gb RAM should be enough with default options though the index may build more
quickly if you have more RAM.
For the human genome, it typically takes half an hour to an hour to build the index, depending on CPU and disk speed.
Currently, the maximum supported genome size is 4Gb. Support for larger genomes is in development, let me know if you need this feature.
Options
The FASTA filename for the
reference genome must be specified following -make_ufi.
The UFI file name is specified by the -output option (required).
The -veryfast option optimizes the index for use with the -veryfast option of the mapping commands (map and map2). Note that an index built with -veryfast will be less accurate regardless of whether -veryfast is specified while mapping. If you want to map both with and without -veryfast, you should create two separate indexes.
Obscure / advanced options you probably don't need
The -wordlength option sets the k-mer length. Default 24.
The -maxix option is an integer value setting the maximum abundance of a k-mer hash which is indexed. Default 32, or 3 if -veryfast is used.
The -load_factor option is a floating-point number < 1 specifying the fraction of the hash table slots which will be occupied when the index has been built. Default is 0.6. For technical reasons, this value is lower than typically used in hash tables. Increasing the load factor to, say, 0.7 significantly degrades index performance. Using values smaller than 0.6 may given some improvement in performance (probably not much) at the expense of increased index size.
The -slots option specifies the size of the hash table.
This should be a prime number substantially larger than the genome size
(number of bases). It is usually better to use the -load_factor option to
specify the hash table size. Default is to calculate a suitable prime number
given the genome size and desired load factor.