Build a pangenome or panproteome
Build pangenome
Build a pangenome out of a set of genomes. The construction consists of two steps: laying out the structure of the De Bruijn graph, and adding localization information to the graph.
build_pangenome
has been parallelized to increase
performance. The level of parallelism is controlled by the --threads
option (see below). Sequence nodes are localized in parallel, and updates to
the localization database cached to disk.--num-db-writer-threads
option below).
Because each database writer thread reads the contents of only a single
bucket into memory at a time, memory usage is reduced.--scratch-directory
option.--cache-size
option. To calculate
the heap space the cache will occupy, multiply the maximum size of the
cache by 128 bytes, e.g. for the default cache size of 10,000,000 PanTools
will need an additional 10,000,000 * 128 B = 1.28 GB of heap space.- Required software
- Parameters
<databaseDirectory>
Path to the database root directory.
<genomesFile>
A text file containing paths to FASTA files of genomes to be added to the pangenome; each on a separate line.
- Options
--kmer-size
Size of k-mers. Should be in range [6..255]. By not giving this argument, the most optimal k-mer size is calculated automatically.
--threads
/-t
Number of parallel working threads, default is the number of cores or 8 whichever is lower.
--scratch-directory
Temporary directory for storing localization update files. If not set a temporary directory will be created inside the default temporary-file directory. On most Linux distributions this default temporary-file directory will be
/tmp/
, on MacOS typically/var/folder/
.If a scratch directory is set, it will be created if it does not exist. If it does exist, PanTools will verify the directory is empty and, if not, raise an exception.
--num-buckets
Number of buckets for sorting, default is 200. During the localization phase updates are cached to disk and sorted into a number of files called buckets. This is to reduce the memory usage of storing all localization updates: instead of keeping them all in memory, we can now read bucket with a given level of parallelism (see the
--num-db-writer-threads
option), and update Neo4j with each bucket’s contents instead.The more buckets are available the lower the memory usage. However, please make sure PanTools can keep a file open for each bucket during the localization by setting the file descriptors limit to an appropriate value. For the default of 200 buckets, we advise setting the limit to 1024, like so:
ulimit -n 1024
. For larger number of buckets, set the limit to around 1,000 plus the number of buckets.--transaction-size
Number of localization updates to pack into a single Neo4j transaction, default is 10,000. To increase throughput to Neo4j localization updates are packed into a single transaction. The greater the number of updates per transaction the higher the throughput (up to a point), but the higher the memory usage.
In our experiments we have found 10,000 to provide a good balance between memory usage and performance.
--num-db-writer-threads
Number of threads to use for writing to Neo4j, default is 2. After sorting localization updates into buckets (see the
--num-buckets
option), buckets are read in parallel by the specified number of Neo4j database writer threads. With the default of two threads, the contents of two buckets will be kept in memory at the same time, and written to Neo4j with a given transaction size (see the--transaction-size
option).In our experiments on SSD and network-backed storage we saw little additional increase in performance by using more than two threads.
--cache-size
Maximum number of items in the node properties, default is 10,000,000. During localization several properties of nucleotide (sequence) nodes are accessed frequently. To prevent loading these from Neo4j every time the specified number of most frequently used items are cached. The cache can be disabled entirely by setting the cache size to zero.
--keep-intermediate-files
Do not delete intermediate localization files after the command finishes. Disabled by default, i.e., files are deleted automatically after the command finishes.
- Example genomes file
/always/genome1.fasta /use_the/genome2.fasta /full_path/genome3.fasta
- Example commands
$ pantools build_pangenome tomato_DB tomato_3.txt $ pantools build_pangenome --kmer-size=15 tomato_DB tomato_3.txt
- Relevant literature
PanTools: representation, storage and exploration of pan-genomic data
Add genomes
Add additional genomes to an existing pangenome.
- Required software
- Parameters
<databaseDirectory>
Path to the database root directory.
<genomesFile>
A text file containing paths to FASTA files of genomes to be added to the pangenome; each on a separate line.
- Example genomes file
/use_the/genome4.fasta /full_path/genome5.fasta
- Example commands
$ pantools add_genomes pangenome_DB extra_genomes.txt
Build panproteome
Build a panproteome out of a set of proteins. By only including protein sequences, the usable functionalities are limited to a protein-based analysis, please see differences pangenome and panproteome. No additional proteins can be added to the panproteome, it needs to be rebuilt completely.
- Parameters
<databaseDirectory>
Path to the database root directory.
<proteomesFile>
A text file containing paths to FASTA files of proteins to be added to the panproteome; each on a separate line.
- Example proteomes file
/always/proteins1.fasta /use_the/proteins2.fasta /full_path/proteins3.faa
- Example commands
$ pantools build_panproteome proteome_DB proteins.txt