Build a pangenome or panproteome
Build pangenome
Build a pangenome out of a set of genomes. The construction consists of two steps: laying out the structure of the De Bruijn graph, and adding localization information to the graph.
build_pangenome has been parallelized to increase
performance. The level of parallelism is controlled by the --threads
option (see below). Sequence nodes are localized in parallel, and updates to
the localization database cached to disk.--num-db-writer-threads option below).
Because each database writer thread reads the contents of only a single
bucket into memory at a time, memory usage is reduced.--scratch-directory option.--cache-size option. To calculate
the heap space the cache will occupy, multiply the maximum size of the
cache by 128 bytes, e.g. for the default cache size of 10,000,000 PanTools
will need an additional 10,000,000 * 128 B = 1.28 GB of heap space.- Required software
- Parameters
<databaseDirectory>
Path to the database root directory.
<genomesFile>
A text file containing paths to FASTA files of genomes to be added to the pangenome; each on a separate line.
- Options
--kmer-sizeSize of k-mers. Should be in range [6..255]. By not giving this argument, the most optimal k-mer size is calculated automatically.
--threads/-tNumber of parallel working threads, default is the number of cores or 8 whichever is lower.
--scratch-directoryTemporary directory for storing localization update files. If not set a temporary directory will be created inside the default temporary-file directory. On most Linux distributions this default temporary-file directory will be
/tmp/, on MacOS typically/var/folder/.If a scratch directory is set, it will be created if it does not exist. If it does exist, PanTools will verify the directory is empty and, if not, raise an exception.
--num-bucketsNumber of buckets for sorting, default is 200. During the localization phase updates are cached to disk and sorted into a number of files called buckets. This is to reduce the memory usage of storing all localization updates: instead of keeping them all in memory, we can now read bucket with a given level of parallelism (see the
--num-db-writer-threadsoption), and update Neo4j with each bucket’s contents instead.The more buckets are available the lower the memory usage. However, please make sure PanTools can keep a file open for each bucket during the localization by setting the file descriptors limit to an appropriate value. For the default of 200 buckets, we advise setting the limit to 1024, like so:
ulimit -n 1024. For larger number of buckets, set the limit to around 1,000 plus the number of buckets.--transaction-sizeNumber of localization updates to pack into a single Neo4j transaction, default is 10,000. To increase throughput to Neo4j localization updates are packed into a single transaction. The greater the number of updates per transaction the higher the throughput (up to a point), but the higher the memory usage.
In our experiments we have found 10,000 to provide a good balance between memory usage and performance.
--num-db-writer-threadsNumber of threads to use for writing to Neo4j, default is 2. After sorting localization updates into buckets (see the
--num-bucketsoption), buckets are read in parallel by the specified number of Neo4j database writer threads. With the default of two threads, the contents of two buckets will be kept in memory at the same time, and written to Neo4j with a given transaction size (see the--transaction-sizeoption).In our experiments on SSD and network-backed storage we saw little additional increase in performance by using more than two threads.
--cache-sizeMaximum number of items in the node properties, default is 10,000,000. During localization several properties of nucleotide (sequence) nodes are accessed frequently. To prevent loading these from Neo4j every time the specified number of most frequently used items are cached. The cache can be disabled entirely by setting the cache size to zero.
--keep-intermediate-filesDo not delete intermediate localization files after the command finishes. Disabled by default, i.e., files are deleted automatically after the command finishes.
- Example genomes file
/always/genome1.fasta /use_the/genome2.fasta /full_path/genome3.fasta
- Example commands
$ pantools build_pangenome tomato_DB tomato_3.txt $ pantools build_pangenome --kmer-size=15 tomato_DB tomato_3.txt
- Relevant literature
PanTools: representation, storage and exploration of pan-genomic data
Add genomes
Add additional genomes to an existing pangenome.
- Required software
- Parameters
<databaseDirectory>
Path to the database root directory.
<genomesFile>
A text file containing paths to FASTA files of genomes to be added to the pangenome; each on a separate line.
- Example genomes file
/use_the/genome4.fasta /full_path/genome5.fasta
- Example commands
$ pantools add_genomes pangenome_DB extra_genomes.txt
Build panproteome
Build a panproteome out of a set of proteins. By only including protein sequences, the usable functionalities are limited to a protein-based analysis, please see differences pangenome and panproteome. No additional proteins can be added to the panproteome, it needs to be rebuilt completely.
- Parameters
<databaseDirectory>
Path to the database root directory.
<proteomesFile>
A text file containing paths to FASTA files of proteins to be added to the panproteome; each on a separate line.
- Example proteomes file
/always/proteins1.fasta /use_the/proteins2.fasta /full_path/proteins3.faa
- Example commands
$ pantools build_panproteome proteome_DB proteins.txt