Build a pangenome or panproteome

Build pangenome

Build a pangenome out of a set of genomes. The construction consists of two steps: laying out the structure of the De Bruijn graph, and adding localization information to the graph.

Optimized localization

The localization step of build_pangenome has been parallelized to increase performance. The level of parallelism is controlled by the --threads option (see below). Sequence nodes are localized in parallel, and updates to the localization database cached to disk.

Localization updates are then sorted into a number of different files, called buckets, whose contents are written to Neo4j by a number of database writer threads in parallel (see the --num-db-writer-threads option below). Because each database writer thread reads the contents of only a single bucket into memory at a time, memory usage is reduced.

To cache localization updates on disk PanTools needs a scratch directory for temporary storage. This directory will be created by PanTools automatically, or can be set to a directory using the --scratch-directory option.

Lastly, an in-memory cache has been introduced to store frequently-accessed properties of nucleotide (sequence nodes). The cache will automatically retain the most-frequently used properties and evict least-frequently used items. This significantly increases performance by reducing Neo4j IO. The size of the cache can be controlled with the --cache-size option. To calculate the heap space the cache will occupy, multiply the maximum size of the cache by 128 bytes, e.g. for the default cache size of 10,000,000 PanTools will need an additional 10,000,000 * 128 B = 1.28 GB of heap space.

Required software

KMC 3.1.0 or higher

Parameters

<databaseDirectory>	Path to the database root directory.
<genomesFile>	A text file containing paths to FASTA files of genomes to be added to the pangenome; each on a separate line.

Options

`--kmer-size`	Size of k-mers. Should be in range [6..255]. By not giving this argument, the most optimal k-mer size is calculated automatically.
`--threads`/`-t`	Number of parallel working threads, default is the number of cores or 8 whichever is lower.
`--scratch-directory`	Temporary directory for storing localization update files. If not set a temporary directory will be created inside the default temporary-file directory. On most Linux distributions this default temporary-file directory will be `/tmp/`, on MacOS typically `/var/folder/`. If a scratch directory is set, it will be created if it does not exist. If it does exist, PanTools will verify the directory is empty and, if not, raise an exception.
`--num-buckets`	Number of buckets for sorting, default is 200. During the localization phase updates are cached to disk and sorted into a number of files called buckets. This is to reduce the memory usage of storing all localization updates: instead of keeping them all in memory, we can now read bucket with a given level of parallelism (see the `--num-db-writer-threads` option), and update Neo4j with each bucket’s contents instead. The more buckets are available the lower the memory usage. However, please make sure PanTools can keep a file open for each bucket during the localization by setting the file descriptors limit to an appropriate value. For the default of 200 buckets, we advise setting the limit to 1024, like so: `ulimit -n 1024`. For larger number of buckets, set the limit to around 1,000 plus the number of buckets.
`--transaction-size`	Number of localization updates to pack into a single Neo4j transaction, default is 10,000. To increase throughput to Neo4j localization updates are packed into a single transaction. The greater the number of updates per transaction the higher the throughput (up to a point), but the higher the memory usage. In our experiments we have found 10,000 to provide a good balance between memory usage and performance.
`--num-db-writer-threads`	Number of threads to use for writing to Neo4j, default is 2. After sorting localization updates into buckets (see the `--num-buckets` option), buckets are read in parallel by the specified number of Neo4j database writer threads. With the default of two threads, the contents of two buckets will be kept in memory at the same time, and written to Neo4j with a given transaction size (see the `--transaction-size` option). In our experiments on SSD and network-backed storage we saw little additional increase in performance by using more than two threads.
`--cache-size`	Maximum number of items in the node properties, default is 10,000,000. During localization several properties of nucleotide (sequence) nodes are accessed frequently. To prevent loading these from Neo4j every time the specified number of most frequently used items are cached. The cache can be disabled entirely by setting the cache size to zero.
`--keep-intermediate-files`	Do not delete intermediate localization files after the command finishes. Disabled by default, i.e., files are deleted automatically after the command finishes.

Example genomes file

/always/genome1.fasta
/use_the/genome2.fasta
/full_path/genome3.fasta

Example commands

$ pantools build_pangenome tomato_DB tomato_3.txt
$ pantools build_pangenome --kmer-size=15 tomato_DB tomato_3.txt

Relevant literature

PanTools: representation, storage and exploration of pan-genomic data

Add genomes

Add additional genomes to an existing pangenome.

Required software

KMC 3.1.0 or higher

Parameters

<databaseDirectory>	Path to the database root directory.
<genomesFile>	A text file containing paths to FASTA files of genomes to be added to the pangenome; each on a separate line.

Example genomes file

/use_the/genome4.fasta
/full_path/genome5.fasta

Example commands

$ pantools add_genomes pangenome_DB extra_genomes.txt

Build panproteome

Build a panproteome out of a set of proteins. By only including protein sequences, the usable functionalities are limited to a protein-based analysis, please see differences pangenome and panproteome. No additional proteins can be added to the panproteome, it needs to be rebuilt completely.

Parameters

<databaseDirectory>	Path to the database root directory.
<proteomesFile>	A text file containing paths to FASTA files of proteins to be added to the panproteome; each on a separate line.

Example proteomes file

/always/proteins1.fasta
/use_the/proteins2.fasta
/full_path/proteins3.faa

Example commands

$ pantools build_panproteome proteome_DB proteins.txt