Technical setup validation

Performance is important when running PanTools on larger datasets. Therefore, it is recommended to test your PanTools installation on one or more of the test data sets below, before applying it to your own data.

PanTools pangenome construction commands make a large amount of transactions to the database and should be built on a location that can handle this for larger pangenomes. We recommend using an SSD or a RAM-disk (/dev/shm/, available on Linux machines). If pangenome construction takes too much time, the location of the database might be the issue. For reference, the building time of a small yeast data set of 10 genomes on our RAM-disk takes a little over two minutes, while it takes nearly 30 minutes on a network disk! Larger datasets exacerbate this gap in runtime. Disk size needed for construction depends on input data (genome size and number of genomes), for 8 arabidopsis genomes ±20GB should suffice, for 3 lettuce genomes ±250GB. Java heap space needed depends on the same parameters: in general it is recommendable to set this to larger than expected because java running out of heap space makes the tool crash (e.g. for lettuce we used -Xmx200g -Xms200g).

Warning

If you want to do everything in a single directory, make sure to download the data on the disk where you want to write the output. If the input data is on a network disk, adjust the commands to write the pangenome database on a local disk (RAM or SSD).

Ten yeast genomes

Download the data from: https://www.bioinformatics.nl/pangenomics/data/yeast_10strains.tgz

$ wget https://www.bioinformatics.nl/pangenomics/data/yeast_10strains.tgz
$ tar zxvf yeast_10strains.tgz

Build the pangenome:

$ pantools -Xms20g -Xmx50g build_pangenome -t16 yeast_db yeast_10strains/metadata/genome_locations.txt

Add annotations:

$ pantools -Xms20g -Xmx50g add_annotations yeast_db yeast_10strains/metadata/annotation_locations.txt

Group proteins:

$ pantools -Xms20g -Xmx50g group yeast_db -t16 --relaxation=3

Run time statistics (16 threads on RAM-disk):

command

running time (h:m:s)

CPU time (s)

build_pangenome

0:02:23

276.50

add_annotations

0:00:35

51.82

group

0:01:58

1351.73

Two Arabidopsis genomes

Start this use case with a bash script to download the data. Download the script from here: https://www.bioinformatics.nl/pangenomics/data/download_2_ara.sh To download the Arabidopsis data, run the script:

$ bash download_2_ara.sh <threads>

Build the pangenome:

$ pantools -Xms40g -Xmx80g build_pangenome -t50 --kmer-size=17 ara_db a_thaliana_2genomes/metadata/genome_locations.txt

Add annotations:

$ pantools -Xms40g -Xmx80g add_annotations ara_db a_thaliana_2genomes/metadata/annotation_locations.txt

Group proteins:

$ pantools -Xms40g -Xmx80g group ara_db -t50 --relaxation=3

Run time statistics (50 threads on RAM-disk)

command

running time (h:m:s)

CPU time (s)

build_pangenome

0:27:38

3102.41

add_annotations

0:00:52

93.56

group

0:02:25

507.60

Three tomato genomes

Download the script from here: https://www.bioinformatics.nl/pangenomics/data/tomato_3genomes.tar.gz

Extract the tomato data:

$ tar xvzf tomato_3genomes.tar.gz

Build the pangenome:

$ pantools -Xms40g -Xmx80g build_pangenome -t50 --kmer-size=13 tomato_db tomato_3genomes/metadata/genome_locations.txt

Add annotations:

$ pantools -Xms40g -Xmx80g add_annotations tomato_db tomato_3genomes/metadata/annotation_locations.txt

Group proteins:

$ pantools -Xms40g -Xmx80g group tomato_db -t50 --relaxation=3

Run time statistics (50 threads on RAM-disk)

command

running time (h:m:s)

CPU time (s)

build_pangenome

20:28:33

257610.72

add_annotations

0:01:35

181.19

group

0:03:02

5861.83