Part 2. Build your own pangenome using PanTools

To demonstrate the main functionalities of PanTools we use a small chloroplasts dataset to avoid long construction times.

Genome	Chloroplast genome	Accession	Length	Genes	tRNAs
1	Cucumis sativus (cucumber)	NC_007144.1	155,293 bp	85	37
2	Oryza sativa Indica 93-11 (rice)	NC_008155.1	134,496 bp	100	40
3	Solanum lycopersicum (tomato)	NC_007898.3	155,461 bp	87	45
4	Solanum tuberosum (potato)	NC_008096.2	155,296 bp	84	45
5	Zea mays (maize)	NC_001666.2	140,384 bp	111	38

Download the chloroplast fasta and gff files here or via wget.

$ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz
$ tar -xvzf chloroplasts.tar.gz #unpack the archive

We assume a PanTools alias was set during the installation. This allows PanTools to be executed with pantools rather than pantools/target/pantools-3.4.0.jar. If you don’t have an alias, either set one or replace the pantools command with the full path to the .jar file in the tutorials.

BUILD, ANNOTATE and GROUP

We start with building a pangenome using four of the five chloroplast genomes. For this you need a text file which directs PanTools to the FASTA files. Call your text file genome_locations.txt and include the following lines:

YOUR_PATH/C_sativus.fasta
YOUR_PATH/O_sativa.fasta
YOUR_PATH/S_lycopersicum.fasta
YOUR_PATH/S_tuberosum.fasta

Make sure that ‘YOUR_PATH’ is the full path to the input files! Then run PanTools with the build_pangenome function and include the text file

$ pantools build_pangenome -dp chloroplast_DB -gf genome_locations.txt

Did the program run without any error messages? Congratulations, you’ve built your first pangenome! If not? Make sure your Java version is up to date and kmc is executable. The text file should only contain full paths to FASTA files, no additional spaces or empty lines.

Adding additional genomes

PanTools has the ability to add additional genomes to an already existing pangenome. To test the function of PanTools, prepare a text file containing the path to the Maize chloroplast genome. Call your text file fifth_genome_location.txt and include the following line to the file:

YOUR_PATH/Z_mays.fasta

Run PanTools on the new text file and use the add_genomes function

$ pantools add_genomes -dp chloroplast_DB -gf fifth_genome_location.txt

Adding annotations To include gene annotations to the pangenome, prepare a text file containing paths to the GFF files. Call your text file annotation_locations.txt and include the following lines into the file:

YOUR_PATH/C_sativus.gff3
YOUR_PATH/O_sativa.gff3
YOUR_PATH/S_lycopersicum.gff3
YOUR_PATH/S_tuberosum.gff3
YOUR_PATH/Z_mays.gff3

Run PanTools using the add_annotations function and include the new text file

$ pantools add_annotations -dp chloroplast_DB -af annotation_locations.txt -ca

PanTools attached the annotations to our nucleotide nodes so now we can cluster them.

Homology grouping

PanTools can infer homology between the protein sequences of a pangenome and cluster them into homology groups. Multiple parameters can be set to influence the sensitivity but for now we use the group functionality with default settings.

$ pantools group -dp chloroplast_DB

Adding phenotypes (requires PanTools v3)

Phenotype values can be Integers, Double, String or Boolean values. Create a text file phenotypes.txt.

Genome,Solanum
1,false
2,false
3,true
4,true
5,false

And use add_phenotypes to add the information to the pangenome.

$ pantools add_phenotypes -dp chloroplast_DB -ph phenotypes.txt

RETRIEVE functions

Now that the construction is complete, lets quickly validate if the construction was successful and the database can be used. To retrieve some genomic regions, prepare a text file containing genomic coordinates. Create the file regions.txt and include the following for each region: genome number, contig number, start and stop position and separate them by a single space

1 200 500
1 300 700
1 1 10000
1 1 10000 -
1 9999 15000
1 100000 110000

Now run the retrieve_regions function and include the new text file

$ pantools retrieve_regions -dp chloroplast_DB --regions-file regions.txt

Take a look at the extracted regions that are written to the chloroplast_DB/retrieval/regions/ directory.

To retrieve entire genomes, prepare a text file genome_numbers.txt and include each genome number on a separate line in the file

1
3
5

Use the retrieve_regions function again but include the new text file

$ pantools retrieve_regions -dp chloroplast_DB -rf genome_numbers.txt

Genome files are written to same directory as before. Take a look at one of the three genomes you have just retrieved.

In part 3 of the tutorial we explore the pangenome you just built using the Neo4j browser and the Cypher language.