Part 2. Build your own pangenome using PanTools

To demonstrate the main functionalities of PanTools we use a small chloroplasts dataset to avoid long construction times.

Genome

Chloroplast genome

Accession

Length

Genes

tRNAs

1

Cucumis sativus (cucumber)

NC_007144.1

155,293 bp

85

37

2

Oryza sativa Indica 93-11 (rice)

NC_008155.1

134,496 bp

100

40

3

Solanum lycopersicum (tomato)

NC_007898.3

155,461 bp

87

45

4

Solanum tuberosum (potato)

NC_008096.2

155,296 bp

84

45

5

Zea mays (maize)

NC_001666.2

140,384 bp

111

38

Download the chloroplast fasta and gff files here or via wget.

$ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz
$ tar -xvzf chloroplasts.tar.gz #unpack the archive

We assume a PanTools alias was set during the installation. This allows PanTools to be executed with pantools rather than the full path to the jar file. If you don’t have an alias, either set one or replace the pantools command with the full path to the .jar file in the tutorials.


BUILD, ANNOTATE and GROUP

We start with building a pangenome using four of the five chloroplast genomes. For this you need a text file which directs PanTools to the FASTA files. Call your text file genome_locations.txt and include the following lines:

YOUR_PATH/C_sativus.fasta
YOUR_PATH/O_sativa.fasta
YOUR_PATH/S_lycopersicum.fasta
YOUR_PATH/S_tuberosum.fasta

Make sure that ‘YOUR_PATH’ is the full path to the input files! Then run PanTools with the build_pangenome function and include the text file

$ pantools build_pangenome chloroplast_DB genome_locations.txt

Did the program run without any error messages? Congratulations, you’ve built your first pangenome! If not? Make sure your Java version is up to date and kmc is executable. The text file should only contain full paths to FASTA files, no additional spaces or empty lines.

Adding additional genomes

PanTools has the ability to add additional genomes to an already existing pangenome. To test the function of PanTools, prepare a text file containing the path to the Maize chloroplast genome. Call your text file fifth_genome_location.txt and include the following line to the file:

YOUR_PATH/Z_mays.fasta

Run PanTools on the new text file and use the add_genomes function

$ pantools add_genomes chloroplast_DB fifth_genome_location.txt

Adding annotations To include gene annotations to the pangenome, prepare a text file containing paths to the GFF files. Call your text file annotation_locations.txt and include the following lines into the file:

1 YOUR_PATH/C_sativus.gff3
2 YOUR_PATH/O_sativa.gff3
3 YOUR_PATH/S_lycopersicum.gff3
4 YOUR_PATH/S_tuberosum.gff3
5 YOUR_PATH/Z_mays.gff3

Run PanTools using the add_annotations function and include the new text file

$ pantools add_annotations --connect chloroplast_DB annotation_locations.txt

PanTools attached the annotations to our nucleotide nodes so now we can cluster them.

Homology grouping

PanTools can infer homology between the protein sequences of a pangenome and cluster them into homology groups. Multiple parameters can be set to influence the sensitivity but for now we use the group functionality with default settings.

$ pantools group chloroplast_DB

Adding phenotypes (requires PanTools v3)

Phenotype values can be Integers, Double, String or Boolean values. Create a text file phenotypes.txt.

Genome,Solanum
1,false
2,false
3,true
4,true
5,false

And use add_phenotypes to add the information to the pangenome.

$ pantools add_phenotypes chloroplast_DB phenotypes.txt

RETRIEVE functions

Now that the construction is complete, lets quickly validate if the construction was successful and the database can be used. To retrieve some genomic regions, prepare a text file containing genomic coordinates. Create the file regions.txt and include the following for each region: genome number, contig number, start and stop position and separate them by a single space

1 1 200 500
2 1 300 700
3 1 1 10000
3 1 1 10000 -
4 1 9999 15000
5 1 100000 110000

Now run the retrieve_regions function and include the new text file

$ pantools retrieve_regions chloroplast_DB regions.txt

Take a look at the extracted regions that are written to the chloroplast_DB/retrieval/regions/ directory.

To retrieve entire genomes, prepare a text file genome_numbers.txt and include each genome number on a separate line in the file

1
3
5

Use the retrieve_regions function again but include the new text file

$ pantools retrieve_regions chloroplast_DB genome_numbers.txt

Genome files are written to same directory as before. Take a look at one of the three genomes you have just retrieved.

In part 3 of the tutorial we explore the pangenome you just built using the Neo4j browser and the Cypher language.