Part 1. Build your own pangenome using PanTools

Install PanTools from bioconda following the install instructions.

To demonstrate the main functionalities of PanTools we use a small chloroplasts dataset to avoid long construction times.

Genome	Chloroplast genome	Accession	Length	Genes	tRNAs
1	Cucumis sativus (cucumber)	NC_007144.1	155,293 bp	85	37
2	Oryza sativa Indica 93-11 (rice)	NC_008155.1	134,496 bp	100	40
3	Solanum lycopersicum (tomato)	NC_007898.3	155,461 bp	87	45
4	Solanum tuberosum (potato)	NC_008096.2	155,296 bp	84	45
5	Zea mays (maize)	NC_001666.2	140,384 bp	111	38

Download the chloroplast fasta and gff files here or via wget.

$ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz
$ tar xvzf chloroplasts.tar.gz #unpack the archive

BUILD, ANNOTATE and GROUP

We start with building a pangenome using four of the five chloroplast genomes. For this you need a text file which directs PanTools to the FASTA files. Call your text file genome_locations.txt and include the following lines:

YOUR_PATH/C_sativus.fasta
YOUR_PATH/O_sativa.fasta
YOUR_PATH/S_lycopersicum.fasta
YOUR_PATH/S_tuberosum.fasta

Make sure that ‘YOUR_PATH’ is the full path to the input files! The text file should only contain full paths to FASTA files, no additional spaces or empty lines. Then run PanTools with the build_pangenome function and include the text file

$ pantools -Xmx5g build_pangenome chloroplast_DB genome_locations.txt

Did the program run without any error messages? Congratulations, you’ve built your first pangenome!

Adding additional genomes

PanTools has the ability to add additional genomes to an already existing pangenome. To test the function of PanTools, prepare a text file containing the path to the Maize chloroplast genome. Call your text file fifth_genome_location.txt and include the following line to the file:

YOUR_PATH/Z_mays.fasta

Run PanTools on the new text file and using the add_genomes function

$ pantools add_genomes chloroplast_DB fifth_genome_location.txt

Adding annotations

To include gene annotations to the pangenome, prepare a text file containing paths to the GFF files. Call your text file annotation_locations.txt and include the following lines into the file:

YOUR_PATH/C_sativus.gff3
YOUR_PATH/O_sativa.gff3
YOUR_PATH/S_lycopersicum.gff3
YOUR_PATH/S_tuberosum.gff3
YOUR_PATH/Z_mays.gff3

Run PanTools using the add_annotations function and include the new text file. Also add the --connect flag to attach the annotations to the nucleotide nodes; this is useful for exploring the pangenome in the Neo4j browser later in the tutorial.

$ pantools add_annotations --connect chloroplast_DB annotation_locations.txt

Homology grouping

PanTools can infer homology between the protein sequences of a pangenome and cluster them into homology groups with the group function. The --relaxation parameter must be set to influence the sensitivity, If you don’t know the best setting for the grouping relaxation, you can use the busco_protein and optimal grouping functions instead, but for now we will use a relaxation of 4.

$ pantools group --relaxation=4 chloroplast_DB

Adding phenotypes

Phenotype values can be Integers, Double, String or Boolean values. Create a text file phenotypes.txt.

Genome,Solanum
1,false
2,false
3,true
4,true
5,false

And use add_phenotypes to add the information to the pangenome.

$ pantools add_phenotypes chloroplast_DB phenotypes.txt

RETRIEVE functions

Now that the construction is complete, lets quickly validate if the construction was successful and the database can be used. To retrieve some genomic regions, prepare a text file containing genomic coordinates. Create the file regions.txt and include the following for each region: genome number, contig number, start and stop position and separate them by a single space

1 200 500
1 300 700
1 1 10000
1 1 10000 -
1 9999 15000
1 100000 110000

Now run the retrieve_regions function and include the new text file

$ pantools retrieve_regions chloroplast_DB regions.txt

Take a look at the extracted regions that are written to the chloroplast_DB/retrieval/regions/ directory.

To retrieve entire genomes, prepare a text file genome_numbers.txt and include each genome number on a separate line in the file

1
3
5

Use the retrieve_regions function again but include the new text file

$ pantools retrieve_regions chloroplast_DB genome_numbers.txt

Genome files are written to same directory as before. Take a look at one of the three genomes you have just retrieved.

In part 2 of the tutorial we explore the pangenome you just built using the Neo4j browser and the Cypher language.