Part 1. Build your own pangenome using PanTools

Install PanTools from bioconda following the install instructions.

To demonstrate the main functionalities of PanTools we use a small chloroplasts dataset to avoid long construction times.

Genome

Chloroplast genome

Accession

Length

Genes

tRNAs

1

Cucumis sativus (cucumber)

NC_007144.1

155,293 bp

85

37

2

Oryza sativa Indica 93-11 (rice)

NC_008155.1

134,496 bp

100

40

3

Solanum lycopersicum (tomato)

NC_007898.3

155,461 bp

87

45

4

Solanum tuberosum (potato)

NC_008096.2

155,296 bp

84

45

5

Zea mays (maize)

NC_001666.2

140,384 bp

111

38

Download the chloroplast fasta and gff files here or via wget.

$ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz
$ tar xvzf chloroplasts.tar.gz #unpack the archive

BUILD, ANNOTATE and GROUP

We start with building a pangenome using four of the five chloroplast genomes. For this you need a text file which directs PanTools to the FASTA files. Call your text file genome_locations.txt and include the following lines:

YOUR_PATH/C_sativus.fasta
YOUR_PATH/O_sativa.fasta
YOUR_PATH/S_lycopersicum.fasta
YOUR_PATH/S_tuberosum.fasta

Make sure that ‘YOUR_PATH’ is the full path to the input files! The text file should only contain full paths to FASTA files, no additional spaces or empty lines. Then run PanTools with the build_pangenome function and include the text file

$ pantools -Xmx5g build_pangenome chloroplast_DB genome_locations.txt

Did the program run without any error messages? Congratulations, you’ve built your first pangenome!

Adding additional genomes

PanTools has the ability to add additional genomes to an already existing pangenome. To test the function of PanTools, prepare a text file containing the path to the Maize chloroplast genome. Call your text file fifth_genome_location.txt and include the following line to the file:

YOUR_PATH/Z_mays.fasta

Run PanTools on the new text file and using the add_genomes function

$ pantools add_genomes chloroplast_DB fifth_genome_location.txt

Adding annotations

To include gene annotations to the pangenome, prepare a text file containing paths to the GFF files. Call your text file annotation_locations.txt and include the following lines into the file:

1 YOUR_PATH/C_sativus.gff3
2 YOUR_PATH/O_sativa.gff3
3 YOUR_PATH/S_lycopersicum.gff3
4 YOUR_PATH/S_tuberosum.gff3
5 YOUR_PATH/Z_mays.gff3

Run PanTools using the add_annotations function and include the new text file. Also add the --connect flag to attach the annotations to the nucleotide nodes; this is useful for exploring the pangenome in the Neo4j browser later in the tutorial.

$ pantools add_annotations --connect chloroplast_DB annotation_locations.txt

Homology grouping

PanTools can infer homology between the protein sequences of a pangenome and cluster them into homology groups with the group function. The --relaxation parameter must be set to influence the sensitivity, If you don’t know the best setting for the grouping relaxation, you can use the busco_protein and optimal grouping functions instead, but for now we will use a relaxation of 4.

$ pantools group --relaxation=4 chloroplast_DB

Adding phenotypes

Phenotype values can be Integers, Double, String or Boolean values. Create a text file phenotypes.txt.

Genome,Solanum
1,false
2,false
3,true
4,true
5,false

And use add_phenotypes to add the information to the pangenome.

$ pantools add_phenotypes chloroplast_DB phenotypes.txt

RETRIEVE functions

Now that the construction is complete, lets quickly validate if the construction was successful and the database can be used. To retrieve some genomic regions, prepare a text file containing genomic coordinates. Create the file regions.txt and include the following for each region: genome number, contig number, start and stop position and separate them by a single space

1 1 200 500
2 1 300 700
3 1 1 10000
3 1 1 10000 -
4 1 9999 15000
5 1 100000 110000

Now run the retrieve_regions function and include the new text file

$ pantools retrieve_regions chloroplast_DB regions.txt

Take a look at the extracted regions that are written to the chloroplast_DB/retrieval/regions/ directory.

To retrieve entire genomes, prepare a text file genome_numbers.txt and include each genome number on a separate line in the file

1
3
5

Use the retrieve_regions function again but include the new text file

$ pantools retrieve_regions chloroplast_DB genome_numbers.txt

Genome files are written to same directory as before. Take a look at one of the three genomes you have just retrieved.

In part 2 of the tutorial we explore the pangenome you just built using the Neo4j browser and the Cypher language.