Part 2. Build your own pangenome using PanTools
To demonstrate the main functionalities of PanTools we use a small chloroplasts dataset to avoid long construction times.
Genome |
Chloroplast genome |
Accession |
Length |
Genes |
tRNAs |
---|---|---|---|---|---|
1 |
Cucumis sativus (cucumber) |
155,293 bp |
85 |
37 |
|
2 |
Oryza sativa Indica 93-11 (rice) |
134,496 bp |
100 |
40 |
|
3 |
Solanum lycopersicum (tomato) |
155,461 bp |
87 |
45 |
|
4 |
Solanum tuberosum (potato) |
155,296 bp |
84 |
45 |
|
5 |
Zea mays (maize) |
140,384 bp |
111 |
38 |
Download the chloroplast fasta and gff files here or via wget.
$ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz
$ tar -xvzf chloroplasts.tar.gz #unpack the archive
We assume a PanTools alias was set during the
installation. This allows PanTools
to be executed with pantools
rather than
pantools/target/pantools-3.4.0.jar
. If you don’t have an alias, either
set one or replace the pantools command with the full path to the .jar
file in the tutorials.
BUILD, ANNOTATE and GROUP
We start with building a pangenome using four of the five chloroplast genomes. For this you need a text file which directs PanTools to the FASTA files. Call your text file genome_locations.txt and include the following lines:
YOUR_PATH/C_sativus.fasta
YOUR_PATH/O_sativa.fasta
YOUR_PATH/S_lycopersicum.fasta
YOUR_PATH/S_tuberosum.fasta
Make sure that ‘YOUR_PATH’ is the full path to the input files! Then run PanTools with the build_pangenome function and include the text file
$ pantools build_pangenome -dp chloroplast_DB -gf genome_locations.txt
Did the program run without any error messages? Congratulations, you’ve built your first pangenome! If not? Make sure your Java version is up to date and kmc is executable. The text file should only contain full paths to FASTA files, no additional spaces or empty lines.
Adding additional genomes
PanTools has the ability to add additional genomes to an already existing pangenome. To test the function of PanTools, prepare a text file containing the path to the Maize chloroplast genome. Call your text file fifth_genome_location.txt and include the following line to the file:
YOUR_PATH/Z_mays.fasta
Run PanTools on the new text file and use the add_genomes function
$ pantools add_genomes -dp chloroplast_DB -gf fifth_genome_location.txt
Adding annotations To include gene annotations to the pangenome, prepare a text file containing paths to the GFF files. Call your text file annotation_locations.txt and include the following lines into the file:
1 YOUR_PATH/C_sativus.gff3
2 YOUR_PATH/O_sativa.gff3
3 YOUR_PATH/S_lycopersicum.gff3
4 YOUR_PATH/S_tuberosum.gff3
5 YOUR_PATH/Z_mays.gff3
Run PanTools using the add_annotations function and include the new text file
$ pantools add_annotations -dp chloroplast_DB -af annotation_locations.txt -ca
PanTools attached the annotations to our nucleotide nodes so now we can cluster them.
Homology grouping
PanTools can infer homology between the protein sequences of a pangenome and cluster them into homology groups. Multiple parameters can be set to influence the sensitivity but for now we use the group functionality with default settings.
$ pantools group -dp chloroplast_DB
Adding phenotypes (requires PanTools v3)
Phenotype values can be Integers, Double, String or Boolean values. Create a text file phenotypes.txt.
Genome,Solanum
1,false
2,false
3,true
4,true
5,false
And use add_phenotypes to add the information to the pangenome.
$ pantools add_phenotypes -dp chloroplast_DB -ph phenotypes.txt
RETRIEVE functions
Now that the construction is complete, lets quickly validate if the construction was successful and the database can be used. To retrieve some genomic regions, prepare a text file containing genomic coordinates. Create the file regions.txt and include the following for each region: genome number, contig number, start and stop position and separate them by a single space
1 1 200 500
2 1 300 700
3 1 1 10000
3 1 1 10000 -
4 1 9999 15000
5 1 100000 110000
Now run the retrieve_regions function and include the new text file
$ pantools retrieve_regions -dp chloroplast_DB --regions-file regions.txt
Take a look at the extracted regions that are written to the chloroplast_DB/retrieval/regions/ directory.
To retrieve entire genomes, prepare a text file genome_numbers.txt and include each genome number on a separate line in the file
1
3
5
Use the retrieve_regions function again but include the new text file
$ pantools retrieve_regions -dp chloroplast_DB -rf genome_numbers.txt
Genome files are written to same directory as before. Take a look at one of the three genomes you have just retrieved.
In part 3 of the tutorial we explore the pangenome you just built using the Neo4j browser and the Cypher language.