Part 1. Build your own pangenome using PanTools
Install PanTools from bioconda following the install instructions.
To demonstrate the main functionalities of PanTools we use a small chloroplasts dataset to avoid long construction times.
Genome |
Chloroplast genome |
Accession |
Length |
Genes |
tRNAs |
|---|---|---|---|---|---|
1 |
Cucumis sativus (cucumber) |
155,293 bp |
85 |
37 |
|
2 |
Oryza sativa Indica 93-11 (rice) |
134,496 bp |
100 |
40 |
|
3 |
Solanum lycopersicum (tomato) |
155,461 bp |
87 |
45 |
|
4 |
Solanum tuberosum (potato) |
155,296 bp |
84 |
45 |
|
5 |
Zea mays (maize) |
140,384 bp |
111 |
38 |
Download the chloroplast fasta and gff files here or via wget.
$ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz
$ tar xvzf chloroplasts.tar.gz #unpack the archive
BUILD, ANNOTATE and GROUP
We start with building a pangenome using four of the five chloroplast genomes. For this you need a text file which directs PanTools to the FASTA files. Call your text file genome_locations.txt and include the following lines:
YOUR_PATH/C_sativus.fasta
YOUR_PATH/O_sativa.fasta
YOUR_PATH/S_lycopersicum.fasta
YOUR_PATH/S_tuberosum.fasta
Make sure that ‘YOUR_PATH’ is the full path to the input files! The text file should only contain full paths to FASTA files, no additional spaces or empty lines. Then run PanTools with the build_pangenome function and include the text file
$ pantools -Xmx5g build_pangenome chloroplast_DB genome_locations.txt
Did the program run without any error messages? Congratulations, you’ve built your first pangenome!
Adding additional genomes
PanTools has the ability to add additional genomes to an already existing pangenome. To test the function of PanTools, prepare a text file containing the path to the Maize chloroplast genome. Call your text file fifth_genome_location.txt and include the following line to the file:
YOUR_PATH/Z_mays.fasta
Run PanTools on the new text file and using the add_genomes function
$ pantools add_genomes chloroplast_DB fifth_genome_location.txt
Adding annotations
To include gene annotations to the pangenome, prepare a text file containing paths to the GFF files. Call your text file annotation_locations.txt and include the following lines into the file:
1 YOUR_PATH/C_sativus.gff3
2 YOUR_PATH/O_sativa.gff3
3 YOUR_PATH/S_lycopersicum.gff3
4 YOUR_PATH/S_tuberosum.gff3
5 YOUR_PATH/Z_mays.gff3
Run PanTools using the add_annotations function and include the new text file. Also add the
--connect flag to attach the annotations to the nucleotide nodes;
this is useful for exploring the pangenome in the Neo4j browser later
in the tutorial.
$ pantools add_annotations --connect chloroplast_DB annotation_locations.txt
Homology grouping
PanTools can infer homology between the protein sequences of a pangenome
and cluster them into homology groups with the
group function. The --relaxation
parameter must be set to influence the sensitivity, If you don’t know the
best setting for the grouping relaxation, you can use the
busco_protein and
optimal grouping functions
instead, but for now we will use a relaxation of 4.
$ pantools group --relaxation=4 chloroplast_DB
Adding phenotypes
Phenotype values can be Integers, Double, String or Boolean values. Create a text file phenotypes.txt.
Genome,Solanum
1,false
2,false
3,true
4,true
5,false
And use add_phenotypes to add the information to the pangenome.
$ pantools add_phenotypes chloroplast_DB phenotypes.txt
RETRIEVE functions
Now that the construction is complete, lets quickly validate if the construction was successful and the database can be used. To retrieve some genomic regions, prepare a text file containing genomic coordinates. Create the file regions.txt and include the following for each region: genome number, contig number, start and stop position and separate them by a single space
1 1 200 500
2 1 300 700
3 1 1 10000
3 1 1 10000 -
4 1 9999 15000
5 1 100000 110000
Now run the retrieve_regions function and include the new text file
$ pantools retrieve_regions chloroplast_DB regions.txt
Take a look at the extracted regions that are written to the chloroplast_DB/retrieval/regions/ directory.
To retrieve entire genomes, prepare a text file genome_numbers.txt and include each genome number on a separate line in the file
1
3
5
Use the retrieve_regions function again but include the new text file
$ pantools retrieve_regions chloroplast_DB genome_numbers.txt
Genome files are written to same directory as before. Take a look at one of the three genomes you have just retrieved.
In part 2 of the tutorial we explore the pangenome you just built using the Neo4j browser and the Cypher language.