Part 2. Build your own pangenome using PanTools =============================================== To demonstrate the main functionalities of PanTools we use a small chloroplasts dataset to avoid long construction times. .. csv-table:: :file: /tables/chloroplast_datasets.csv :header-rows: 1 :delim: ; Download the chloroplast fasta and gff files `here `_ or via wget. .. code:: bash $ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz $ tar -xvzf chloroplasts.tar.gz #unpack the archive We assume a PanTools alias was set during the :ref:`installation `. This allows PanTools to be executed with ``pantools`` rather than the full path to the jar file. If you don’t have an alias, either set one or replace the pantools command with the full path to the .jar file in the tutorials. -------------- BUILD, ANNOTATE and GROUP ------------------------- We start with building a pangenome using four of the five chloroplast genomes. For this you need a text file which directs PanTools to the FASTA files. Call your text file **genome_locations.txt** and include the following lines: .. code:: text YOUR_PATH/C_sativus.fasta YOUR_PATH/O_sativa.fasta YOUR_PATH/S_lycopersicum.fasta YOUR_PATH/S_tuberosum.fasta Make sure that ‘*YOUR_PATH*’ is the full path to the input files! Then run PanTools with the :ref:`build_pangenome ` function and include the text file .. code:: bash $ pantools build_pangenome chloroplast_DB genome_locations.txt Did the program run without any error messages? Congratulations, you’ve built your first pangenome! If not? Make sure your Java version is up to date and kmc is executable. The text file should only contain full paths to FASTA files, no additional spaces or empty lines. Adding additional genomes ~~~~~~~~~~~~~~~~~~~~~~~~~ PanTools has the ability to add additional genomes to an already existing pangenome. To test the function of PanTools, prepare a text file containing the path to the Maize chloroplast genome. Call your text file **fifth_genome_location.txt** and include the following line to the file: .. code:: text YOUR_PATH/Z_mays.fasta Run PanTools on the new text file and use the :ref:`add_genomes ` function .. code:: bash $ pantools add_genomes chloroplast_DB fifth_genome_location.txt Adding annotations To include gene annotations to the pangenome, prepare a text file containing paths to the GFF files. Call your text file **annotation_locations.txt** and include the following lines into the file: .. code:: text 1 YOUR_PATH/C_sativus.gff3 2 YOUR_PATH/O_sativa.gff3 3 YOUR_PATH/S_lycopersicum.gff3 4 YOUR_PATH/S_tuberosum.gff3 5 YOUR_PATH/Z_mays.gff3 Run PanTools using the :ref:`add_annotations ` function and include the new text file .. code:: bash $ pantools add_annotations --connect chloroplast_DB annotation_locations.txt PanTools attached the annotations to our nucleotide nodes so now we can cluster them. Homology grouping ~~~~~~~~~~~~~~~~~ PanTools can infer homology between the protein sequences of a pangenome and cluster them into homology groups. Multiple parameters can be set to influence the sensitivity but for now we use the :ref:`group ` functionality with default settings. .. code:: bash $ pantools group chloroplast_DB -------------- Adding phenotypes (requires PanTools v3) ---------------------------------------- Phenotype values can be Integers, Double, String or Boolean values. Create a text file **phenotypes.txt**. .. code:: text Genome,Solanum 1,false 2,false 3,true 4,true 5,false And use :ref:`add_phenotypes ` to add the information to the pangenome. .. code:: bash $ pantools add_phenotypes chloroplast_DB phenotypes.txt -------------- RETRIEVE functions ------------------ Now that the construction is complete, lets quickly validate if the construction was successful and the database can be used. To retrieve some genomic regions, prepare a text file containing genomic coordinates. Create the file **regions.txt** and include the following for each region: genome number, contig number, start and stop position and separate them by a single space .. code:: text 1 1 200 500 2 1 300 700 3 1 1 10000 3 1 1 10000 - 4 1 9999 15000 5 1 100000 110000 Now run the :ref:`retrieve_regions ` function and include the new text file .. code:: bash $ pantools retrieve_regions chloroplast_DB regions.txt Take a look at the extracted regions that are written to the **chloroplast_DB/retrieval/regions/** directory. To retrieve entire genomes, prepare a text file **genome_numbers.txt** and include each genome number on a separate line in the file .. code:: text 1 3 5 Use the **retrieve_regions** function again but include the new text file .. code:: bash $ pantools retrieve_regions chloroplast_DB genome_numbers.txt Genome files are written to same directory as before. Take a look at one of the three genomes you have just retrieved. In :doc:`part 3 ` of the tutorial we explore the pangenome you just built using the Neo4j browser and the Cypher language.