Part 1. Build your own pangenome using PanTools =============================================== Install PanTools from bioconda following the :ref:`install ` instructions. To demonstrate the main functionalities of PanTools we use a small chloroplasts dataset to avoid long construction times. .. csv-table:: :file: /tables/chloroplast_datasets.csv :header-rows: 1 :delim: ; Download the chloroplast fasta and gff files `here `_ or via wget. .. code:: bash $ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz $ tar xvzf chloroplasts.tar.gz #unpack the archive -------------- BUILD, ANNOTATE and GROUP ------------------------- We start with building a pangenome using four of the five chloroplast genomes. For this you need a text file which directs PanTools to the FASTA files. Call your text file **genome_locations.txt** and include the following lines: .. code:: text YOUR_PATH/C_sativus.fasta YOUR_PATH/O_sativa.fasta YOUR_PATH/S_lycopersicum.fasta YOUR_PATH/S_tuberosum.fasta Make sure that ‘*YOUR_PATH*’ is the full path to the input files! The text file should only contain full paths to FASTA files, no additional spaces or empty lines. Then run PanTools with the :ref:`build_pangenome ` function and include the text file .. code:: bash $ pantools -Xmx5g build_pangenome chloroplast_DB genome_locations.txt Did the program run without any error messages? Congratulations, you’ve built your first pangenome! Adding additional genomes ~~~~~~~~~~~~~~~~~~~~~~~~~ PanTools has the ability to add additional genomes to an already existing pangenome. To test the function of PanTools, prepare a text file containing the path to the Maize chloroplast genome. Call your text file **fifth_genome_location.txt** and include the following line to the file: .. code:: text YOUR_PATH/Z_mays.fasta Run PanTools on the new text file and using the :ref:`add_genomes ` function .. code:: bash $ pantools add_genomes chloroplast_DB fifth_genome_location.txt Adding annotations ~~~~~~~~~~~~~~~~~~ To include gene annotations to the pangenome, prepare a text file containing paths to the GFF files. Call your text file **annotation_locations.txt** and include the following lines into the file: .. code:: text 1 YOUR_PATH/C_sativus.gff3 2 YOUR_PATH/O_sativa.gff3 3 YOUR_PATH/S_lycopersicum.gff3 4 YOUR_PATH/S_tuberosum.gff3 5 YOUR_PATH/Z_mays.gff3 Run PanTools using the :ref:`add_annotations ` function and include the new text file. Also add the ``--connect`` flag to attach the annotations to the nucleotide nodes; this is useful for exploring the pangenome in the Neo4j browser later in the tutorial. .. code:: bash $ pantools add_annotations --connect chloroplast_DB annotation_locations.txt Homology grouping ~~~~~~~~~~~~~~~~~ PanTools can infer homology between the protein sequences of a pangenome and cluster them into homology groups with the :ref:`group ` function. The ``--relaxation`` parameter must be set to influence the sensitivity, If you don't know the best setting for the grouping relaxation, you can use the :ref:`busco_protein ` and :ref:`optimal grouping ` functions instead, but for now we will use a relaxation of 4. .. code:: bash $ pantools group --relaxation=4 chloroplast_DB -------------- Adding phenotypes ----------------- Phenotype values can be Integers, Double, String or Boolean values. Create a text file **phenotypes.txt**. .. code:: text Genome,Solanum 1,false 2,false 3,true 4,true 5,false And use :ref:`add_phenotypes ` to add the information to the pangenome. .. code:: bash $ pantools add_phenotypes chloroplast_DB phenotypes.txt -------------- RETRIEVE functions ------------------ Now that the construction is complete, lets quickly validate if the construction was successful and the database can be used. To retrieve some genomic regions, prepare a text file containing genomic coordinates. Create the file **regions.txt** and include the following for each region: genome number, contig number, start and stop position and separate them by a single space .. code:: text 1 1 200 500 2 1 300 700 3 1 1 10000 3 1 1 10000 - 4 1 9999 15000 5 1 100000 110000 Now run the :ref:`retrieve_regions ` function and include the new text file .. code:: bash $ pantools retrieve_regions chloroplast_DB regions.txt Take a look at the extracted regions that are written to the **chloroplast_DB/retrieval/regions/** directory. To retrieve entire genomes, prepare a text file **genome_numbers.txt** and include each genome number on a separate line in the file .. code:: text 1 3 5 Use the **retrieve_regions** function again but include the new text file .. code:: bash $ pantools retrieve_regions chloroplast_DB genome_numbers.txt Genome files are written to same directory as before. Take a look at one of the three genomes you have just retrieved. In :doc:`part 2 ` of the tutorial we explore the pangenome you just built using the Neo4j browser and the Cypher language.