Part 2. Build your own pangenome using PanTools
===============================================

To demonstrate the main functionalities of PanTools we use a small
chloroplasts dataset to avoid long construction times.

.. csv-table::
   :file: /tables/chloroplast_datasets.csv
   :header-rows: 1
   :delim: ;

Download the chloroplast fasta and gff files
`here <http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz>`_
or via wget.

.. code:: bash

   $ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz
   $ tar -xvzf chloroplasts.tar.gz #unpack the archive

We assume a PanTools alias was set during the
:ref:`installation <getting_started/install:set pantools alias>`. This allows
PanTools to be executed with ``pantools`` rather than the full path to the jar
file. If you don’t have an alias, either set one or replace the pantools
command with the full path to the .jar file in the tutorials.

--------------

BUILD, ANNOTATE and GROUP
-------------------------

We start with building a pangenome using four of the five chloroplast
genomes. For this you need a text file which directs PanTools to the
FASTA files. Call your text file **genome_locations.txt** and include
the following lines:

.. code:: text

   YOUR_PATH/C_sativus.fasta
   YOUR_PATH/O_sativa.fasta
   YOUR_PATH/S_lycopersicum.fasta
   YOUR_PATH/S_tuberosum.fasta

Make sure that ‘*YOUR_PATH*’ is the full path to the input files! Then
run PanTools with the :ref:`build_pangenome <construction/build:build
pangenome>` function and include the text file

.. code:: bash

   $ pantools build_pangenome chloroplast_DB genome_locations.txt

Did the program run without any error messages? Congratulations, you’ve
built your first pangenome! If not? Make sure your Java version is up to
date and kmc is executable. The text file should only contain full paths
to FASTA files, no additional spaces or empty lines.

Adding additional genomes
~~~~~~~~~~~~~~~~~~~~~~~~~
PanTools has the ability to add additional genomes to an already
existing pangenome. To test the function of PanTools, prepare a text
file containing the path to the Maize chloroplast genome. Call your
text file **fifth_genome_location.txt** and include the following
line to the file:

.. code:: text

   YOUR_PATH/Z_mays.fasta

Run PanTools on the new text file and use the
:ref:`add_genomes <construction/build:add genomes>` function

.. code:: bash

   $ pantools add_genomes chloroplast_DB fifth_genome_location.txt

Adding annotations
To include gene annotations to the pangenome, prepare a text file
containing paths to the GFF files. Call your text file
**annotation_locations.txt** and include the following lines into the
file:

.. code:: text

   1 YOUR_PATH/C_sativus.gff3
   2 YOUR_PATH/O_sativa.gff3
   3 YOUR_PATH/S_lycopersicum.gff3
   4 YOUR_PATH/S_tuberosum.gff3
   5 YOUR_PATH/Z_mays.gff3

Run PanTools using the :ref:`add_annotations <construction/annotate:add
annotations>` function and include the new text file

.. code:: bash

   $ pantools add_annotations --connect chloroplast_DB annotation_locations.txt

PanTools attached the annotations to our nucleotide nodes so now we can
cluster them.

Homology grouping
~~~~~~~~~~~~~~~~~

PanTools can infer homology between the protein sequences of a
pangenome and cluster them into homology groups. Multiple parameters
can be set to influence the sensitivity but for now we use the
:ref:`group <construction/group:group>` functionality with default
settings.

.. code:: bash

   $ pantools group chloroplast_DB

--------------

Adding phenotypes (requires PanTools v3)
----------------------------------------

Phenotype values can be Integers, Double, String or Boolean values.
Create a text file **phenotypes.txt**.

.. code:: text

   Genome,Solanum
   1,false
   2,false
   3,true
   4,true
   5,false

And use :ref:`add_phenotypes <construction/annotate:add phenotypes>` to add the
information to the pangenome.

.. code:: bash

   $ pantools add_phenotypes chloroplast_DB phenotypes.txt

--------------

RETRIEVE functions
------------------

Now that the construction is complete, lets quickly validate if the
construction was successful and the database can be used. To retrieve
some genomic regions, prepare a text file containing genomic
coordinates. Create the file **regions.txt** and include the following
for each region: genome number, contig number, start and stop position
and separate them by a single space

.. code:: text

   1 1 200 500
   2 1 300 700
   3 1 1 10000
   3 1 1 10000 -
   4 1 9999 15000
   5 1 100000 110000

Now run the :ref:`retrieve_regions <analysis/explore:retrieve regions>`
function and include the new text file

.. code:: bash

   $ pantools retrieve_regions chloroplast_DB regions.txt

Take a look at the extracted regions that are written to the
**chloroplast_DB/retrieval/regions/** directory.

To retrieve entire genomes, prepare a text file **genome_numbers.txt**
and include each genome number on a separate line in the file

.. code:: text

   1
   3
   5

Use the **retrieve_regions** function again but include the new text
file

.. code:: bash

   $ pantools retrieve_regions chloroplast_DB genome_numbers.txt

Genome files are written to same directory as before. Take a look at one
of the three genomes you have just retrieved.

In :doc:`part 3 <tutorial_part3>` of the tutorial we explore the
pangenome you just built using the Neo4j browser and the Cypher language.