Part 1. Build your own pangenome using PanTools
===============================================

Install PanTools from bioconda following the
:ref:`install <getting_started/install:Install from bioconda>`
instructions.

To demonstrate the main functionalities of PanTools we use a small
chloroplasts dataset to avoid long construction times.

.. csv-table::
   :file: /tables/chloroplast_datasets.csv
   :header-rows: 1
   :delim: ;

Download the chloroplast fasta and gff files
`here <http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz>`_
or via wget.

.. code:: bash

   $ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz
   $ tar xvzf chloroplasts.tar.gz #unpack the archive

--------------

BUILD, ANNOTATE and GROUP
-------------------------

We start with building a pangenome using four of the five chloroplast
genomes. For this you need a text file which directs PanTools to the
FASTA files. Call your text file **genome_locations.txt** and include
the following lines:

.. code:: text

   YOUR_PATH/C_sativus.fasta
   YOUR_PATH/O_sativa.fasta
   YOUR_PATH/S_lycopersicum.fasta
   YOUR_PATH/S_tuberosum.fasta

Make sure that ‘*YOUR_PATH*’ is the full path to the input files! The text
file should only contain full paths to FASTA files, no additional spaces or
empty lines. Then run PanTools with the
:ref:`build_pangenome <construction/build:build pangenome>` function and
include the text file

.. code:: bash

   $ pantools -Xmx5g build_pangenome chloroplast_DB genome_locations.txt

Did the program run without any error messages? Congratulations, you’ve
built your first pangenome!

Adding additional genomes
~~~~~~~~~~~~~~~~~~~~~~~~~
PanTools has the ability to add additional genomes to an already
existing pangenome. To test the function of PanTools, prepare a text
file containing the path to the Maize chloroplast genome. Call your
text file **fifth_genome_location.txt** and include the following
line to the file:

.. code:: text

   YOUR_PATH/Z_mays.fasta

Run PanTools on the new text file and using the
:ref:`add_genomes <construction/build:add genomes>` function

.. code:: bash

   $ pantools add_genomes chloroplast_DB fifth_genome_location.txt

Adding annotations
~~~~~~~~~~~~~~~~~~

To include gene annotations to the pangenome, prepare a text file
containing paths to the GFF files. Call your text file
**annotation_locations.txt** and include the following lines into the
file:

.. code:: text

   1 YOUR_PATH/C_sativus.gff3
   2 YOUR_PATH/O_sativa.gff3
   3 YOUR_PATH/S_lycopersicum.gff3
   4 YOUR_PATH/S_tuberosum.gff3
   5 YOUR_PATH/Z_mays.gff3

Run PanTools using the :ref:`add_annotations <construction/annotate:add
annotations>` function and include the new text file. Also add the
``--connect`` flag to attach the annotations to the nucleotide nodes;
this is useful for exploring the pangenome in the Neo4j browser later
in the tutorial.

.. code:: bash

   $ pantools add_annotations --connect chloroplast_DB annotation_locations.txt

Homology grouping
~~~~~~~~~~~~~~~~~

PanTools can infer homology between the protein sequences of a pangenome
and cluster them into homology groups with the
:ref:`group <construction/group:group>` function. The ``--relaxation``
parameter must be set to influence the sensitivity,  If you don't know the
best setting for the grouping relaxation, you can use the
:ref:`busco_protein <construction/group:busco protein>` and
:ref:`optimal grouping <construction/group:optimal grouping>` functions
instead, but for now we will use a relaxation of 4.

.. code:: bash

   $ pantools group --relaxation=4 chloroplast_DB

--------------

Adding phenotypes
-----------------

Phenotype values can be Integers, Double, String or Boolean values.
Create a text file **phenotypes.txt**.

.. code:: text

   Genome,Solanum
   1,false
   2,false
   3,true
   4,true
   5,false

And use :ref:`add_phenotypes <construction/annotate:add phenotypes>` to add the
information to the pangenome.

.. code:: bash

   $ pantools add_phenotypes chloroplast_DB phenotypes.txt

--------------

RETRIEVE functions
------------------

Now that the construction is complete, lets quickly validate if the
construction was successful and the database can be used. To retrieve
some genomic regions, prepare a text file containing genomic
coordinates. Create the file **regions.txt** and include the following
for each region: genome number, contig number, start and stop position
and separate them by a single space

.. code:: text

   1 1 200 500
   2 1 300 700
   3 1 1 10000
   3 1 1 10000 -
   4 1 9999 15000
   5 1 100000 110000

Now run the :ref:`retrieve_regions <analysis/explore:retrieve regions>`
function and include the new text file

.. code:: bash

   $ pantools retrieve_regions chloroplast_DB regions.txt

Take a look at the extracted regions that are written to the
**chloroplast_DB/retrieval/regions/** directory.

To retrieve entire genomes, prepare a text file **genome_numbers.txt**
and include each genome number on a separate line in the file

.. code:: text

   1
   3
   5

Use the **retrieve_regions** function again but include the new text
file

.. code:: bash

   $ pantools retrieve_regions chloroplast_DB genome_numbers.txt

Genome files are written to same directory as before. Take a look at one
of the three genomes you have just retrieved.

In :doc:`part 2 <tutorial_part2>` of the tutorial we explore the
pangenome you just built using the Neo4j browser and the Cypher language.