Annotate the pangenome graph
============================

Structural annotations
----------------------

Add annotations
^^^^^^^^^^^^^^^

Construct or expand the annotation layer of an existing pangenome. The
layer consists of genomic features like genes, mRNAs, proteins, tRNAs
etc. PanTools is only able to read General Feature Format (**GFF**)
files.

Multiple annotations can be assigned to a single genome; however, only
one annotation a time can be included in an analysis. The most recently
included annotation of a genome is included as default, unless a
different annotation is specified via ``--annotations-file``. This annotation
file contains only annotation identifiers, each on a separate line. The most
recent annotation is used for genomes where no annotation number is specified
in the file. Below is an example where the third annotation of genome 1 is
selected and the second annotation of genome 2 and 3.

.. code:: text

   1_3
   2_2
   3_2

| **Note on GFF files**
| GFF files are notoriously difficult to parse. PanTools uses
  htsjdk to parse GFF files, which is a Java library. Since we need to
  put this annotation in the graph database, it can be that the features
  are not correctly added. This is especially true for non-standard GFF
  files and annotated organellar genomes. If you encounter problems with
  a gff file, please check whether it is valid to the
  `GFF3 specification <https://github.com/The-Sequence-Ontology/Specifications/
  blob/master/gff3.md>`_.
  Also, our code should be able to handle all valid GFF3 files, but
  if the GFF3 file contains a trans-spliced gene that has alternative
  splicing, it will not be able to handle it (it will only annotate one
  mRNA).

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.
     * - <annotationsFile>
       - A text file with on each line a genome number and the full path
         to the corresponding annotation file, separated by a space.

**Options**
  .. list-table::
     :widths: 30 70

     * - ``--connect``
       - Connect the annotated genomic features to nucleotide nodes in the
         DBG.
     * - ``--ignore-invalid-features``
       - Ignore GFF3 features that do not match the fasta.
     * - ``--assume-one-mrna-per-cds``
       - Only relevant for features in GFF files that lack an mRNA between CDS
         and gene. By default, PanTools will assume that all CDS features belong
         to the same mRNA. If this option is set, PanTools will assume that each
         CDS feature belongs to a separate mRNA. For most GFF files this option
         should not be set.


**Example commands**
  .. code:: bash

     $ pantools add_annotations tomato_DB annotations.txt
     $ pantools add_annotations --connect tomato_DB annotations.txt

**Output**
  The annotated features are incorporated in the graph. Output files are
  written to the database directory.

  -  **annotation_overview.txt**, a summary of the GFF files incorporated
     in the pangenome.
  -  **annotation.log**, a list of misannotated feature identifiers.

**Example input file**
  Each line of the file starts with the genome number followed by the full
  path to the annotation file. The genome numbers match the line number of
  the file that you used to construct the pangenome.

  .. code:: text

     1 /always/genome1.gff
     2 /use_the/genome2.gff
     3 /full_path/genome3.gff

**GFF3 file format**
  The GFF format consists of one line per feature, each containing 9
  columns of data, plus optional track definition lines, that must be
  tab separated. Please use the proper hierarchy for the feature:
  **gene** -> **mRNA** -> **CDS**. Where *gene* is the parent of *mRNA*
  and *mRNA* is the parent of the *CDS* feature. The following example
  from *Saccharomyces cerevisiae* YJM320 (GCA_000975885) displays a
  correctly formatted gene entry:

  .. code:: text

     CP004621.1      Genbank gene    44836   45753   .       -       .       ID=gene99;Name=RPL23A;end_range=45753,.;gbkey=Gene;gene=RPL23A;gene_biotype=protein_coding;locus_tag=H754_YJM320B00023;partial=true;start_range=.,44836
     CP004621.1      Genbank mRNA    44836   45753   .       -       .       ID=rna99;Parent=gene99;gbkey=mRNA;gene=RPL23A;product=Rpl23ap
     CP004621.1      Genbank exon    45712   45753   .       -       .       ID=id112;Parent=rna99;gbkey=mRNA;gene=RPL23A;product=Rpl23ap
     CP004621.1      Genbank exon    44836   45207   .       -       .       ID=id113;Parent=rna99;gbkey=mRNA;gene=RPL23A;product=Rpl23ap
     CP004621.1      Genbank CDS     45712   45753   .       -       0       ID=cds92;Parent=rna99;Dbxref=SGD:S000000183,NCBI_GP:AJQ01854.1;Name=AJQ01854.1;Note=corresponds to s288c YBL087C;gbkey=CDS;gene=RPL23A;product=Rpl23ap;protein_id=AJQ01854.1
     CP004621.1      Genbank CDS     44836   45207   .       -       0       ID=cds92;Parent=rna99;Dbxref=SGD:S000000183,NCBI_GP:AJQ01854.1;Name=AJQ01854.1;Note=corresponds to s288c YBL087C;gbkey=CDS;gene=RPL23A;product=Rpl23ap;protein_id=AJQ01854.1

| **Select specific annotations for analysis**
| Only **one** annotation per genome is considered by any PanTools
  functionality. When multiple annotations are included, the last added
  annotation of a genome is automatically selected unless an
  ``--annotations-file`` is included specifying which annotations to use.
  This annotation file contains only annotation identifiers, each on a
  separate line. The most recent annotation is used for genomes where no
  annotation number is specified in the file. Below is an example where
  the third annotation of genome 1 is selected and the second annotation
  of genome 2 and 3.

.. code:: text

   1_3
   2_2
   3_2

-----------------------

Remove annotations
^^^^^^^^^^^^^^^^^^

Remove all the genomic features that belong to annotations, such as
*gene*, *mRNA*, *exon*, *tRNA*, and *feature* nodes. Functional
annotation nodes are not removed with this function but can be removed
with :ref:`remove_functions <construction/annotate:remove functions>`. Removing
annotations can be done in two ways:

1. Selecting genomes with ``--include`` or ``--exclude``, for which all
   annotation features will be removed.
2. Remove specific annotations by providing a text file with identifiers
   via the ``--annotations-file`` argument.

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.

**Options**
  Requires **one** of ``--include``\|\ ``--exclude``\|\ ``--annotations-file``.

  .. list-table::
     :widths: 30 70

     * - ``--include``/``-i``
       - A selection of genomes for which all annotations will be removed.
     * - ``--exclude``/``-e``
       - A selection of genomes excluded from the removal of annotations.
     * - ``--annotations-file``/``-A``
       - A text file with the identifiers of annotations to be removed,
         each on a separate line.

**Example annotations file**
  The annotations file should contain identifiers for annotations on each
  line (genome number, annotation number). The following example will
  remove the first annotations of genome 1, 2 and 3 and the second
  annotation of genome 1.

  .. code:: text

     1_1
     1_2
     2_1
     3_1

**Example commands**
  .. code:: bash

     $ pantools remove_annotations --exclude=3,4,5
     $ pantools remove_annotations -A annotations.txt

--------------------

Functional annotations
----------------------

PanTools is able to incorporate functional annotations into the
pangenome by reading output from various functional annotation tools.

Add functions
^^^^^^^^^^^^^

This function can integrate different functional annotations from a
variety of annotation files. Currently available functional annotations:
**Gene Ontology**, **Pfam**, **InterPro**, **TIGRFAM**, **Phobius**,
**SignalP** and **COG**. The first time this function is executed, the
Pfam, TIRGRAM, GO, and InterPro databases are integrated into the
pangenome. Phobius, SignalP and COG annotations do not have separate
nodes and are directly annotated on 'mRNA' nodes in the pangenome.

Gene names (or identifiers) from the input file are used to identify
gene nodes in the pangenome. Only genes with an exactly matching
name/identifier can be connected to functional annotation nodes! Use the
same FASTA and GFF3 files that were used to construct the pangenome database.
(It is best to use the protein fasta files in the ``proteins`` directory of the
database.)

| **Functional databases**
| If the needed databases are not available, they are downloaded by PanTools and
  extracted (Pfam, TIGRFAM, GO and InterPro are downloaded from the web). Prior
  to v4.2.0, PanTools came with these databases pre-downloaded. This is no
  longer the case, as this limited the distribution of PanTools as a single
  binary file. We strongly suggest to set the ``-F`` option to prevent
  unnecessary downloads from the internet, preferably to a location easily
  accessible.

| PanTools has been tested with the following versions of the databases:

.. list-table::
   :widths: 50 50
   :header-rows: 1

   * - Database type
     - Version
   * - GO
     - 2021-12-15
   * - Pfam
     - 35.0
   * - TIGRFAM
     - 15.0
   * - InterPro
     - 87.0

| The exact filenames PanTools checks for are:

.. csv-table::
   :file: /tables/functional_databases.csv
   :header-rows: 1
   :delim: ;

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.
     * - <functionsFile>
       - A text file with on each line a genome number and the full path
         to the corresponding annotation file, separated by a space.

**Options**
  .. list-table::
     :widths: 30 70

     * - ``--annotations-file``/``-A``
       - A text file with the identifiers of annotations to be included,
         each on a separate line. The most recent annotation is selected for
         genomes without an identifier.
     * - ``--functional-databases-directory``/``-F``
       - Path to the directory containing the functional databases. If the
         databases are not present, they are downloaded automatically. (Default
         location is "functional_databases" in the database directory.)

**Example commands**
  .. code:: bash

     $ pantools add_functions -F ~/function_databases tomato_DB f_annotations.txt
     $ pantools add_functions -F ~/function_databases -A annotations.txt tomato_DB f_annotations.txt

**Output**
  Functional annotations are incorporated in the graph. A log file is
  written to the **log** directory.

  -  **add_functional_annotations.log**, a log file with the the number of
     added functions per type and the identifiers of functions that could
     not be included.

**Example function files**
  The <functionsFile> requires to be formatted like an annotation input
  file. Each line of the file starts with the genome number followed by
  the full path to an annotation file.

  .. list-table::
     :widths: 40 60
     :header-rows: 1

     * - File type
       - Recognized by pattern in file name
     * - InterProScan
       - interpro & .gff
     * - eggNOG-mapper
       - eggnog
     * - Phobius
       - phobius
     * - SignalP
       - signalp
     * - Custom file
       - custom

  .. code:: text

     1 /mnt/scratch/interpro_results_genome_1.gff
     1 /mnt/scratch/custom_annotation_1.txt
     1 /mnt/scratch/phobius_1.txt
     2 /mnt/scratch/signalp.txt
     2 /mnt/scratch/eggnog_genome_2.annotations
     2 /mnt/scratch/transmembrane_annotations.txt phobius
     3 /mnt/scratch/ipro_results_genome_3.annot custom

**Annotation file types**
  PanTools can recognize functional annotations in different output
  formats.

  Phobius and SignalP are not standard analyses of the InterProScan
  pipeline and require some additional steps during the InterProScan
  installation. Please take a look at
  :ref:`our InterProScan install instruction <developer_guide/install:install interproscan>`
  to verify if the tools are part of the prediction pipeline. Phobius 1.01

  .. list-table::
     :widths: 20 80
     :header-rows: 1

     * - Function type
       - Allowed annotation file
     * - GO
       - InterProscan .gff & custom annotation file
     * - Pfam
       - InterProscan .gff & custom annotation file
     * - InterPro
       - InterProscan .gff & custom annotation file
     * - TIGRFAM
       - InterProscan .gff & custom annotation file
     * - Phobius
       - InterProscan .gff & Phobius 1.01 output
     * - SignalP
       - InterProscan .gff, signalP 4.1 output, signalP 5.0 output
     * - COG
       - eggNOG-mapper

  InterProScan gff file:

  .. code:: text

     ##gff-version 3
     ##interproscan-version 5.52-86.0
     AT4G21230.1   ProSiteProfiles protein_match 333 620 39.000664   +   .   date=06-10-2021;Target=mRNA.AT4G21230.1 333 620;Ontology_term="GO:0004672","GO:0005524","GO:0006468";ID=match$42_333_620;signature_desc=Protein kinase domain profile.;Name=PS50011;status=T;Dbxref="InterPro:IPR000719"
     AT3G08980.5   TIGRFAM protein_match         25  101 3.7E-14     +   .   date=06-10-2021;Target=mRNA.AT3G08980.5 25 101;Ontology_term="GO:0006508","GO:0008236","GO:0016020";ID=match$66_25_101;signature_desc=sigpep_I_bact: signal peptidase I;Name=TIGR02227;status=T;Dbxref="InterPro:IPR000223"
     AT2G17780.2   Phobius protein_match         338 354 .           +   .   date=06-10-2021;Target=AT2G17780.2 338 354;ID=match$141_338_354;signature_desc=Region of a membrane-bound protein predicted to be embedded in the membrane.;Name=TRANSMEMBRANE;status=T
     AT2G17780.2   Phobius protein_match         1   337 .           +   .   date=06-10-2021;Target=AT2G17780.2 1 337;ID=match$142_1_337;signature_desc=Region of a membrane-bound protein predicted to be outside the membrane, in the extracellular region.;Name=NON_CYTOPLASMIC_DOMAIN;status=T
     AT3G11780.2   SignalP_EUK protein_match     1   24  .           +   .   date=06-10-2021;Target=mRNA.AT3G11780.2 1 24;ID=match$230_1_24;Name=SignalP-noTM;status=T
     AT1G04300.2   CDD protein_match             40  114 1.54717E-13 +   .   date=06-10-2021;Target=mRNA.AT1G04300.2 40 114;Ontology_term="GO:0005515";ID=match$212_40_114;signature_desc=MATH;Name=cd00121;status=T;Dbxref="InterPro:IPR002083"

  eggNOG-mapper (tab separated) file:

  .. code:: text

     #query_name     seed_eggNOG_ortholog seed_ortholog_evalue seed_ortholog_score best_tax_level Preferred_name GOs EC KEGG_ko KEGG_Pathway KEGG_Module KEGG_Reaction KEGG_rclass BRITE KEGG_TC CAZy BiGG_Reaction taxonomic scope eggNOG OGs best eggNOG OG COG Functional cat. eggNOG free text desc.
     ATKYO-2G54530.1 3702.AT2G35130.2     1.9e-179             636.0               Brassicales     GO:0003674,GO:0003676,GO:0003723,GO:0003824,GO:0004518,GO:0004519,GO:0005488,GO:0005575,GO:0005622,GO:0005623,GO:0006139,GO:0006725,GO:0006807,GO:0008150,GO:0008152,GO:0009451,GO:0009987,GO:0016070,GO:0016787,GO:0016788,GO:0034641,GO:0043170,GO:0043226,GO:0043227,GO:0043229,GO:0043231,GO:0043412,GO:0044237,GO:0044238,GO:0044424,GO:0044464,GO:0046483,GO:0071704,GO:0090304,GO:0090305,GO:0097159,GO:1901360,GO:1901363                                           Viridiplantae   37R67@33090,3GAUT@35493,3HNDD@3699,KOG4197@1,KOG4197@2759   NA|NA|NA    E   Pentacotripeptide-repeat region of PRORP
     ATKYO-UG22500.1 3712.Bo02269s010.1   7.5e-35              153.7               Brassicales                                                 Viridiplantae   29I9W@1,2RRH4@2759,383W6@33090,3GWQZ@35493,3I1A9@3699   NA|NA|NA
     ATKYO-1G60060.1 3702.AT1G48090.1     0.0                  6241.0              Brassicales             ko:K19525                   ko00000             Viridiplantae   37IJB@33090,3GAN0@35493,3HQ90@3699,COG5043@1,KOG1809@2759   NA|NA|NA    U   Vacuolar protein sorting-associated protein
     ATKYO-3G74720.1 3702.AT3G52120.1     7.2e-245             852.8               Brassicales             ko:K13096                   ko00000,ko03041             Viridiplantae   37QYY@33090,3G9VU@35493,3HRDK@3699,KOG0965@1,KOG0965@2759   NA|NA|NA    L   SWAP (Suppressor-of-White-APricot) surp domain-containing protein D111 G-patch domain-containing protein
     ATKYO-4G41660.1 3702.AT4G16340.1     0.0                  3392.1              Brassicales     GO:0003674,GO:0005085,GO:0005088,GO:0005089,GO:0005488,GO:0005515,GO:0005575,GO:0005622,GO:0005623,GO:0005634,GO:0005737,GO:0005783,GO:0005829,GO:0005886,GO:0006810,GO:0008064,GO:0008150,GO:0008360,GO:0009605,GO:0009606,GO:0009628,GO:0009629,GO:0009630,GO:0009958,GO:0009966,GO:0009987,GO:0010646,GO:0010928,GO:0012505,GO:0016020,GO:0016043,GO:0016192,GO:0017016,GO:0017048,GO:0019898,GO:0019899,GO:0022603,GO:0022604,GO:0023051,GO:0030832,GO:0031267,GO:0032535,GO:0032956,GO:0032970,GO:0033043,GO:0043226,GO:0043227,GO:0043229,GO:0043231,GO:0044422,GO:0044424,GO:0044425,GO:0044432,GO:0044444,GO:0044446,GO:0044464,GO:0048583,GO:0050789,GO:0050793,GO:0050794,GO:0050896,GO:0051020,GO:0051128,GO:0051179,GO:0051234,GO:0051493,GO:0065007,GO:0065008,GO:0065009,GO:0070971,GO:0071840,GO:0071944,GO:0090066,GO:0098772,GO:0110053,GO:1902903     ko:K21852                   ko00000,ko04131             Viridiplantae   37QIM@33090,3G8RK@35493,3HSFN@3699,KOG1997@1,KOG1997@2759   NA|NA|NA    T   Belongs to the DOCK family

  A custom input file must consist of two tab or comma separated columns.
  The first column should contain a gene/mRNA id, the second an identifier
  from one of four functional annotation databases: GO, Pfam, InterPro or
  TIGRFAM.

  .. code:: text

     AT5G23090.4,GO:0046982
     AT5G23090.4,IPR009072
     AT1G27540.2,PF03478
     AT2G18450.1,TIGR01816

  Phobius 1.01 'short' (tab separated) functions file:

  .. code:: text

     SEQENCE ID                     TM SP PREDICTION
     mRNA-YPR204W                    0  0 o
     mRNA-ndhB-2_1                   6  Y n5-16c21/22o37-57i64-83o89-113i134-156o168-189i223-246o

  Phobius 1.01 'long' (tab separated) functions file:

  .. code:: text

     ID   mRNA-YPR204W
     FT   DOMAIN        1   1032       NON CYTOPLASMIC.
     //
     ID   mRNA-ndhB-2_1
     FT   SIGNAL        1     21
     FT   DOMAIN        1      4       N-REGION.
     FT   DOMAIN        5     16       H-REGION.
     FT   DOMAIN       17     21       C-REGION.
     FT   DOMAIN       22     36       NON CYTOPLASMIC.
     FT   TRANSMEM     37     57
     FT   DOMAIN       58     63       CYTOPLASMIC.
     FT   TRANSMEM     64     83
     FT   DOMAIN       84     88       NON CYTOPLASMIC.
     FT   TRANSMEM     89    113
     FT   DOMAIN      114    133       CYTOPLASMIC.
     FT   TRANSMEM    134    156
     FT   DOMAIN      157    167       NON CYTOPLASMIC.
     FT   TRANSMEM    168    189
     FT   DOMAIN      190    222       CYTOPLASMIC.
     FT   TRANSMEM    223    246
     FT   DOMAIN      247    253       NON CYTOPLASMIC.
     //

  SignalP 4.1 'short' (tab separated) functions file:

  .. code:: text

     # name                     Cmax  pos  Ymax  pos  Smax  pos  Smean   D     ?  Dmaxcut    Networks-used
     mRNA-rpl2-3                0.148  20  0.136  20  0.146   3  0.126   0.131 N  0.450      SignalP-noTM
     mRNA-cox2                  0.107  25  0.132  12  0.270   4  0.162   0.148 N  0.450      SignalP-noTM
     mRNA-cox2_1                0.850  17  0.776  17  0.785   2  0.717   0.753 Y  0.500      SignalP-TM

  SignalP 5.0 'short' (tab separated) functions file:

  .. code:: text

     # SignalP-5.0 Organism:   Eukarya     Timestamp: 20211122233246
     # ID          Prediction  SP(Sec/SPI) OTHER    CS Position
     AT3G26880.1   SP(Sec/SPI) 0.998803    0.001197 CS pos: 21-22. VYG-KK. Pr: 0.9807
     mRNA-rpl2-3   OTHER       0.001227    0.998773

**Relevant literature**
  -  `Expansion of the Gene Ontology knowledgebase and resources
     <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5210579/>`_
  -  `InterPro in 2019: improving coverage, classification and access to
     protein sequence annotations
     <https://academic.oup.com/nar/article/47/D1/D351/5162469>`_
  -  `TIGRFAMs and Genome Properties in 2013
     <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531188/>`_
  -  `A Combined Transmembrane Topology and Signal Peptide Prediction
     Method
     <https://www.sciencedirect.com/science/article/abs/pii/S0022283604002943?
     via%3Dihub>`_
  -  `Expanded microbial genome coverage and improved protein family
     annotation in the COG database
     <https://academic.oup.com/nar/article/43/D1/D261/2439462>`_

--------------

Remove functions
^^^^^^^^^^^^^^^^

Remove functional annotation features from the graph database. Functional
annotations include the *GO*, *pfam*, *tigrfam* and *interpro* nodes as well as
*mRNA* node properties for *COG*, *phobius* and *signalp*. There are multiple
modes available:

- 'all' removes all functional annotation nodes and properties.
- 'nodes' removes all *GO*, *pfam*, *tigrfam* and *interpro* nodes.
- 'properties' removes all *COG*, *phobius* and *signalp* properties
  from *mRNA* nodes.
- 'COG' removes all *COG* properties from *mRNA* nodes.
- 'phobius' removes all *phobius* properties from *mRNA* nodes.
- 'signalp' removes all *signalp* properties from *mRNA* nodes.
- 'bgc' removes all AntiSMASH BGC nodes and relationships.

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.

**Options**
  .. list-table::
     :widths: 30 70

     * - ``--mode``/``-m``
       - Mode for which annotations to remove (default: all). Can be one of
         'all', 'nodes', 'properties', 'COG', 'phobius' or 'signalp', 'bgc'. See
         above for more information.

**Example commands**
  .. code:: bash

     $ pantools remove_functions tomato_DB
     $ pantools remove_functions --mode nodes tomato_DB

-----------------

Add antiSMASH
^^^^^^^^^^^^^

Read antiSMASH output and incorporate **Biosynthetic Gene Clusters**
(BGC) nodes into the pangenome database. A 'bgc' node holds the gene
cluster product, the cluster address and has a relationship to all gene
nodes of the cluster. For this function to work, antiSMASH should be
performed with the same FASTA and GFF3 files used for building the
pangenome. antiSMASH output will not match the identifiers of the
pangenome when no GFF file was included.

As of PanTools v3.3.4 the required antiSMASH version is 6.0.0. Gene
cluster information is parsed from the .JSON file that is generated in
each run. We try to keep the parser updated with newer versions but
please contact us when this is no longer the case.

.. list-table::
   :widths: 35 30 35
   :header-rows: 1

   * -
     - Version
     - Version Date
   * - antiSMASH
     - 6.0.0
     - 21-02-2021

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.
     * - <antiSMASHFile>
       - A text file with on each line a genome number and the full path
         to the corresponding antiSMASH output file, separated by a space.

**Options**
  .. list-table::
     :widths: 30 70

     * - ``--annotations-file``/``-A``
       - A text file with the identifiers of annotations to be included,
         each on a separate line. The most recent annotation is selected for
         genomes without an identifier.

**Example antiSMASH file**
  The <antiSMASHFile> requires to be formatted like a regular annotation
  input file. Each line of the file starts with the genome number followed
  by the full path to the **JSON** file.

  .. code:: text

     1 /mnt/scratch/IPO3844/antismash/IPO3844.json
     4 /home/user/IPO3845/antismash/IPO3845.json

**Example commands**
  .. code:: bash

     $ pantools add_antismash tomato_DB clusters.txt
     $ pantools add_antismash -A annotations.txt tomato_DB clusters.txt

--------------

Function overview
^^^^^^^^^^^^^^^^^

Creates several summary files for each type of functional annotation
present in the database: GO, PFAM, InterPro, TIGRFAM, COG, Phobius, and
biosynthetic gene clusters from antiSMASH. In addition to the functions
that must be added via
:ref:`add_functions <construction/annotate:add functions>`,
this function also requires proteins to be clustered by
:ref:`group <construction/group:group>`.

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the pangenome database root directory.

**Options**
  .. list-table::
     :widths: 30 70

     * - ``--include``/``-i``
       - Only include a selection of genomes.
     * - ``--exclude``/``-e``
       - Exclude a selection of genomes.
     * - ``--annotations-file``/``-A``
       - A text file with the identifiers of annotations that should be used.
         The most recent annotation is selected for genomes without an
         identifier.

**Example commands**
  .. code:: bash

     $ pantools function_overview tomato_DB
     $ pantools function_overview --include=2-4 tomato_DB

**Output**
  Output files are written to *function* directory in the database. The
  overview CSV files are tables with on each row a function identifier
  with the frequency of per genome and.

  -  **functions_per_group_and_mrna.csv**, overview of all homology groups
     and the associated functions.
  -  **function_counts_per_group.csv**,
  -  **go_overview.csv**, overview of the GO terms in the pangenome.
  -  **pfam_overview.csv**, overview of the PFAM domains in the pangenome.
  -  **tigrfam_overview.csv**, overview of the TIGRFAMs in the pangenome.
  -  **interpro_overview.csv**, overview of the InterPro domains in the
     pangenome.
  -  **bgc_overview.csv**, overview of the added biosynthetic gene
     clusters from antiSMASH in the pangenome.
  -  **phobius_signalp_overview.csv**, overview of the included Phobius
     transmembrane topology and signal peptide predictions in the
     pangenome.
  -  **cog_overview.csv**, overview of the functional COG categories in
     the pangenome.
  -  **cog_per_class.R**, an R script to plot the distribution of COG
     categories over the core, accessory, unique homology groups.

  .. figure:: /figures/COG_abundance.png
     :width: 600
     :align: center

     *Example output of*\   **cog_per_class.R**\ *. The proportion of COGs
     functional categories assigned to homology groups.*

--------------------

Phenotypes
----------

Add phenotypes
^^^^^^^^^^^^^^

Including phenotype data to the pangenome which allows the
identification of phenotype specific genes, SNPs, functions, etc..
Altering the data is done by rerunning the command with an updated CSV
file.

| **Data types**
| Each phenotype node contains a genome number and can hold the
  following data types: **String**, **Integer**, **Float** or
  **Boolean**.

-  Values recognized as round number are converted to an **Integer** and
   to a **Double** when having one or multiple decimals.
-  **Boolean** types are identified by checking if the value matches
   'true' or 'false', ignoring capitalization of letters.
-  **String** values remain completely unaltered except for spaces and
   quotes characters. Spaces are changed into an underscore ('\_')
   character and quotes are completely removed.

| **Bin numerical values**
| When using numerical values, two genomes are only considered to share
  a phenotype if the value is identical. PanTools creates an
  alternative version for these phenotypes by binning the values. Taking
  'Pathogenicity' from the example below we see the integers between 3
  and 15. Using these two extreme values three bins are created for a
  new phenotype 'Pathogenicity_binned': 3-6.33, 6.34-11.66 and 11.67-15.
  The number of bins is controlled through ``--bins``. For skewed data,
  consider making the bins manually and include this as string
  phenotype.

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.
     * - <phenotypesFile>
       - A CSV file containing the phenotype information.

**Options**
  .. list-table::
     :widths: 30 70

     * - ``--scratch-directory``
       - Temporary directory for storing localization update files. If not set
         a temporary directory will be created inside the default temporary-file
         directory. On most Linux distributions this default temporary-file
         directory will be ``/tmp``, on MacOS typically ``/var/folders/``.

         If a scratch directory is set, it will be created if it does not exist.
         If it does exist, PanTools will verify the directory is empty and, if
         not, raise an exception.
     * - ``--append``
       - Do not remove existing phenotype nodes but only add new
         properties to them. If a property already exists, values from
         the new file will overwrite the old.
     * - ``--bins``
       - Number of bins used to group numerical values of a phenotype
         (default: 3).

**Example phenotypes file**
  The input file needs to be in .CSV format, a plain text file where each
  value is separated by a comma. The first **row** should start with
  'Genome,' followed by the phenotype names and/or identifiers. The first
  **column** must start with genome numbers corresponding to the one in
  your pangenome. Phenotypes and metadata must be placed on the same line
  as their genome number. A field can remain empty when the phenotype for
  a genome is missing or unknown. Here below is an example of five genomes
  contains six phenotypes:

  .. code:: text

     Genome,Gram,Region,Pathogenicity,Boolean,float,species
     1,+,NL,3,True,0.1,Species
     2,+,BE,,False,0.1,Species3
     3,+,LUX,7,true,0.1,Species3
     4,+,NL,9,false,0.1,Species3
     5,+,BE,15,TRUE,0.1,Species1

**Example commands**
  .. code:: bash

     $ pantools add_phenotypes tomato_DB pheno.csv
     $ pantools add_phenotypes --append tomato_DB pheno.csv

**Output**
  Phenotype information is stored in 'phenotype' nodes in the graph. An
  output file is written to the database directory.

  -  **phenotype_overview.txt**, a summary of the available phenotypes in
     the pangenome.

---------------------

Remove phenotypes
^^^^^^^^^^^^^^^^^

Delete **phenotype** nodes or remove specific phenotype information from
the nodes. The specific phenotype property needs to be specified with
``--phenotype``. When this argument is not included, *phenotype* nodes
are removed.

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.

**Options**
  .. list-table::
     :widths: 30 70

     * - ``--include``/``-i``
       - Only remove nodes of the selected genomes.
     * - ``--exclude``/``-e``
       - Do not remove nodes of the selected genomes.
     * - ``--phenotype``/``-p``
       - Name of the phenotype. All information of the given phenotype is
         removed from 'phenotype' nodes.

**Example commands**
  .. code:: bash

     $ pantools remove_phenotypes tomato_DB
     $ pantools remove_phenotypes --phenotype=color tomato_DB
     $ pantools remove_phenotypes --phenotype=color --exclude=11,12 tomato_DB

----------------------

Genomic variation
-----------------

Add genomic variation to the pangenome database. These functions can
handle SNP (single nucleotide polymorphism)/InDel (insertion/deletion) and PAV
(presence/absence variation) information but will only consider genic variation
when adding the information to the database. For SNP/InDel information, VCF
(variant call format) files are required. For PAV information, a tab-separated
file with 1s and 0s describing the presence and absence, respectively.

--------------

Add Variants
^^^^^^^^^^^^

Add variants to the pangenome database. The function will only consider
genomic variation that is present in the mRNA features of the pangenome.
The SNP/InDel information will be used to create a consensus sequence for each
mRNA features. For each accession and mRNA features, a new variant node will be
created to hold this consensus sequence.

Several temporary files will be created during the process: a fasta file
containing the original mRNA sequences and fasta files containing the consensus
mRNA sequences for each sample. These files will be deleted after the process
is finished unless the ``--keep-intermediate-files`` option is used.
By default, the location of these files will be at ``/tmp`` for Linux and
``/var/folders`` for macOS. The location can be changed with the
``--scratch-directory`` option.

NB: VCF files that are not indexed with tabix will be indexed automatically on
their original location!

**Required software**
  -  `bcftools <https://samtools.github.io/bcftools/bcftools.html>`_
  -  `tabix <https://www.htslib.org/doc/tabix.html>`_

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.
     * - <vcfsFile>
       - A text file with on each line a genome number and the full path
         to a corresponding VCF file, separated by a space.

**Options**
  .. list-table::
     :widths: 30 70

     * - ``--threads``/``-t``
       - Number of threads to use. Default: total number of cores
         available or 8, whichever is lower.
     * - ``--scratch-directory``
       - Temporary directory for storing intermediate files. If not set a
         temporary directory will be created inside the default temporary-file
         directory. On most Linux distributions this default temporary-file
         directory will be ``/tmp/``, on MacOS typically ``/var/folders/``.

         If a scratch directory is set, it will be created if it does not exist.
         If it does exist, PanTools will verify the directory is empty and, if
         not, raise an exception.
     * - ``--keep-intermediate-files``
       - Keep intermediate consensus fasta and corresponding log files.

**Example VCFs file list**
  .. code:: text

      1 /path/to/LA1547.vcf.gz
      1 /path/to/LA1557.vcf.gz
      4 /path/to/LA1582.vcf.gz

**Example commands**
  .. code:: bash

      $ pantools add_variants tomato_DB vcf_locations.txt
      $ pantools add_variants -t 4 tomato_DB vcf_locations.txt

--------------

Remove variants
^^^^^^^^^^^^^^^

Remove variants from the pangenome database. This function will remove all
VCF information from the database. All variant nodes created by the
``add_variants`` function will be removed. The VCF information will be
removed from the accession nodes. If there is no variant information
left for an accession node, the node will be removed.

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.

**Example commands**
  .. code:: bash

      $ pantools remove_variants tomato_DB

--------

Add PAVs
^^^^^^^^

Add PAVs to the pangenome database. PAV information can only be added
about mRNA features. For each accession and mRNA feature, PAV information can
be stores in the database. Only values of 1 and 0 are allowed in the
PAV file. A value of 1 indicates that the gene is present in the sample
and a value of 0 indicates that the gene is absent in the sample.

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.
     * - <pavsFile>
       - A text file with on each line a genome number and the full path
         to a corresponding PAV file, separated by a space.

**Example PAVs file list**
  .. code:: text

      1 /path/to/LA1547.pav.tsv
      4 /path/to/LA1582.pav.tsv

**Example PAV file**
  .. code:: text

      mrnaID  accession102  accession103  accession104
      LA1547_00001  1  1  1
      LA1547_00002  1  1  0
      LA1547_00003  1  1  1
      LA1547_00004  1  0  1
      LA1547_00005  1  1  1
      LA1547_00006  0  0  1
      LA1547_00007  0  0  0

**Example commands**
  .. code:: bash

      $ pantools add_pavs tomato_DB pav_locations.txt

-----------

Remove PAVs
^^^^^^^^^^^

Remove PAVs from the pangenome database. This function will remove all
PAV information from the database. All variant nodes created by the
``add_pavs`` function will be removed. The PAV information will be
removed from the accession nodes. If there is no variant information
left for an accession node, the node will be removed.

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.

**Example commands**
  .. code:: bash

      $ pantools remove_pavs tomato_DB

--------------

Variation overview
^^^^^^^^^^^^^^^^^^

Create a readable overview of the variation in the pangenome database. The
overview will be written to a text file. Per genome, this overview will contain
the number of genes with PAV and/or VCF information and their sample names.

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.

**Example commands**
  .. code:: bash

      $ pantools variation_overview tomato_DB

**Output**
  The output file will be written to the **variation** directory in the
  database as a text file.

  - **variation_overview.txt**, a summary of available variation in the
    pangenome.


--------------------------

Phased pangenomics
------------------

Add phasing
^^^^^^^^^^^

.. warning::
 This is a novel function and has not yet undergone testing by external users.
 Please report any bugs or issues to the PanTools team so we can improve it.

Include phasing information into the pangenome. A chromosome number combined
with a phasing letter makes a phasing identifier. (Currently) a phasing
identifier must be unique, therefore phasing related PanTools functionalities
may only be useful when using chromosome scale and fully phased assemblies.

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.
     * - <phasingFile>
       - A text file with phasing information of sequences.

**Options**
  .. list-table::
     :widths: 30 70

     * - ``--assume-unphased``
       - All chromosomes without a letter will be be considered unphased.

**Example commands**
  .. code:: bash

    $ pantools add_phasing tomato_DB phasing_info.txt

**Example input**
  The text file should have two columns, separated by a tab, space or comma. The
  first column can only contain sequence identifiers. The second column can be
  formatted in two different ways.

  | **Input format 1. Chromosome numbers**
  | The second colum contains only (chromosome) numbers. This number becomes the
    chromosome number. To obtain the phasing letters, we count the number
    sequences from the same genome within one cluster. The sequence order
    determines the phasing letter.

  Taking the example below, for the second chromosome: genome 1 has 4 sequences,
  genome 2 has 3 sequences, and genome 3 has 1 sequence. The assigned
  identifiers are:

  -  Genome 1 - 2_A, 2_B, 2_C, 2_D
  -  Genome 2 - 2_A, 2_B, 2_C
  -  Genome 3 - 2_unphased

  .. code:: text

      1_1    1
      1_2    1
      1_3    1
      1_4    1
      2_1    1
      2_2    1
      2_3    1
      2_4    1
      3_1    1
      1_5    2
      1_6    2
      1_7    2
      1_8    2
      2_5    2
      2_6    2
      2_7    2
      3_2    2

  This file format is generated by running TreeCluster.py on a sequence-level
  k-mer distance tree.

  .. code:: text

      $ TreeCluster.py -i sequence_kmer_distance.tree -m avg_clade -t 0.03 > phasing_info.txt


  | **Input format 2. Directly assign identifiers**
  | Example file that will directly assign phasing identifiers to sequences.
    The identifiers are identical to the example above.

  .. code:: text

     1_1,1_A
     1_2,1_B
     1_3,1_C
     1_4,1_D
     2_1,1_A
     2_2,1_B
     2_3,1_C
     2_4,1_D
     3_1,unphased
     1_5,2_A
     1_6,2_B
     1_7,2_C
     1_8,2_D
     2_5,2_A
     2_6,2_B
     2_7,2_C
     3_2,unphased

-------------------

Repetitive elements
-------------------

Add repeats
^^^^^^^^^^^

.. warning::
 This is a novel function and has not yet undergone testing by external users.
 Please report any bugs or issues to the PanTools team so we can improve it.

Add repeat annotations to an existing pangenome. PanTools is only able to read
General Feature Format (GFF) files. Reads everything as a single line thus
ignores hierarchical levels of the GFF format. Repeat 'type' is based on the
3rd column.

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.
     * - <annotationsFile>
       - A text file with on each line a genome number and the full path to the
         corresponding annotation file, separated by a space.

**Options**
  .. list-table::
     :widths: 30 70

     * - ``--connect``
       - Connect the annotated genomic features to nucleotide nodes in the DBG.
     * - ``--strict``
       - Stop the annotation if sequences or repeat coordinates do not match to
         the database.

**Example commands**
  .. code:: bash

     $ pantools add_repeats tomato_DB repeats.txt
     $ pantools add_repeats potato_DB repeats.txt --connect --strict


**Example input file**
  In the required input file each line starts with the genome number followed by
  the full path to a GFF file, separated by a space.

  .. code:: text

      1 /always/genome1.gff
      2 /use_the/genome2.gff3
      3 /full_path/genome3.gff

  The GFF format consists of one line per feature, each containing 9 columns of
  data (plus optional track definition lines), that must be tab separated.
  Currently, we identify the repeat type through the 3rd column.

  .. code:: text

      ##seqid source sequence_ontology start end score strand phase attributes
      chr1A   EDTA    repeat_region   350 8207    .   ?   .   ID=repeat_region_1;Name=TE_00012440;Classification=LTR/unknown;Sequence_ontology=SO:0000657;ltr_identity=0.9995;Method=structural;motif=TGCA;tsd=CCTGG
      chr1A   EDTA    target_site_duplication 350 354 .   ?   .   ID=lTSD_1;Parent=repeat_region_1;Name=TE_00012440;Classification=LTR/unknown;Sequence_ontology=SO:0000434;ltr_identity=0.9995;Method=structural;motif=TGCA;tsd=CCTGG
      chr1A   EDTA    long_terminal_repeat    355 2216    .   ?   .   ID=lLTR_1;Parent=repeat_region_1;Name=TE_00012440;Classification=LTR/unknown;Sequence_ontology=SO:0000286;ltr_identity=0.9995;Method=structural;motif=TGCA;tsd=CCTGG
      chr1A   EDTA    LTR_retrotransposon 355 8202    .   ?   .   ID=LTRRT_1;Parent=repeat_region_1;Name=TE_00012440;Classification=LTR/unknown;Sequence_ontology=SO:0000186;ltr_identity=0.9995;Method=structural;motif=TGCA;tsd=CCTGG
      chr1A   EDTA    helitron    2843    3627    4150    +   .   ID=TE_homo_0;Name=TE_00001861;Classification=DNA/Helitron;Sequence_ontology=SO:0000544;Identity=0.819;Method=homology
      chr1A   EDTA    helitron    3812    3914    360 +   .   ID=TE_homo_1;Name=TE_00001914;Classification=DNA/Helitron;Sequence_ontology=SO:0000544;Identity=0.822;Method=homology
      chr1A   EDTA    Mutator_TIR_transposon  5076    5627    4956    +   .   ID=TE_homo_2;Name=TE_00010497;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=0.985;Method=homology
      chr1A   EDTA    hAT_TIR_transposon  5801    6148    3156    -   .   ID=TE_homo_3;Name=TE_00003074;Classification=DNA/DTA;Sequence_ontology=SO:0002279;Identity=0.997;Method=homology
      chr1A   EDTA    long_terminal_repeat    6342    8202    .   ?   .   ID=rLTR_1;Parent=repeat_region_1;Name=TE_00012440;Classification=LTR/unknown;Sequence_ontology=SO:0000286;ltr_identity=0.9995;Method=structural;motif=TGCA;tsd=CCTGG
      chr1A   EDTA    target_site_duplication 8203    8207    .   ?   .   ID=rTSD_1;Parent=repeat_region_1;Name=TE_00012440;Classification=LTR/unknown;Sequence_ontology=SO:0000434;ltr_identity=0.9995;Method=structural;motif=TGCA;tsd=CCTGG
      chr1A   EDTA    Gypsy_LTR_retrotransposon   8203    8764    5107    +   .   ID=TE_homo_4;Name=TE_00012288_INT;Classification=LTR/Gypsy;Sequence_ontology=SO:0002265;Identity=0.993;Method=homology
      chr1A   EDTA    LTR_retrotransposon 8865    10542   11862   -   .   ID=TE_homo_5;Name=TE_00009031_LTR;Classification=LTR/unknown;Sequence_ontology=SO:0000186;Identity=0.932;Method=homology
      chr1A   EDTA    Copia_LTR_retrotransposon   10643   10979   2849    +   .   ID=TE_homo_6;Name=TE_00005676_LTR;Classification=LTR/Copia;Sequence_ontology=SO:0002264;Identity=0.967;Method=homology
      chr1A   EDTA    CACTA_TIR_transposon    10978   11061   501 +   .   ID=TE_homo_7;Name=TE_00006381;Classification=DNA/DTC;Sequence_ontology=SO:0002285;Identity=0.866;Method=homology

Repeat overview
^^^^^^^^^^^^^^^

Calculate the frequency and overlap of repeats in the genome (split into
windows) and gene regions.

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.

**Options**
  .. list-table::
     :widths: 30 70

     * - ``--include``/``-i``
       - Only include a selection of genomes. This automatically lowers the
         threshold for core genes.
     * - ``--exclude``/``-e``
       - Exclude a selection of genomes. This automatically lowers the
         threshold for core genes.
     * - ``-—selection-file``
       - Text file with rules to use a specific set of genomes and sequences.
         This automatically lowers the threshold for core genes.
     * - ``--window-length``
       - Set the window length (default: 50000).
     * - ``--upstream``
       - Set the gene upstream region (default: 1000).
     * - ``--downstream``
       - Set the gene downstream region (default: 1000).
     * - ``--exclude-repeats``
       - Text file to only include (or exclude) certain repeat types for the
         analysis.

**Example commands**
  .. code:: bash

     $ pantools repeat_overview tomato_DB
     $ pantools repeat_overview tomato_DB --selection-file sequence_selection.txt
     $ pantools repeat_overview tomato_DB --window-length 1000000 --upstream 5000 --downstream 5000

**Example input files**
  The ``--selection-file`` must be a single line text file to include or exclude
  a selection of repeat types. The repeat types must be separated through
  commas.

  .. code:: text

     INCLUDE = LTR_retrotransposon, LINE_element, Copia_LTR_retrotransposon

  .. code:: text

     EXCLUDE = Gypsy_LTR_retrotransposon

**Output**
  Output files are written to the **repeats** directory in the database.

  -  **windows_all_sequences.csv**, Holds the calculated repeat frequency and
     bases overlapped per repeat type for all sequences combined.
  -  **statistics_genomes_sequences.csv**, per genome and sequence, holds the
     calculated repeat frequency and bases overlapped per repeat type and all
     repeat types combined.
  -  **repeats_in_genes.csv** provides repeat statistics for individual genes.
  -  **coverage_plot.R** creates a coverage plot for each sequence.
  -  **coverage_plot.R** creates a coverage plot for every sequence pair.
  -  **density_plot.R** creates a density and density abundance plot for each
     sequence.
  -  **density_plot_two_sequences.R** creates a density and % density plot for
     every sequence pair.

  Additional output files named after each sequence identifier are available
  in the **repeats/windows** directory. Per window, these hold the calculated
  repeat frequency and bases overlapped per repeat type and all repeat types
  combined.

-----------------------

Removing data
-------------

Remove nodes
^^^^^^^^^^^^

Remove a selection of nodes and their relationships from the pangenome.
For a pangenome database the following nodes should never be removed:
*nucleotide*, *pangenome*, *genome*, *sequence*. When using a
panproteome, *mRNA* nodes cannot be removed.

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.

**Options**
  Requires **one** of ``--nodes``\|\ ``--label``, ``include`` and ``exclude``
  only work for ``--label``.

  .. list-table::
     :widths: 30 70

     * - ``--include``/``-i``
       - Only remove nodes of the selected genomes.
     * - ``--exclude``/``-e``
       - Do not remove nodes of the selected genomes.
     * - ``--nodes``/``-n``
       - One or multiple node identifiers, separated by a comma.
     * - ``--label``
       - A node label, all nodes matching the label are removed.

**Example commands**
  .. code:: bash

     $ pantools remove_nodes --nodes=10348734,10348735,10348736 tomato_DB
     $ pantools remove_nodes --label=busco --include=2-6 tomato_DB