Grouping proteins
=================

Group
^^^^^

Generate homology groups based on similarity of protein sequences. The
resulting homology groups connect similar sequences in the pangenome
database. Homology groups contain not only orthologous pairs, but also
pairs of homologs duplicated after the speciation of the two species,
so-called in-paralogs. The sizes of the groups are controlled by the
``--relaxation`` parameter that can be set very strict or more lenient,
depending on the evolutionary distance of the genomes. When you are
unsure which relaxation setting is most suitable for your dataset,
running the :ref:`optimal_grouping <construction/group:optimal grouping>`
functionality is recommended.

Be aware that not every sequence within a homology group has to be
similar to the other sequences. For example, two non-similar protein
sequences each have a high-similarity hit with the same protein sequence
but align to a different region, one at the start and one near the end
of the sequence.

When you want to run **group** another time but with different
parameters, the currently active grouping must first either be moved or
removed. This can be achieved with the
:ref:`remove <construction/group:remove grouping>` or
:ref:`deactivate grouping <construction/group:deactivate grouping>`
functions.

| **Method**
| Here, we explain a simplified version of the original algorithm,
  please take a look at our publication for an extensive explanation.
  First, potential similar sequences are identified by counting shared
  *k*-mer (protein) sequences. Similarity between the selected protein
  sequences is calculated through (local) Smith-Waterman alignments.
  When the (normalized) similarity score of two sequences is above a
  given threshold (controlled by ``--relaxation``), the proteins are
  connected with each other in the similarity graph. Every similarity
  component is then passed to the MCL (Markov clustering) algorithm to
  be possibly broken into several homology groups.

| **Relaxation**
| The ``relaxation`` parameter is a combination of four sub-parameters:
  ``intersection rate``, ``similarity threshold``, ``mcl inflation``
  and ``contrast``. The values for these parameters for each relaxation
  setting can be seen in the table below. We strongly recommend using the
  ``--relaxation`` option to control the grouping, but advanced users still
  have the option to control the individual sub-parameters.

  .. csv-table::
     :file: /tables/relaxation.csv
     :header-rows: 1
     :delim: ;

**Required software**
  `MCL <https://micans.org/mcl/>`_

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.

**Options**
  .. list-table::
     :widths: 30 70

     * - ``--threads``/``-t``
       - Number of parallel working threads, default is the number of
         available cores or 8, whichever is lower.
     * - ``--include``/``-i``
       - Only include a selection of genomes.
     * - ``--exclude``/``-e``
       - Exclude a selection of genomes.
     * - ``--annotations-file``/``-A``
       - A text file with the identifiers of annotations to be included,
         each on a separate line. The most recent annotation is selected
         for  genomes without an identifier.
     * - ``--longest``
       - Only cluster protein sequences of the longest transcript per gene.
     * - ``--scoring-matrix``
       - The scoring matrix used, default is BLOSUM62.
     * - ``--relaxation``
       - The relaxation in homology calls. Should be in range [1-8], from
         strict to relaxed. This argument automatically sets
         the four remaining arguments stated below.
     * - ``--intersection-rate``
       - The fraction of *k*-mers that needs to be shared by two intersecting
         proteins. Should be in range [0.001,0.1].
     * - ``--similarity-threshold``
       - The minimum normalized similarity score of two proteins. Should be in
         range [1..99].
     * - ``--mcl-inflation``
       - The MCL inflation. Should be in range [1,19].
     * - ``--contrast``
       - The contrast factor. Should be in range [0,10].

**Example commands**
  .. code:: bash

     $ pantools group -t=12 -r=4 tomato_DB
     $ pantools group --intersection-rate=0.05 --similarity-threshold=65  --mcl-inflation=7.2 --contrast=5 tomato_DB

**Output**
  -  **pantools_homology_groups.txt**, overview of the created homology
     groups. Each line represents one homology group, starting with the
     homology group (database) identifier followed by a colon (:) and mRNA
     identifiers (from GFF) that are separated by a space. To ensure all
     identifiers are unique in this file, the mRNA ids are extended by a
     hash symbol (#) and a genome number. The following line is example
     output of an homology group with two genes from genome 1 and 146:

  .. code:: text

     14001754: DLACAPHP_00001_mRNA#1 OPJEMMMF_03822_mRNA#146

**Relevant literature**
  `Efficient inference of homologs in large eukaryotic pan-proteomes
  <https://bmcbioinformatics.biomedcentral.com/articles/10.1186/
  s12859-018-2362-4>`_

--------------

BUSCO protein
^^^^^^^^^^^^^

BUSCO is used in PanTools to estimate the optimal settings for
grouping relaxation, in conjunction with the
:ref:`optimal_grouping <construction/group:optimal grouping>` function.

BUSCO attempts to provide a quantitative assessment of the completeness
in terms of expected gene content of a genome assembly. Proteins are
placed into categories of Complete and **single-copy** (S), Complete and
**duplicated** (D), **fragmented** (F), or **missing** (M). This
function is able to run BUSCO **v3**, **v4** or **v5** against protein
sequences of the pangenome.

The number of reported duplicated genes in eukaryotes is often to high
as different protein isoforms are counted multiple times. To adjust the
imprecise duplication score, include the ``--longest-transcripts``
argument to the command.

| **What BUSCO benchmark set to use**

  -  When using **BUSCO v3**, go to https://busco.ezlab.org, download a odb9
     set, and untar it with ``tar -xvzf``. Include the entire directory in
     the command using the ``--input-file`` argument.
  -  For **BUSCO v4** or **v5**, you only have to provide the odb10 database
     name with the ``--input-file`` argument, the database is downloaded
     automatically. To get a full list of the available datasets, run
     ``busco --list-datasets``.

**Required software**
  BUSCO must be set to your $PATH. For v3, test if the
  ``which run_BUSCO.py`` command displays the full path so it can accessed
  anywhere. For v4 and v5, test if ``busco`` is executable.

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.

**Options**

  .. list-table::
     :widths: 30 70

     * - ``--threads``/``-t``
       - Number of parallel working threads, default is the number of
         available cores or 8, whichever is lower.
     * - ``--include``/``-i``
       - Only include a selection of genomes.
     * - ``--exclude``/``-e``
       - Exclude a selection of genomes.
     * - ``--annotations-file``/``-A``
       - A text file with the identifiers of annotations to be included,
         each on a separate line. The most recent annotation is selected
         for genomes without an identifier.
     * - ``--busco-version``/``-v``
       - The BUSCO version. Select either 4 or 5 (default).
         (default).
     * - ``--odb10``
       - An odb10 benchmark dataset name.
     * - ``--longest``
       - Only search against the longest protein-coding transcript of
         genes.
     * - ``skip-busco``
       - A list of questionable BUSCOs. The completeness score is
         recalculated by skipping these genes.

**Example commands**
  .. code:: bash

     $ pantools busco_protein --odb10=bacteria_odb10 bacteria_DB
     $ pantools busco_protein --skip-busco=POG093P01OY,POG093P0009,POG093P022K,POG093P027M,POG093P00Z2,POG093P013J bacteria_DB

**Output**
  The BUSCO scores are stored inside **BUSCO** nodes of the pangenome
  graph. Output files are written to the *busco* directory inside the
  database.

  -  **busco_scores.txt**, overview of the BUSCO scores per genome.
     Average and median statistics are calculated per category.
  -  **busco_overview.csv**, a table which combines the completeness
     scores per genome together with the duplicated, fragmented and
     missing BUSCO genes.
  -  **hmm_overview.txt**, a list of BUSCO genes showing the assigned
     categories per genome.

------------------

Optimal grouping
^^^^^^^^^^^^^^^^

Finding the most suitable settings for :ref:`group <construction/group:group>`
can be difficult and is always dependent on evolutionary distance of the
genomes in the pangenome. This functionality runs **group** on all eight
``--relaxation`` settings, from strictest (d1) to the most relaxed (d8).
To find the optimal setting, complete and **non-duplicated BUSCO** genes
that are present in all genomes are used to validate each setting.

| **Method**
| A perfect clustering of the sequences would place each BUSCO in a
  separate homology group with one representative protein per genome.
  When BUSCO is run against the pangenome, the proteins corresponding to
  the BUSCO HMMs have been identified. For each BUSCO, the
  representative proteins are checked whether these are clustered into a
  single or multiple groups. These groups are searched to identify
  sequences other than the current BUSCO. The highest number of
  correctly clustered BUSCOs present in one group are true positives
  (**tp**). Any other gene clustered inside this group is considered a
  false positive (**fp**) The remaining BUSCO genes outside this best
  group are counted as false negative (**fn**). The summation of tps fps
  and fns are defined as **TP**, **FP** and **FN**, respectively. From
  these scores recall, precision and F-score measures are calculated as
  follows:

  .. math::

     Recall &= \frac{TP}{TP + FN}

     Precision &= \frac{TP}{TP + FP}

     F-score &= 2 \frac{Recall * Precision}{Recall + Precision}

  .. figure:: /figures/true_false_positives.png
     :width: 600
     :align: center

     *Proteins of three distinct homology groups are represented as
     triangles, circles and squares. Green shapes are true positives (tp)
     which have been assigned to the true group; red shapes are false
     positives (fp) for the group they have been incorrectly assigned to, and
     false negatives (fn) for their true group*

| **Choosing the optimal setting**
| Choosing the correct setting is usually a trade-off between TPs and
  FNs. The most strict grouping results in a significantly higher number
  of clusters as the more relaxed settings. With stringent settings,
  related proteins could get separated; however, a high number of false
  positives is (usually) prevented (FN > FP). When you would go for a
  more loose setting, the related proteins are likely to part of the
  same group, but other sequences could be included as well (FN < FP).

| **Note on active grouping**
| No grouping is active after running this function. Use the generated
 output files to identify a suitable grouping. Activate this grouping
 using :ref:`change_grouping <construction/group:change grouping>`. An
 overview of the available groupings and used settings is stored in the
 'pangenome' node (inside the database), or can be created by running
 :ref:`grouping_overview <construction/group:grouping overview>`.

**Required software**
  `MCL <https://micans.org/mcl/>`_

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.
     * - <buscoDirectory>
       - The output directory created by the
         :ref:`busco_protein <construction/group:busco protein>` function.
         This directory is found **inside** the pangenome database, in the
         *busco* directory.

**Options**
  .. list-table::
     :widths: 30 70

     * - ``--threads``/``-t``
       - Number of parallel working threads, default is the number of
         available cores or 8, whichever is lower.
     * - ``--include``/``-i``
       - Only include a selection of genomes.
     * - ``--exclude``/``-e``
       - Exclude a selection of genomes.
     * - ``--annotations-file``/``-A``
       - A text file with the identifiers of annotations to be included,
         each on a separate line. The most recent annotation is selected
         for  genomes without an identifier.
     * - ``--fast``
       - Assume the optimal grouping is found when the F1-score drops
         compared to the previous clustering round.
     * - ``--longest``
       - Only cluster protein sequences of the longest transcript per gene.
     * - ``--scoring-matrix``
       - The scoring matrix used, default is BLOSUM62.
     * - ``--relaxation``
       - Only consider a selection of relaxation settings (1-8 allowed).

**Example commands**
  .. code:: bash

     $ pantools optimal_grouping bacteria_DB bacteria_DB/busco/bacteria_odb9
     $ pantools optimal_grouping -t=12 --fast bacteria_DB bacteria_DB/busco/bacteria_odb9
     $ pantools optimal_grouping -tn=12 --relaxation=1,2,3 bacteria_DB bacteria_DB/busco/bacteria_odb9

     $ Rscript optimal_grouping.R

**Output**
  After each clustering round, homology groups are incorporated in the
  graph. A text file with homology group and gene identifiers is stored in
  the **group** directory in the pangenome database. This file is named
  after the used sequence similarity threshold (25-95). Each line
  represents one homology group, starting with the homology group
  (database) identifier followed by a colon (:) and mRNA identifiers (from
  GFF) that are separated by a space. The mRNA identifiers are extended by
  a hash (#) and their genome number. The following line is example output
  of an homology group with two genes from genome 1 and 146:

  .. code:: text

     14001754: DLACAPHP_00001_mRNA#1 OPJEMMMF_03822_mRNA#146

  Output files are written to **optimal_grouping** directory inside the
  database.

  -  **grouping_overview.csv**, a summary of the benchmark statistics. Use
     this file to find the most suitable grouping for your pangenome.
  -  **optimal_grouping.R**, Rscript to plot FN and FP values per
     grouping.
  -  **counts_per_busco.info**, a log file of the scoring. Shows in which
     homology groups the BUSCO genes were placed for the different
     groupings.

  .. figure:: /figures/best_grouping.png
     :width: 300
     :align: center

     *Example output of optimal_grouping.R. The number of FN and FP for all
     eight relaxation settings.*

--------------

Change grouping
^^^^^^^^^^^^^^^

Only a single homology grouping can be active in the pangenome. Use this
function to change the active grouping version. Information of the
available groupings and used settings is stored in the 'pangenome' node
(inside the database) and can be created by running
:ref:`grouping_overview <construction/group:grouping overview>`.

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.

**Options**
  .. list-table::
     :widths: 30 70

     * - ``--grouping-version``/``-v``
       - Required. The version of homology grouping to become active.

**Example commands**
  .. code:: bash

     $ pantools change_grouping -v=5 tomato_DB

---------------------

Remove grouping
^^^^^^^^^^^^^^^

Delete all 'homology_group' nodes and 'is_similar' relations between
'mRNA' nodes from the database.

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.

**Options**
  .. list-table::
     :widths: 30 70

     * - ``--fast``
       - Do not remove the 'is_similar' relationships between mRNA nodes.
         This does not influence the next grouping.
     * - ``--grouping-version``/``-v``
       - Select a specific grouping version to be removed. Should be either a
         grouping number, 'all' for all groupings or 'all_inactive' for
         all inactive groupings.

**Example commands**
  .. code:: bash

     $ pantools remove_grouping --version=1 tomato_DB
     $ pantools remove_grouping --version=all --fast tomato_DB
     $ pantools remove_grouping --version=all_inactive tomato_DB

------------------------

Deactivate grouping
^^^^^^^^^^^^^^^^^^^

Relabel 'homology_group' nodes to 'inactive_homology_group'. The
moved grouping can be activated again with
:ref:`change_grouping <construction/group:change grouping>`.

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.

**Options**
  .. list-table::
     :widths: 30 70

     * - ``--fast``
       - Do not remove the 'is_similar' relationships between mRNA nodes.
         This does not influence the next grouping.

**Example commands**
  .. code:: bash

     $ pantools deactivate_grouping tomato_DB
     $ pantools deactivate_grouping --fast tomato_DB

------------------------

Grouping overview
^^^^^^^^^^^^^^^^^

Reports the content of all (active & inactive) homology groups for the
different groupings in the pangenome. Include ``--fast`` into the
command to get a quick overview of the available groupings and the
settings that were used.

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the pangenome database root directory.

**Options**
  .. list-table::
     :widths: 30 70

     * - ``--fast``
       - Only show which grouping is active and which groupings can be
         activated.

**Example commands**
  .. code:: bash

     $ pantools grouping_overview tomato_DB
     $ pantools grouping_overview --fast tomato_DB

**Output**
  Output files are written to */database_directory/group/*

  -  **grouping_overview.txt**, all homology groups in the pangenome. For
     each homology group, the total number of members and the number of
     members per per genome is reported.
  -  **current_pantools_homology_groups.txt**, overview of the active
     homology groups. Each line represents one homology group. The line
     starts with the homology group (database) identifier followed by a
     colon and the rest are mRNA IDs (from gff/genbank) seperated by a
     space.