Synteny
=======

Calculate synteny
^^^^^^^^^^^^^^^^^

.. warning::
 This is a novel function and has not yet undergone testing by external users.
 Please report any bugs or issues to the PanTools team so we can improve it.

Estimate synteny between sequences of the pangenome using MCScanX. PanTools
generates MCScanX's required input **GFF** and **.homology** file for every
pairwise comparison. The GFF holds the gene identififers and genomic
coordinates. The .homology file is a two-column tab separated file that states
which genes are homologous to another. Two genes are considered homologous
when part of the same homology group together with a protein similarity
greater than the threshold set during :ref:`group <construction/group:group>`.

| **Executing MCScanX**
| PanTools executes MCscanX_h with default settings when ``--run``
  argument is included. Pairwise comparisons are divided over the number of
  ``--threads`` provided by the user. Please consider the number of sequences
  in your analysis and create a subset by using ``--selection-file``. The
  output of each comparison is written to a separate folder. Once the threads
  are finished, every output (**.collinearity**) file is collected and
  combined into a single file.
| To allow the visualization of synteny blocks we recommend using
  `Synvisio <https://synvisio.github.io/>`__ or
  `Accusyn <https://accusyn.usask.ca/>`__.

| **Sequence identifiers change**
| Both MCScanX and the online visualization tools have some issues with
  using the regular identifiers (1_1, 1_2, 2_1 etc.). To be able to work with
  all tools we change the identifiers into a two letter combination with
  a number, current limit is 676 genomes with ``--genome`` or 676 sequences
  using ``--sequence``. PanTools will try to give each genome a unique first
  letter but this is only possible with 26 or less genomes and 26 sequences
  per genome.

**Required software**
  `MCScanX <https://github.com/wyp1125/MCScanX>`__ must be manually
  :ref:`installed <getting_started/install:download mcscanx>` and set to your
  $PATH, since it is unavailable on conda.

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.

**Options**
  .. list-table::
     :widths: 30 70

     * - ``--include``/``-i``
       - Only include a selection of genomes. This automatically lowers the
         threshold for core genes.
     * - ``--exclude``/``-e``
       - Exclude a selection of genomes. This automatically lowers the threshold
         for core genes.
     * - ``-—selection-file``
       - Text file with rules to use a specific set of genomes and sequences.
         This automatically lowers the threshold for core genes.
     * - ``--run``
       - Run MCScanX_h (default: false)
     * - ``--threads``/``-t``
       - Number of parallel threads to be used.
     * - ``--sequence``
       - Calculate synteny between sequences (from the same genome) instead of
         genomes.

**Example commands**
  .. code:: bash

     $ pantools calculate_synteny tomato_DB
     $ pantools calculate_synteny tomato_DB --sequence
     $ pantools calculate_synteny tomato_DB --sequence --run -t=24

**Output**
  Output files are written to the **synteny** directory in the database.

  -  **mcscanx.gff**, the genomic coordinates of all genes included in the
     analysis.
  -  **mcscanx.homology**, all pairwise homology relationships in the
     analysis.
  -  **synteny_identifiers.csv**, table with the original and synteny
     identifiers.

  When ``--run`` is included:

  -  **mcscanx.collinearity**, main output file of MCScanX_h. Contains
     synteny blocks that consist of pairwise collinear gene pairs. Usable
     by Synvisio and Accusyn in combination with mcscanx.gff. Can be included
     to the pangenome with
     :ref:`add_synteny <construction/synteny:add synteny>`.
  -  A .gff, .homology, .collinearity file for every sequence combination.
     Files are written to a folder named after the combination of two sequence
     identifiers. This folder also holds two .html files that visualizes
     collinear blocks and duplication depth between the two sequences.

**Relevant literature**
  -  `MCScanX: a toolkit for detection and evolutionary analysis of gene synteny
     and collinearity <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3326336/>`__
  -  `Interactive Exploration of Genomic Conservation (Synvisio)
     <https://graphicsinterface.org/wp-content/uploads/gi2020-9.pdf>`__
  -  `Using Simulated Annealing to Declutter Genome Visualizations (Accusyn)
     <https://www.aaai.org/ocs/index.php/FLAIRS/FLAIRS20/paper/view/18433/1
     7546>`__

-----------------------

Add synteny
^^^^^^^^^^^

.. warning::
 This is a novel function and has not yet undergone testing by external users.
 Please report any bugs or issues to the PanTools team so we can improve it.

Include synteny information into the pangenome.

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.
     * - <collinearityFile>
       - A MCScanX .collinearity file.

**Example commands**
  .. code:: bash

     $ pantools add_synteny tomato_DB tomato_DB/synteny/mcscanx.collinearity

**Example input**
  Required input is a .collinearity file generated by MCScanX. The example below
  shows the first three syntenic blocks calculated within an apple genome
  (sequence 2_3 to 2_28).

  .. code:: text

     ############### Parameters ###############
     # MATCH_SCORE: 50
     # MATCH_SIZE: 5
     # GAP_PENALTY: -1
     # OVERLAP_WINDOW: 5
     # E_VALUE: 1e-05
     # MAX GAPS: 25
     ############### Statistics ###############
     # Number of collinear genes: 217123, Percentage: 80.16
     # Number of all genes: 270847
     ##########################################
     ## Alignment 0: score=251.0 e_value=1.1e-10 N=6 bk1&cj1 plus
       0-  0:    2_3#Mdg_11A016020_mRNA1 2_28#Mdg_06B006370_mRNA1          0
       0-  1:    2_3#Mdg_11A016050_mRNA1 2_28#Mdg_06B006500_mRNA1          0
       0-  2:    2_3#Mdg_11A016230_mRNA1 2_28#Mdg_06B006510_mRNA1          0
       0-  3:    2_3#Mdg_11A016330_mRNA1 2_28#Mdg_06B006570_mRNA1          0
       0-  4:    2_3#Mdg_11A016390_mRNA1 2_28#Mdg_06B006590_mRNA1          0
       0-  5:    2_3#Mdg_11A016400_mRNA1 2_28#Mdg_06B006660_mRNA1          0
     ## Alignment 1: score=258.0 e_value=4.1e-11 N=6 bk1&cj1 minus
       1-  0:    2_3#Mdg_11A015930_mRNA1 2_28#Mdg_06B006850_mRNA1          0
       1-  1:    2_3#Mdg_11A016020_mRNA1 2_28#Mdg_06B006770_mRNA1          0
       1-  2:    2_3#Mdg_11A016050_mRNA1 2_28#Mdg_06B006760_mRNA1          0
       1-  3:    2_3#Mdg_11A016230_mRNA1 2_28#Mdg_06B006570_mRNA1          0
       1-  4:    2_3#Mdg_11A016330_mRNA1 2_28#Mdg_06B006510_mRNA1          0
       1-  5:    2_3#Mdg_11A016390_mRNA1 2_28#Mdg_06B006500_mRNA1          0
     ## Alignment 2: score=252.0 e_value=3.1e-12 N=6 bk1&cj1 minus
       2-  0:    2_3#Mdg_11A015930_mRNA1 2_28#Mdg_06B006880_mRNA1          0
       2-  1:    2_3#Mdg_11A016020_mRNA1 2_28#Mdg_06B006870_mRNA1          0
       2-  2:    2_3#Mdg_11A016050_mRNA1 2_28#Mdg_06B006860_mRNA1          0
       2-  3:    2_3#Mdg_11A016230_mRNA1 2_28#Mdg_06B006770_mRNA1          0
       2-  4:    2_3#Mdg_11A016390_mRNA1 2_28#Mdg_06B006660_mRNA1          0
       2-  5:    2_3#Mdg_11A016400_mRNA1 2_28#Mdg_06B006590_mRNA1          0

---------------------------

Synteny Overview
^^^^^^^^^^^^^^^^

.. warning::
 This is a novel function and has not yet undergone testing by external users.
 Please report any bugs or issues to the PanTools team so we can improve it.

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.

**Example commands**
  .. code:: bash

     $ pantools synteny_overview tomato_DB

**Output**
  - **synteny_blocks_statistics.csv**, statistics about homology
    relationships and synteny blocks between sequences.
  - **blocks_overview.csv**, overview of all synteny blocks.

In the **synteny** directory:
  - **sequence_overlap_per_sequence.csv**, statistics of overlap in genes
    between multiple synteny blocks with other sequences.
  - **sequence_overlap_per_block.csv**, statistics of overlap in genes
    between multiple synteny blocks.

In the **synteny/statistics** directory:
  - **gene_frequency.csv**, overview of which genes belong to which synteny
    blocks.
  - **genome_overlap.csv**, overview of the synteny blocks in each genome.