Multiple Sequence Alignments
============================

This page is entirely dedicated to performing Multiple Sequence
Alignments (MSA) with PanTools.

MSA
^^^

Performs multiple sequence alignment (MSA) with **MAFFT** on sets of sequences.
These alignments can either be:

-  per homology group
-  multiple homology groups
-  regions
-  with all sequences containing a functional domain

| **Select a method**
| By default, this function will make a MSA for each homology group. (It can
  still be specified with ``--method=per-group``.) Using another option
  from the list above requires use of the ``--method`` argument. For
  aligning multiple homology groups, please use
  ``--method=multiple-groups`` with the homology groups specified in a
  csv file on a single line (can be added with
  ``--homology-file=/path/to/hm.csv`` or ``--homology-groups=2,3,4``).
  For aligning regions, please use ``--method=regions`` with a regions
  file that is added with ``--regions-file=/path/to/rf.txt``. For aligning
  sequences based on a functional domain, please use
  ``--method=functions`` together with a functional domain that is added
  with ``--functions=<domain>``.

| **Alignment behavior**
| Three flags exist to change the alignment behavior: ``--align-nucleotide``,
  ``--align-protein``, ``--align-variants``. By default, only nucleotide
  sequences (CDS) are aligned for pangenomes and only protein sequences are
  aligned for panproteomes. If variants were added using
  :ref:`add_variants <construction/annotate:add variants>`, PanTools
  will align complete mRNA sequences (including all variants) instead of CDS
  sequences; see below for more information. PAV can be taken into account using
  the ``--pavs`` option.
| By default, the MSA is performed in two rounds. In the first round, the
  sequences are aligned as is. Next, the sequences are trimmed
  based on the longest start and end gap of the initial alignment and aligned
  again. After the second round, the variable and parsimony informative sites
  are identified. When the ``--no-trimming`` argument is included, the variable
  and parsimony informative sites are identified from the initial alignment and
  no trimming is performed (thus, only one round of aligning).
| For each round, a ML phylogeny is created with **FastTree**. The option
  ``--no-fasttree`` can be used to skip running FastTree, which is used for
  generating a tree from each alignment.
| We know this behavior is not optimal and will be revised in future versions.

.. _msa-for-phylogenetics:

| **Phylogenetics**
| For phylogenetics, it is common to use a protein alignment for the trimming
  of nucleotide sequences for the nucleotide alignment. However, as this mixes
  two different types of data, it requires two separate runs using a pangenome.
  Please follow the following steps for this:

    1. Run the MSA with ``--align-protein``.
    2. Run the MSA with ``--align-nucleotide`` and ``--trim-using-proteins``.

  **NB**: These two steps are also required for correctly running
  :ref:`core_phylogeny <analysis/phylogeny:core phylogeny>` and
  :ref:`consensus_tree <analysis/phylogeny:consensus tree>`.

| **Identify phenotype shared or specific variation**
| Shared SNPs or amino acid substitutions can be found among the members
  of a phenotype when ``--phenotype`` is included. As homology groups can highly
  differ in size, the threshold for a phenotype shared or specific
  SNP/substitution is based on the number of sequences (from a certain
  phenotype) of an homology group instead of the number of genomes in the
  pangenome. For example, the pangenome holds 500 genomes but the homology group
  consists of only 100 sequences. The threshold can be lowered by including
  ``--phenotype-threshold=<phenotypeThreshold>``, which lowers the original
  threshold by multiplying it to a given percentage.

| **Sequence identity and similarity**
| - The percentage **identity** of two sequences is calculated based on
  the number of exactly matching characters divided by the alignment
  length minus the positions were both sequences have a gap.
| - The **similarity** (protein only) is calculated from the number of
  identical matches, increased by the number of similar amino acids
  (according to the BLOSUM 62 matrix), divided by the alignment length
  minus the shared gap positions. The calculated percentage of
  similarity is dependant on the BLOSUM matrix set by ``--blosum``.
  Choose a larger BLOSUM number BLOSUM less divergent sequences.

| **More information on variants**
| With the ``--variants`` argument, gene variants as added to the
  pangenome with
  :ref:`add_variants <construction/annotate:add variants>` can be
  included in the alignment. For these variants, the consensus sequence for this
  variant is added to the input sequences for the alignment. Also,
  presence/absence variations as added to the pangenome with
  :ref:`add_pavs <construction/annotate:add pavs>` will be used for
  determining whether or not to include a variant in the alignment. **NB**: This
  option uses full mRNA sequences for alignment, not CDS sequences.

**Required software**
  -  `MAFFT <https://mafft.cbrc.jp/alignment/software/>`_
  -  `FastTree <http://microbesonline.org/fasttree/>`_

**Parameters**
  .. list-table::
     :widths: 30 70

     * - <databaseDirectory>
       - Path to the database root directory.

**Options**
  .. list-table::
     :widths: 30 70

     * - ``--threads``/``-t``
       - Number of threads for MAFFT and IQ-tree, default is the number
         of cores or 8, whichever is lower.
     * - ``--include``/``-i``
       - Only include a selection of genomes.
     * - ``--exclude``/``-e``
       - Exclude a selection of genomes.
     * - ``--homology-file``/``-H``
       - A text file with homology group node identifiers, separated by a comma.
         Default is all homology groups. (Mutually exclusive with
         ``--homology-groups``.)
     * - ``--homology-groups``/``-G``
       - A comma separated list of homology group node identifiers.
         Default is all homology groups. (Mutually exclusive with
         ``--homology-file``.)
     * - ``--regions-file``/``-R``
       - A text file containing genome locations with on each line: a genome
         number, sequence number, begin and end position, separated by a space.
     * - ``--method``
       - The kind of alignment to make. Can be either ``per-group``,
         ``multiple-groups``, ``regions`` or ``functions``.
     * - ``--align-nucleotide``
       - Align nucleotide sequences (default: depends on database type).
     * - ``--align-protein``
       - Align protein sequences (default: depends on database type).
     * - ``--functions``
       - For specifying one or multiple functional domains (Only used when
         ``--method=functions``).
     * - ``--phenotype``/``-p``
       - Identify phenotype shared/specific/exclusive positions in the
         alignment.
     * - ``--blosum``
       - A BLOSUM matrix number to control MAFFT’s sensitivity and the
         similarity calculation. Allowed values are 45, 62 and 80 (default: 62).
     * - ``--phenotype-threshold``
       - Threshold for phenotype specific SNPs (default: 100%).
     * - ``--[no-]trimming``
       - Align the sequences only once. Trimming is on by default.
     * - ``--trim-using-proteins``
       - Trim the (nucleotide) alignment using the protein alignment of a
         previous run. This option is only available when the database is a
         pangenome.
     * - ``--[no-]fasttree``
       - Run FastTree (default: true).
     * - ``--variants``/``-v``
       - Use variants in the alignment. This option does not work with the
         ``--method=regions`` option. This option will include variant sequences
         in the alignment from ``pantools add_variants``.
     * - ``--pavs``
       - Use PAVs in the alignment. This option does not work with the
         ``--method=regions`` option and can only be used if ``--variants`` is
         set. This option will include PAVs as marked by ``pantools add_pavs``.

**Example regions file**
  Each line must have a genome number, sequence number, begin and end
  positions that are separated by a space. Place a minus symbol behind a
  region to extract the reverse complement sequence.

  .. code:: text

     1 1 1 10000
     195 1 477722 478426
     71 10 17346 18056 -
     138 47 159593 160300 -

**Example commands**
  .. code:: bash

     $ pantools msa tomato_DB
     $ pantools msa --align-protein -H=hmgroups.txt tomato_DB
     $ pantools msa --align-nucleotide --trim-using-proteins -H=hmgroups.txt tomato_DB
     $ pantools msa --no-trimming --align-nucleotide -H=hmgroups.txt tomato_DB
     $ pantools msa --phenotype --phenotype-threshold=99 -H=hmgroups.txt tomato_DB
     $ pantools msa --method=multiple-groups tomato_DB
     $ pantools msa --method=multiple-groups --align-protein -H=hmgroups.txt tomato_DB
     $ pantools msa --method=multiple-groups --phenotype --phenotype-threshold=95 -H=hmgroups.txt tomato_DB
     $ pantools msa --method=regions -R=regions.txt tomato_DB
     $ pantools msa --method=functions --functions=PF10137 tomato_DB
     $ pantools msa --variants tomato_DB
     $ pantools msa --variants --pavs --threads=10 --method=functions --functions=PF10137 tomato_DB

**Output**
  Output files are stored in *database_directory/alignments/msa\_?/grouping_v?/*
  A separate directory is created for each alignment which holds the input
  and output files.

  The ‘input’ directory contains the input files for the alignments.

  -  **prot/nuc/var(_trimmed).fasta**, original and trimmed input sequences.
  -  **prot/nuc/var_trimmed.info**, number of trimmed positions per sequence.
  -  **sequences.info**, relevant gene information of sequences in group:
     gene names, mRNA node id, address, strand orientation.
  -  **genome_order.info**, information about the order of genomes in the
     alignment.
  -  **nuc/var.structure.tsv**, nucleotide structure information of the
     alignment. Tab-separated file with the following columns:
     ``sequence_id``, ``start``, ``stop``, ``strand``, ``feature``, ``mRNA``.
  -  **skipped_genomes.info**, genomes that were skipped.

  The alignments and output files are written to the ‘output’ directory.

  -  **prot/nuc/var(_trimmed).afa**, the initial and second (trimmed)
     alignment in CLUSTAL format.
  -  **prot/nuc/var(_trimmed).fasta**, the initial and second (trimmed)
     alignment in FASTA format.
  -  **prot/nuc/var(_trimmed).newick**, FastTree ML tree inferred from the
     initial and second (trimmed) alignment.
  -  **prot/nuc/var(_trimmed)_alignment.info**, some statistics about the
     initial and second (trimmed) alignment: alignment length, number of
     conserved, variable and parsimony informative sites

  Sequence identity and similarity output files.

  -  **prot/nuc/var(_trimmed)_identity.csv**, table with the sequence
     identity scores.
  -  **prot(_trimmed)_similarity.csv**, table with similarity of the
     protein sequences.

  Variable and parsimony informative sites output files.

  -  **informative_prot/nuc/var(_trimmed)_distance.csv**, table with
     distances between sequences based on parsimony informative sites in
     the alignment.
  -  **variable_prot/nuc/var(_trimmed)_distance.csv**, table with distances
     between sequences based on variable sites in the alignment.
  -  **informative_prot/nuc/var(_trimmed)_sites.csv**, table with the number
     shared parsimony informative sites between sequences.
  -  **variable_prot/nuc/var(_trimmed)_sites.csv**, table with the number of
     shared variable sites between sequences.

  When a ``--phenotype`` is included.

  - File named after phenotype property. A CSV table of all variable positions
    in the alignment, with counts of the letters per phenotype value and whether
    a letter is actually shared/specific/exclusive.
  - **groups/functions/regions_with_phenotype_changes.txt**, a summary
    of the alignments having at least one phenotype shared/specific/exclusive
    position. This file is generated once per analysis, not per alignment.