Multiple Sequence Alignments
============================

This page is entirely dedicated to performing Multiple Sequence
Alignments (MSA) with PanTools.

MSA
---

Performs multiple sequence alignments with **MAFFT** on sets of sequences.
These alignments can either be:

-  per homology group
-  multiple homology groups
-  regions
-  with all sequences containing a functional domain

The alignment consists of two rounds: After the initial alignment,
protein sequences are trimmed based on the longest start and end gap of
the alignment. The number of trimmed amino acids is multiplied by three
to trim the correct number of nucleotides. If only nucleotide sequences
are aligned, the nucleotide sequence alignment is used for trimming. The
trimmed sequences are aligned a second time to identify variable and
parsimony informative sites. For each round, a ML phylogeny will be
created with **FastTree**.

| **Select a method**
| By default, this function will make a MSA per homology group. (It can
  still be specified with ``--method=per-group``.) Using another option
  from the list above requires use of the ``--method`` argument. For
  aligning multiple homology groups, please use
  ``--method=multiple-groups`` with the homology groups specified in a
  csv file on a single line (can be added with
  ``--homology-file=/path/to/hm.csv`` or ``--homology-groups=2,3,4,4``).
  For aligning regions, please use ``--method=regions`` with a regions
  file that is added with ``--regions-file=/path/to/rf.txt``. For aligning
  sequences based on a functional domain, please use
  ``--method=functions`` together with a functional domain that is added
  with ``--functions=<domain>``.

| **Other options**
| In case you are only interested in the alignment of the nucleotide or
  protein sequences, use ``--mode=nucleotide`` or ``--mode=protein``.
  When the ``--no-trimming`` argument is included, the variable and
  parsimony informative sites are identified from the initial alignment
  and no trimming is performed (thus, only one round of aligning). The
  option ``--no-fasttree`` can be used to skip running FastTree, which is used
  for generating a tree from the alignment.

| **Identify phenotype shared or specific variation**
| Shared SNPs or amino acid substitutions can be found among the members
  of a phenotype when ``--phenotype=<phenotype>`` is included. As
  homology groups can highly differ in size, the threshold for a
  phenotype shared or specific SNP/substitution is based on the number
  of sequences (from a certain phenotype) of an homology group instead
  of the number of genomes in the pangenome. For example, the pangenome
  holds 500 genomes but the homology group consists of only 100
  sequences. The threshold can be lowered by including
  ``--phenotype-threshold=<phenotypeThreshold>``, which lowers the original
  threshold by multiplying it to a given percentage.

| **Sequence identity and similarity**
| - The percentage **identity** of two sequences is calculated based on
  the number of exactly matching characters divided by the alignment
  length minus the positions were both sequences have a gap.
| - The **similarity** (protein only) is calculated from the number of
  identical matches, increased by the number of similar amino acids
  (according to the BLOSUM 62 matrix), divided by the alignment length
  minus the shared gap positions. The calculated percentage of
  similarity is dependant on the BLOSUM matrix set by ``--blosum``.
  Choose a larger BLOSUM number BLOSUM less divergent sequences.

Required software
~~~~~~~~~~~~~~~~~

-  `MAFFT <https://mafft.cbrc.jp/alignment/software/>`_
-  `FastTree <http://microbesonline.org/fasttree/>`_

Parameters
~~~~~~~~~~

.. list-table::
   :widths: 30 70

   * - <databaseDirectory>
     - Path to the database root directory.

Options
~~~~~~~

.. list-table::
   :widths: 30 70

   * - ``--threads``/``-t``
     - Number of threads for MAFFT and IQ-tree, default is the number
       of cores or 8, whichever is lower.
   * - ``--include``/``-i``
     - Only include a selection of genomes.
   * - ``--exclude``/``-e``
     - Exclude a selection of genomes.
   * - ``--homology-file``/``-H``
     - A text file with homology group node identifiers, separated by a comma.
       Default is all homology groups. (Mutually exclusive with ``--homology-groups``.)
   * - ``--homology-groups``/``-G``
     - A comma separated list of homology group node identifiers.
       Default is all homology groups. (Mutually exclusive with ``--homology-file``.)
   * - ``--regions-file``/``-R``
     - A text file containing genome locations with on each line: a genome
       number, sequence number, begin and end position, separated by a space.
   * - ``--method``
     - The kind of alignment to make. Can be either ``per-group``,
       ``multiple-groups``, ``regions`` or ``functions``.
   * - ``--mode``/``-m``
     - Choose to only align **nucleotide** or **protein** sequences
       (default is both).
   * - ``--functions``
     - For specifying one or multiple functional domains (Only used when
       ``--method=functions``).
   * - ``--phenotype``/``-p``
     - A phenotype name, used to identify phenotype specific SNPs/substitutions.
   * - ``--blosum``
     - A BLOSUM matrix number to control MAFFT’s sensitivity and the similarity
       calculation. Allowed values are 45, 62 and 80 (default: 62).
   * - ``--phenotype-threshold``
     - Threshold for phenotype specific SNPs (default: 100%).
   * - ``--[no-]trimming``
     - Align the sequences only once. Trimming is on by default.
   * - ``--[no-]fasttree``
     - Run FastTree (default: true).
   * - ``--variants``
     - Use variation (default: false).

Example regions file
~~~~~~~~~~~~~~~~~~~~

Each line must have a genome number, sequence number, begin and end
positions that are separated by a space. Place a minus symbol behind a
region to extract the reverse complement sequence.

.. code:: text

   1 1 1 10000
   195 1 477722 478426 
   71 10 17346 18056 -
   138 47 159593 160300 -

Example commands
~~~~~~~~~~~~~~~~

.. code:: bash

   $ pantools msa tomato_DB
   $ pantools msa --mode=protein -H=hmgroups.txt tomato_DB
   $ pantools msa --no-trimming --mode=nucleotide -H=hmgroups.txt tomato_DB
   $ pantools msa --phenotype=resistance --phenotype-threshold=99 -H=hmgroups.txt tomato_DB
   $ pantools msa --method=multiple-groups tomato_DB
   $ pantools msa --method=multiple-groups --mode=protein -H=hmgroups.txt tomato_DB
   $ pantools msa --method=multiple-groups --phenotype=resistance --phenotype-threshold=95 -H=hmgroups.txt tomato_DB
   $ pantools msa --method=regions -R=regions.txt tomato_DB
   $ pantools msa --method=functions --functions=PF10137 tomato_DB

Output files
~~~~~~~~~~~~

Output files are stored in *database_directory/alignments/msa\_/grouping_v?/*
A separate directory is created for each alignment which holds the input
and output files.

The ‘input’ directory contains the input files for the alignments.

-  **nuc/prot(_trimmed).fasta**, original and trimmed input sequences.
-  **trimmed.info**, number of trimmed positions per sequence.
-  **sequences.info**, relevant gene information of sequences in group:
   gene names, mRNA node id, address, strand orientation.

The alignments and output files are written to the ‘output’ directory.

-  **nuc/prot(_trimmed).afa**, the initial and second (trimmed)
   alignment in CLUSTAL format.
-  **nuc/prot(_trimmed).fasta**, the initial and second (trimmed)
   alignment in FASTA format.
-  **nuc/prot(_trimmed).newick**, FastTree ML tree inferred from the
   initial and second (trimmed) alignment.
-  **nuc/prot(_trimmed)_alignment.info**, some statistics about the
   initial and second (trimmed) alignment: alignment length, number of
   conserved, variable and parsimony informative sites

Sequence identity and similarity output files.

-  **nuc/prot(_trimmed)_identity.csv**, table with the sequence
   identity scores.
-  **prot(_trimmed)_similarity.csv**, table with similarity of the
   protein sequences.

Variable and parsimony informative sites output files.

-  **informative_nuc/prot(_trimmed)_distance.csv**, table with
   distances between sequences based on parsimony informative sites in
   the alignment.
-  **variable_nuc/prot(_trimmed)_distance.csv**, table with distances
   between sequences based on variable sites in the alignment.
-  **informative_nuc/prot(_trimmed)_sites.csv**, table with the number
   shared parsimony informative sites between sequences.
-  **variable_nuc/prot(_trimmed)_sites.csv**, table with the number of
   shared variable sites between sequences.

When a ``--phenotype`` is included.

-  **phenotype_specific_changes_nuc/prot_groups.csv**, the node
   identifiers of homology groups with phenotype specific substitutions.
-  **phenotype_specific_changes_nuc/prot.txt**, the positions of
   phenotype specific substitutions in the alignments.
-  **phenotype_disrupted_nuc/prot.txt**, shows how many sequences of
   different phenotypes prevented a SNP/substitution from becoming
   phenotype specific.