Multiple Sequence Alignments

This page is entirely dedicated to performing Multiple Sequence Alignments (MSA) with PanTools.

Sequence alignments

Alignment of homology groups

Performs multiple sequence alignments with MAFFT on sets of sequences. These alignments can either be:

  • per homology group

  • multiple homology groups

  • regions

  • with all sequences containing a functional domain

The alignment consists of two rounds: After the initial alignment, protein sequences are trimmed based on the longest start and end gap of the alignment. The number of trimmed amino acids is multiplied by three to trim the correct number of nucleotides. If only nucleotide sequences are aligned, the nucleotide sequence alignment is used for trimming. The trimmed sequences are aligned a second time to identify variable and parsimony informative sites. For each round, a ML phylogeny will be created with FastTree.

Select a method
By default, this function will make a MSA per homology group. (It can still be specified with --method per_group.) Using another option from the list above requires use of the --method argument. For aligning multiple homology groups, please use --method multiple_groups with the homology groups specified in a csv file on a single line (can be added with -hm /path/to/hm.csv. For aligning regions, please use --method regions with a regions file that is added with -rf /path/to/rf.txt. For aligning sequences based on a functional domain, please use --method functions together with a functional domain that is added with --name <domain>.
Other options
In case you are only interested in the alignment of the nucleotide or protein sequences, use --mode nucleotide or --mode protein. When the --no-trimming argument is included, the variable and parsimony informative sites are identified from the initial alignment and no trimming is performed (thus, only one round of aligning). The option --fast can be used to skip running FastTree, which is used for generating a tree from the alignment.
Identify phenotype shared or specific variation
Shared SNPs or amino acid substitutions can be found among the members of a phenotype when --phenotype <phenotype> is included. As homology groups can highly differ in size, the threshold for a phenotype shared or specific SNP/substitution is based on the number of sequences (from a certain phenotype) of an homology group instead of the number of genomes in the pangenome. For example, the pangenome holds 500 genomes but the homology group consists of only 100 sequences. The threshold can be lowered by including --phenotype-threshold <threshold>, which lowers the original threshold by multiplying it to a given percentage.
Sequence identity and similarity
- The percentage identity of two sequences is calculated based on the number of exactly matching characters divided by the alignment length minus the positions were both sequences have a gap.
- The similarity (protein only) is calculated from the number of identical matches, increased by the number of similar amino acids (according to the BLOSUM 62 matrix), divided by the alignment length minus the shared gap positions. The calculated percentage of similarity is dependant on the BLOSUM matrix set by --blosum. Choose a larger BLOSUM number BLOSUM less divergent sequences.

Required software

Required arguments

--database-path/-dp Path to the database

Optional arguments

--method The kind of alignment to make. Can be either per_group, multiple_groups, regions or functions.
--homology-groups/-hm Text file with homology group node identifiers. Default is all groups!
--phenotype/-ph a phenotype name, used to identify phenotype specific SNPs/substitutions.
--phenotype-threshold Threshold for phenotype specific SNPs (default is 100%).
--skip and --reference/-ref Skip over a selection of genomes.
--threads-number/-tn The number of parallel working threads for MAFFT and FastTree (Highly recommended! default is 1).
--mode nucleotide or --mode protein Choose to only align nucleotide or protein sequences (default is both).
--no-trimming Align the sequences only once.
--fast Don’t run FastTree.
--name For specifying one or multiple functional domains. (Only used when --method functions.)
--regions-file/-rf Regions file for aligning regions. (Only used when --method regions.)
--blosum a BLOSUM matrix number to control MAFFT’s sensitivity and the similarity calculation. Allowed values: 45, 62 (default), 80.

Example regions file

Each line must have a genome number, sequence number, begin and end positions that are separated by a space. Place a minus symbol behind a region to extract the reverse complement sequence.

1 1 1 10000
195 1 477722 478426
71 10 17346 18056 -
138 47 159593 160300 -

Example commands

$ pantools msa -dp tomato_DB
$ pantools msa -dp tomato_DB -hm hmgroups.txt --mode protein
$ pantools msa -dp tomato_DB -hm hmgroups.txt --mode nucleotide --no-trimming
$ pantools msa -dp tomato_DB -hm hmgroups.txt --phenotype resistance --phenotype-threshold 99
$ pantools msa --method multiple_groups -dp tomato_DB
$ pantools msa --method multiple_groups -dp tomato_DB -hm hmgroups.txt --mode protein
$ pantools msa --method multiple_groups -dp tomato_DB -hm hmgroups.txt --phenotype resistance --phenotype-threshold 95
$ pantools msa --method regions -dp tomato_DB -rf regions.txt
$ pantools msa --method functions -dp tomato_DB --name PF10137

Output files

Output files are stored in database_directory/alignments/msa_/grouping_v?/ A separate directory is created for each alignment which holds the input and output files.

The ‘input’ directory contains the input files for the alignments.

  • nuc/prot(_trimmed).fasta, original and trimmed input sequences.

  • trimmed.info, number of trimmed positions per sequence.

  • sequences.info, relevant gene information of sequences in group: gene names, mRNA node id, address, strand orientation.

The alignments and output files are written to the ‘output’ directory.

  • nuc/prot(_trimmed).afa, the initial and second (trimmed) alignment in CLUSTAL format.

  • nuc/prot(_trimmed).fasta, the initial and second (trimmed) alignment in FASTA format.

  • nuc/prot(_trimmed).newick, FastTree ML tree inferred from the initial and second (trimmed) alignment.

  • nuc/prot(_trimmed)_alignment.info, some statistics about the initial and second (trimmed) alignment: alignment length, number of conserved, variable and parsimony informative sites

Sequence identity and similarity output files.

  • nuc/prot(_trimmed)_identity.csv, table with the sequence identity scores.

  • prot(_trimmed)_similarity.csv, table with similarity of the protein sequences.

Variable and parsimony informative sites output files.

  • informative_nuc/prot(_trimmed)_distance.csv, table with distances between sequences based on parsimony informative sites in the alignment.

  • variable_nuc/prot(_trimmed)_distance.csv, table with distances between sequences based on variable sites in the alignment.

  • informative_nuc/prot(_trimmed)_sites.csv, table with the number shared parsimony informative sites between sequences.

  • variable_nuc/prot(_trimmed)_sites.csv, table with the number of shared variable sites between sequences.

When a --phenotype is included.

  • phenotype_specific_changes_nuc/prot_groups.csv, the node identifiers of homology groups with phenotype specific substitutions.

  • phenotype_specific_changes_nuc/prot.txt, the positions of phenotype specific substitutions in the alignments.

  • phenotype_disrupted_nuc/prot.txt, shows how many sequences of different phenotypes prevented a SNP/substitution from becoming phenotype specific.