Multiple Sequence Alignments
This page is entirely dedicated to performing Multiple Sequence Alignments (MSA) with PanTools.
MSA
Performs multiple sequence alignment (MSA) with MAFFT on sets of sequences. These alignments can either be:
per homology group
multiple homology groups
regions
with all sequences containing a functional domain
--method=per-group
.) Using another option
from the list above requires use of the --method
argument. For
aligning multiple homology groups, please use
--method=multiple-groups
with the homology groups specified in a
csv file on a single line (can be added with
--homology-file=/path/to/hm.csv
or --homology-groups=2,3,4
).
For aligning regions, please use --method=regions
with a regions
file that is added with --regions-file=/path/to/rf.txt
. For aligning
sequences based on a functional domain, please use
--method=functions
together with a functional domain that is added
with --functions=<domain>
.--align-nucleotide
,
--align-protein
, --align-variants
. By default, only nucleotide
sequences (CDS) are aligned for pangenomes and only protein sequences are
aligned for panproteomes. If variants were added using
add_variants, PanTools
will align complete mRNA sequences (including all variants) instead of CDS
sequences; see below for more information. PAV can be taken into account using
the --pavs
option.--no-trimming
argument is included, the variable
and parsimony informative sites are identified from the initial alignment and
no trimming is performed (thus, only one round of aligning).--no-fasttree
can be used to skip running FastTree, which is used for
generating a tree from each alignment.
Run the MSA with
--align-protein
.Run the MSA with
--align-nucleotide
and--trim-using-proteins
.NB: These two steps are also required for correctly running core_phylogeny and consensus_tree.
--phenotype
is included. As homology groups can highly
differ in size, the threshold for a phenotype shared or specific
SNP/substitution is based on the number of sequences (from a certain
phenotype) of an homology group instead of the number of genomes in the
pangenome. For example, the pangenome holds 500 genomes but the homology group
consists of only 100 sequences. The threshold can be lowered by including
--phenotype-threshold=<phenotypeThreshold>
, which lowers the original
threshold by multiplying it to a given percentage.--blosum
.
Choose a larger BLOSUM number BLOSUM less divergent sequences.--variants
argument, gene variants as added to the
pangenome with
add_variants can be
included in the alignment. For these variants, the consensus sequence for this
variant is added to the input sequences for the alignment. Also,
presence/absence variations as added to the pangenome with
add_pavs will be used for
determining whether or not to include a variant in the alignment. NB: This
option uses full mRNA sequences for alignment, not CDS sequences.- Required software
- Parameters
<databaseDirectory>
Path to the database root directory.
- Options
--threads
/-t
Number of threads for MAFFT and IQ-tree, default is the number of cores or 8, whichever is lower.
--include
/-i
Only include a selection of genomes.
--exclude
/-e
Exclude a selection of genomes.
--homology-file
/-H
A text file with homology group node identifiers, separated by a comma. Default is all homology groups. (Mutually exclusive with
--homology-groups
.)--homology-groups
/-G
A comma separated list of homology group node identifiers. Default is all homology groups. (Mutually exclusive with
--homology-file
.)--regions-file
/-R
A text file containing genome locations with on each line: a genome number, sequence number, begin and end position, separated by a space.
--method
The kind of alignment to make. Can be either
per-group
,multiple-groups
,regions
orfunctions
.--align-nucleotide
Align nucleotide sequences (default: depends on database type).
--align-protein
Align protein sequences (default: depends on database type).
--functions
For specifying one or multiple functional domains (Only used when
--method=functions
).--phenotype
/-p
Identify phenotype shared/specific/exclusive positions in the alignment.
--blosum
A BLOSUM matrix number to control MAFFT’s sensitivity and the similarity calculation. Allowed values are 45, 62 and 80 (default: 62).
--phenotype-threshold
Threshold for phenotype specific SNPs (default: 100%).
--[no-]trimming
Align the sequences only once. Trimming is on by default.
--trim-using-proteins
Trim the (nucleotide) alignment using the protein alignment of a previous run. This option is only available when the database is a pangenome.
--[no-]fasttree
Run FastTree (default: true).
--variants
/-v
Use variants in the alignment. This option does not work with the
--method=regions
option. This option will include variant sequences in the alignment frompantools add_variants
.--pavs
Use PAVs in the alignment. This option does not work with the
--method=regions
option and can only be used if--variants
is set. This option will include PAVs as marked bypantools add_pavs
.- Example regions file
Each line must have a genome number, sequence number, begin and end positions that are separated by a space. Place a minus symbol behind a region to extract the reverse complement sequence.
1 1 1 10000 195 1 477722 478426 71 10 17346 18056 - 138 47 159593 160300 -
- Example commands
$ pantools msa tomato_DB $ pantools msa --align-protein -H=hmgroups.txt tomato_DB $ pantools msa --align-nucleotide --trim-using-proteins -H=hmgroups.txt tomato_DB $ pantools msa --no-trimming --align-nucleotide -H=hmgroups.txt tomato_DB $ pantools msa --phenotype --phenotype-threshold=99 -H=hmgroups.txt tomato_DB $ pantools msa --method=multiple-groups tomato_DB $ pantools msa --method=multiple-groups --align-protein -H=hmgroups.txt tomato_DB $ pantools msa --method=multiple-groups --phenotype --phenotype-threshold=95 -H=hmgroups.txt tomato_DB $ pantools msa --method=regions -R=regions.txt tomato_DB $ pantools msa --method=functions --functions=PF10137 tomato_DB $ pantools msa --variants tomato_DB $ pantools msa --variants --pavs --threads=10 --method=functions --functions=PF10137 tomato_DB
- Output
Output files are stored in database_directory/alignments/msa_?/grouping_v?/ A separate directory is created for each alignment which holds the input and output files.
The ‘input’ directory contains the input files for the alignments.
prot/nuc/var(_trimmed).fasta, original and trimmed input sequences.
prot/nuc/var_trimmed.info, number of trimmed positions per sequence.
sequences.info, relevant gene information of sequences in group: gene names, mRNA node id, address, strand orientation.
genome_order.info, information about the order of genomes in the alignment.
nuc/var.structure.tsv, nucleotide structure information of the alignment. Tab-separated file with the following columns:
sequence_id
,start
,stop
,strand
,feature
,mRNA
.skipped_genomes.info, genomes that were skipped.
The alignments and output files are written to the ‘output’ directory.
prot/nuc/var(_trimmed).afa, the initial and second (trimmed) alignment in CLUSTAL format.
prot/nuc/var(_trimmed).fasta, the initial and second (trimmed) alignment in FASTA format.
prot/nuc/var(_trimmed).newick, FastTree ML tree inferred from the initial and second (trimmed) alignment.
prot/nuc/var(_trimmed)_alignment.info, some statistics about the initial and second (trimmed) alignment: alignment length, number of conserved, variable and parsimony informative sites
Sequence identity and similarity output files.
prot/nuc/var(_trimmed)_identity.csv, table with the sequence identity scores.
prot(_trimmed)_similarity.csv, table with similarity of the protein sequences.
Variable and parsimony informative sites output files.
informative_prot/nuc/var(_trimmed)_distance.csv, table with distances between sequences based on parsimony informative sites in the alignment.
variable_prot/nuc/var(_trimmed)_distance.csv, table with distances between sequences based on variable sites in the alignment.
informative_prot/nuc/var(_trimmed)_sites.csv, table with the number shared parsimony informative sites between sequences.
variable_prot/nuc/var(_trimmed)_sites.csv, table with the number of shared variable sites between sequences.
When a
--phenotype
is included.File named after phenotype property. A CSV table of all variable positions in the alignment, with counts of the letters per phenotype value and whether a letter is actually shared/specific/exclusive.
groups/functions/regions_with_phenotype_changes.txt, a summary of the alignments having at least one phenotype shared/specific/exclusive position. This file is generated once per analysis, not per alignment.