PanTools version 4.0.0

PanTools is a toolkit for comparative analysis of large number of genomes. It is developed in the Bioinformatics Group of Wageningen University, the Netherlands. Please cite the relevant publication(s) from the list of publications if you use PanTools in your research.

Licence

PanTools has been licensed under GNU GENERAL PUBLIC LICENSE version 3.

Publications

Functionalities

PanTools currently provides these functionalities:

  • Construction of a panproteome

  • Adding new genomes to the pangenome

  • Adding structural/functional annotations to the genomes

  • Detecting homology groups based on similarity of proteins

  • Optimization of homology grouping using BUSCO

  • Read mapping

  • Gene classification

  • Phylogenetic methods

Requirements

  • Java Virtual Machine version 1.8 or higher, Add path to the java executable to your OS path environment variable.

  • KMC: A disk-based k-mer counter, After downloading the appropriate version (linux, macos or windows), add path to the kmc and kmc_tools executables to your OS path environment variable.

  • MCL: The Markov Clustering Algorithm, After downloading and compiling the software, add path to the mcl executable to your OS path environment variable.

For installing and configuring all required software, please see our Installing and configuring the required software page.

Running the program

Add the path to the java archive of PanTools, located in the pantools/target subdirectory, to the OS path environment variable. Then run PanTools from the command line by:

$ java <JVM options> -jar pantools-4.0.0.jar <subcommand> <arguments>
Useful JVM options
- -server : To optimize JIT compilations for higher performance
- -Xmn(a number followed by m/g) : Minimum heap size in mega/giga bytes
- -Xmx(a number followed by m/g) : Maximum heap size in mega/giga bytes

Options

All options except for --version also apply to all subcommands.

--version/-V

Display version info.

--help/-h

Display a help message for pantools or any subcommand.

--manual/-M

Open the manual for pantools or any subcommand in a local browser.

--force/-f

Force. For instance, force overwrite a database with build_pangenome.

--no-input

Ignore prompts or any other interactive user input.

--debug/-d

Show debug messages in the console.

--quiet/-q

Only show warnings or higher level logging messages in the console.

Contents

Installing and configuring the required software

  1. Download PanTools

  2. Install Neo4j

  3. Install dependencies, either manually or through conda.

For PanTools developers:

  1. Installing pre-commit hooks

Download PanTools

The preferred option is to download the .jar file from https://git.wur.nl/bioinformatics/pantools/-/releases and put it in a directory named “pantools/target”.

Alternatively, follow the installation and compilation instructions from the README.md file in the desired version (e.g. https://git.wur.nl/bioinformatics/pantools/-/tree/v4.0.0 ).

Test if PanTools is executable:

$ java -jar /YOUR_FULL_PATH/pantools/target/pantools-4.0.0.jar

If the help page does not appear this (likely) means you don’t have a properly working Java version 8. Java is included in the PanTools conda environment, please consider to first install the environment. To manually download Java, follow the instructions at https://www.java.com/en/download.

Set PanTools alias

To avoid typing long command line arguments every time, we suggest setting an alias to your profile. Set an alias in your ~/.bashrc using the following command. Always include the full path to PanTools’ .jar file.

If Java is set to your $PATH.

$ echo "alias pantools='java -Xms20g -Xmx50g -jar /YOUR_FULL_PATH/pantools/target/pantools-4.0.0.jar'" >> ~/.bashrc

If Java is not set to your $PATH, include the full path in the alias. Replace ‘YOUR_PATH’ 2x with the correct directory structure.

$ echo "alias pantools='/YOUR_PATH/jdk1.8.0_161/bin/java -Xms20g -Xmx50g -jar /YOUR_PATH/pantools/target/pantools-4.0.0.jar'" >> ~/.bashrc

Source your ~/.bashrc and test if the alias works.

$ source ~/.bashrc
pantools --version

PanTools uses picocli command line parsing. Tab autocompletion for the command line can be enabled for an alias with the following tutorial.


Install Neo4j

Although Neo4j is not needed for any of the PanTools functionalities, it is required to be able to start up a database and use cypher queries. In the PanTools versions up to 3.2 we use Neo4j 3.5.3 libraries, whereas newer releases use Neo4j 3.5.30. Neo4j version 3.5.30 is compatible with all earlier PanTools versions.

Download the Neo4j 3.5.30 community edition from the Neo4j website or download the binaries directly from our server.

$ wget http://www.bioinformatics.nl/pangenomics/tutorial/neo4j-community-3.5.30-unix.tar.gz
$ tar -xvzf neo4j-community-*

# Edit your ~/.bashrc to include Neo4j to your $PATH
$ echo "export PATH=/YOUR_PATH/neo4j-community-3.5.30/bin:\$PATH" >> ~/.bashrc #replace YOUR_PATH with the correct path on your computer
$ source ~/.bashrc
$ neo4j status # test if Neo4j is executable

Official Neo4j 3.5 manual: https://neo4j.com/docs/operations-manual/3.5/


Dependencies

Some of PanTools functionalities require additional software to be installed. Installing every dependency will take a considerate amount of time, therefore we highly recommend to use Mamba. Mamba efficiently manages Conda environments allowing the installation of all required tools into a separate environment. Instructions for creating the Mamba environment or installing the tools manually are found in the sections below.

Install dependencies using Conda

Instructions on how to install and use conda can be found in the conda manual page. Once conda is installed, we suggest to install Mamba into the Conda base environment to enable much faster dependency solving.

To install all dependencies into a separate environment, run the following commands. Please choose the conda_linux.yml or conda_macos.yml file depending on your operating system. These files be found in the release. The difference between the two files is that the linux file contains BUSCO v5.2.2, which is not compatible with the other dependencies on macOS.

$ conda install mamba -n base -c conda-forge #install mamba into the base environment

$ cd /YOUR_PATH/pantools #replace YOUR_PATH with the correct path on your computer
$ mamba env create -n pantools -f conda_linux.yml #for Linux
$ mamba env create -n pantools -f conda_macos.yml #for macOS

$ conda activate pantools # activate the environment before using PanTools
$ conda deactivate # deactivate when you are done

Run the following commands when you do not want to install every dependency, but only specific ones for the analysis that you’re interested in.

$ conda create -n pantools python=3.6 kmc=3.0 mcl # Creates an environment that is able to construct the pangenome and cluster protein sequences
$ conda install -n pantools mafft iqtree fasttree blast mash fastani busco=5.2.2 r-ggplot2 r-ape graphviz aster=1.3 # include tools you want to install via conda

Manual installation of dependencies

All tools must be set to your $PATH so PanTools is able to use them on any location. The instructions below are based on a linux machine.

Install KMC

PanTools requires KMC v2.3 or 3.0 for k-mer counting during the constructing of the pangenome graph. KMC v3.0 is fastest, but v2.3 should also be compatible with PanTools. The KMC3 binaries can be downloaded from https://github.com/refresh-bio/KMC/releases.

$ tar -xvzf KMC* #uncompress the KMC binaries

# Edit your ~/.bashrc to include KMC to your PATH
$ echo "export PATH=/YOUR_PATH/KMC/:\$PATH" >> ~/.bashrc #replace YOUR_PATH with the correct path on your computer
$ source ~/.bashrc
$ kmc # test if KMC is executable
$ kmc_tools # test if kmc_tools is executable

Install MCL

The MCL (Markov clustering) algorithm is required for the homology grouping of PanTools. The software can be found on https://micans.org/mcl under License & software.

$ wget https://micans.org/mcl/src/mcl-14-137.tar.gz
$ tar -xvzf mcl-*
$ cd mcl-14-137
$ ./configure --prefix=/YOUR_PATH/mcl-14-137/shared #replace YOUR_PATH with the correct path on your computer
$ make install

# Edit your ~/.bashrc to include MCL to your PATH
$ echo "export PATH=/YOUR_PATH/mcl-14-137/bin/:\$PATH" >> ~/.bashrc #replace YOUR_PATH with the correct path on your computer
$ source ~/.bashrc
$ mcl -h # test if MCL is executable

Install BUSCO

BUSCO v3 to v5 can be run against the pangenome to estimate annotation completeness. The versions require a different Python release and need to be installed in a different way. We suggest to install BUSCO v5, follow the instructions at https://gitlab.com/ezlab/busco/.


Install FastTree

FastTree is used to infer approximately-maximum-likelihood phylogenetic trees from the alignments of nucleotide or protein sequences which are extracted from the pangenome. An executable can be found on the FastTree website: http://www.microbesonline.org/fasttree/.

$ wget http://www.microbesonline.org/fasttree/FastTree
$ chmod +x FastTree
$ ./FastTree # test if FastTree is executable

# Edit your ~/.bashrc to include FastTree to your PATH
$ echo "export PATH=/YOUR_PATH:\$PATH" >> ~/.bashrc #replace YOUR_PATH with the correct path on your computer
$ source ~/.bashrc

Install R

R and some additional R packages are required to execute R scripts (files with .R extension) that create plots and construct Neighbor-Joining phylogenies. In most cases, R is already installed on a server. If this is not the case, install it through the instructions on the website https://cran.r-project.org/, or compile it by using following steps.

mkdir R
mkdir R/R_LIBS
cd R
wget https://cran.r-project.org/src/base/R-4/R-4.0.2.tar.gz #version number might have changed already
tar -xvf R-4.0.2.tar.gz
cd R-4.0.2/
./configure --prefix=/YOUR_PATH/R/  #replace YOUR_PATH with the correct path on your computer
make

# Edit your ~/.bashrc to include R to your PATH
$ echo "export PATH=/YOUR_PATH/R/bin/:\$PATH" >> ~/.bashrc #replace YOUR_PATH with the correct path on your computer
$ source ~/.bashrc
$ R --help # test if R is executable

When R_LIB is set to your $PATH, R scripts know the location of the libraries and are able to install additional R packages to the selected directory.

$ echo "R_LIBS=/YOUR_PATH/R/R_LIBS/" >> ~/.bashrc
$ echo "export R_LIBS" >> ~/.bashrc
$ echo $R_LIBS # validate if the path to the R libraries can be found

Install MAFFT

MAFFT is required for all the alignment functionalities, such as the alignment of homology groups and inferring the core SNP phylogeny. The full manual is available at https://mafft.cbrc.jp/alignment/software/.

$ git clone https://github.com/GSLBiotech/mafft.git
$ cd mafft/core

# Edit the first line of Makefile to change the desired install location, from 'PREFIX = /usr/local' to 'PREFIX = /YOUR_DESIRED_PATH/mafft/'
# Make sure the 'ENABLE_MULTITHREAD = -Denablemultithread' line is uncommented, to enable multithreading

# Edit your ~/.bashrc to include MAFFT to your $PATH
$ echo "export PATH=/YOUR_PATH/mafft/bin/:\$PATH" >> ~/.bashrc #replace YOUR_PATH with the correct path on your computer
$ source ~/.bashrc
$ mafft --help # test if MAFFT is executable

Install IQ-tree

Using IQ-tree we infer phylogenetic trees by maximum likelihood. Information about the tool can found on their webpage https://github.com/ebi-pf-team/interproscan/wiki/HowToDownload

wget https://github.com/Cibiv/IQ-TREE/releases/download/v1.6.12/iqtree-1.6.12-Linux.tar.gz
tar -xvf iqtree-1.6.12-Linux

# Edit your ~/.bashrc to include IQ-tree to your $PATH
$ echo "export PATH=/YOUR_PATH/iqtree-1.6.12-Linux/bin/:\$PATH" >> ~/.bashrc #replace YOUR_PATH with the correct path on your computer
$ source ~/.bashrc
$ iqtree -h # test if IQ-tree is executable

Install fastANI or MASH

To be able to construct a Neighbor-Joining phylogeny using ANI-scores, either fastANI or MASH is required. The manual for fastANI is available at https://github.com/ParBLiSS/FastANI/. The manual for MASH can be found at https://mash.readthedocs.io/en/latest/.

$ wget https://github.com/marbl/Mash/releases/download/v2.2/mash-Linux64-v2.2.tar
$ tar -xvf mash-Linux64-v2.2.tar
$ mv mash-Linux64-v2.2/mash .

$ wget https://github.com/ParBLiSS/FastANI/releases/download/v1.32/fastANI-Linux64-v1.32.zip #
$ unzip fastANI-Linux64-v1.32.zip

# Edit your ~/.bashrc to include MASH and FastANI to your $PATH
$ echo "export PATH=/YOUR_PATH/:\$PATH" >> ~/.bashrc #replace YOUR_PATH with the correct path on your computer
$ source ~/.bashrc
$ mash -h # test if MASH is executable
$ fastANI -h # test if FastANI is executable

Install BLAST

BLAST is only required by one function, where the sequences are blasted against a database to obtain their COG category. Information about BLAST can be found at https://www.ncbi.nlm.nih.gov/books/NBK279690/?report=classic.

$ wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.10.1+-x64-linux.tar.gz
$ tar -xvf ncbi-blast-2.10.1+-x64-linux.tar.gz

# Edit your ~/.bashrc to include BLAST to your $PATH
$ echo "export PATH=/YOUR_PATH/ncbi-blast-2.10.1+/bin/:\$PATH" >> ~/.bashrc #replace YOUR_PATH with the correct path on your computer
$ source ~/.bashrc
$ blastp -help # test if BLAST is executable

Install ASTER

ASTER is required for creating a phylogenetic tree based on both orthologs and paralogs with astral-pro. The manual for ASTER can be found at https://github.com/chaoszhang/ASTER.

$ git clone https://github.com/chaoszhang/ASTER.git
$ cd ASTER
$ git checkout v1.3
$ make

# Edit your ~/.bashrc to include ASTER to your $PATH
$ echo "export PATH=/YOUR_PATH/ASTER/bin/:\$PATH" >> ~/.bashrc #replace YOUR_PATH with the correct path on your computer
$ source ~/.bashrc
$ astral-pro -h # test if ASTER is executable

Install InterProScan

Not required by any function, but the .GFF3 output of InterProScan can be read to include functional annotations to the database. The installation itself can be quite tricky as it uses many different third-party binaries and each having their own dependencies. Please check https://github.com/ebi-pf-team/interproscan/wiki/HowToDownload and take a look at the install requirements as well. Installation of the Panther models is not required.

Phobius via InterProScan

Phobius predictions can be performed during the InterProScan analysis but it is not part of the standard set of predictions. To allow these predictions, https://phobius.sbc.su.se/, place the entire directory in the InterProScan/bin/ directory and edit the interproscan.properties configuration file. More information about including Phobius into the InterProScan analysis is found at https://interproscan-docs.readthedocs.io/en/latest/ActivatingLicensedAnalyses.html.

Install eggNOGmapper

Not required by any function, but the .annotations output of eggNOG-mapper can be read to include functional annotations to the database. Information about this tool can be found on http://eggnog-mapper.embl.de/

git clone https://github.com/eggnogdb/eggnog-mapper.git

Installing pre-commit hooks

First install the pre-commit Python package by following the installation instructions.

Then, inside the root directory of the repository, run:

pre-commit install

This step you will need to run only once after cloning the repository. The hooks will be installed in your local repository’s configuration under .git/hooks/pre-commit.

After installation of the hooks they will be triggered at each commit if any Java files have changed. Should any of the pre-commit hooks fail, git will not allow you to create the commit. The output of the pre-commit hooks should tell you what failed, allowing you to fix any problems and to re-add the affected files for another commit attempt.

Pre-commit hooks can be run manually as well with:

pre-commit run

Construct a pangenome

Build pangenome

Build a pangenome out of a set of genomes.

Required software

KMC 2.3 or 3.0

Parameters

<databaseDirectory>

Path to the database root directory.

<genomesFile>

A text file containing paths to FASTA files of genomes to be added to the pangenome; each on a separate line.

Options

--kmer-size

Size of k-mers. Should be in range [6..255]. By not giving this argument, the most optimal k-mer size is calculated automatically.

Example genomes file
/always/genome1.fasta
/use_the/genome2.fasta
/full_path/genome3.fasta
Example commands
$ pantools build_pangenome tomato_DB tomato_3.txt
$ pantools build_pangenome --kmer-size=15 tomato_DB tomato_3.txt
Relevant literature

Add annotations

Construct or expand the annotation layer of an existing pangenome. The layer consists of genomic features like genes, mRNAs, proteins, tRNAs etc. PanTools is only able to read General Feature Format (GFF) files.

Multiple annotations can be assigned to a single genome; however, only one annotation a time can be included in an analysis. The most recently included annotation of a genome is included as default, unless a different annotation is specified via --annotations-file, see the explanation below.

NB: GFF files are notoriously difficult to parse. PanTools uses htsjdk to parse GFF files, which is a Java library. Since we need to put this annotation in the graph database, it can be that the features are not correctly added. This is especially true for non-standard GFF files and annotated organellar genomes. If you encounter problems with a gff file, please check whether it is valid to the GFF3 specification. Also, our code should be able to handle all valid GFF3 files, but if the GFF3 file contains a trans-spliced gene that has alternative splicing, it will not be able to handle it (it will only annotate one mRNA).

Parameters

<databaseDirectory>

Path to the database root directory.

<annotationsFile>

A text file with on each line a genome number and the full path to the corresponding annotation file, separated by a space.

Options

--connect

Connect the annotated genomic features to nucleotide nodes in the DBG.

Example commands
$ pantools add_annotations tomato_DB annotations.txt
$ pantools add_annotations --connect tomato_DB annotations.txt
Output

The annotated features are incorporated in the graph. Output files are written to the database directory.

  • annotation_overview.txt, a summary of the GFF files incorporated in the pangenome

  • annotation.log, a list of misannotated feature identifiers.

Example input file

Each line of the file starts with the genome number followed by the full path to the annotation file. The genome numbers match the line number of the file that you used to construct the pangenome.

1 /always/genome1.gff
2 /use_the/genome2.gff
3 /full_path/genome3.gff
GFF3 file format
The GFF format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines, that must be tab separated. Please use the proper hierarchy for the feature: gene -> mRNA -> CDS. Where gene is the parent of mRNA and mRNA is the parent of the CDS feature. The following example from Saccharomyces cerevisiae YJM320 (GCA_000975885) displays a correctly formatted gene entry:
CP004621.1      Genbank gene    44836   45753   .       -       .       ID=gene99;Name=RPL23A;end_range=45753,.;gbkey=Gene;gene=RPL23A;gene_biotype=protein_coding;locus_tag=H754_YJM320B00023;partial=true;start_range=.,44836
CP004621.1      Genbank mRNA    44836   45753   .       -       .       ID=rna99;Parent=gene99;gbkey=mRNA;gene=RPL23A;product=Rpl23ap
CP004621.1      Genbank exon    45712   45753   .       -       .       ID=id112;Parent=rna99;gbkey=mRNA;gene=RPL23A;product=Rpl23ap
CP004621.1      Genbank exon    44836   45207   .       -       .       ID=id113;Parent=rna99;gbkey=mRNA;gene=RPL23A;product=Rpl23ap
CP004621.1      Genbank CDS     45712   45753   .       -       0       ID=cds92;Parent=rna99;Dbxref=SGD:S000000183,NCBI_GP:AJQ01854.1;Name=AJQ01854.1;Note=corresponds to s288c YBL087C;gbkey=CDS;gene=RPL23A;product=Rpl23ap;protein_id=AJQ01854.1
CP004621.1      Genbank CDS     44836   45207   .       -       0       ID=cds92;Parent=rna99;Dbxref=SGD:S000000183,NCBI_GP:AJQ01854.1;Name=AJQ01854.1;Note=corresponds to s288c YBL087C;gbkey=CDS;gene=RPL23A;product=Rpl23ap;protein_id=AJQ01854.1
Select specific annotations for analysis

Only one annotation per genome is considered by any PanTools functionality. When multiple annotations are included, the last added annotation of a genome is automatically selected unless an --annotations-file is included specifying which annotations to use. This annotation file contains only annotation identifiers, each on a separate line. The most recent annotation is used for genomes where no annotation number is specified in the file. Below is an example where the third annotation of genome 1 is selected and the second annotation of genome 2 and 3.

1_3
2_2
3_2

Grouping proteins

Group

Generate homology groups based on similarity of protein sequences. The resulting homology groups connect similar sequences in the pangenome database. Homology groups contain not only orthologous pairs, but also pairs of homologs duplicated after the speciation of the two species, so-called in-paralogs. The sizes of the groups are controlled by the --relaxation parameter that can be set very strict or more lenient, depending on the evolutionary distance of the genomes. When you are unsure which relaxation setting is most suitable for your dataset, running the optimal_grouping functionality is recommended.

Be aware that not every sequence within a homology group has to be similar to the other sequences. For example, two non-similar protein sequences each have a high-similarity hit with the same protein sequence but align to a different region, one at the start and one near the end of the sequence.

When you want to run group another time but with different parameters, the currently active grouping must first either be moved or removed. This can be achieved with the move or remove grouping functions.

Method

Here, we explain a simplified version of the original algorithm, please take a look at our publication for an extensive explanation. First, potential similar sequences are identified by counting shared k-mer (protein) sequences. Similarity between the selected protein sequences is calculated through (local) Smith-Waterman alignments. When the (normalized) similarity score of two sequences is above a given threshold (controlled by --relaxation), the proteins are connected with each other in the similarity graph. Every similarity component is then passed to the MCL (Markov clustering) algorithm to be possibly broken into several homology groups.

Relaxation

The relaxation parameter is a combination of four sub-parameters: intersection rate, similarity threshold, mcl inflation and contrast. The values for these parameters for each relaxation setting can be seen in the table below. We strongly recommend using the --relaxation option to control the grouping, but advanced users still have the option to control the individual sub-parameters.

Relaxation

1

2

3

4

5

6

7

8

Intersection rate

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0.01

Similarity threshold

95

85

75

65

55

45

35

25

Mcl inflation

10.8

9.6

8.4

7.2

6.0

4.8

3.6

2.4

Contrast

8

7

6

5

4

3

2

1

Required software

MCL

Parameters

<databaseDirectory>

Path to the database root directory.

Options

--threads/-t

Number of parallel working threads, default is the number of available cores or 8, whichever is lower.

--include/-i

Only include a selection of genomes.

--exclude/-e

Exclude a selection of genomes.

--annotations-file/-A

A text file with the identifiers of annotations to be included, each on a separate line. The most recent annotation is selected for genomes without an identifier.

--longest

Only cluster protein sequences of the longest transcript per gene.

--relaxation

The relaxation in homology calls. Should be in range [1-8], from strict to relaxed. This argument automatically sets the four remaining arguments stated below.

--intersection-rate

The fraction of k-mers that needs to be shared by two intersecting proteins. Should be in range [0.001,0.1].

--similarity-threshold

The minimum normalized similarity score of two proteins. Should be in range [1..99].

--mcl-inflation

The MCL inflation. Should be in range [1,19].

--contrast

The contrast factor. Should be in range [0,10].

Example commands
$ pantools group -t=12 -r=4 tomato_DB
$ pantools group --intersection-rate=0.05 --similarity-threshold=65  --mcl-inflation=7.2 --contrast=5 tomato_DB
Output
  • pantools_homology_groups.txt, overview of the created homology groups. Each line represents one homology group, starting with the homology group (database) identifier followed by a colon (:) and mRNA identifiers (from GFF) that are separated by a space. To ensure all identifiers are unique in this file, the mRNA ids are extended by a hash symbol (#) and a genome number. The following line is example output of an homology group with two genes from genome 1 and 146:

14001754: DLACAPHP_00001_mRNA#1 OPJEMMMF_03822_mRNA#146
Relevant literature

Optimal grouping

Finding the most suitable settings for group can be difficult and is always dependent on evolutionary distance of the genomes in the pangenome. This functionality runs group on all eight --relaxation settings, from strictest (d1) to the most relaxed (d8). To find the optimal setting, complete and non-duplicated BUSCO genes that are present in all genomes are used to validate each setting.

Method
A perfect clustering of the sequences would place each BUSCO in a separate homology group with one representative protein per genome. When BUSCO is run against the pangenome, the proteins corresponding to the BUSCO HMMs have been identified. For each BUSCO, the representative proteins are checked whether these are clustered into a single or multiple groups. These groups are searched to identify sequences other than the current BUSCO. The highest number of correctly clustered BUSCOs present in one group are true positives (tp). Any other gene clustered inside this group is considered a false positive (fp) The remaining BUSCO genes outside this best group are counted as false negative (fn). The summation of tps fps and fns are defined as TP, FP and FN, respectively. From these scores recall, precision and F-score measures are calculated as follows:
\[ \begin{align}\begin{aligned}Recall &= \frac{TP}{TP + FN}\\Precision &= \frac{TP}{TP + FP}\\F-score &= 2 \frac{Recall * Precision}{Recall + Precision}\end{aligned}\end{align} \]
_images/true_false_positives.png

Proteins of three distinct homology groups are represented as triangles, circles and squares. Green shapes are true positives (tp) which have been assigned to the true group; red shapes are false positives (fp) for the group they have been incorrectly assigned to, and false negatives (fn) for their true group

Choosing the optimal setting
Choosing the correct setting is usually a trade-off between TPs and FNs. The most strict grouping results in a significantly higher number of clusters as the more relaxed settings. With stringent settings, related proteins could get separated; however, a high number of false positives is (usually) prevented (FN > FP). When you would go for a more loose setting, the related proteins are likely to part of the same group, but other sequences could be included as well (FN < FP).

No grouping is active after running this function. Use the generated output files to identify a suitable grouping. Activate this grouping using change_grouping. An overview of the available groupings and used settings is stored in the ‘pangenome’ node (inside the database), or can be created by running grouping_overview.

Required software

MCL

Parameters

<databaseDirectory>

Path to the database root directory.

<buscoDirectory>

The output directory created by the busco_protein function. This directory is found inside the pangenome database, in the busco directory.

Options

--threads/-t

Number of parallel working threads, default is the number of available cores or 8, whichever is lower.

--include/-i

Only include a selection of genomes.

--exclude/-e

Exclude a selection of genomes.

--annotations-file/-A

A text file with the identifiers of annotations to be included, each on a separate line. The most recent annotation is selected for genomes without an identifier.

--fast

Assume the optimal grouping is found when the F1-score drops compared to the previous clustering round.

--longest

Only cluster protein sequences of the longest transcript per gene.

--relaxation

Only consider a selection of relaxation settings (1-8 allowed).

Example commands
$ pantools optimal_grouping bacteria_DB bacteria_DB/busco/bacteria_odb9
$ pantools optimal_grouping -t=12 --fast bacteria_DB bacteria_DB/busco/bacteria_odb9
$ pantools optimal_grouping -tn=12 --relaxation=1,2,3 bacteria_DB bacteria_DB/busco/bacteria_odb9

$ Rscript optimal_grouping.R
Output

After each clustering round, homology groups are incorporated in the graph. A text file with homology group and gene identifiers is stored in the group directory in the pangenome database. This file is named after the used sequence similarity threshold (25-95). Each line represents one homology group, starting with the homology group (database) identifier followed by a colon (:) and mRNA identifiers (from GFF) that are separated by a space. The mRNA identifiers are extended by a hash (#) and their genome number. The following line is example output of an homology group with two genes from genome 1 and 146:

14001754: DLACAPHP_00001_mRNA#1 OPJEMMMF_03822_mRNA#146

Output files are written to optimal_grouping directory inside the database.

  • grouping_overview.csv, a summary of the benchmark statistics. Use this file to find the most suitable grouping for your pangenome.

  • optimal_grouping.R, Rscript to plot FN and FP values per grouping.

  • counts_per_busco.info, a log file of the scoring. Shows in which homology groups the BUSCO genes were placed for the different groupings.

_images/best_grouping.png

Example output of optimal_grouping.R. The number of FN and FP for all eight relaxation settings.


Change grouping

Only a single homology grouping can be active in the pangenome. Use this function to change the active grouping version. Information of the available groupings and used settings is stored in the ‘pangenome’ node (inside the database) and can be created by running grouping_overview.

Parameters

<databaseDirectory>

Path to the database root directory.

Options

--grouping-version/-v

Required. The version of homology grouping to become active.

Example commands
$ pantools change_grouping -v=5 tomato_DB

Build panproteome

Build a panproteome out of a set of proteins. By only including protein sequences, the usable functionalities are limited to a protein-based analysis, please see differences pangenome and panproteome. No additional proteins can be added to the panproteome, it needs to be rebuilt completely.

Parameters

<databaseDirectory>

Path to the database root directory.

<proteomesFile>

A text file containing paths to FASTA files of proteins to be added to the panproteome; each on a separate line.

Example proteomes file
/always/proteins1.fasta
/use_the/proteins2.fasta
/full_path/proteins3.faa
Example commands
$ pantools build_panproteome proteome_DB proteins.txt

Add genomes

Add additional genomes to an existing pangenome.

Required software

KMC 2.3 or 3.0

Parameters

<databaseDirectory>

Path to the database root directory.

<genomesFile>

A text file containing paths to FASTA files of genomes to be added to the pangenome; each on a separate line.

Example genomes file
/use_the/genome4.fasta
/full_path/genome5.fasta
Example commands
$ pantools add_genomes pangenome_DB extra_genomes.txt

Add phenotypes

Including phenotype data to the pangenome which allows the identification of phenotype specific genes, SNPs, functions, etc.. Altering the data is done by rerunning the command with an updated CSV file.

Data types
Each phenotype node contains a genome number and can hold the following data types: String, Integer, Float or Boolean.
  • Values recognized as round number are converted to an Integer and to a Double when having one or multiple decimals.

  • Boolean types are identified by checking if the value matches ‘true’ or ‘false’, ignoring capitalization of letters.

  • String values remain completely unaltered except for spaces and quotes characters. Spaces are changed into an underscore (’_’) character and quotes are completely removed.

Bin numerical values
When using numerical values, two genomes are only considered to share a phenotype if the value is identical. PanTools creates an alternative version for these phenotypes by binning the values. Taking ‘Pathogenicity’ from the example below we see the integers between 3 and 15. Using these two extreme values three bins are created for a new phenotype ‘Pathogenicity_binned’: 3-6.33, 6.34-11.66 and 11.67-15. The number of bins is controlled through --bins. For skewed data, consider making the bins manually and include this as string phenotype.
Parameters

<databaseDirectory>

Path to the database root directory.

<phenotypesFile>

A CSV file containing the phenotype information.

Options

--append

Do not remove existing phenotype nodes but only add new properties to them. If a property already exists, values from the new file will overwrite the old.

--bins

Number of bins used to group numerical values of a phenotype (default: 3).

Example phenotypes file

The input file needs to be in .CSV format, a plain text file where each value is separated by a comma. The first row should start with ‘Genome,’ followed by the phenotype names and/or identifiers. The first column must start with genome numbers corresponding to the one in your pangenome. Phenotypes and metadata must be placed on the same line as their genome number. A field can remain empty when the phenotype for a genome is missing or unknown. Here below is an example of five genomes contains six phenotypes:

Genome,Gram,Region,Pathogenicity,Boolean,float,species
1,+,NL,3,True,0.1,Species
2,+,BE,,False,0.1,Species3
3,+,LUX,7,true,0.1,Species3
4,+,NL,9,false,0.1,Species3
5,+,BE,15,TRUE,0.1,Species1
Example command
$ pantools add_phenotypes tomato_DB pheno.csv
$ pantools add_phenotypes --append tomato_DB pheno.csv
Output

Phenotype information is stored in ‘phenotype’ nodes in the graph. An output file is written to the database directory.

  • phenotype_overview.txt, a summary of the available phenotypes in the pangenome


BUSCO protein

BUSCO attempts to provide a quantitative assessment of the completeness in terms of expected gene content of a genome assembly. Proteins are placed into categories of Complete and single-copy (S), Complete and duplicated (D), fragmented (F), or missing (M). This function is able to run BUSCO v3, v4 or v5 against protein sequences of the pangenome.

The number of reported duplicated genes in eukaryotes is often to high as different protein isoforms are counted multiple times. To adjust the imprecise duplication score, include the --longest-transcripts argument to the command.

You don’t have a benchmark set?

  • When using BUSCO v3, go to https://busco.ezlab.org, download a odb9 set, and untar it with tar -xvzf. Include the entire directory in the command using the --input-file argument.

  • For BUSCO v4 and v5, you only have to provide the odb10 database name with the --input-file argument, the database is downloaded automatically. To get a full list of the available datasets, run busco --list-datasets.

Required software

BUSCO must be set to your $PATH. For v3, test if the which run_BUSCO.py command displays the full path so it can accessed anywhere. For v4 and v5, test if busco is executable.

Parameters

<databaseDirectory>

Path to the database root directory.

Options

Requires one of --busco9|--busco10.

--threads/-t

Number of parallel working threads, default is the number of available cores or 8, whichever is lower.

--include/-i

Only include a selection of genomes.

--exclude/-e

Exclude a selection of genomes.

--annotations-file/-A

A text file with the identifiers of annotations to be included, each on a separate line. The most recent annotation is selected for genomes without an identifier.

--busco-version/-v

The BUSCO version. Select either ‘busco3’, ‘busco4’ or ‘busco5’ (default).

--busco9

An odb9 benchmark dataset file.

--busco10

An odb10 benchmark dataset name.

--longest

Only search against the longest protein-coding transcript of genes.

skip-busco

A list of questionable BUSCOs. The completeness score is recalculated by skipping these genes.

Example commands
$ pantools busco_protein --busco10=bacteria_odb10 bacteria_DB
$ pantools busco_protein -v=busco3 --busco9=busco_sets/bacteria_odb9/ bacteria_DB
$ pantools busco_protein --busco9=busco_sets/bacteria_odb9/ --skip-busco=POG093P01OY,POG093P0009,POG093P022K,POG093P027M,POG093P00Z2,POG093P013J bacteria_DB
Output

The BUSCO scores are stored inside BUSCO nodes of the pangenome graph. Output files are written to the busco directory inside the database.

  • busco_scores.txt, overview of the BUSCO scores per genome. Average and median statistics are calculated per category.

  • busco_overview.csv, a table which combines the completeness scores per genome together with the duplicated, fragmented and missing BUSCO genes.

  • hmm_overview.txt, a list of BUSCO genes showing the assigned categories per genome.


Add functional annotations

PanTools is able to incorporate functional annotations into the pangenome by reading output from various functional annotation tools.

Add functions

This function can integrate different functional annotations from a variety of annotation files. Currently available functional annotations: Gene Ontology, Pfam, InterPro, TIGRFAM, Phobius, SignalP and COG. The first time this function is executed, the Pfam, TIRGRAM, GO, and InterPro databases are integrated into the pangenome. Phobius, SignalP and COG annotations do not have separate nodes and are directly annotated on ‘mRNA’ nodes in the pangenome.

Gene names (or identifiers) from the input file are used to identify gene nodes in the pangenome. Only genes with an exactly matching name/identifier can be connected to functional annotation nodes! Use the same FASTA and GFF3 files that were used to construct the pangenome database.

Functional databases

Database versions in PanTools repository

Version

Download date (dd-mm-yyyy)

Gene ontology

2021-12-15

20-12-2021

Pfam

35.0

20-12-2021

TIGRFAM

15.0

01-10-2020

InterPro

87+

Not included in repository

We regularly check and update the four functional database. To update the functional database manually, download the following files and replace the old ones in the /pantools/addons/ directory. The TIGRFAM.info files are bundled in the TIGRFAMs_15.0_INFO.tar.gz file; download the file to addons/tigrfam and uncompress the tarball first. The first time running this function .INFO files are combined into a new file COMBINATION_INFO_FILES and removed afterwards.

File

Database type

Required directory

Download link

go.basic.obo

GO

addons

http://purl.obolibrary.org/obo/go/go-basic.obo

gene_ontology.txt

Pfam

addons

ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases//Pfam35.0/database_files/gene_ontology.txt.gz

Pfam-A.clans.tsv

Pfam

addons

ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases//Pfam35.0/Pfam-A.clans.tsv.gz

interpro.xml

InterPro

addons

https://ftp.ebi.ac.uk/pub/databases/interpro/current_release/interpro.xml.gz

TIGRFAMS_GO_LINK

TIGRFAM

addons/tigrfam

https://ftp.ncbi.nlm.nih.gov/hmm/TIGRFAMs/release_15.0/TIGRFAMS_GO_LINK

TIGRFAMS_ROLE_LINK

TIGRFAM

addons/tigrfam

https://ftp.ncbi.nlm.nih.gov/hmm/TIGRFAMs/release_15.0/TIGRFAMS_ROLE_LINK

TIGR_ROLE_NAMES

TIGRFAM

addons/tigrfam

https://ftp.ncbi.nlm.nih.gov/hmm/TIGRFAMs/release_15.0/TIGR_ROLE_NAMES

TIGR00001.INFO to TIGR04571.INFO

TIGRFAM

addons/tigrfam

https://ftp.ncbi.nlm.nih.gov/hmm/TIGRFAMs/release_15.0/TIGRFAMs_15.0_INFO.tar.gz

Parameters

<databaseDirectory>

Path to the database root directory.

<functionsFile>

A text file with on each line a genome number and the full path to the corresponding annotation file, separated by a space.

Options

`--annotations-file/-A

A text file with the identifiers of annotations to be included, each on a separate line. The most recent annotation is selected for genomes without an identifier.

Example commands
$ pantools add_functions tomato_DB f_annotations.txt
$ pantools add_functions -A annotations.txt tomato_DB f_annotations.txt
Output

Functional annotations are incorporated in the graph. A log file is written to the log directory.

  • add_functional_annotations.log, a log file with the the number of added functions per type and the identifiers of functions that could not be included.

Example function files

The <functionsFile> requires to be formatted like an annotation input file. Each line of the file starts with the genome number followed by the full path to an annotation file.

File type

Recognized by pattern in file name

InterProScan

interpro & .gff

eggNOG-mapper

eggnog

Phobius

phobius

SignalP

signalp

Custom file

custom

1 /mnt/scratch/interpro_results_genome_1.gff
1 /mnt/scratch/custom_annotation_1.txt
1 /mnt/scratch/phobius_1.txt
2 /mnt/scratch/signalp.txt
2 /mnt/scratch/eggnog_genome_2.annotations
2 /mnt/scratch/transmembrane_annotations.txt phobius
3 /mnt/scratch/ipro_results_genome_3.annot custom
Annotation file types
PanTools can recognize functional annotations in different output formats.

Phobius and SignalP are not standard analyses of the InterProScan pipeline and require some additional steps during the InterProScan installation. Please take a look at our InterProScan install instruction to verify if the tools are part of the prediction pipeline. Phobius 1.01

Function type

Allowed annotation file

GO

InterProscan .gff & custom annotation file

Pfam

InterProscan .gff & custom annotation file

InterPro

InterProscan .gff & custom annotation file

TIGRFAM

InterProscan .gff & custom annotation file

Phobius

InterProscan .gff & Phobius 1.01 output

SignalP

InterProscan .gff, signalP 4.1 output, signalP 5.0 output

COG

eggNOG-mapper

InterProScan gff file:

##gff-version 3
##interproscan-version 5.52-86.0
AT4G21230.1   ProSiteProfiles protein_match 333 620 39.000664   +   .   date=06-10-2021;Target=mRNA.AT4G21230.1 333 620;Ontology_term="GO:0004672","GO:0005524","GO:0006468";ID=match$42_333_620;signature_desc=Protein kinase domain profile.;Name=PS50011;status=T;Dbxref="InterPro:IPR000719"
AT3G08980.5   TIGRFAM protein_match         25  101 3.7E-14     +   .   date=06-10-2021;Target=mRNA.AT3G08980.5 25 101;Ontology_term="GO:0006508","GO:0008236","GO:0016020";ID=match$66_25_101;signature_desc=sigpep_I_bact: signal peptidase I;Name=TIGR02227;status=T;Dbxref="InterPro:IPR000223"
AT2G17780.2   Phobius protein_match         338 354 .           +   .   date=06-10-2021;Target=AT2G17780.2 338 354;ID=match$141_338_354;signature_desc=Region of a membrane-bound protein predicted to be embedded in the membrane.;Name=TRANSMEMBRANE;status=T
AT2G17780.2   Phobius protein_match         1   337 .           +   .   date=06-10-2021;Target=AT2G17780.2 1 337;ID=match$142_1_337;signature_desc=Region of a membrane-bound protein predicted to be outside the membrane, in the extracellular region.;Name=NON_CYTOPLASMIC_DOMAIN;status=T
AT3G11780.2   SignalP_EUK protein_match     1   24  .           +   .   date=06-10-2021;Target=mRNA.AT3G11780.2 1 24;ID=match$230_1_24;Name=SignalP-noTM;status=T
AT1G04300.2   CDD protein_match             40  114 1.54717E-13 +   .   date=06-10-2021;Target=mRNA.AT1G04300.2 40 114;Ontology_term="GO:0005515";ID=match$212_40_114;signature_desc=MATH;Name=cd00121;status=T;Dbxref="InterPro:IPR002083"

eggNOG-mapper (tab separated) file:

#query_name     seed_eggNOG_ortholog seed_ortholog_evalue seed_ortholog_score best_tax_level Preferred_name GOs EC KEGG_ko KEGG_Pathway KEGG_Module KEGG_Reaction KEGG_rclass BRITE KEGG_TC CAZy BiGG_Reaction taxonomic scope eggNOG OGs best eggNOG OG COG Functional cat. eggNOG free text desc.
ATKYO-2G54530.1 3702.AT2G35130.2     1.9e-179             636.0               Brassicales     GO:0003674,GO:0003676,GO:0003723,GO:0003824,GO:0004518,GO:0004519,GO:0005488,GO:0005575,GO:0005622,GO:0005623,GO:0006139,GO:0006725,GO:0006807,GO:0008150,GO:0008152,GO:0009451,GO:0009987,GO:0016070,GO:0016787,GO:0016788,GO:0034641,GO:0043170,GO:0043226,GO:0043227,GO:0043229,GO:0043231,GO:0043412,GO:0044237,GO:0044238,GO:0044424,GO:0044464,GO:0046483,GO:0071704,GO:0090304,GO:0090305,GO:0097159,GO:1901360,GO:1901363                                           Viridiplantae   37R67@33090,3GAUT@35493,3HNDD@3699,KOG4197@1,KOG4197@2759   NA|NA|NA    E   Pentacotripeptide-repeat region of PRORP
ATKYO-UG22500.1 3712.Bo02269s010.1   7.5e-35              153.7               Brassicales                                                 Viridiplantae   29I9W@1,2RRH4@2759,383W6@33090,3GWQZ@35493,3I1A9@3699   NA|NA|NA
ATKYO-1G60060.1 3702.AT1G48090.1     0.0                  6241.0              Brassicales             ko:K19525                   ko00000             Viridiplantae   37IJB@33090,3GAN0@35493,3HQ90@3699,COG5043@1,KOG1809@2759   NA|NA|NA    U   Vacuolar protein sorting-associated protein
ATKYO-3G74720.1 3702.AT3G52120.1     7.2e-245             852.8               Brassicales             ko:K13096                   ko00000,ko03041             Viridiplantae   37QYY@33090,3G9VU@35493,3HRDK@3699,KOG0965@1,KOG0965@2759   NA|NA|NA    L   SWAP (Suppressor-of-White-APricot) surp domain-containing protein D111 G-patch domain-containing protein
ATKYO-4G41660.1 3702.AT4G16340.1     0.0                  3392.1              Brassicales     GO:0003674,GO:0005085,GO:0005088,GO:0005089,GO:0005488,GO:0005515,GO:0005575,GO:0005622,GO:0005623,GO:0005634,GO:0005737,GO:0005783,GO:0005829,GO:0005886,GO:0006810,GO:0008064,GO:0008150,GO:0008360,GO:0009605,GO:0009606,GO:0009628,GO:0009629,GO:0009630,GO:0009958,GO:0009966,GO:0009987,GO:0010646,GO:0010928,GO:0012505,GO:0016020,GO:0016043,GO:0016192,GO:0017016,GO:0017048,GO:0019898,GO:0019899,GO:0022603,GO:0022604,GO:0023051,GO:0030832,GO:0031267,GO:0032535,GO:0032956,GO:0032970,GO:0033043,GO:0043226,GO:0043227,GO:0043229,GO:0043231,GO:0044422,GO:0044424,GO:0044425,GO:0044432,GO:0044444,GO:0044446,GO:0044464,GO:0048583,GO:0050789,GO:0050793,GO:0050794,GO:0050896,GO:0051020,GO:0051128,GO:0051179,GO:0051234,GO:0051493,GO:0065007,GO:0065008,GO:0065009,GO:0070971,GO:0071840,GO:0071944,GO:0090066,GO:0098772,GO:0110053,GO:1902903     ko:K21852                   ko00000,ko04131             Viridiplantae   37QIM@33090,3G8RK@35493,3HSFN@3699,KOG1997@1,KOG1997@2759   NA|NA|NA    T   Belongs to the DOCK family

A custom input file must consist of two tab or comma separated columns. The first column should contain a gene/mRNA id, the second an identifier from one of four functional annotation databases: GO, Pfam, InterPro or TIGRFAM.

AT5G23090.4,GO:0046982
AT5G23090.4,IPR009072
AT1G27540.2,PF03478
AT2G18450.1,TIGR01816

Phobius 1.01 ‘short’ (tab separated) functions file:

SEQENCE ID                     TM SP PREDICTION
mRNA-YPR204W                    0  0 o
mRNA-ndhB-2_1                   6  Y n5-16c21/22o37-57i64-83o89-113i134-156o168-189i223-246o

Phobius 1.01 ‘long’ (tab separated) functions file:

ID   mRNA-YPR204W
FT   DOMAIN        1   1032       NON CYTOPLASMIC.
//
ID   mRNA-ndhB-2_1
FT   SIGNAL        1     21
FT   DOMAIN        1      4       N-REGION.
FT   DOMAIN        5     16       H-REGION.
FT   DOMAIN       17     21       C-REGION.
FT   DOMAIN       22     36       NON CYTOPLASMIC.
FT   TRANSMEM     37     57
FT   DOMAIN       58     63       CYTOPLASMIC.
FT   TRANSMEM     64     83
FT   DOMAIN       84     88       NON CYTOPLASMIC.
FT   TRANSMEM     89    113
FT   DOMAIN      114    133       CYTOPLASMIC.
FT   TRANSMEM    134    156
FT   DOMAIN      157    167       NON CYTOPLASMIC.
FT   TRANSMEM    168    189
FT   DOMAIN      190    222       CYTOPLASMIC.
FT   TRANSMEM    223    246
FT   DOMAIN      247    253       NON CYTOPLASMIC.
//

SignalP 4.1 ‘short’ (tab separated) functions file:

# name                     Cmax  pos  Ymax  pos  Smax  pos  Smean   D     ?  Dmaxcut    Networks-used
mRNA-rpl2-3                0.148  20  0.136  20  0.146   3  0.126   0.131 N  0.450      SignalP-noTM
mRNA-cox2                  0.107  25  0.132  12  0.270   4  0.162   0.148 N  0.450      SignalP-noTM
mRNA-cox2_1                0.850  17  0.776  17  0.785   2  0.717   0.753 Y  0.500      SignalP-TM

SignalP 5.0 ‘short’ (tab separated) functions file:

# SignalP-5.0 Organism:   Eukarya     Timestamp: 20211122233246
# ID          Prediction  SP(Sec/SPI) OTHER    CS Position
AT3G26880.1   SP(Sec/SPI) 0.998803    0.001197 CS pos: 21-22. VYG-KK. Pr: 0.9807
mRNA-rpl2-3   OTHER       0.001227    0.998773
Relevant literature

Add antiSMASH

Read antiSMASH output and incorporate Biosynthetic Gene Clusters (BGC) nodes into the pangenome database. A ‘bgc’ node holds the gene cluster product, the cluster address and has a relationship to all gene nodes of the cluster. For this function to work, antiSMASH should be performed with the same FASTA and GFF3 files used for building the pangenome. antiSMASH output will not match the identifiers of the pangenome when no GFF file was included.

As of PanTools v3.3.4 the required antiSMASH version is 6.0.0. Gene cluster information is parsed from the .JSON file that is generated in each run. We try to keep the parser updated with newer versions but please contact us when this is no longer the case.

Version

Version Date

antiSMASH

6.0.0

21-02-2021

Parameters

<databaseDirectory>

Path to the database root directory.

<antiSMASHFile>

A text file with on each line a genome number and the full path to the corresponding antiSMASH output file, separated by a space.

Options

--annotations-file/-A

A text file with the identifiers of annotations to be included, each on a separate line. The most recent annotation is selected for genomes without an identifier.

Example antiSMASH file

The <antiSMASHFile> requires to be formatted like a regular annotation input file. Each line of the file starts with the genome number followed by the full path to the JSON file.

1 /mnt/scratch/IPO3844/antismash/IPO3844.json
4 /home/user/IPO3845/antismash/IPO3845.json
Example commands
$ pantools add_antismash tomato_DB clusters.txt
$ pantools add_antismash -A annotations.txt tomato_DB clusters.txt

Removing data

The following functionalities allow the removal of large sets of nodes and relationships from the pangenome. These functions will first ask for a confirmation before the nodes are actually removed. Be careful, the data is not backed up and removing nodes or properties means it is permanently gone.


Remove nodes

Remove a selection of nodes and their relationships from the pangenome. For a pangenome database the following nodes should never be removed: nucleotide, pangenome, genome, sequence. When using a panproteome, mRNA nodes cannot be removed.

Parameters

<databaseDirectory>

Path to the database root directory.

Options

Requires one of --nodes|--label, include and exclude only work for --label.

--include/-i

Only remove nodes of the selected genomes.

--exclude/-e

Do not remove nodes of the selected genomes.

--nodes/-n

One or multiple node identifiers, separated by a comma.

--label

A node label, all nodes matching the label are removed.

Example commands
$ pantools remove_nodes --nodes=10348734,10348735,10348736 tomato_DB
$ pantools remove_nodes --label=busco --include=2-6 tomato_DB

Remove phenotypes

Delete phenotype nodes or remove specific phenotype information from the nodes. The specific phenotype property needs to be specified with --phenotype. When this argument is not included, phenotype nodes are removed.

Parameters

<databaseDirectory>

Path to the database root directory.

Options

--include/-i

Only remove nodes of the selected genomes.

--exclude/-e

Do not remove nodes of the selected genomes.

--phenotype/-p

Name of the phenotype. All information of the given phenotype is removed from ‘phenotype’ nodes.

Example commands
$ pantools remove_phenotype tomato_DB
$ pantools remove_phenotype --phenotype=color tomato_DB
$ pantools remove_phenotype --phenotype=color --exclude=11,12 tomato_DB

Remove annotations

Remove all the genomic features that belong to annotations, such as gene, mRNA, exon, tRNA, and feature nodes. Functional annotation nodes are not removed with this function but can be removed with remove_functions. Removing annotations can be done in two ways:

  1. Selecting genomes with --include or --exclude, for which all annotation features will be removed.

  2. Remove specific annotations by providing a text file with identifiers via the --annotations-file argument.

Parameters

<databaseDirectory>

Path to the database root directory.

Options

Requires one of --include|--exclude|--annotations-file.

--include/-i

A selection of genomes for which all annotations will be removed.

--exclude/-e

A selection of genomes excluded from the removal of annotations.

--annotations-file/-A

A text file with the identifiers of annotations to be removed, each on a separate line.

Example annotations file

The annotations file should contain identifiers for annotations on each line (genome number, annotation number). The following example will remove the first annotations of genome 1, 2 and 3 and the second annotation of genome 1.

1_1
1_2
2_1
3_1
Example commands
$ pantools --exclude=3,4,5 remove_annotations
$ pantools -A annotations.txt remove_annotations

Remove functions

Remove all the functional annotation features from the graph database. Functional annotations include the GO, pfam, tigrfam and interpro nodes as well as mRNA node properties for COG, phobius and signalp. There are multiple modes available using --mode:

  • ‘all’ removes all functional annotation nodes and properties.

  • ‘nodes’ removes all GO, pfam, tigrfam and interpro nodes.

  • ‘properties’ removes all COG, phobius and signalp properties from mRNA nodes.

  • ‘GO’, ‘pfam’ and ‘tigrfam’ only remove specific properties from mRNA nodes.

Parameters

<databaseDirectory>

Path to the database root directory.

Options

--mode/-m

Mode for which annotations to remove (default: all)

Example commands
$ pantools remove_functions
$ pantools --mode=nodes remove_functions

Move or remove grouping

As only one grouping can be active at the time, the currently active grouping needs to be removed or inactivated before group can be run again.

Remove grouping

Delete all ‘homology_group’ nodes and ‘is_similar’ relations between ‘mRNA’ nodes from the database.

Parameters

<databaseDirectory>

Path to the database root directory.

Options

--fast

Do not remove the ‘is_similar’ relationships between mRNA nodes. This does not influence the next grouping.

--grouping-version/-v

Select a specific grouping version to be removed. Two additional options: ‘all’ to remove all groupings and ‘all_inactive’ to remove all inactive groupings.

Example commands
$ pantools remove_grouping tomato_DB
$ pantools remove_grouping --version=1 tomato_DB
$ pantools remove_grouping --version=all --fast tomato_DB
$ pantools remove_grouping --version=all_inactive tomato_DB
Move grouping

Relabel ‘homology_group’ nodes to ‘inactive_homology_group’. The moved grouping can be activated again with change_grouping.

Parameters

<databaseDirectory>

Path to the database root directory.

Options

--fast

Do not remove the ‘is_similar’ relationships between mRNA nodes. This does not influence the next grouping.

Example commands
$ pantools move_grouping tomato_DB
$ pantools move_grouping --fast tomato_DB

Pangenome characterization

Functionalities for characterization a pangenome based on genes, k-mer sequences and functions. In this manual we use several pangenome related terms with the following definitions:

  • Core, an element is present in all genomes

  • Unique, an element is present in a single genome

  • Accessory, an element is present in some but not all genomes

When phenotype information is used in the analysis, three additional categories can be assigned:

  • Shared, an element present in all genomes of a phenotype

  • Exclusive, an element is only present in a certain phenotype

  • Specific, an element present in all genomes of a phenotype and is also exclusive

_images/classify_table.png

The possible classification categories for genes, k mers and functions. Additional copies of an element are assigned to the same category.


Metrics

Generates relevant metrics of the pangenome and the individual genomes and sequences.

  • On the pangenome level: the number of genomes, sequences, annotations, genes, proteins, homology groups, k-mers, and database nodes and edges.

  • On the genome and sequence level: assembly statistics and metrics about functional elements. The assembly statistics consists of genome size, N25-N95, L25-L95, BUSCO scores and GC content. An overview of the functional elements is created by summarizing the functional annotations per genome (and sequence) and reporting the shortest, longest, average length and density per MB for genome features such as genes, exons and CDS.

Parameters

<databaseDirectory>

Path to the database root directory.

Options

--include/-i

Only include a selection of genomes.

--exclude/-e

Exclude a selection of genomes.

--annotations-file/-A

A text file with the identifiers of annotations that should be used. The most recent annotation is selected for genomes without an identifier.

Example commands
$ pantools metrics tomato_DB
$ pantools metrics --exclude=1,2,5 tomato_DB
Output

Output files are written to the metrics directory in the database. Note: the percentage a genome or sequence is covered by a genes, repeats etc., (currently) does not consider overlap between features!

  • metrics.txt, overview of the metrics calculated on the pangenome and genome level.

  • metrics_per_genome.csv, summary of the metrics that are calculated on a genome level. The output is formatted as table.

  • metrics_per_sequence.csv, summary of metrics that are calculated on a sequence (contig/scaffold) level. The output is formatted as table. This file is not created when using a panproteome.


Homology groups

The following functions require the protein sequences to be clustered by group.

Gene classification

Classification of the pangenome’s gene repertoire. Homology groups are utilized to identify shared genes between genomes. The default criteria for defining the category of a gene is shown in Fig. 3.

To identify soft core and cloud genes, the core and unique thresholds (%) can be relaxed by --core-threshold and --unique-threshold, respectively. The --phenotype-threshold argument can be used to lower the threshold for phenotype specific and shared homology groups.

Parameters

<databaseDirectory>

Path to the database root directory.

Options

--include/-i

Only include a selection of genomes. This automatically lowers the threshold for core genes.

--exclude/-e

Exclude a selection of genomes. This automatically lowers the threshold for core genes.

--phenotype/-p

A phenotype name, used to find genes specific to the phenotype.

--mlsa

Finds suitable single-copy groups for a MLSA.

--core-threshold

Threshold (%) for (soft) core genes. Default is 100% of genomes.

--unique-threshold

Threshold (%) for unique/cloud genes. Default is a single genome, not a percentage.

--phenotype-threshold

Threshold (%) for phenotype specific/shared genes. Default is 100% of genomes with phenotype.

Example commands
$ pantools gene_classification tomato_DB
$ pantools gene_classification --unique-threshold=5 --core-threshold=95 tomato_DB
$ pantools gene_classification --phenotype=resistance --exclude=2,3 --phenotype-threshold=95 tomato_DB
Output

Output files are written to the gene_classification directory in the database.

  1. gene_classification_overview.txt, statistics of the core, accessory, unique groups of the pangenome and individual genomes.

  2. classified_groups.csv, the classified homology groups formatted as the table in the example table above.

  3. cnv_core_accessory.txt, core and accessory groups with genomes that have additional copies compared to the lowest number (at least 1) in the group.

  4. group_size_occurrence.txt, number of times a group of a certain size occurs in the pangenome. The homology group sizes can be based on the number of proteins or the number of genomes.

  5. gene_distance_tree.R, an R script to cluster genomes based on gene distance (absence/presence). For more information, see the Gene distance tree manual.

  6. shared_unshared_gene_count.csv, six tables with the number of shared and unshared genes between genomes: all genes, distinct genes and informative distinct genes. To get the number of distinct genes, additional copies of a gene within a homology group are ignored. Genes are considered informative when shared by at least two genomes.

Additional files are generated when the --phenotype argument is included.

  1. gene_classification_phenotype_overview.txt, the number of identified phenotype shared and specific groups.

  2. phenotype_disrupted.txt, this file shows which proteins prevented phenotype shared groups to be specific.

  3. phenotype_cnv, homology groups where all members of a phenotype have at least one additional copy of a gene compared to one of the other phenotypes.

  4. phenotype_association.csv, results of performed Fisher exact tests on homology groups with an unequal proportion of phenotype members.

The following files contain homology group node identifiers.

  1. all_homology_groups.csv, the node identifiers of all homology groups.

  2. core_groups.csv, the node identifiers of the core homology groups.

  3. single_copy_orthologs.csv, the node identifiers of single-copy ortholog groups. This is a subset of the core set where each genome is only allowed to have a one copy of a gene.

  4. accessory_groups.csv, the node identifiers of accessory homology groups. The groups are ordered (in descending order) by the group size based on the total number of genomes present.

  5. accessory_combinations.csv, the node identifiers of accessory homology groups, ordered by the combination of genomes by which they are shared.

  6. unique_groups.csv, the node identifiers of unique homology groups ordered by genome.

  7. phenotype_specific_groups.csv, the node identifiers of phenotype specific homology groups.

  8. phenotype_shared_groups.csv, the node identifiers of phenotype shared homology groups.

  9. phenotype_exclusive_groups.csv, the node identifiers of phenotype exclusive homology groups.

When --mlsa is included

  1. mlsa_suggestions.txt, a list of single copy ortholog genes all having the same gene name. This file cannot be created when using a panproteome.


Core unique thresholds

Runs a simplified version of the gene_classification function to test the effect of different --core-threshold and --unique-threshold cut-offs between 1 and 100%.

Parameters

<databaseDirectory>

Path to the pangenome database root directory.

Options

--include/-i

Only include a selection of genomes. This automatically lowers the threshold for core genes.

--exclude/-e

Exclude a selection of genomes. This automatically lowers the threshold for core genes.

Example commands
$ pantools core_unique_thresholds tomato_DB
$ pantools core_unique_thresholds --exclude=1,2,5-10 tomato_DB
$ R script tomato_DB/R_scripts/core_unique_thresholds/core_unique_thresholds.R
Output

Output files are written to core_unique_thresholds directory in the database.

  • core_unique_thresholds.csv, the number of (soft) core unique/cloud homology groups for all tested thresholds.

  • core_unique_thresholds.R, the R script plots the number of (soft) core unique/cloud homology groups for all tested thresholds.

_images/core_unique_thresholds.png

Example output of core_unique_thresholds.R on a pangenome of 197 Pectobacterium genomes demonstrates the effect of loosening the thresholds. The number of (soft) core (orange) homology groups slightly increases when the cut-off for this category is lowered from 100% (200 genomes) to 1% (2) in steps of 1%. Unique/cloud (blue) start at 0.00 which represents a single genome. Using a 0.01 cut-off, groups are unique/cloud having 2 genomes or less. The threshold is further increased to 100% (200) in steps of 1%.


Grouping overview

Reports the content of all (active & inactive) homology groups for the different groupings in the pangenome. Include --fast into the command to get a quick overview of the available groupings and the settings that were used.

Parameters

<databaseDirectory>

Path to the pangenome database root directory.

Options

--fast

Only show which grouping is active and which groupings can be activated.

Example commands
$ pantools grouping_overview tomato_DB
$ pantools grouping_overview --fast tomato_DB
Output

Output files are written to /database_directory/group/

  • grouping_overview.txt, all homology groups in the pangenome. For each homology group, the total number of members and the number of members per per genome is reported.

  • current_pantools_homology_groups.txt, overview of the active homology groups. Each line represents one homology group. The line starts with the homology group (database) identifier followed by a colon and the rest are mRNA IDs (from gff/genbank) seperated by a space.


Pangenome structure

Iterations of random genome combinations according to the models proposed by Tettelin et al.* in 2005 are used to determine the contribution of new accessions with respect to the increase in core, accessory, and unique. Each iteration starts with three random genomes from which core, accessory and unique homology groups are identified. Subsequently, random genomes are added and group reclassified until the maximum number of genomes is reached. To simulate the overall pangenome-size increase and core-genome decrease, we suggest to use at least 10,000 iterations. Additional copies of a gene are ignored in the simulation.

Heaps’ law (a power law) can be fitted to the number of new genes observed when increasing the pangenome by one random genome. The formula for the power law model is \(n = k * N^{-a}\), where n is the newly discovered genes, N is the total number of genomes, and k and a are the fitting parameters. A pangenome can be considered open when a < 1 and closed if a > 1.

Pangenome size estimation is based on homology groups. This function requires the sequences to be already clustered by group. The same simulation can be performed on k-mer sequences instead of homology groups with --kmer. As the number of k-mers is significantly higher than the number of homology groups, the runtime is much longer and the (default) number of loops is set to only 100.

Parameters

<databaseDirectory>

Path to the pangenome database root directory.

Options

--threads/-t

Only include a selection of genomes.

--include/-i

Only include a selection of genomes.

--exclude/-e

Exclude a selection of genomes.

--kmer/-k

Pangenome size estimation based on k-mer sequences (homology-groups is default).

--loops

Number of loops (default is 10.000 for homology groups, 100 for k-mers).

Example commands
$ pantools pangenome_structure tomato_DB
$ pantools pangenome_structure --loops=1000 --exclude=1-3,5  tomato_DB
$ pantools pangenome_structure --kmer tomato_db

$ R script pangenome_growth.R
$ R script gains_losses_median_or_average.R
$ R script gains_losses_median_and_average.R
$ R script heaps_law.R
$ R script core_access_unique.R
Output

Output files for homology-group based estimation are written to /database_directory/pangenome_size/gene/

  • pangenome_size.txt, various statistics on the number core, accessory, and unique homology groups for the different pangenome sizes.

  • gains_losses.txt, the average group gain and loss between different pangenome sizes. First the average number (core, accessory, and unique) groups for each pangenome size is calculated. The average gain and loss of groups is then found by subtracting the averages of a certain size to the averages of one genome larger (e.g. pangenome size of 5 is compared to 6).

  • gains_losses_last_genome.txt, the number of (core, accessory, and unique) groups that are gained or lost when including one of the genomes to a pangenome of the remaining genomes.

  • pangenome_growth.R, an R script to plot the number of core, accessory and unique groups for the different genome combinations. Second option is to only plot a core and accessory curve by including unique groups to the accessory.

  • gains_losses_median_and/or_average.R, R scripts to plot the average and median group gain and loss between pangenome sizes.

  • heaps_law.R, an R script to perform Heaps’ law.

_images/size.png

Example output of pangenome_growth.R (left) and gains_losses_median_and_average.R (right) on a pangenome of 204 bacteria.

Output files for k-mer based estimation are written to /database_directory/pangenome_size/kmer/

  • pangenome_size_kmer.txt, statistics of the number of k-mers with different pangenome sizes.

  • core_access_unique.R, an R script to plot the number core, accessory, unique k-mers for the different genome combinations.

  • core_access.R, an R script to plot the number of core and accessory (including unique) k-mers for the different genome combinations.

Relevant literature

K-mer classification

Calculate the number of core, accessory, unique, (and phenotype specific) k-mer sequences. Because k-mer sequences of non-branching paths of the DBG graph are collapsed into a single node, k-mers are first uncompressed before they are counted. When --compressed is included, sequences are not uncompressed and considered as a single k-mer. Nucleotide nodes with a ‘degenerate’ label contain letters other than the four non-ambiguous ones (A, T, C, G). and are ignored by this function.

Parameters

<databaseDirectory>

Path to the pangenome database root directory.

Options

--include/-i

Only include a selection of genomes. This automatically lowers the threshold for core k-mers.

--exclude/-e

Exclude a selection of genomes. This automatically lowers the threshold for core k-mers.

--phenotype/-p

A phenotype name, used to identify phenotype specific k-mers.

--compressed

Do not uncompress collapsed non-branching k-mers for k-mer counting.

--core-threshold

Threshold (%) for (soft) core k-mers. Default is 100% of the genomes.

--unique-threshold

Threshold (%) for unique/cloud k-mers. Default is a single genome, not a percentage.

--phenotype-threshold

Threshold (%) for phenotype specific/shared k-mers. Default is 100% of genomes with phenotype.

Example commands
$ pantools kmer_classification tomato_DB
$ pantools kmer_classification --phenotype=resistant --exclude=2,3,4 tomato_DB
$ pantools kmer_classification --compressed --core-threshold=95 --unique-threshold=5 tomato_DB
Output

Output files are written to /database_directory/kmer_classification/

  • kmer_classification_overview.txt, some general statistics and percentages about the core, accessory unique k-mers per genome.

  • kmer_occurrence.txt, the occurrence of k-mers per genome and total occurrence in the pangenome.

  • kmer_distance_tree.R, an R script to cluster genomes with four different k-mer distances to choose from. For more information, see The k-mers are ordered from high to low by the total number of genomes the k-mer is found.

  • unique_kmers.csv, the node identifiers of unique k-mers ordered by genome.

  • phenotype_specific_kmers.csv, the node identifiers of phenotype specific k-mers.

  • phenotype_shared_kmers.csv, the node identifiers of phenotype shared k-mers.


Functional annotations

The following functions can only be used when any type of functional annotation is added to the database.

Functional classification

Similar to gene and k-mer classification, this function identifies core, accessory, unique functional annotations in the pangenome. Only the following functions are considered for this analysis: biosynthetic gene clusters from antiSMASH, GO, PFAM, InterPro, TIGRFAM.

Parameters

<databaseDirectory>

Path to the pangenome database root directory.

Options

--include/-i

Only include a selection of genomes. This automatically lowers the threshold for core genes.

--exclude/-e

Exclude a selection of genomes. This automatically lowers the threshold for core genes.

--annotations-file/-A

A text file with the identifiers of annotations that should be used. The most recent annotation is selected for genomes without an identifier.

--phenotype/-p

A phenotype name, used to find functions specific to a phenotype.

--core-threshold

Threshold (%) For (soft) core functions (default is 100%).

--unique-threshold

Threshold (%) For unique/cloud functions (default is a single genome, not a percentage).

Example commands
$ pantools functional_classification tomato_DB
$ pantools functional_classification -p=flowering_time tomato_DB
Output

Output files are written to /database_directory/function/functional_classification/

  • functional_annotation_overview, number of core, accessory, and unique functions. Holds the number of phenotype shared and specific functions when a phenotype is included.

  • core_functions.txt, functional annotations found in every genome of the pangenome.

  • accessory_functions.txt, functional annotations labeled as accessory.

  • unique_functions.txt, functional annotations unique to a single genome.

When a --phenotype is included

  • phenotype_shared_functions.txt, functional annotations shared by all phenotype members.

  • phenotype_specific_functions.txt, functional annotations specific to certain phenotypes.


Function overview

Creates several summary files for each type of functional annotation present in the database: GO, PFAM, InterPro, TIGRFAM, COG, Phobius, and biosynthetic gene clusters from antiSMASH. In addition to the functions that must be added via add_functional_annotations, this function also requires proteins to be clustered by group.

Parameters

<databaseDirectory>

Path to the pangenome database root directory.

Options

--include/-i

Only include a selection of genomes.

--exclude/-e

Exclude a selection of genomes.

--annotations-file/-A

A text file with the identifiers of annotations that should be used. The most recent annotation is selected for genomes without an identifier.

Example commands
$ pantools function_overview tomato_DB
$ pantools function_overview --include=2-4 tomato_DB
Output

Output files are written to function directory in the database. The overview CSV files are tables with on each row a function identifier with the frequency of per genome and.

  • functions_per_group_and_mrna.csv, overview of all homology groups and the associated functions.

  • function_counts_per_group.csv,

  • go_overview.csv, overview of the GO terms in the pangenome.

  • pfam_overview.csv, overview of the PFAM domains in the pangenome.

  • tigrfam_overview.csv, overview of the TIGRFAMs in the pangenome.

  • interpro_overview.csv, overview of the InterPro domains in the pangenome.

  • bgc_overview.csv, overview of the added biosynthetic gene clusters from antiSMASH in the pangenome.

  • phobius_signalp_overview.csv, overview of the included Phobius transmembrane topology and signal peptide predictions in the pangenome.

  • cog_overview.csv, overview of the functional COG categories in the pangenome.

  • cog_per_class.R, an R script to plot the distribution of COG categories over the core, accessory, unique homology groups.

_images/COG_abundance.png

Example output of cog_per_class.R. The proportion of COGs functional categories assigned to homology groups.


GO enrichment

For a given set of mRNA’s or homology groups, this function identifies over or underrepresented GO terms by using a hypergeometric distribution.

The p-value is calculated from the hypergeometric distribution

\[P(X = k) = \frac{\binom{K}{k} \binom{N-K}{n-k}}{\binom{N}{n}}\]
  • Parameter N = size of the population (Universe of genes).

  • Parameter n = size of the sample (signature gene set)

  • Parameter K = successes in population (enrichment gene set)

  • Parameter k = successes in sample (intersection of both gene sets)

  • Return the p-value of the Hypergeometric Distribution for P(X=k)

Prepare input for hypergeometric tests

The size and number of successes of the sample (n, k) and background (N, K) is prepared for each genome individually. Per genome, loops over every mRNA and checks for connected GO nodes. Each GO node connected to the mRNA is used to move up in the GO hierarchy via ‘is_a’ relations until the molecular_function, biological_process or cellular_component node is reached. Each GO term is counted only once per mRNA and a mRNA needs at least one GO term to be included in the sample and background sets. mRNA nodes which are part of the input homology groups are included into the sample set.

Multiple testing correction

Critical p-value using Bonferroni

For a GO germ to be significant, the p-value should be below 0.05 divided by number of tests per genome. For example, when 100 tests were performed, each p-value must be below 0.05/100 = 0.0005 to be considered significant.

Critical p-value using Benjamini-Hochberg procedure

  1. Individual p-values are put in ascending order.

  2. Ranks are assigned to the p-values. The lowest value has a rank of 1, the second lowest gets rank 2, etc..

  3. The individual p-values Benjamini-Hochberg critical value is calculated using the formula \((i/m)Q\), where i is the individual p-values rank, m = total number of tests and Q is the false discovery rate.

  4. Compare your original p-values to the critical B-H from Step 3; find the largest p value that is smaller than the critical value.

The critical p-value for the first rank for a total of 100 GO terms (tests) with a 5% false discovery rate is \((1/100) * 0.05 = 0.0005\). For the second and third rank this will be 0.0010 and 0.0015, respectively.

Required software
  • dot. Although this function still works when dot is not (properly) installed, no visualizations of the GO hierarchy can be created.

Parameters

<databaseDirectory>

Path to the pangenome database root directory.

Options

Requires one of --homology-file``|–nodes``.

--homology-file/-H

A text file with homology group node identifiers, seperated by a comma.

--nodes/-n

mRNA node identifiers, seperated by a comma on the command line.

--include/-i

Only include a selection of genomes.

--exclude/-e

Exclude a selection of genomes.

--fdr

The false discovery rate (percentage), default is 5%.

Example commands
$ pantools go_enrichment -H=unique_groups.txt tomato_DB
$ pantools go_enrichment --fdr=1 -i=1-3,5 -H=pheno_specific.txt tomato_DB
Output

Output files are stored in /database_directory/function/go_enrichment/.

  • go_enrichment.csv, overview of all GO terms, p-values and the significance of enrichment. The output is formatted as a table.

  • go_enrichment_overview_per_go.txt, results of the analysis are ordered by GO term.

  • function_overview_per_mrna.txt, all functional annotations connected to the input sequences, ordered per mRNA.

  • function_overview_per_genome.txt, all functional annotations connected to the input sequences, ordered per genome.

Additional files are generated per individual genome and placed in /results_per_genome/.

  • go_enrichment.txt, list of GO terms, p-values and the critical p-values of Benjamin-Hochberg and Bonferroni.

  • revigo.txt, a list of GO terms and p-values that can be visualized on http://revigo.irb.hr

  • bio_process.pdf, dot visualisation of the Biological Process GO hierarchy.

  • cell_comp.pdf, dot visualisation of the Cellular Component GO hierarchy.

  • mol_function.pdf, dot visualisation of the Molecular Function GO hierarchy.

_images/bio_process.png

Visualization of GO hierarchy by dot


Phylogeny

There are six different methods implemented which can create phylogenetic trees. The consensus tree method creates a Maximum per-Locus Quartet-score Species Tree (MLQST) from a set of gene trees. The other five methods use a Neighbour-joining (NJ) or Maximum Likelihood (ML) algorithm to infer the phylogeny.

All functions produce tree files in Newick format that can be visualized with iTOL or any other phylogenetic tree visualization software.


Core phylogeny

Infer a Maximum likelihood (ML) or Neighbour-Joining (NJ) phylogeny from SNPs identified from single copy orthologous genes. This function requires single-copy homology groups which are automatically detected if gene_classification was run before. The homology groups are aligned in two consecutive rounds with msa.

When using --clustering-mode ML, parsimony informative positions are extracted from the trimmed alignments and concatenated into single continuous sequence per genome. IQ-tree infers the ML tree with minimum of 1000 bootstrap iterations.

The --clustering-mode NJ method counts the total and shared number of variable sites between two genomes in the alignment and calculates a Jaccard distance (0-1):

\[D_J(A,B) = 1 - J(A,B) = \frac{|A \cup B| - |A \cap B|}{|A \cup B|}\]
Required software

Please cite the appropriate tool(s) when using the core phylogeny in your research.

Parameters

<databaseDirectory>

Path to the database root directory.

Options

--threads/-t

Number of parallel working threads, default is the number of cores or 8, whichever is lower.

--include/-i

Only include a selection of genomes.

--exclude/-e

Exclude a selection of genomes.

--homology-file/-H

A file with homology group node identifiers of single copy groups. Default is single_copy_orthologs.csv, generated in the previous gene_classification run. (Mutually exclusive with --homology-groups.)

--homology-groups/-G

A comma separated list of homology group node identifiers of single copy groups. Default is single_copy_orthologs.csv, generated in the previous gene_classification run. (Mutually exclusive with --homology-file.)

--protein

Use proteins instead of nucleotide sequences.

--phenotype/-p

Include phenotype information in the resulting phylogeny.

--clustering-mode/-m

Maximum likelihood (–mode ML) or Neighbour joining (–mode NJ). Default is ML.

Example commands
$ pantools core_phylogeny -t=24 tomato_DB
$ pantools core_phylogeny -t=24 --clustering-mode=NJ --protein tomato_DB
$ pantools core_phylogeny -t=24 -m=ML -p=resistance tomato_DB
Output

Output files are written to the core_snp_tree directory in the database.

  • sites_per_group.csv, number of parsimony informative and variable sites per homology group.

When --clustering-mode NJ is included

  • core_snp_NJ_tree.R, Rscript to create NJ tree from distances based on shared sites. Two distances can be selected, based on variable sites and parsimony informative sites.

  • shared_informative_positions.csv, table with total number of shared parsimony informative sites between genomes.

  • shared_variable_positions.csv, table with total number of shared variable sites between genomes.

When --clustering-mode ML is included

  • informative.fasta, nucleotides from parsimony informative sites of the alignments, concatenated into a single sequences per genomes.

  • variable.fasta, nucleotides from variable sites of the alignment, concatenated into a single sequences per genomes.

A command is generated which can be used to execute IQ-tree and infer the phylogeny on informative.fasta.

  • informative.fasta.iqtree, IQ-tree log file.

  • informative.fasta.treefile, the ML phylogeny.

  • informative.fasta.splits.nex, the splits graph. With ideal data, this file is a tree, whereas data with conflicting phylogenetic signals will result in a tree-like network. This type of tree/network can be visualized with a tool like SplitsTree


K-mer distance tree

A NJ phylogeny of k-mer distances can be created by executing the Rscript generated by k-mer_classification.

Three types of distances can be selected to infer the phylogeny. The first two distances are Jaccard distances (0-1): one considering only distinct k-mers and the other using all k-mers. The distance from distinct k-mers ignores additional copies of a k-mer.

\[ \begin{align}\begin{aligned}D_J(A,B) &= 1 - J(A,B) = \frac{|A \cup B| - |A \cap B|}{|A \cup B|}\\D_J(A,B) &= 1 - J(A,B) = \frac{|A \uplus B| - |A \cap B|}{|A \uplus B|}\end{aligned}\end{align} \]

We observed an exponential increase in the k-mer distance as the evolutionary distance between two genomes increases. So in the case of more distant genomes, the depicted clades are still correct but the extreme long branch lengths make the tree hard to decipher. To normalize the numbers, we implemented the MASH distance. Distance = −1/𝑘 * ln(J), where k is the k-mer length; J is the jaccard index (of distinct k-mers).

$ Rscript genome_kmer_distance_tree.R
Output file

The phylogenetic tree genome_kmer_distance_tree.tree is written to the kmer_classification directory in the database.


Consensus tree

Create a consensus tree by combining gene trees from homology groups using ASTRAL-Pro. Gene trees are created from all sequences in an homology groups, no genomes can be skipped.

Required software

Please cite MAFFT, FastTree and ASTRAL-Pro when using the consensus tree in your research.

Parameters

<databaseDirectory>

Path to the database root directory.

Options

--threads/-t

Number of parallel working threads, default is the number of cores or 8, whichever is lower.

--homology-file/-H

A file with homology group node identifiers. Default is all homology groups. (Mutually exclusive with --homology-groups.)

--homology-groups/-G

A comma separated list of homology group node identifiers. Default is all homology groups. (Mutually exclusive with --homology-file.)

--polytomies

Allow polytomies for ASTRAL-PRO.

Example commands
$ pantools consensus_tree -t=24 apple_DB
$ pantools consensus_tree -H=apple_DB/gene_classification/group_identifiers/all_homology_groups.csv apple_DB
$ pantools consensus_tree -H=apple_DB/gene_classification/group_identifiers/core_homology_groups.csv apple_DB
$ pantools consensus_tree -H=apple_DB/gene_classification/group_identifiers/accessory_homology_groups.csv apple_DB
Output

Output files are written to the consensus_tree directory in the database.

  • all_trees.hmgroups.newick, all gene trees of homology groups included in the analysis, combined into a single file.

  • consensus_tree.astral-pro.newick, the output consensus tree from ASTRAL-Pro.

Relevant literature

Gene distance tree

A NJ phylogeny of gene distances is created by executing the Rscript generated by gene_classification.

Shared genes between genomes are identified through homology groups. Two Jaccard distance (0-1) can be used to infer a tree: one considering only distinct genes and the other using all genes. The distance from distinct genes ignores additional gene copies in an homology group.

\[ \begin{align}\begin{aligned}D_J(A,B) &= 1 - J(A,B) = \frac{|A \cup B| - |A \cap B|}{|A \cup B|}\\D_J(A,B) &= 1 - J(A,B) = \frac{|A \uplus B| - |A \cap B|}{|A \uplus B|}\end{aligned}\end{align} \]
$ Rscript gene_distance_tree.R
Output file

The phylogenetic tree gene_distance_tree.tree is written to the gene_classification directory in the database.


ANI

Average Nucleotide Identity (ANI) is a measure of nucleotide-level genomic similarity between the coding regions of two prokaryotic genomes. Two very fast ANI estimation tools (fastANI and MASH) are implemented and are able to perform the pairwise comparisons between genomes in the pangenome. To convert the ANI score into a distance (0-1), the scores are transformed by \(1-(ANI/100)\).

Required software

The required software depends on the tool you want to use. Please cite the appropriate tool when using the ANI tree in your research.

Parameters

<databaseDirectory>

Path to the database root directory.

Options

--threads/-t

Number of parallel working threads, default is the number of cores or 8, whichever is lower. MASH is always single threaded (and currently not parallelized yet).

--include/-i

Only include a selection of genomes.

--exclude/-e

Exclude a selection of genomes.

--mode/-m

Software to calculate ANI score (default: MASH).

--phenotype

Include phenotype information in the phylogeny.

Example commands
$ pantools ani pecto_DB
$ pantools ani --phenotype=species_name --mode=fastani pecto_DB
$ pantools ani  --exclude=4,5,6 --mode=mash pecto_DB
Output

Output files are written to the ANI directory in the database.

  • ANI_scores.csv, a table with ANI scores for all genome pairs.

  • ANI_distance_matrix.csv, a table with the ANI distances (1-ANI). This matrix is read by ANI_tree.R.

  • ANI_tree.R, Rscript to generate NJ tree from ANI distances

Find closest typestrain

Compares bacterial strains to the typestrain when this information is available in a pangenome database.

  1. Add the ‘typestrain’ phenotype to the pangenome with add_phenotypes. You only have to include typestrains names, other genomes can be left empty as shown in the example below, five genomes with three different typestrains.

  2. Run the ANI function

  3. The ‘typestrain’ phenotype is recognized, and typestrain_comparison.csv is created. This file contains the highest score of each genome(5) against all the included typestrains and states whether the score is above 95%.

Genome,typestrain
1,Salmonella choleraesuis NCTC 5735
2,Salmonella enteritidisi NCTC 12694
3,
4,Salmonella paratyphi NCTC 5702
5,
Relevant literature

MLSA

Within PanTools you can perform a Multilocus sequence analysis (MLSA) by running three consecutive functions:

  1. mlsa_find_genes

  2. mlsa_concatenate

  3. mlsa

Step 1 Search for genes

Find your genes of interest in the pangenome and extract their nucleotide and protein sequence. A regular search is not case sensitive but the gene names must exactly match the given input name. For example, searching a gene with ‘sonic1’ as query will not be able find ‘sonic’, but is able to find Sonic1, SONIC1 or sOnIc1. Including the --extensive argument allows a more relaxed search and using ‘sonic’ will now also find gene name variations as ‘sonic1’, ‘sonic3’ etc.. For this function it is important that genomes are annotated by a method that follow the rules for genetic nomenclature, so there are no differences in the naming of genes.

To gain insight in which genes are appropriate for this analysis, run gene_classification with the --mlsa argument. This method creates a list of genes that have same gene name, are present in all (selected) genomes and are placed in the single-copy homology group. Using genes from this list guarantees a successful MLSA.

Possible generated warnings during gene search

When a gene is included that is not on the list of suitable genes, it is not necessarily unusable but possibly requires manual . This function generates a log file with the issues and explains the user what to do.

  • Gene is not found in every genome. Consider using --extensive. The gene is not suitable with the current genome selection when this argument was already included.

  • The found genes are placed in different homology groups. A directory named the gene name is created where sequences are stored in a separate file per homology group. When one of the groups is single copy orthologous, it is automatically selected. With multiple correct single-copy groups, the first is selected. If no single-copy groups are found, this gene is probably not a suitable candidate based on the high divergence. If you are determined to use the gene, align and infer a gene tree on all_sequences.fasta to identify appropriate sequences.

  • At least one gene has an additional copy. The extra copies must be removed from the output file if you want to include this gene in the analysis. Find the copies that stand out by aligning and inferring a gene tree of the homology group.

Parameters

<databaseDirectory>

Path to the database root directory.

Options

--genes/-g (required)

One or multiple gene names, separated by a comma.

--include/-i

Only include a selection of genomes.

--exclude/-e

Exclude a selection of genomes.

--extensive

Perform a more extensive gene search.

Example commands
$ pantools mlsa_find_genes -g=dnaX,gapA,recA bacteria_DB
$ pantools mlsa_find_genes --extensive --genes=gapA bacteria_DB
Output

Output files are written to the mlsa/input/ directory in the database. For each gene name that was included, a nucleotide and protein and FASTA file is created that holding the sequences found in all genomes.

  • mlsa_find_genes.log, when one or multiple warnings are given they are placed in this log file. File is not created when there aren’t any warnings.


Step 2 Concatenate genes

Concatenate sequences obtained by mlsa_find_genes into a single sequence per genome. The --genes argument is required, but the selection of gene names is allowed to be a sub-selection of the earlier selection.

  1. Proteins are aligned with MAFFT

  2. The longest gap at the start and end of each protein alignment is identified.

  3. Nucleotide sequences are trimmed accordingly

  4. Trimmed nucleotide sequence are concatenated into a single sequence per genome.

Required software
Parameters

<databaseDirectory>

Path to the database root directory.

Options

--genes/-g (required)

One or multiple gene names, separated by a comma.

--threads/-t

Number of threads for MAFFT, default is the number of cores or 8, whichever is lower.

--include/-i

Only include a selection of genomes.

--exclude/-e

Exclude a selection of genomes.

--phenotype

Name of the phenotype.

Example commands
$ pantools mlsa_concatenate --genes=dnaX,gapA bacteria_DB
$ pantools mlsa_concatenate --g=dnaX,gapA,recA --e=1,2,10-25 bacteria_DB
Output

The output file is stored in /database_directory/mlsa/input/

  • concatenated.fasta, file holding one concatenated sequence per genome.


Step 3 Run MLSA

Run MAFFT and IQ-tree on the concatenated nucleotide sequences from mlsa_concatenate to create an unrooted ML tree with 1,000 bootstrappings.

Required software

Please cite the MAFFT and IQ-tree when using the MLSA in your research.

Parameters

<databaseDirectory>

Path to the database root directory.

Options

--threads/-t

Number of threads for MAFFT and IQ-tree, default is the number of cores or 8, whichever is lower.

--include/-i

Only include a selection of genomes.

--exclude/-e

Exclude a selection of genomes.

--phenotype/-p

Add phenotype information/values to the phylogeny. Allows the identification of phenotype specific SNPs in the alignment.

Example commands
$ pantools mlsa bacteria_DB
$ pantools mlsa -t=24 -p=species bacteria_DB
Output

Input and output files are written to the mlsa/output/ directory in the database.

  • mlsa.afa, the alignment in CLUSTAL format.

  • mlsa.fasta, the alignment in FASTA format.

  • mlsa.fasta.treefile, the (ML) phylogeny created by IQ-tree in Newick format.

When a --phenotype is included

  • nuc_phenotype_specific_changes.info, the positions of phenotype specific substitutions in the alignment.

The var_inf_positions directory holds files related to the counting variable positions of the alignment.

  • nuc_variable_positions.csv, a table with the counts of A, T, C, G, or gap for every variable position in the alignment

  • informative_nuc_distance.csv, a table with distances calculated from parsimony informative positions in the alignment.

  • informative_nuc_site_counts.csv, a table with number of shared parsimony informative positions between genomes.

  • variable_nuc_distance.csv, a table with distances calculated from variable positions in the alignment.

  • variable_nuc_site_counts.csv, a table with number of shared positions between genomes.


Edit Phylogeny

Rename phylogeny

Update or the terminal nodes (leaves) of a phylogenic tree. This is useful when you already constructed a tree but forgot to include a phenotype or to update the tree with a different phenotype. When no --phenotype is included, the node values are changed to genome numbers.

Parameters

<databaseDirectory>

Path to the database root directory.

<treeFile>

A phylogenetic tree in newick or nexus format. The tree must be generated by PanTools.

Options

--no-numbers

Exclude genome numbers from the terminal nodes (leaves).

--phenotype/-p

The phenotype used to rename the terminal nodes (leaves) of the selected tree.

Example commands
$ pantools rename_phylogeny bacteria_DB core_snp.tree
$ pantools rename_phylogeny --phenotype=species bacteria_DB bacteria_DB/ANI/fastANI/ani.tree
Output file

A new phylogenetic tree is written to the directory of the selected input tree:

  • When the original file is called ‘old_tree.newick’, a new tree is created with filename ‘old_tree_RENAMED.newick’.


Root phylogeny

All phylogenetic trees that come from the PanTools functionalities are unrooted. This function is able to create a new rooted tree simply by selecting one of the external (terminal) nodes via --node. The included number or string should match exactly one node in the phylogeny or the program will not execute.

Required software
Parameters

<databaseDirectory>

Path to the database root directory.

<treeFile>

A phylogenetic tree in newick format. The tree must be generated by PanTools.

Options

--node/-n (required)

The name of the terminal node that will root the tree.

Example commands
$ pantools root_phylogeny --node=1 bacteria_DB core_snp.tree
$ pantools root_phylogeny --node=1_A.thaliana bacteria_DB core_snp.tree
$ pantools root_phylogeny -n=1_1 bacteria_DB kmer.tree
$ Rscript reroot.R
Output file

A new phylogenetic tree is written to the same location as the provided input file

  • When the original tree is called ‘tree.newick’, the new file is named ‘tree_REROOTED.newick’.


Create tree template

Creates ‘ring’ and ‘colored range’ ITOL templates based on phenotypes for the visualization of phylogenies in iTOL. Phenotypes must already be included in the pangenome with the add_phenotypes functionality. How to use the template files in iTOL can be found in one of the tutorials.

If you run this function without a --phenotype argument, templates are created for trees that contain only genome numbers as node labels. When there is a --phenotype included, templates are created where the leaves are named according to the selected phenotype but are coloured by one of the other phenotypes in the pangenome. For example, you originally used the ‘species name’ as a phenotype to construct the phylogeny but want them to be coloured by the ‘pathogenicity’ phenotype.

More information about ITOL templates can be found on their own webpage.

There is a maximum of 20 possible colors that are used in the following order:

Color (> 8 phenotypes)

Hexadecimal color

Color (≤ 8 phenotypes)

Hexadecimal color

1

Pink

#fabebe

Orange

#E69F00

2

Lime

#bfef45

Sky blue

#56B4E9

3

Cyan

#42d4f4

Bluish green

#009E73

4

Apricot

#ffd8b1

Yellow

#F0E442

5

Mint

#aaffc3

Blue

#0072B2

6

Beige

#fffac8

Vermilion

#D55E00

7

Lavender

#e6beff

Reddish purple

#CC79A7

8

Teal

#469990

Grey

#999999

9

Red

#e6194B

10

Orange

#f58231

11

Yellow

#ffe119

12

Green

#3cb44b

13

Blue

#4363d8

14

Purple

#911eb4

15

Grey

#a9a9a9

16

Maroon

#800000

17

Olive

#808000

18

Brown

#9A6324

19

Navy

#000075

20

Magenta

#f032e6

_images/color_templates.png

Figures were copied from:

Parameters

<databaseDirectory>

Path to the database root directory.

Options

--color

Assign a color to a phenotypes with a minimum amount of genomes (default: 2).

--phenotype/-p

Use the names from this phenotype.

Example commands
$ pantools create_tree_template bacteria_DB
$ pantools create_tree_template --phenotype=flowering --color=1 bacteria_DB
$ pantools create_tree_template --phenotype=root_morph --color=3 bacteria_DB
Output

Output files are written to the create_tree_template directory in the database.

  • When no phenotype information is included, a directory ‘genome_numbers’ is created where the templates are stored.

  • When a --phenotype is included, a directory (named after the phenotype) is created where the templates are stored.

The template files are named after the phenotypes, therefore the colors are based on that phenotype as well.


Multiple Sequence Alignments

This page is entirely dedicated to performing Multiple Sequence Alignments (MSA) with PanTools.

MSA

Performs multiple sequence alignments with MAFFT on sets of sequences. These alignments can either be:

  • per homology group

  • multiple homology groups

  • regions

  • with all sequences containing a functional domain

The alignment consists of two rounds: After the initial alignment, protein sequences are trimmed based on the longest start and end gap of the alignment. The number of trimmed amino acids is multiplied by three to trim the correct number of nucleotides. If only nucleotide sequences are aligned, the nucleotide sequence alignment is used for trimming. The trimmed sequences are aligned a second time to identify variable and parsimony informative sites. For each round, a ML phylogeny will be created with FastTree.

Select a method
By default, this function will make a MSA per homology group. (It can still be specified with --method=per-group.) Using another option from the list above requires use of the --method argument. For aligning multiple homology groups, please use --method=multiple-groups with the homology groups specified in a csv file on a single line (can be added with --homology-file=/path/to/hm.csv or --homology-groups=2,3,4,4). For aligning regions, please use --method=regions with a regions file that is added with --regions-file=/path/to/rf.txt. For aligning sequences based on a functional domain, please use --method=functions together with a functional domain that is added with --functions=<domain>.
Other options
In case you are only interested in the alignment of the nucleotide or protein sequences, use --mode=nucleotide or --mode=protein. When the --no-trimming argument is included, the variable and parsimony informative sites are identified from the initial alignment and no trimming is performed (thus, only one round of aligning). The option --no-fasttree can be used to skip running FastTree, which is used for generating a tree from the alignment.
Identify phenotype shared or specific variation
Shared SNPs or amino acid substitutions can be found among the members of a phenotype when --phenotype=<phenotype> is included. As homology groups can highly differ in size, the threshold for a phenotype shared or specific SNP/substitution is based on the number of sequences (from a certain phenotype) of an homology group instead of the number of genomes in the pangenome. For example, the pangenome holds 500 genomes but the homology group consists of only 100 sequences. The threshold can be lowered by including --phenotype-threshold=<phenotypeThreshold>, which lowers the original threshold by multiplying it to a given percentage.
Sequence identity and similarity
- The percentage identity of two sequences is calculated based on the number of exactly matching characters divided by the alignment length minus the positions were both sequences have a gap.
- The similarity (protein only) is calculated from the number of identical matches, increased by the number of similar amino acids (according to the BLOSUM 62 matrix), divided by the alignment length minus the shared gap positions. The calculated percentage of similarity is dependant on the BLOSUM matrix set by --blosum. Choose a larger BLOSUM number BLOSUM less divergent sequences.
Required software
Parameters

<databaseDirectory>

Path to the database root directory.

Options

--threads/-t

Number of threads for MAFFT and IQ-tree, default is the number of cores or 8, whichever is lower.

--include/-i

Only include a selection of genomes.

--exclude/-e

Exclude a selection of genomes.

--homology-file/-H

A text file with homology group node identifiers, separated by a comma. Default is all homology groups. (Mutually exclusive with --homology-groups.)

--homology-groups/-G

A comma separated list of homology group node identifiers. Default is all homology groups. (Mutually exclusive with --homology-file.)

--regions-file/-R

A text file containing genome locations with on each line: a genome number, sequence number, begin and end position, separated by a space.

--method

The kind of alignment to make. Can be either per-group, multiple-groups, regions or functions.

--mode/-m

Choose to only align nucleotide or protein sequences (default is both).

--functions

For specifying one or multiple functional domains (Only used when --method=functions).

--phenotype/-p

A phenotype name, used to identify phenotype specific SNPs/substitutions.

--blosum

A BLOSUM matrix number to control MAFFT’s sensitivity and the similarity calculation. Allowed values are 45, 62 and 80 (default: 62).

--phenotype-threshold

Threshold for phenotype specific SNPs (default: 100%).

--[no-]trimming

Align the sequences only once. Trimming is on by default.

--[no-]fasttree

Run FastTree (default: true).

--variants

Use variation (default: false).

Example regions file

Each line must have a genome number, sequence number, begin and end positions that are separated by a space. Place a minus symbol behind a region to extract the reverse complement sequence.

1 1 1 10000
195 1 477722 478426
71 10 17346 18056 -
138 47 159593 160300 -
Example commands
$ pantools msa tomato_DB
$ pantools msa --mode=protein -H=hmgroups.txt tomato_DB
$ pantools msa --no-trimming --mode=nucleotide -H=hmgroups.txt tomato_DB
$ pantools msa --phenotype=resistance --phenotype-threshold=99 -H=hmgroups.txt tomato_DB
$ pantools msa --method=multiple-groups tomato_DB
$ pantools msa --method=multiple-groups --mode=protein -H=hmgroups.txt tomato_DB
$ pantools msa --method=multiple-groups --phenotype=resistance --phenotype-threshold=95 -H=hmgroups.txt tomato_DB
$ pantools msa --method=regions -R=regions.txt tomato_DB
$ pantools msa --method=functions --functions=PF10137 tomato_DB
Output files

Output files are stored in database_directory/alignments/msa_/grouping_v?/ A separate directory is created for each alignment which holds the input and output files.

The ‘input’ directory contains the input files for the alignments.

  • nuc/prot(_trimmed).fasta, original and trimmed input sequences.

  • trimmed.info, number of trimmed positions per sequence.

  • sequences.info, relevant gene information of sequences in group: gene names, mRNA node id, address, strand orientation.

The alignments and output files are written to the ‘output’ directory.

  • nuc/prot(_trimmed).afa, the initial and second (trimmed) alignment in CLUSTAL format.

  • nuc/prot(_trimmed).fasta, the initial and second (trimmed) alignment in FASTA format.

  • nuc/prot(_trimmed).newick, FastTree ML tree inferred from the initial and second (trimmed) alignment.

  • nuc/prot(_trimmed)_alignment.info, some statistics about the initial and second (trimmed) alignment: alignment length, number of conserved, variable and parsimony informative sites

Sequence identity and similarity output files.

  • nuc/prot(_trimmed)_identity.csv, table with the sequence identity scores.

  • prot(_trimmed)_similarity.csv, table with similarity of the protein sequences.

Variable and parsimony informative sites output files.

  • informative_nuc/prot(_trimmed)_distance.csv, table with distances between sequences based on parsimony informative sites in the alignment.

  • variable_nuc/prot(_trimmed)_distance.csv, table with distances between sequences based on variable sites in the alignment.

  • informative_nuc/prot(_trimmed)_sites.csv, table with the number shared parsimony informative sites between sequences.

  • variable_nuc/prot(_trimmed)_sites.csv, table with the number of shared variable sites between sequences.

When a --phenotype is included.

  • phenotype_specific_changes_nuc/prot_groups.csv, the node identifiers of homology groups with phenotype specific substitutions.

  • phenotype_specific_changes_nuc/prot.txt, the positions of phenotype specific substitutions in the alignments.

  • phenotype_disrupted_nuc/prot.txt, shows how many sequences of different phenotypes prevented a SNP/substitution from becoming phenotype specific.

Explore the pangenome

The functionalities on this page allow to actively explore the pangenome.

  • Retrieve regions from the pangenome

  • Retrieve sequences and functional annotations from homology groups

  • Search for genes using a gene name, functional annotation or database node identifier

  • Align homology groups or genomic regions

  • GO enrichment analysis

Locate Genes

Identify and compare gene clusters of neighbouring genes based on a set of homology groups. First, identifies the genomic position of genes in homology groups, retrieves the order of genes per genome and based on this construct the gene clusters. If homology groups with multiple genomes were selected, the gene cluster composition is compared between genomes. When a --phenotype is included, gene clusters can be found that only consist of groups of a certain phenotype.

For example, 100 groups were predicted as core in a pangenome of 5 genomes. The gene clusters are first identified per genome, after which it compares the gene order of one genome to all the other genomes. The result could be 75 groups with genes that are not only homologous but also share their gene neighbourhood. Another example, when accessory (present 2 in to 4 genomes) groups are given to this function in combination with a --phenotype (assigned to only two genomes), the function can return clusters that can only be found in the phenotype members.

Parameters

<databaseDirectory>

Path to the database root directory.

<homologyFile>

A text file with homology group node identifiers, seperated by a comma.

Options

--include/-i

Only include a selection of genomes.

--exclude/-e

Exclude a selection of genomes.

--phenotype/-p

A phenotype name, used to identify gene clusters shared by all phenotype members.

--nucleotides

The number of allowed nucleotides between two neighbouring genes (default is 1 MB).

--gap-open

When constructing the clusters, allow a number of genes for each cluster that are not originally part of the input groups (default: 0).

--core-threshold

Lower the threshold (%) for a group to be considered (soft) core (default is the total number of genomes found in the groups, not a percentage).

--ignore-duplications

Duplicated and co-localized genes no longer break up clusters.

Example commands
$ pantools locate_genes tomato_DB phenotype_groups.csv
$ pantools locate_genes --nucleotides=5000 --gap-open=1  tomato_DB unique_groups.csv
$ pantools locate_genes --ignore-duplications --core-threshold=95 tomato_DB accessory_groups.csv
Output files

Output files are stored in database_directory/locate_genes/

  • gene_clusters_by_position.txt, the identified gene clusters ordered by their position in the genome.

  • gene_clusters_by_size.txt, the identified gene clusters ordered from largest to smallest.

  • compare_gene_clusters, the composition of found gene clusters is compared to the other genomes. For each cluster, it shows which parts match other clusters and which parts do not. The file is not created when homology groups only contain proteins of a single genome (unique).

When a --phenotype is included

  • phenotype_clusters, homology group node identifiers from phenotype shared and specific clusters.

  • compare_gene_clusters_PHENOTYPE.txt, the same information as compare_gene_clusters but now the gene cluster comparison is only done between phenotype members.


Find genes

Find genes by name

Find your genes of interest in the pangenome by using the gene name and extract the nucleotide and protein sequence. To be able to find a gene, every letter of the given input must match a gene name. The search is not case sensitive. Performing a search with ‘sonic1’ as query will not be able find ‘sonic’, but is able to find Sonic1, SONIC1 or sOnIc1. Including the --extensive option allows a more relaxed search and using ‘sonic’ will now also find gene name variations as ‘sonic1’, ‘sonic3’ etc..

Be aware, for this function to work it is important that genomes are annotated by a method that follows the rules for genetic nomenclature. Gene naming can be inconsistent when different tools are used for genome annotation, making this functionality ineffective.

This function is the same as mlsa_find_genes but uses a different output directory. Several warnings (shown in the other manual) can be generated during the search. These warning are less relevant for this function as the genes are not required to be single copy-orthologous.

Parameters

<databaseDirectory>

Path to the database root directory.

Options

``–genes``/``-g``

Required. One or multiple gene names, seperated by a comma.

--include/-i

Only include a selection of genomes.

--exclude/-e

Exclude a selection of genomes.

--extensive

Perform a more extensive gene search.

Example commands
$ pantools find_genes_by_name --genes=dnaX,gapA,recA tomato_DB
$ pantools find_genes_by_name --extensive -g=gapA tomato_DB
Output files

Output files are stored in /database_directory/find_genes/by_name/. For each gene name that was included, a nucleotide and protein and .FASTA file is created with sequences found in all genomes.

  • find_genes_by_name.log, relevant information about the extracted genes: node identifier, gene location, homology group etc..


Find genes by annotation

Find genes of interest in the pangenome that share a functional annotation node and extract the nucleotide and protein sequence.

Parameters

<databaseDirectory>

Path to the database root directory.

Options

Requires one of --functions|--nodes.

--functions

One or multiple function identifiers (GO, InterPro, PFAM, TIGRFAM), seperated by a comma.

--nodes/-n

One or multiple identifiers of function nodes (GO, InterPro, PFAM, TIGRFAM), seperated by a comma.

--include/-i

Only include a selection of genomes.

--exclude/-e

Exclude a selection of genomes.

Example commands
$ pantools find_genes_by_annotation --nodes=14928,25809 tomato_DB
$ pantools find_genes_by_annotation --functions=PF00005,GO:0000160,IPR000683,TIGR02499 tomato_DB
Output files

Output files are stored in /database_directory/find_genes/by_annotation/. For each function (node) that was included, a nucleotide and protein and .FASTA file is created with sequences from the genes that are connected to the node.

  • find_genes_by_annotation.log, relevant information about the extracted genes: node identifier, gene location, homology group etc..


Find genes in region

Find genes of interest in the pangenome that can be (partially) found within a given region (partially). For each found gene, relevant information, the nucleotide sequence and protein sequence is extracted.

Parameters

<databaseDirectory>

Path to the database root directory.

<regionsFile>

A text file containing genome locations with on each line: a genome number, sequence number, begin and end position, separated by a space.

Options

--partial

Also retrieve genes that only partially overlap the input regions.

Example input file

Each line must have a genome number, sequence number, begin and end positions that are separated by a space.

195 1 477722 478426
71 10 17346 18056
138 47 159593 160300
Example commands
$ pantools find_genes_in_region tomato_DB regions.txt
$ pantools find_genes_in_region --partial tomato_DB regions.txt
Output files

Output files are stored in /database_directory/find_genes/in_region/. For each region that was included, a nucleotide and protein and .FASTA file is created with sequences from the genes that are found within the region.

  • find_genes_in_region.log, relevant information about the extracted genes: node identifier, gene location, homology group etc..


Functional annotations

The following functions can only be used when any type of functional annotation is added to the database.

Show GO

For a selection of ‘GO’ nodes, retrieves connected ‘mRNA’ nodes, child and all parent GO terms that are higher in the GO hierarchy. This function follows the ‘is_a’ relationships of GO each node to their parent GO term until the ‘biological process’, ‘molecular function’ or ‘cellular location’ node is reached. This can be is useful in case InterProScan annotations were included, as these only add the most specific GO terms of the hierarchy to a sequence.

Parameters

<databaseDirectory>

Path to the database root directory.

Options

Requires one of --functions|--nodes.

--functions

One or multiple GO term identifiers, seperated by a comma.

--nodes/-n

One or multiple identifiers of ‘GO’ nodes, seperated by a comma.

Example commands
$ pantools show_go --functions=GO:0000001,GO:0000002,GO:0008982 tomato_DB
$ pantools show_go --nodes=15078,15079 tomato_DB
Output file
  • show_go.txt, information of the selected GO node(s): the connected ‘mRNA’ nodes, the GO layer below, and all layers above.


Compare GO

Check if and how similar two given GO terms are. For both nodes, follows the ‘is_a’ relationships up to their parent GO terms until the ‘biological process’, ‘molecular function’ or ‘cellular location’ node is reached. After all parent terms are found, the shared GO terms and their location in the hierarchy is reported.

Parameters

<databaseDirectory>

Path to the database root directory.

Options

Requires one of --functions|--nodes.

--functions

One or multiple GO term identifiers, seperated by a comma.

--nodes/-n

One or multiple identifiers of ‘GO’ nodes, seperated by a comma.

--include/-i

Only include a selection of genomes.

--exclude/-e

Exclude a selection of genomes.

Example commands
$ pantools compare_go --functions=GO:0032775,GO:0006313 tomato_DB
$ pantools compare_go --nodes=741487,741488 tomato_DB
Output file

Output files are stored in database_directory/function/

  • compare_go.txt, information of the two GO nodes: the connected ‘mRNA’ nodes, the GO layer below, all layers above and the shared GO terms between the two nodes.


Group info

Report all available information of one or multiple homology groups.

Parameters

<databaseDirectory>

Path to the database root directory.

<homologyFile>

A text file with homology group node identifiers, seperated by a comma.

Options

--include/-i

Only include a selection of genomes.

--exclude/-e

Exclude a selection of genomes.

--functions

Name of function identifiers from GO, PFAM, InterPro or TIGRAM. To find Phobius (P) or SignalP (S) annotations, include: ‘secreted’ (P/S), ‘receptor’ (P/S), or ‘transmembrane’ (P).

--genes/-g

One or multiple gene names, seperated by a comma.

--node

Retrieve the nucleotide nodes belonging to genes in homology groups

Example commands
$ pantools group_info yeast_DB core_groups.txt
$ pantools group_info --functions=GO:0032775,GO:0006313 --genes=budC,estP yeast_DB core_groups.txt
Output files

Output files are stored in database_directory/alignments/grouping_v?/groups/. For each homology group that was included, a nucleotide and protein and .FASTA file is created with sequences found in all genomes.

  • group_info.txt, relevant information for each homology group: number of copies per genome, gene names, mRNA node identifiers, functions, protein sequence lengths, etc..

  • group_functions.txt, full description of the functions found in homology groups

When function identifiers are included via --functions

  • groups_with_function.txt, homology group node identifiers from groups that match one of the input functions.

When gene names are included via --genes

  • groups_with_name.txt, homology group node identifiers from groups that match one of the input gene ames.


Matrix files

Several functions generate tables in a CSV file format. as tables that the following functions can work with. For example, ANI scores, k-mer and gene distance used for constructing the Neighbour Joining phylogenetic trees, and the identity and protein sequence similarity tables created by the alignment functions.

Order matrix

Transforms the CSV table to easy to read file by ordering the values in ascending order from low to high or descending order when --descending is included in the command. If phenotype information is included in the header, a separate file with the range of found values is created for each phenotype. If this information is not present (only genome numbers in the header), use rename_matrix to change the headers.

Parameters

<databaseDirectory>

Path to the database root directory.

<matrixFile>

A CSV formatted matrix file.

Options

--include/-i

Only include a selection of genomes.

--exclude/-e

Exclude a selection of genomes.

--descending

Order the matrix in descending order.

Example commands
$ pantools order_matrix bacteria_DB bacteria_DB/ANI/fastANI/ANI_distance_matrix.csv
$ pantools order_matrix --descending bacteria_DB bacteria_DB/ANI/fastANI/ANI_distance_matrix.csv
Output file

Output is written to the same directory as the selected input file

  • old file name’ + ‘_ORDERED’, ordered values of the original matrix file.

When phenotype information is present in the header

  • old file name’ + ‘_PHENOTYPE’, range of values per phenotype.


Rename matrix

Rename the headers (first row and leftmost column) of CSV formatted matrix files. If no --phenotype is included, headers are changed to only contain genome numbers.

Parameters

<databaseDirectory>

Path to the database root directory.

<matrixFile>

A matrix file with numerical values.

Options

--include/-i

Only include a selection of genomes in the new matrix file.

--exclude/-e

Exclude a selection of genomes from the new matrix file.

--phenotype/-p

A phenotype name, used to include phenotype information into the headers.

--[no-]numbers

In- or exclude genome numbers from the headers. Numbers are included by default.

Example commands
$ pantools rename_matrix pecto_DB pecto_DB/ANI/fastANI/ANI_distance_matrix.csv
$ pantools rename_matrix --no-numbers --phenotype=species pecto_DB pecto_DB/ANI/fastANI/ANI_distance_matrix.csv
Output file

Output is written to the same directory as the selected input file.

  • old file name’ + ‘_RENAMED’, the original matrix file with changed headers.


Retrieve regions, genomes or features

The two following functions allow users to retrieve genomic regions from the pangenome.

Retrieve regions

Retrieve the full genome sequence or genomic regions from the pangenome.

Parameters

<databaseDirectory>

Path to the database root directory.

<regionsFile>

A text file containing genome locations with on each line: a genome number, sequence number, begin and end positions separated by a space.

Example command
$ pantools retrieve_regions pecto_DB regions.txt
Example input

To extract:

  • Complete genome - Include a genome number

  • An entire sequence - Include a genome number with sequence number

  • A genomic region - Include a genome number, sequence number, begin and end positions that are separated by a space. Place a minus symbol behind the regions to extract the reverse complement sequence of the region.

1
1 1
1 1 1 10000
1 1 1000 1500 -
195 1 477722 478426
71 10 17346 18056 -
138 47 159593 160300 -
Output file

A single FASTA file is created for all given locations and is stored in the database directory.

Retrieve features

To retrieve the sequence of annotated features from the pangenome.

Parameters

<databaseDirectory>

Path to the database root directory.

Options

--feature-type

Required. The feature name; for example ‘gene’, ‘mRNA’, ‘exon’, ‘tRNA’, etc.

--include/-i

Only include a selection of genomes.

--exclude/-e

Exclude a selection of genomes.

Example commands
$ pantools retrieve_features --feature-type=gene pecto_DB
$ pantools retrieve_features --feature-type=mRNA -i=1-5 pecto_DB
Output files

For each genome a FASTA file containing the retrieved features will be stored in the database directory. For example, genes.1.fasta contains all the genes annotated in genome 1.


Read mapping

Map

Map single or paired-end short reads to one or multiple genomes in the pangenome. One SAM or BAM file is generated for each genome included in the analysis.

Parameters

<databaseDirectory>

Path to the database root directory.

<genomeNumbers>

A text file containing genome numbers to map reads against in each line.

<shortReadFiles>

One or two short-read archives in FASTQ format, which can be gz/bz2 compressed.

Options

--threads/-t

Number of threads for MAFFT and IQ-tree, default is the number of cores or 8, whichever is lower.

--output/-o

Path to the output files (default is the database path).

--best-hits = none|all|random

In case of multiple “best” hits, return none, all best hits or a random best hit (Default: random).

--all-hits

Return all hits rather than only the best.

--competitive

Find the best mapping location in the complete pangenome
(default: find the best location for each genome).

--previous-run

The mapping_summary.txt file from a previous mapping run (random-best competitive mode) for a better estimation of coverage in a metagenomic setting.

--out-format = SAM|BAM|none

Writes the alignment files in BAM or SAM format or don’t write any output files (default: SAM).

--gap-open

Gap open penalty (range: [-50..-1], default: -20).

--gap-extension

Gap extension penalty (range: [-5..-1], default: -3).

--interleaved

Process the fastq file as an interleaved paired-end archive.

--unmapped

Check unmapped genomes.

Options that influence the mapping sensitivity

--sensitivity/-s = very-fast|fast|sensitive|very-sensitive

Four settings that automatically set the parameters controlling the sensitivity, ranging from least to most sensitive.

--min-identity

The minimum acceptable identity of the alignment
(default: 0.5, range: [0,1]).

--alignment-band

The length of bound of banded alignment
(default: 5, range: [1..100]).

--min-hit-length

The minimum acceptable length of alignment after soft-clipping
(default: 13, range: [10..100]).

--max-num-locations

The maximum number of locations of candidate hits to examine
(default: 15, range: [1..100]).

--max-alignment-length

The maximum acceptable length of alignment
(default: 2.000, range: [50..5.000]).

--max-fragment-length

The maximum acceptable length of fragment
(default: 4998, range: [50..5000]).

--num-kmer-samples

The number of kmers sampled from read
(default: 15, range: [1..r-k+1]).

--clipping-stringency

The stringency of soft-clipping (default: 1).
0 : no soft clipping
1 : low
2 : medium
3 : high

Example input files

FASTQ file

@SRR13153715.1 1/1
TGGTCATACAGCAAAGCATAATTGTCACCATTACTATGGCAATCAAGCCAGCTATAAAACCTAGCCAAATGTACCATGGCCATTTTATATACTGCTCATACTTTCCAAGTTCTTGGAGATCGAT
+
EEEEEEEEEEEEEEEAEEEE/EEEEE/AEEEEEEEEEEEEEE/EE/EEE/<EEEEEEE/EEEEEEEEEEEEEAEEEEEAEEEEEAEEAEEEEEEA<AAAEEAEEA<EE/EEEEAEAEA/EEAA/

Genome numbers file

1
2
5
Example commands
$ pantools map arabidopsis_DB genome_numbers.txt ERR031564_1.fastq
$ pantools map --include=1-5 --sensitivity=sensitive arabidopsis_DB genome_numbers.txt ERR031564_1.fastq
$ pantools map --competitive -m=all-bests arabidopsis_DB genome_numbers.txt ERR031564_1.fastq
$ pantools map --interleaved  arabidopsis_DB genome_numbers.txt interleaved_reads.fastq
$ pantools map arabidopsis_DB genome_numbers.txt ERR031564_1.fastq ERR031564_2.fastq
Output files
  • mapping_summary.txt, number of mapped and unmapped reads per genome

  • One SAM or BAM file is generated for each genome included in the analysis.


Querying the pangenome

Cypher is Neo4j’s graph query language that lets you ask specific questions or retrieve data from the graph database. The Cypher query language depicts patterns of nodes and relationships and filters those patterns based on labels and properties. While using node and relationship patterns in databases queries may seem a little daunting, it is easy to pick up! This page contains some example queries to help you get started. Feel free to email us if you have any question regarding Cypher queries.

More information on Neo4j and the Cypher language:

Match and return 100 nucleotide nodes

MATCH (n:nucleotide) RETURN n LIMIT 100

Find all the genome nodes

MATCH (n:genome) RETURN n

Retrieve the pangenome node

MATCH (n:pangenome) RETURN n

Match and return 100 genes

MATCH (g:gene) RETURN g LIMIT 100

Match and return 100 genes and order them by length

MATCH (g:gene) RETURN g ORDER BY g.length DESC LIMIT 100

The same query as before but results are now returned in a table

MATCH (g:gene) RETURN g.name, g.address, g.length ORDER BY g.length DESC LIMIT 100

Return genes which are between 100 and 250 bp. This can also be applied to other features such as exons introns or CDS.

MATCH (g:gene) where g.length > 100 AND g.length < 250 RETURN * LIMIT 100

Find genes located on first genome

MATCH (g:gene) WHERE g.address[0] = 1 RETURN * LIMIT 100

Find genes located on first genome and first sequence

MATCH (g:gene) WHERE g.address[0] = 1 AND g.address[1] = 1 RETURN * LIMIT 100

Obtain genes between 100 and 250 nucleotides

MATCH (g:gene) where g.length > 100 AND g.length < 250 RETURN *

Return pfam identifiers for genes between 100 and 250 nucleotides long

match (n:mRNA)--(m:pfam) where n.length > 100 and n.length < 150 return m.id

Return all genes for a specific contig and count them

MATCH (n:gene) WHERE n.address[0] = 1 and n.address[1] = 1 RETURN count(n)

Return all genes genes between 1000-1500 nucleotides and order them by length

MATCH (n:gene) WHERE n.length > 1000 and n.length < 1500 RETURN n order by n.length DESC

Returns the homology group matching your gene of interest

MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE g.name = 'GENE\_NAME' RETURN *

Returns the genes of genome 1 that don’t have a homolog in a the other genome

MATCH (n:homology_group)--(m:mRNA)--(g:gene) where n.num_members = 1 and g.genome = 1 RETURN g

Retrieve unique GO identifiers for mRNA’s with a signal peptide

MATCH (m:mRNA)--(g:GO) where m.signalp_signal_peptide = true RETURN DISTINCT m.id, g.id

Return all sequence nodes for a specific contig

MATCH (n)-[r]->() WHERE exists (r.'a1\_1') and (n:degenerate or n:node) RETURN id(n), n.sequence , r.'a1\_1'

Return all sequence nodes for a specific contig within the range of position 1000 and 2000

MATCH (n)-[r]->() WHERE exists (r.'a1\_1') and (n:degenerate or n:node) and r.'a1'\_1[0] > 1000 and r.'a1\_1'[0] < 2000 RETURN id(n), n.sequence, r.'a1\_1'

Find SNP bubbles in the graph. For simplification we only use the FF relation

MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return * limit 50

Differences between pangenome and panproteome

PanTools offers functionalities to build and analyze a pangenome or panproteome.

A pangenome is constructed from genome and annotation files. First, genome sequences are k-merized and compressed into a De Bruijn graph. Genes and other annotation features from annotation files are integrated into the pangenome as ‘gene’, ‘mRNA’ and ‘CDS’ nodes. Gene start and stop positions are annotated in the graph as relationships and connect the annotation layer to the nucleotide layer. The protein sequences can be clustered into homology groups and connect homologous proteins from different genomes.

A panproteome is built from protein sequences only, ignoring the underlying genome structure. Again, the protein sequences are clustered into homology groups which serve as main input for many functionalities.

In addition to the single layer in panproteomes and three layers in pangenomes, a functional layer can be included in both databases. This layer consists of multiple functional annotation databases (e.g. GO, PFAM) and connects proteins with a shared function.

Since there is only a protein layer and functional layer present in panproteomes, not all functions can be utilized. See the table below for which functions can be used for pangenomes and panproteomes.

_images/layers.png

Schematic of genome, annotation, and protein layer of a pangenome database. Figure taken from Efficient inference of homologs in large eukaryotic pan-proteomes

Available functions

Construct a pangenome

Function

Pangenome

Panproteome

Build pangenome

YES

NO

Build panproteome

NO

YES

Add annotations

YES

NO

Add genomes

YES

NO

Group

YES

YES

Optimal grouping

YES

YES

Change grouping

YES

YES

BUSCO protein

YES

YES

Add phenotype

YES

YES

Add functional annotations

YES

YES

Add antiSMASH

YES

NO

Remove nodes

YES

YES

Move or remove grouping

YES

YES

Pangenome characterization

Function

Pangenome

Panproteome

Statistics

YES

YES

Gene classification

YES

YES

Core unique thresholds

YES

YES

Grouping overview

YES

YES

Pangenome structure for homology groups

YES

YES

Pangenome structure for k-mers

YES

NO

K-mer classification

YES

NO

Functional classification

YES

YES

Functional annotation overview

YES

YES

Explore the pangenome

Function

Pangenome

Panproteome

Locate genes

YES

NO

mRNAs connected to function

YES

NO

Find gene

YES

NO

GO enrichment

YES

YES

Show GO

YES

YES

Compare GO

YES

YES

Compare BGC

YES

NO

Alignment of homology group

YES

YES

Alignment of multiple homology groups

YES

YES

Alignment of genomic regions

YES

NO

Order matrix

YES

YES

Rename matrix

YES

YES

Retrieve genomes

YES

NO

Retrieve regions

YES

NO

Retrieve features

YES

NO

Phylogeny

Function

Pangenome

Panproteome

Core SNP tree

YES

YES

K-mer distance tree

YES

NO

Gene distance tree

YES

YES

ANI tree

YES

NO

MLSA

YES

NO

Rename phylogeny

YES

YES

Create tree template

YES

YES

Read mapping

Function

Pangenome

Panproteome

Map

YES

NO

Part 1. Install PanTools

For instructions on how to install PanTools, see Installing and configuring the required software.

Part 2. Build your own pangenome using PanTools

To demonstrate the main functionalities of PanTools we use a small chloroplasts dataset to avoid long construction times.

Genome

Chloroplast genome

Accession

Length

Genes

tRNAs

1

Cucumis sativus (cucumber)

NC_007144.1

155,293 bp

85

37

2

Oryza sativa Indica 93-11 (rice)

NC_008155.1

134,496 bp

100

40

3

Solanum lycopersicum (tomato)

NC_007898.3

155,461 bp

87

45

4

Solanum tuberosum (potato)

NC_008096.2

155,296 bp

84

45

5

Zea mays (maize)

NC_001666.2

140,384 bp

111

38

Download the chloroplast fasta and gff files here or via wget.

$ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz
$ tar -xvzf chloroplasts.tar.gz #unpack the archive

We assume a PanTools alias was set during the installation. This allows PanTools to be executed with pantools rather than the full path to the jar file. If you don’t have an alias, either set one or replace the pantools command with the full path to the .jar file in the tutorials.


BUILD, ANNOTATE and GROUP

We start with building a pangenome using four of the five chloroplast genomes. For this you need a text file which directs PanTools to the FASTA files. Call your text file genome_locations.txt and include the following lines:

YOUR_PATH/C_sativus.fasta
YOUR_PATH/O_sativa.fasta
YOUR_PATH/S_lycopersicum.fasta
YOUR_PATH/S_tuberosum.fasta

Make sure that ‘YOUR_PATH’ is the full path to the input files! Then run PanTools with the build_pangenome function and include the text file

$ pantools build_pangenome chloroplast_DB genome_locations.txt

Did the program run without any error messages? Congratulations, you’ve built your first pangenome! If not? Make sure your Java version is up to date and kmc is executable. The text file should only contain full paths to FASTA files, no additional spaces or empty lines.

Adding additional genomes

PanTools has the ability to add additional genomes to an already existing pangenome. To test the function of PanTools, prepare a text file containing the path to the Maize chloroplast genome. Call your text file fifth_genome_location.txt and include the following line to the file:

YOUR_PATH/Z_mays.fasta

Run PanTools on the new text file and use the add_genomes function

$ pantools add_genomes chloroplast_DB fifth_genome_location.txt

Adding annotations To include gene annotations to the pangenome, prepare a text file containing paths to the GFF files. Call your text file annotation_locations.txt and include the following lines into the file:

1 YOUR_PATH/C_sativus.gff3
2 YOUR_PATH/O_sativa.gff3
3 YOUR_PATH/S_lycopersicum.gff3
4 YOUR_PATH/S_tuberosum.gff3
5 YOUR_PATH/Z_mays.gff3

Run PanTools using the add_annotations function and include the new text file

$ pantools add_annotations --connect chloroplast_DB annotation_locations.txt

PanTools attached the annotations to our nucleotide nodes so now we can cluster them.

Homology grouping

PanTools can infer homology between the protein sequences of a pangenome and cluster them into homology groups. Multiple parameters can be set to influence the sensitivity but for now we use the group functionality with default settings.

$ pantools group chloroplast_DB

Adding phenotypes (requires PanTools v3)

Phenotype values can be Integers, Double, String or Boolean values. Create a text file phenotypes.txt.

Genome,Solanum
1,false
2,false
3,true
4,true
5,false

And use add_phenotypes to add the information to the pangenome.

$ pantools add_phenotypes chloroplast_DB phenotypes.txt

RETRIEVE functions

Now that the construction is complete, lets quickly validate if the construction was successful and the database can be used. To retrieve some genomic regions, prepare a text file containing genomic coordinates. Create the file regions.txt and include the following for each region: genome number, contig number, start and stop position and separate them by a single space

1 1 200 500
2 1 300 700
3 1 1 10000
3 1 1 10000 -
4 1 9999 15000
5 1 100000 110000

Now run the retrieve_regions function and include the new text file

$ pantools retrieve_regions chloroplast_DB regions.txt

Take a look at the extracted regions that are written to the chloroplast_DB/retrieval/regions/ directory.

To retrieve entire genomes, prepare a text file genome_numbers.txt and include each genome number on a separate line in the file

1
3
5

Use the retrieve_regions function again but include the new text file

$ pantools retrieve_regions chloroplast_DB genome_numbers.txt

Genome files are written to same directory as before. Take a look at one of the three genomes you have just retrieved.

In part 3 of the tutorial we explore the pangenome you just built using the Neo4j browser and the Cypher language.

Part 3. Explore the pangenome using the Neo4j browser

Did you skip part 2 of the tutorial or were you unable to build the chloroplast pangenome? Download the pre-constructed pangenome here or via wget.

$ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz
$ tar -xvzf chloroplast_DB.tar.gz

Configuring Neo4j

Set the full path to the chloroplast pangenome database by opening neo4j.conf (’neo4j-community-3.5.30/conf/neo4j.conf’) and include the following line in the config file. Please make sure there is always only a single uncommented line with ‘dbms.directories.data’.

#dbms.directories.data=/YOUR_PATH/any_other_database
dbms.directories.data=/YOUR_PATH/chloroplast_DB
Allowing non-local connections
To be able to run Neo4j on a server and have access to it from anywhere, some additional lines in the config file must be changed.
  • Uncomment the four following lines in neo4j-community-3.5.30/conf/neo4j.conf.

  • Replace 7686, 7474, and 7473 by three different numbers that are not in use by other people on your server. In this way, everyone can have their own database running at the same time.

#dbms.connectors.default_listen_address=0.0.0.0
#dbms.connector.bolt.listen_address=:7687
#dbms.connector.http.listen_address=:7474
#dbms.connector.https.listen_address=:7473

Lets start up the Neo4j server!

$ neo4j start

Start Firefox (or a web browser of your own preference) and let it run on the background.

$ firefox &

In case you did not change the config to allow non-local connections, browse to http://localhost:7474. Whenever you did change the config file, go to server_address:7474, where 7474 should be replaced with the number you chose earlier.

If the database startup was successful, a login terminal will appear in the webpage. Use ‘neo4j’ both as username and password. After logging in, you are requested to set a new password.


Exploring nodes and edges in Neo4j

Go through the following steps to become proficient in using the Neo4j browser and the underlying PanTools data structure. If you have any difficulty trouble finding a node, relationship or any type of information, download and use this visual guide.

  1. Click on the database icon on the left. A menu with all node types and relationship types will appear.

  2. Click on the ‘gene’ button in the node label section. This automatically generated a query. Execute the query.

  3. The LIMIT clause prevents large numbers of nodes popping up to avoid your web browser from crashing. Set LIMIT to 10 and execute the query.

  4. Hover over the nodes, click on them and take a look at the values stored in the nodes. All these features (except ID) were extracted from the GFF annotation files. ID is an unique number automatically assigned to nodes and relationships by Neo4j.

  5. Double-click on the matK gene node, all nodes with a connection to this gene node will appear. The nodes have distinct colors as these are different node types, such as mRNA, CDS, nucleotide. Take a look at the node properties to observe that most values and information is specific to a certain node type.

  6. Double-click on the matK mRNA node, a homology_group node should appear. These type of nodes connect homologous genes in the graph. However, you can see this gene did not cluster with any other gene.

  7. Hover over the start relation of the matK gene node. As you can see information is not only stored in nodes, but also in relationships! A relationship always has a certain direction, in this case the relation starts at the gene node and points to a nucleotide node. Offset marks the location within the node.

  8. Double-click on the nucleotide node at the end of the ‘start’ relationship. An in- and outgoing relation appear that connect to other nucleotide nodes. Hover over both the relations and compare them. The relations holds the genomic coordinates and shows this path only occurs in contig/sequence 1 of genome 1.

  9. Follow the outgoing FF-relationship to the next nucleotide node and expand this node by double-clicking. Three nodes will pop up this time. If you hover over the relations you see the coordinates belong to other genomes as well. You may also notice the relationships between nucleotide nodes is always a two letter combination of F (forward) and R (reverse) which state if a sequence is reverse complemented or not. The first letter corresponds to the sequence of the node at the start of the relation where the second letters refers to the sequence of the end node.

  10. Finally, execute the following query to call the database scheme to see how all node types are connected to each other: CALL db.schema(). The schema will be useful when designing your own queries!


Query the pangenome database using CYPHER

Cypher is a declarative, SQL-inspired language and uses ASCII-Art to represent patterns. Nodes are represented by circles and relationships by arrows.

  • The MATCH clause allows you to specify the patterns Neo4j will search for in the database.

  • With WHERE you can add constraints to the patterns described.

  • In the RETURN clause you define which parts of the pattern to display.

Cypher queries

Match and return 100 nucleotide nodes

MATCH (n:nucleotide) RETURN n LIMIT 100

Find all the genome nodes

MATCH (n:genome) RETURN n

Find the pangenome node

MATCH (n:pangenome) RETURN n

Match and return 100 genes

MATCH (g:gene) RETURN g LIMIT 100

Match and return 100 genes and order them by length

MATCH (g:gene) RETURN g ORDER BY g.length DESC LIMIT 100

The same query as before but results are now returned in a table

MATCH (g:gene) RETURN g.name, g.address, g.length ORDER BY g.length DESC LIMIT 100

Return genes which are longer as 100 but shorter than 250 bp (this can also be applied to other features such as exons introns or CDS)

MATCH (g:gene) where g.length > 100 AND g.length < 250 RETURN * LIMIT 100

Find genes located on first genome

MATCH (g:gene) WHERE g.address[0] = 1 RETURN * LIMIT 100

Find genes located on first genome and first sequence

MATCH (g:gene) WHERE g.address[0] = 1 AND g.address[1] = 1 RETURN * LIMIT 100

Homology group queries

Return 100 homology groups

MATCH (h:homology_group) RETURN h LIMIT 100

Match homology groups which contain two members

MATCH (h:homology_group) WHERE h.num_members = 2 RETURN h

Match homology groups and ‘walk’ to the genes and corresponding start and end node

MATCH (h:homology_group)-->(f:feature)<--(g:gene)-->(n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25

Turn off autocomplete by clicking on the button on the bottom right. The graph falls apart because relations were not assigned to variables.

The same query as before but now the relations do have variables

MATCH (h:homology_group)-[r1]-> (f:feature) <-[r2]-(g:gene)-[r3]-> (n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25

When you turn off autocomplete again only the ‘is_similar_to’ relation disappears since we did not call it

Find homology group that belong to the rpoC1 gene

MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE g.name = 'rpoC1' RETURN *

Find genes on genome 1 which don’t show homology

MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE n.num_members = 1 and g.genome = 1 RETURN *

Structural variant detection

Find SNP bubbles (for simplification we only use the FF relation)

MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return * limit 50

The same query but returning the results in a table

MATCH (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return a1.length,b1.length, a1.sequence, b1.sequence limit 50

Functions such as count(), sum() and stDev() can be used in a query.

The same SNP query but count the hits instead of displaying them

MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return count(p)

Hopefully you know have some feeling with the Neo4j browser and cypher and you’re inspired to create your own queries!

When you’re done working in the browser, close the database (by using the command line again).

$ neo4j stop
More information on Neo4j and the cypher language:

In part 4 of the tutorial we explore some of the functionalities to analyze the pangenome.

Part 4. Characterization

Part 4 preparation

PanTools v3 is required to follow this part of the tutorial. In addition, MAFFT and R (and a few packages) need to be installed and set to your $PATH. Everything should already be correctly installed if you use the conda environment. Validate if the tools are executable by using the following commands.

$ Rscript --help
$ mafft -h

We assume a PanTools alias was set during the installation. This allows PanTools to be executed with pantools rather than the full path to the jar file. If you don’t have an alias, either set one or replace the pantools command with the full path to the .jar file in the tutorials.

Input data

Genome

Name

Accession

Length

Sequences

Genes

1

  1. odoriferum Q166

GCF_002904195.1

5.09 Mb

66

4510

2

  1. fontis M022

GCF_000803215.1

4.15 Mb

107

3723

3

  1. polaris S4.16.03.2B

GCF_003595035.1

4.86 Mb

65

4442

4

  1. brasiliense S2

GCF_000808375.1

4.84 Mb

37

4367

5

  1. brasiliense Y49

GCF_000808115.1

4.70 Mb

31

4231

6

  1. dadantii 3937

GCF_000147055.1

4.92 Mb

1

4281

To demonstrate how to use the PanTools functionalities we use a small dataset of six bacteria to avoid long runtimes. Download a pre-constructed pangenome or test your new skills and construct a pangenome yourself using the fasta and gff files.

Option 1: Download separate genome and annotation files

$ wget http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_input.tar.gz
$ tar -xvzf pecto_dickeya_input.tar.gz
$ gzip -d pecto_dickeya_input/annotations/*
$ gzip -d pecto_dickeya_input/genomes/*
$ gzip -d pecto_dickeya_input/functions/*

$ pantools build_pangenome pecto_dickeya_DB pecto_dickeya_input/genomes.txt
$ pantools add_annotations --connect pecto_dickeya_DB pecto_dickeya_input/annotations.txt
$ pantools group --relaxation=4 pecto_dickeya_DB

Option 2: Download the pre-constructed pangenome

$ wget http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz
$ tar -xvzf pecto_dickeya_DB.tar.gz

Adding phenotype/metadata to the pangenome

Before starting with the analysis, we will add some phenotype data to the pangenome. Phenotypes allow you to find similarities for a group of genomes sharing a phenotype as well as identifying variation between different phenotypes. Below is a textfile with data for three phenotypes. The third phenotype, low_temperature, is in this case a made up example! It states whether the strain is capable of growing on (extreme) low temperatures. The phenotype file can be found inside the database directory or create a new file using the text from the box below. Add the phenotype information to the pangenome using add_phenotypes.

Genome, species, strain_name, low_temperature
1,P. odoriferum,P. odoriferum Q166, false
2,P. fontis, P. fontis M022, true
3,P. polaris,P. polaris S4.16.03.2B, false
4,P. brasiliense, P. brasiliense S2, true
5,P. brasiliense, P. brasiliense Y49, false
6,D. dadantii, D. dadantii 3937,?
$ pantools add_phenotypes pecto_dickeya_DB pecto_dickeya_input/phenotypes.txt

Metrics and general statistics

After building or uncompressing the pangenome, run the metrics functionality to produce various statistics that should verify an errorless construction.

$ pantools metrics pecto_dickeya_DB

Open metrics_per_genome.csv with a spreadsheet tool (Excel, Libreoffice, Google sheets) and make sure the columns are split on commas. You may easily notice the many empty columns in this table as these type of annotations or features are not included in the database (yet). Functional annotations are incorporated later in this tutorial. Columns for features like exon and intron will remain empty as bacterial coding sequences are not interrupted.


Gene classification

With the gene_classification functionality you are able to organize the gene repertoire into the core, accessory or unique part of the pangenome.

  • Core, a gene is present in all genomes

  • Unique, a gene is present in a single genome

  • Accessory, a gene is present in some but not all genomes

$ pantools gene_classification pecto_dickeya_DB
_images/classify_table.png

Take a look in gene_classification_overview.txt. Here you can find the number of classified homology groups and genes on a pangenome level but also for individual genomes.

Open additional_copies.csv with a spreadsheet tool. This file can be useful to identify duplicated genes in relation to other genomes.

The default criteria to call a group core is presence in all genomes where unique is only allowed to be present in one genome. These two categories are highly influenced by annotation quality, especially in large pangenomes. Luckily, the threshold for core and unique groups can easily be adjusted. Let’s consider genes to be core when present in only five of the six genomes by setting the --core-threshold argument.

$ pantools gene_classification --core-threshold=85 pecto_dickeya_DB

Look in gene_classification_overview.txt again to observe the increase of core groups/genes at the cost of accessory groups.

For this pangenome, the Dickeya genome is considered an outgroup to the five Pectobacterium genomes. While this outgroup is needed to root and analyze phylogenetic trees (tutorial part 5), it affects the number classified groups for the all other genomes. Use --include or --exclude to exclude the Dickeya genome.

$ pantools gene_classification --include=1-5 pecto_dickeya_DB
$ pantools gene_classification --exclude=6 pecto_dickeya_DB

Take a look in gene_classification_overview.txt one more time to examine the effect of excluding this genome. The total number of groups in the analysis is lower now but the number of core and unique genes have increased for the five remaining genomes.

When phenotype information is used in the analysis, three additional categories can be assigned to a group:

  • Shared, a gene present in all genomes of a phenotype

  • Exclusive, a gene is only present in a certain phenotype

  • Specific, a gene present in all genomes of a phenotype and is also exclusive

Include a --phenotype argument to find genes that are exclusive for a certain species.

$ pantools gene_classification --phenotype=species pecto_dickeya_DB

Open gene_classification_phenotype_overview.txt to see the number of classified groups for the species phenotype.

Open phenotype_disrupted.csv in a spreadsheet tool. This file explains exactly why a homology groups is labeled as phenotype shared and not specific.

Open phenotype_additional_copies.csv in a spreadsheet tool. Similarly to phenotype_additional.csv this file shows groups where all genomes of a certain phenotype have additional gene copies to (at least one of) the other phenotypes.

Each time you run the gene_classification function, multiple files are created that contain node identifiers of a certain homology group category. These files can be given to other PanTools functions for a downstream analysis, for example, sequence alignment, phylogeny, or GO enrichment. We will use one of the files later in this tutorial.


Pangenome structure

With the previous functionality we identified the core, accessory and unique parts of the pangenome. Now we will use the pangenome_size_genes function to observe how these numbers are reached by simulating the growth of the pangenome. Simulating the growth helps explaining if a pangenome should be considered open or closed. An pangenome is called open as long as a significant number of new (unique) genes are added to the total gene repertoire. The openness of a pangenome is usually tested using Heap’s law. Heaps’ law (a power law) can be fitted to the number of new genes observed when increasing the pangenome by one random genome. The formula for the power law model is n = k x N-a, where n is the newly discovered genes, N is the total number of genomes, and k and a are the fitting parameters. A pangenome can be considered open when a < 1 and closed if a > 1.

The outcome of the function can again be controlled through command line arguments. Genomes can be excluded from the analysis with --exclude. You can set the number of iterations with --loops.

$ pantools pangenome_structure pecto_dickeya_DB

The current test set of six bacterial genomes is not representative of a full-sized pangenome. Therefore we prepared the results for the structure simulation on a set of 197 Pectobacterium genomes. The runtime of the analysis using 10.000 loops and 24 threads was 1 minute and 54 seconds. Download the files here, unpack the archive and take a look at the files.

$ wget wget http://bioinformatics.nl/pangenomics/tutorial/pectobacterium_structure.tar.gz
$ tar -xvf pectobacterium_structure.tar.gz

Normally you still have to run the R scripts to create the output figures and determine the openness of the pangenome.

cd pectobacterium_structure
$ Rscript pangenome_growth.R
$ Rscript gains_losses_median_and_average.R
$ Rscript heaps_law.R

Take a look at the plot. In core_accessory_unique_size.png, the number of classified groups are plotted for any of the genome combination that occurred during the simulation. For the core_accessory_size.png plots, the number of unique groups is combined with accessory groups.

The gains_losses.png files display the average and mean group gain and loss between different pangenome sizes. The line of the core starts below zero, meaning for every random genome added, the core genome decreases by a number of X genes.

The alpha value of heaps' law is written to heaps_law_alpha.txt. Click here to reveal the result.


With an alpha of 0.53 the Pectobacterium pangenome has an open structure, which is typical for many bacterial species due their extensive exchange of genetic information. Performing this analysis on a set of closely related animal or plant genomes usually results in a closed pangenome, indicating a limited gene pool.


Functional annotations

PanTools is able to incorporate functional annotations into the pangenome by reading output of various functional annotation tools. In this tutorial we only include annotations from InterProScan. Please see the add_functions manual to check which other tools are available. To include the annotations, create a file functions.txt using text from the box below and add it to the command line argument.

1 YOUR_PATH/GCF_002904195.1.gff3
2 YOUR_PATH/GCF_000803215.1.gff3
3 YOUR_PATH/GCF_003595035.1.gff3
4 YOUR_PATH/GCF_000808375.1.gff3
5 YOUR_PATH/GCF_000808115.1.gff3
6 YOUR_PATH/GCA_000147055.1.gff3
$ pantools add_functions pecto_dickeya_DB functions.txt

PanTools will ask you to download the InterPro database. Follow the steps and execute the program again.

The complete GO, PFAM, Interpro and TIGRFAM, databases are now integrated in the graph database after. Genes with a predicted function have gained a relationship to that function node. Retrieving a set of genes that share a function is now possible through a simple cypher query. If you would run metrics again, statistics for these type functional annotations are calculated. To create a summary table for each type of functional annotation, run function_overview.

$ pantools function_overview pecto_dickeya_DB

In function_overview_per_group.csv you can navigate to a homology group or gene to see the connected functions. You can also search in the opposite direction, use one of the created overview files for a type of functional annotation and quickly navigate to a function of interest to find which genes are connected.

GO enrichment

We go back to the output files from gene classification that only contain node identifiers. We can retrieve group functions by providing one the files to group_info with the --homology-groups argument. However, interpreting groups by assessing each one individually is not very practical. A common approach to discover interesting genes from a large set is GO-enrichment. This statistical method enables the identification of genes sharing a function that are significantly over or under-represented in a given gene set compared to the rest of the genome. Let’s perform a GO enrichment on homology groups of the core genome.

Phenotype: P._brasiliense, 2 genomes, threshold of 2 genomes
1278537,1282642,1283856,1283861,1283862,1283869,1283906,1283921,1283934,1283941,1283945,1283946
$ pantools group_info pecto_dickeya_DB brasiliense_groups.csv
$ pantools go_enrichment pecto_dickeya_DB brasiliense_groups.csv

Open go_enrichment.csv with a spreadsheet tool. This file holds GO terms found in at least one of the genomes, the p-value of the statistical test and whether it is actually enriched after the multiple testing correction. as this is different for each genome a function might enriched in one genome but not in another.

A directory with seperate output files is created for each genome, open go_enrichment.csv for the genome 4 or 5 in a spreedsheet. Also take a look at the PDF files that visualize part of the Gene ontology hierarchy.

Classifying functional annotations

Similarly to classifying gene content, functional annotations can be categorized using functional_classification. This tool provides an easy way to identify functions shared by a group of genomes of a certain phenotype but can also be used to identify core or unique functions. The functionality uses the same set of arguments as gene_classification. You can go through the same steps again to see the effect of changing the arguments.

$ pantools functional_classification pecto_dickeya_DB
$ pantools functional_classification --core-threshold=85 pecto_dickeya_DB
$ pantools functional_classification --exclude=6 pecto_dickeya_DB
$ pantools functional_classification --phenotype=species pecto_dickeya_DB

Sequence alignment

In the final part of this tutorial we will test the alignment function by aligning homology groups. PanTools is able to align genomic regions, genes and proteins to identify SNPs or amino acid changes with msa.

Start with the alignment of protein sequences from the 12 P. brasiliense specific homology groups.

$ pantools msa --mode=protein --method=per-group -H=brasiliense_groups.csv pecto_dickeya_DB

Go to the pecto_dickeya_DB/alignments/grouping_v1/groups/ directory and select one of homology groups and check if you can find the following files

  • The alignments are written to prot_trimmed.fasta and prot_trimmed.afa.

  • A gene tree is written to prot_trimmed.newick

  • prot_trimmed_variable_positions.csv located in the var_inf_positions subdirectory. This matrix holds every variable position of the alignment; the rows are the position in the alignment and the columns are the 20 amino acids and gaps.

  • The identity and similarity (for proteins) is calculated between sequences and written to tables in the similarity_identity subdirectory.

Run the function again but include the --no-trimming argument.

$ pantools msa --no-trimming --mode=protein --method=per-group -H=brasiliense_groups.csv pecto_dickeya_DB

The output files are generated right after the first alignment without trimming the sequences first. The file names differ from the trimmed alignments by the ’_trimmed’ substring.

Run the function again but exclude the --mode=protein and --no-trimming arguments. When no additional arguments are included to the command, both nucleotide and protein sequences are aligned two consecutive times.

$ pantools msa --method=per-group -H=brasiliense_groups.csv pecto_dickeya_DB

Again, the same type of files are generated but the output files from nucleotide sequence can be recognized by the ‘nuc_’ substrings. The matrix in nuc_trimmed_variable_positions.csv now only has columns for the four nucleotides and gaps.

Finally, run the function one more time but include a phenotype. This allows you to identify phenotype specific SNPs or amino acid changes.

$ pantools msa --no-trimming --phenotype=low_temperature -H=brasiliense_groups.csv pecto_dickeya_DB

Open the nuc- or prot_trimmed_phenotype_specific_changes.info file inside one of the homology group output directories.


Besides the functionalities in this tutorial, PanTools has more useful functions that may aid you in retrieving more specific information from the pangenome.

  • Identify shared k-mers between genomes with kmer_classification.

  • Find co-localized genes in a set of homology groups: locate_genes.

  • Mapping short reads against the pangenome with map.

In part 5 of the tutorial we explore some of the phylogenetic methods implemented in PanTools.

Part 5. Phylogeny

Part 5 preparation

Pantools v3 is required to follow this part of the tutorial. In addition, MAFFT, FastTree, IQ-tree, R (and the ape R package) need to be installed and set to your $PATH. Validate if the tools are executable by using the following commands.

pantools --version
Rscript --help
mafft -h
iqtree -h
fasttree -h

If you did not follow part 4 of the tutorial, download the pre-constructed pangenome here.

$ wget http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz
$ tar -xvzf pecto_dickeya_DB.tar.gz

Adding phenotype/metadata to the pangenome

Before we construct the trees, we will add some phenotype data to the pangenome. Once the we have a phylogeny, the information can be included or be used to color parts of the tree. Below is a textfile with data for three phenotypes. The third phenotype, low_temperature, is in this case a made up example! It states whether the strain is capable of growing on (extreme) low temperatures. The phenotype file can be found inside the database directory, add the information to the pangenome by using add_phenotypes.

Genome, species, strain_name, low_temperature
1,P. odoriferum,P. odoriferum Q166, false
2,P. fontis, P. fontis M022, true
3,P. polaris,P. polaris S4.16.03.2B, false
4,P. brasiliense, P. brasiliense S2, true
5,P. brasiliense, P. brasiliense Y49, false
6,D. dadantii, D. dadantii 3937,?
$ pantools add_phenotypes pecto_dickeya_DB pecto_dickeya_DB/phenotypes.txt

Constructing a phylogeny

In this tutorial we will construct three phylogenies, each based on a different type of variation: SNPs, genes and k-mers. Take a look at the phylogeny manuals to get an understanding how the three methods work and how they differ from each other.

  1. Core phylogeny

  2. Gene distance tree

  3. K-mer distance tree

Core SNP phylogeny

The core SNP phylogeny will run various Maximum Likelihood models on parsimony informative sites of single-copy orthologous sequences. A site is parsimony-informative when there are at least two types of nucleotides that occur with a minimum frequency of two. The informative sites are automatically identified by aligning the sequences; however, it does not know which sequences are single-copy orthologous. You can identify these conserved sequences by running gene_classification.

$ pantools gene_classification --phenotype=species pecto_dickeya_DB

Open gene_classification_overview.txt and take a look at statistics. As you can see there are 2134 single-copy ortholog groups. Normally, all of these groups are aligned to identify SNPs but for this tutorial we’ll make a selection of only a few groups to accelerate the steps. You can do this in two different ways:

Option 1: Open single_copy_orthologs.csv and remove all node identifiers after the first 20 homology groups and save the file.

$ pantools core_phylogeny --mode=ML pecto_dickeya_DB

Option 2: Open single_copy_orthologs.csv and select the first 20 homology_group node identifiers. Place them in a new file sco_groups.txt and include this file to the function.

$ pantools core_snp_tree --mode=ML -H=sco_groups.txt pecto_dickeya_DB

The sequences of the homology groups are being aligned two consecutive times. After the initial alignment, input sequences are trimmed based on the longest start and end gap of the alignment. The parsimony informative positions are taken from the second alignment and concatenated into a sequence. When opening informative.fasta you can find 6 sequences, the length of the sequences being the number of parsimony-informative sites.

$ iqtree -nt 4 -s pecto_dickeya_DB/alignments/grouping_v1/core_snp_tree/informative.fasta -redo -bb 1000

IQ-tree generates several files, the tree that we later on in the tutorial will continue with is called informative.fasta.treefile. When examining the informative.fasta.iqtree file you can find the best fit model of the data. This file also shows the number of sites that were used, as sites with gaps (which IQ-tree does not allow) were changed into singleton or constant sites.

Gene distance tree

To create a phylogeny based on gene distances (absence/presence), we can simply execute the Rscript that was created by gene_classification.

$ Rscript pecto_dickeya_DB/gene_classification/gene_distance_tree.R

The resulting tree is called gene_distance.tree.

K-mer distance tree

To obtain a k-mer distance phylogeny, the k-mers must first be counted with the kmer_classification function. Afterwards, the tree can be constructed by executing the Rscript.

$ pantools kmer_classification pecto_dickeya_DB
$ Rscript pecto_dickeya_DB/kmer_classification/genome_kmer_distance_tree.R

The resulting tree is written to genome_kmer_distance.tree.


Renaming tree nodes

So far, we used three different types of distances (SNPs, genes, k-mers), and two different methods (ML, NJ) to create three phylogenetic trees. First, lets take a look at the text files. The informative.fasta.treefile only contain genome numbers, bootstrap values and branch lengths but is lacking the metadata. Examining gene_distance.tree file also shows this information but the species names as well, because we included this as a phenotype during gene_classification.

Let’s include the strain identifiers to the core snp tree to make the final figure more informative. Use the rename_phylogeny function to rename the tree nodes.

$ pantools rename_phylogeny --phenotype=strain_name pecto_dickeya_DB pecto_dickeya_DB/alignments/grouping_v1/core_snp_tree/informative.fasta.treefile

Take a look at informative.fasta_RENAMED.treefile, strain identifiers have been added to the tree.


Visualizing the tree in iTOL

Go to https://itol.embl.de and click on “Upload a tree” under the ANNOTATE box. On this page you can paste the tree directly into the tree text: textbox or can click the button to upload the .newick file.

_images/tutorial_phylogeny1.png

Basic controls ITOL

  • The default way of visualizing a tree is the rectangular view. Depending on the number of genomes, the circular view can be easier to interpret. You can the view by clicking on the “Display Mode” buttons.

  • Increase the font size and branch width to improve readability

  • When visualizing a Maximum likelihood (ML) tree, bootstrap values can be displayed by clicking the “Display” button next to Bootstrap/metadata in the Advanced tab of the Control window. This enables you to visualize the values as text or symbol on the branch. or by coloring the branch or adjusting the width.

_images/tutorial_phylogeny2.png
  • When you have a known out group or one of the genomes is a clear outlier in the tree, you should reroot the tree. Hover over the name, click it so a pop-up menu appears. Click “tree structure” followed by “Reroot the tree here”.

_images/tutorial_phylogeny3.png
  • Clicking on the name of a node in the tree allows you to color the name, branch, or background of that specific node.

  • When you’re happy the way your tree looks, go to the Export tab of the Control window. Select the desired output format, click on the “Full image” button and export the file to a figure.

  • Refresh the webpage to go back to the default view of your tree.

Create iTOL templates

In iTOL it is possible to add colors to the tree by coloring the terminal nodes or adding an outer ring. The PanTools function create_tree_template is able to create templates that allows for easy coloring (with maximum of 20 possible colors). If the function is run without any additional argument, templates are created for trees that only contain genome numbers (e.g. k-mer distance tree). Here we want to color the (renamed) core SNP tree with the ‘low_temperature’ phenotype. Therefore, the --phenotype strain_name must be included to the function.

$ pantools create_tree_template pecto_dickeya_DB # Run this command when the tree contains genome numbers only
$ pantools create_tree_template --phenotype=strain_name pecto_dickeya_DB

Copy the two low_temperature.txt files from the label/strain_name/ and ring/strain_name/ directories to your personal computer. Click and move the ring template file into the tree visualization webpage.

_images/tutorial_phylogeny4.png

The resulting tree should look this when: the tree is rooted with the Dickeya genome, bootstrap values are displayed as text and the ring color template was included.

_images/tutorial_phylogeny_color.png

Tree coloring is especially useful for large datasets. An example is shown in the figure below, where members of the same species share a color.

_images/tutorial_phylogeny_tree.png