Installing PanTools from source

For easy installation, please see Installing PanTools in the user guide. On this page, we will describe the full installation process needed for developers.

Install dependencies
Install Neo4j
Compile PanTools
Install Sphinx
Install pre-commit hooks

Install dependencies

Some of PanTools functionalities require additional software to be installed. Installing every dependency will take a considerate amount of time, therefore we highly recommend to use mamba. Mamba efficiently manages conda environments allowing the installation of all required tools into a separate environment. Instructions for creating the mamba environment or installing the tools manually are found in the sections below.

Install dependencies using Conda

To install all dependencies into a separate environment, run the following commands. Please use the conda.yaml file which can be found in the release or in the PanTools home when working with the git itself. For smooth dependency resolving with conda, it is recommended to use strict channel priority and to only use the bioconda and conda-forge channels.

$ mamba env create -n pantools -f conda.yaml

Manual installation of dependencies

All tools must be set to your PATH so PanTools is able to use them. The instructions below are based on a Linux machine. Also, please note that some tools required may be missing from the list below. You can install them using conda or mamba.

Install KMC

PanTools requires KMC v3.1.0 or higher. The KMC3 binaries can be downloaded from https://github.com/refresh-bio/KMC/releases.

$ tar xvzf KMC* #uncompress the KMC binaries

# Edit your ~/.bashrc to include KMC to your PATH
$ echo "export PATH=/path/to/KMC/:\$PATH" >> ~/.bashrc #replace /path/to with the correct path on your computer
$ source ~/.bashrc
$ kmc # test if KMC is executable
$ kmc_tools # test if kmc_tools is executable

Install MCL

The MCL (Markov clustering) algorithm is required for the homology grouping of PanTools. The software can be found on https://micans.org/mcl under License & software.

$ wget https://micans.org/mcl/src/mcl-14-137.tar.gz
$ tar xvzf mcl-*
$ cd mcl-14-137
$ ./configure --prefix=/path/to/mcl-14-137/shared #replace /path/to with the correct path on your computer
$ make install

# Edit your ~/.bashrc to include MCL to your PATH
$ echo "export PATH=/path/to/mcl-14-137/bin/:\$PATH" >> ~/.bashrc #replace /path/to with the correct path on your computer
$ source ~/.bashrc
$ mcl -h # test if MCL is executable

Install BUSCO

BUSCO v3 to v5 can be run against the pangenome to estimate annotation completeness. The versions require a different Python release and need to be installed in a different way. We suggest to install BUSCO v5, follow the instructions at https://gitlab.com/ezlab/busco/.

Install FastTree

FastTree is used to infer approximately-maximum-likelihood phylogenetic trees from the alignments of nucleotide or protein sequences which are extracted from the pangenome. An executable can be found on the FastTree website: http://www.microbesonline.org/fasttree/.

$ wget http://www.microbesonline.org/fasttree/FastTree
$ chmod +x FastTree
$ ./FastTree # test if FastTree is executable

# Edit your ~/.bashrc to include FastTree to your PATH
$ echo "export PATH=/path/to:\$PATH" >> ~/.bashrc #replace /path/to with the correct path on your computer
$ source ~/.bashrc

Install R

R and some additional R packages are required to execute R scripts (files with .R extension) that create plots and construct Neighbor-Joining phylogenies. In most cases, R is already installed on a server. If this is not the case, install it through the instructions on the website https://cran.r-project.org/, or compile it by using following steps.

mkdir R
mkdir R/R_LIBS
cd R
wget https://cran.r-project.org/src/base/R-4/R-4.0.2.tar.gz #version number might have changed already
tar -xvf R-4.0.2.tar.gz
cd R-4.0.2/
./configure --prefix=/path/to/R/  #replace /path/to with the correct path on your computer
make

# Edit your ~/.bashrc to include R to your PATH
$ echo "export PATH=/path/to/R/bin/:\$PATH" >> ~/.bashrc #replace /path/to with the correct path on your computer
$ source ~/.bashrc
$ R --help # test if R is executable

When R_LIB is set to your $PATH, R scripts know the location of the libraries and are able to install additional R packages to the selected directory.

$ echo "R_LIBS=/path/to/R/R_LIBS/" >> ~/.bashrc #replace /path/to with the correct path on your computer
$ echo "export R_LIBS" >> ~/.bashrc
$ echo $R_LIBS # validate if the path to the R libraries can be found

Install MAFFT

MAFFT is required for all the alignment functionalities, such as the alignment of homology groups and inferring the core SNP phylogeny. The full manual is available at https://mafft.cbrc.jp/alignment/software/.

$ git clone https://github.com/GSLBiotech/mafft.git
$ cd mafft/core

# Edit the first line of Makefile to change the desired install location, from 'PREFIX = /usr/local' to 'PREFIX = /YOUR_DESIRED_PATH/mafft/'
# Make sure the 'ENABLE_MULTITHREAD = -Denablemultithread' line is uncommented, to enable multithreading

# Edit your ~/.bashrc to include MAFFT to your $PATH
$ echo "export PATH=/path/to/mafft/bin/:\$PATH" >> ~/.bashrc #replace /path/to with the correct path on your computer
$ source ~/.bashrc
$ mafft --help # test if MAFFT is executable

Install IQ-tree

Using IQ-tree we infer phylogenetic trees by maximum likelihood. Information about the tool can found on their webpage https://github.com/ebi-pf-team/interproscan/wiki/HowToDownload

wget https://github.com/Cibiv/IQ-TREE/releases/download/v1.6.12/iqtree-1.6.12-Linux.tar.gz
tar -xvf iqtree-1.6.12-Linux

# Edit your ~/.bashrc to include IQ-tree to your $PATH
$ echo "export PATH=/path/to/iqtree-1.6.12-Linux/bin/:\$PATH" >> ~/.bashrc #replace /path/to with the correct path on your computer
$ source ~/.bashrc
$ iqtree -h # test if IQ-tree is executable

Install fastANI or MASH

To be able to construct a Neighbor-Joining phylogeny using ANI-scores, either fastANI or MASH is required. The manual for fastANI is available at https://github.com/ParBLiSS/FastANI/. The manual for MASH can be found at https://mash.readthedocs.io/en/latest/.

$ wget https://github.com/marbl/Mash/releases/download/v2.2/mash-Linux64-v2.2.tar
$ tar -xvf mash-Linux64-v2.2.tar
$ mv mash-Linux64-v2.2/mash .

$ wget https://github.com/ParBLiSS/FastANI/releases/download/v1.32/fastANI-Linux64-v1.32.zip #
$ unzip fastANI-Linux64-v1.32.zip

# Edit your ~/.bashrc to include MASH and FastANI to your $PATH
$ echo "export PATH=/path/to/:\$PATH" >> ~/.bashrc #replace /path/to with the correct path on your computer
$ source ~/.bashrc
$ mash -h # test if MASH is executable
$ fastANI -h # test if FastANI is executable

Install BLAST

BLAST is only required by one function, where the sequences are blasted against a database to obtain their COG category. Information about BLAST can be found at https://www.ncbi.nlm.nih.gov/books/NBK279690/?report=classic.

$ wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.10.1+-x64-linux.tar.gz
$ tar -xvf ncbi-blast-2.10.1+-x64-linux.tar.gz

# Edit your ~/.bashrc to include BLAST to your $PATH
$ echo "export PATH=/path/to/ncbi-blast-2.10.1+/bin/:\$PATH" >> ~/.bashrc #replace /path/to with the correct path on your computer
$ source ~/.bashrc
$ blastp -help # test if BLAST is executable

Install ASTER

ASTER is required for creating a phylogenetic tree based on both orthologs and paralogs with astral-pro. The manual for ASTER can be found at https://github.com/chaoszhang/ASTER.

$ git clone https://github.com/chaoszhang/ASTER.git
$ cd ASTER
$ git checkout v1.3
$ make

# Edit your ~/.bashrc to include ASTER to your $PATH
$ echo "export PATH=/path/to/ASTER/bin/:\$PATH" >> ~/.bashrc #replace /path/to with the correct path on your computer
$ source ~/.bashrc
$ astral-pro -h # test if ASTER is executable

Install InterProScan

Not required by any function, but the .GFF3 output of InterProScan can be read to include functional annotations to the database. The installation itself can be quite tricky as it uses many different third-party binaries and each having their own dependencies. Please check https://github.com/ebi-pf-team/interproscan/wiki/HowToDownload and take a look at the install requirements as well. Installation of the Panther models is not required.

Phobius via InterProScan

Phobius predictions can be performed during the InterProScan analysis but it is not part of the standard set of predictions. To allow these predictions, https://phobius.sbc.su.se/, place the entire directory in the InterProScan/bin/ directory and edit the interproscan.properties configuration file. More information about including Phobius into the InterProScan analysis is found https://interproscan-docs.readthedocs.io/en/latest/ActivatingLicensedAnalyses.html.

Install eggNOGmapper

Not required by any function, but the .annotations output of eggNOG-mapper can be read to include functional annotations to the database. Information about this tool can be found on http://eggnog-mapper.embl.de/

git clone https://github.com/eggnogdb/eggnog-mapper.git

Install Neo4j

Although Neo4j is not needed for any of the PanTools functionalities, it is required to visualize the graph database and use cypher queries. In the PanTools versions up to 3.2 we use Neo4j 3.5.3 libraries, whereas newer releases use Neo4j 3.5.30. Neo4j version 3.5.30 is compatible with all earlier PanTools versions.

Download the Neo4j 3.5.30 community edition from the Neo4j website or download the binaries directly from our server.

$ wget http://www.bioinformatics.nl/pangenomics/tutorial/neo4j-community-3.5.30-unix.tar.gz
$ tar xvzf neo4j-community-3.5.30-unix.tar.gz

# Edit your ~/.bashrc to include Neo4j to your $PATH
$ echo "export PATH=/path/to/neo4j-community-3.5.30/bin:\$PATH" >> ~/.bashrc #replace /path/to with the correct path on your computer
$ source ~/.bashrc
$ neo4j status # test if Neo4j is executable

Official Neo4j 3.5 manual: https://neo4j.com/docs/operations-manual/3.5/

Compile PanTools

PanTools is written in Java and can be compiled using Maven. The instructions for compilation are written in the pom.xml file. The following commands can be used to compile PanTools (in no particular order):

# Compile PanTools
mvn compile

# Run the tests
mvn test

# Create a fat jar file for PanTools (including all dependencies)
mvn package

# Compile PanTools without running the tests
mvn package -DskipTests

If you have created a fat jar with the mvn package command, you can run PanTools using the following command:

java -jar target/pantools-4.3.5.jar

Please note that the version is always the version of the latest release. To see the exact version of the jar file, you can use the following command:

java -jar target/pantools-4.3.5.jar --version

Finally, for development purposes, it is possible to not create a fat jar file and run PanTools directly from the compiled Java classes. This can be done using the following command:

# locally
mvn compile
mvn dependency:copy-dependencies
rsync -avPhz target/{dependency,classes} user@remote.server.nl:/path/to/dev/pantools/target

# on the remote server
alias pantools-dev='java -cp "/path/to/dev/pantools/target/dependency/*:/path/to/dev/pantools/target/classes" nl.wur.bif.pantools.Pantools'
pantools-dev --version

Install Sphinx

Sphinx is required to build the documentation we host on ReadTheDocs. It is possible to test and build the documentation locally. The documentation is written in reStructuredText and can be found in the docs/ directory.

In order to test and build the documentation locally, you need to install Sphinx, sphinx-rtd-theme and sphinx-lint, as well as graphviz for the sphinx.ext.graphviz extension (used for creating graphs in the documentation). Please make sure you use python 3.7 as this is important for the version of sphinx-lint (and pre-commit).

# Install graphviz
mamba install graphviz

# Install sphinx, sphinx-rtd-theme and sphinx-lint
pip install sphinx sphinx-rtd-theme sphinx-lint

The following commands can be used to test and build the documentation:

# Test the documentation
sphinx-lint -e=all --max-line-length=80

# Build the documentation
sphinx-build -W docs/source output

Install pre-commit hooks

It is highly recommended to install the pre-commit hooks to ensure that all code you commit to the repository is properly formatted and passes all checks. This will help to keep the code base clean and consistent.

First install the pre-commit Python package by following the installation instructions. Please first install Sphinx as described above.

pip install pre-commit

Then, inside the root directory of the repository, run:

pre-commit install

This step you will need to run only once after cloning the repository. The hooks will be installed in your local repository’s configuration under .git/hooks/pre-commit.

After installation of the hooks, only relevant files will be checked when committing changes. For example, if you change a Java file, only the Java related hooks will be triggered. Should any of the pre-commit hooks fail, git will not allow you to create the commit. The output of the pre-commit hooks should tell you what failed, allowing you to fix any problems and to re-add the affected files for another commit attempt.

Pre-commit hooks can be run manually as well with:

pre-commit run -a

End-to-end pipeline

We include an end-to-end pipeline that automatically runs online on git.wur.nl. However, it is also possible to run this pipeline locally. The pipeline is written in Snakemake and can be found in the tests/ directory. The pipeline should work on both macOS and Linux.

Install Snakemake

For running the pipeline locally, please first install Snakemake in these previously created conda environment:

mamba install snakemake=9.5.1

Obtain data

For the pipeline, we need to download the yeast pangenome and/or panproteome test data which we deposited on our in-house server as yeast_pangenome.tar.gz and yeast_panproteome.tar.gz. A tarball for the functional annotation databases exists as well to keep it consistent: functional_databases.tar.gz. For access to these files, please contact the PanTools developers. Next, unpack the files:

cd tests
tar xvzf yeast_pangenome.tar.gz  #for yeast pangenome
tar xvzf yeast_panproteome.tar.gz  #for yeast panproteome
tar xzvf functional_databases.tar.gz  #for functional databases

Run the pipeline

Then, to run the pipeline, run the following in the tests/ directory:

snakemake -pc1 --use-conda --configfile config/yeast_pangenome_yaml  #for yeast pangenome
snakemake -pc1 --use-conda --configfile config/yeast_panproteome_yaml  #for yeast panproteome

Pipeline overview

The pipeline runs all basis functionalities of PanTools for the current repository and a reference commit. This reference commit has to be a commit on the develop branch and is specified in tests/config/shared.yaml. Importantly, the dependencies of this reference commit (meaning, a copy of the conda.yaml file of that commit) are needed to run the pipeline: this file has to be stored as tests/envs/reference.yaml.

The pipeline will clean the output of both local and reference subcommands and compare these to each other. Cleaning is needed because node identifiers are allowed to differ between runs, but the output should be the same otherwise. If any output file differs, the end-to-end pipeline will fail.

The following subcommands are run in the pipeline:

build_pangenome
export_pangenome
build_panproteome
add_annotations
group
grouping_overview
add_phenotypes
add_pav
add_variants
variation_overview
gene_classification
kmer_classification
pangenome_structure (gene)
pangenome_structure (kmer)
add_functions
functional_classification
function_overview
go_enrichment
map
msa (var)
msa (prot)
ani
core_phylogeny