DIY Pangenomics

Likely you want to do pangenomics using your own data, on your own system. To support this, we offer several tools:

PanTools : core application to construct and analyze pangenomes
PanUtils: utilities to support pangenomics (e.g. QC)
PanBrowse: applications for pangenome visualization (e.g. PanVA)

When working with these software tools, consider the following tips:

Separate issues in the (technical) setup from issues in the data, so try the tool on known (test) data first.
Garbage in, garbage out! Therefore, perform rigorous quality control on the input data, e.g. using https://github.com/PanUtils/pantools-qc-pipeline.
Start small, e.g. 10-50 bacterial genomes or 3-10 eukaryotic genomes, before scaling up.

PanTools

PanTools (https://git.wur.nl/bioinformatics/pantools) is a software package for pangenome construction and analysis. It works across the tree of life, from small viruses to large plant, animal, or human genomes. Because it is not alignment based, but employs a compacted De Bruijn Graph (cDBG) it works over large(r) evolutionary distances, e.g. at species, genus or family level.

PanUtils

We have been working on several pangenome utilities, collected in https://github.com/PanUtils. The repository currently contains three pipelines: one for quality control, one for pangenome construction and one for analysis and preparing a PanVA instance.

Quality control pipeline

This is a Snakemake pipeline that can be used for running quality control on the data. We highly recommend this pipeline for checking your files, as annotation files can be very tricky to work with (and therefore sometimes cause issues for tools like PanTools). This pipeline can be found at https://github.com/PanUtils/pantools-qc-pipeline which includes a README on how to install and run it. In summary, one needs to create a conda environment with the most recent version of Snakemake (at the moment of writing this is 8.25.5). Please follow the instructions on which configuration to set and where all files should be placed.

The pipeline has four different workflows:

raw_statistics: creates an overview of the statistics of the input data
filter: can be used for filtering genome and annotation files
proteins: creates protein files including statistics
functions: creates functional annotations for all proteins in the protein files

PanTools v4 pipeline

The PanTools Snakemake pipeline runs through all major PanTools functionalities. You can use this pipeline for inspiration on what can be done with PanTools or to actually run PanTools. Please see https://github.com/PanUtils/pantools-pipeline-v4 for the code, including README that summarizes how to install, configure and run it.

Warning

It is not recommended to use this pipeline with new experimental data; adjustments to data and PanTools command settings might have to be made during the process, for which this pipeline lacks the flexibility.

The pipeline has three different workflows:

all_panproteome: runs through all major PanTools functionalities available for a panproteome
all_pangenome: runs through all major PanTools functionalities available for a pangenome
panva: runs all PanTools functions necessary to set up a PanVA instance

PanVA conversion

If one has already created a pangenome database with PanTools that needs to be converted to the data structure necessary for setting up a PanVA instance, please use this repository: https://github.com/PanUtils/export-to-panva. It contains a README on how to install and run it. In summary, one needs to create an environment using mamba from a predefined YAML file, configure settings in a config.ini file and then run the python script which takes the config.ini as input.

PanBrowse

We work on visualization methods for (interactive) visualization of complex pangenomes. PanBrowse (https://github.com/PanBrowse) is where we host the visualization applications.

PanVA

PanVA is a web application allowing users to visually and interatively explore sequence variants in pangenomes (generated by PanTools). It provides context for these variants by displaying their corresponding annotations, phylogenetic and phenotypic information.