DIY Pangenomics

Likely you want to do pangenomics using your own data, on your own system. To support this, we offer several tools:

  • PanTools : core application to construct and analyze pangenomes

  • PanUtils: utilities to support pangenomics (e.g. QC)

  • PanBrowse: applications for pangenome visualization (e.g. PanVA)

When working with these software tools, consider the following tips:

  • Separate issues in the (technical) setup from issues in the data, so try the tool on known (test) data first.

  • Garbage in, garbage out! Therefore, perform rigorous quality control on the input data, e.g. using https://github.com/PanUtils/pantools-qc-pipeline.

  • Start small, e.g. 10-50 bacterial genomes or 3-10 eukaryotic genomes, before scaling up.

PanTools

PanTools (https://git.wur.nl/bioinformatics/pantools) is a software package for pangenome construction and analysis. It works across the tree of life, from small viruses to large plant, animal, or human genomes. Because it is not alignment based, but employs a compacted De Bruijn Graph (cDBG) it works over large(r) evolutionary distances, e.g. at species, genus or family level.

PanUtils

We have been working on several pangenome utilities, collected in https://github.com/PanUtils. The repository currently contains three pipelines: one for quality control, one for pangenome construction and one for analysis and preparing a PanVA instance.

Quality control pipeline

This is a Snakemake pipeline that can be used for running quality control on the data. We highly recommend this pipeline for checking your files, as annotation files can be very tricky to work with (and therefore sometimes cause issues for tools like PanTools). This pipeline can be found at https://github.com/PanUtils/pantools-qc-pipeline which includes a README on how to install and run it. In summary, one needs to create a conda environment with the most recent version of Snakemake (at the moment of writing this is 8.25.5). Please follow the instructions on which configuration to set and where all files should be placed.

The pipeline has four different workflows:

  • raw_statistics: creates an overview of the statistics of the input data

  • filter: can be used for filtering genome and annotation files

  • proteins: creates protein files including statistics

  • functions: creates functional annotations for all proteins in the protein files

PanTools v4 pipeline

The PanTools Snakemake pipeline runs through all major PanTools functionalities. You can use this pipeline for inspiration on what can be done with PanTools or to actually run PanTools. Please see https://github.com/PanUtils/pantools-pipeline-v4 for the code, including README that summarizes how to install, configure and run it.

Warning

It is not recommended to use this pipeline with new experimental data; adjustments to data and PanTools command settings might have to be made during the process, for which this pipeline lacks the flexibility.

The pipeline has three different workflows:

  • all_panproteome: runs through all major PanTools functionalities available for a panproteome

  • all_pangenome: runs through all major PanTools functionalities available for a pangenome

  • panva: runs all PanTools functions necessary to set up a PanVA instance

PanVA conversion

If one has already created a pangenome database with PanTools that needs to be converted to the data structure necessary for setting up a PanVA instance, please use this repository: https://github.com/PanUtils/export-to-panva. It contains a README on how to install and run it. In summary, one needs to create an environment using mamba from a predefined YAML file, configure settings in a config.ini file and then run the python script which takes the config.ini as input.

PanBrowse

We work on visualization methods for (interactive) visualization of complex pangenomes. PanBrowse (https://github.com/PanBrowse) is where we host the visualization applications.

PanVA

PanVA is a web application allowing users to visually and interatively explore sequence variants in pangenomes (generated by PanTools). It provides context for these variants by displaying their corresponding annotations, phylogenetic and phenotypic information.