Part 4 - Visualization using PanVA

This part guides you through the process of getting from raw data to PanVA instance.

Data Packages

Currently one small data package is available, which contain all the information for creating a pangenome using PanTools and creating a PanVA instance from it, all the way from raw data to pangenome browser. This data package is different from the one in the performance test, so don’t reuse that one here!

data type

link

contents

database size

Fungi

Yeast [4.3K]

10 genomes

4G

Steps to generate PanVA input

1: Download the data package

$ wget https://www.bioinformatics.nl/pangenomics/data/<target-dataset>.tar.gz
$ tar xvzf <target-dataset>.tar.gz

Every package contains a README with all the exact commands, so make sure to check those if you’re stuck. For all packages, those can be found at the root of the decompressed TAR-files. The Snakemake pipelines used in this workflow create conda environments in the data package directory. If you want to re-use these pipelines for different data, use the --conda-prefix Snakemake command to set a directory where the conda environments will be stored. All commands should be run from the root directory of the package. Run the whole package from RAM-disk or SSD, or set the path of the results to RAM-disk/SSD in the configs.

2: Downloading data for the pangenome

Goal:
  • Acquire genome and structural annotation data

To download corresponding raw-data for each of the above-linked packages, follow the first steps of the instructions laid out in the README file. This command looks looks like this:

$ ./<target-dataset>_download.sh

Make sure the gunzip command is available on your system.

3: Preprocessing the data for PanTools

Goal:
  • Filtering the minimum sequence size of genomes in the FASTA file

  • Filtering the minimum ORF size of CDS features in the annotation

  • Extract protein-sequences by matching CDS features to genomic sequences

  • Create functional annotations for extracted protein sequences (Optional)

  • Generate statistics for raw and filtered data

This requires the following three steps:

Step 1: Clone the PanUtils data filtering pipeline

The data filtering pipeline filters out small sequences, matches FASTA with GFF contents and removes CDS features with ORF below cutoff value. Also extracts protein sequences and creates functional annotations from them.

$ git clone https://github.com/PanUtils/pantools-qc-pipeline.git

Step 2: Activate or create Snakemake

Activate or create a snakemake environment (works in python versions <= 3.11).

$ mamba create -c conda-forge -c bioconda -n snakemake snakemake=8

Note

If you are using an ARM-based machine (such as an M4-based Mac), make sure to make the new environment compatible with Intel-based packages. Many dependencies in conda are not yet compatible with ARM systems. Consider for example installing Rosetta 2.

$ softwareupdate --install-rosetta

Please use this command to set up your environment:

$ CONDA_SUBDIR=osx-64 mamba create -c conda-forge -c bioconda -n snakemake snakemake=8

This command ensures that packages are downloaded for an Intel architecture. Afterwards, restart your shell with the “Open using Rosetta”-setting enabled. For this, go to “Applications”/Utilities/Terminal” and click on “Get Info”. Select the option to start the terminal with Rosetta!

Step 3: Filter raw data and create functional annotations for protein sequences

Filter the raw data and create protein sequences:

$ snakemake --use-conda --snakefile pantools-qc-pipeline/workflow/Snakefile --configfile config/<target-dataset>_qc.yaml --cores <threads>

4: Running all pangenome analyses using PanTools

Goal:
  • Build the pangenome

  • Add structural annotations, functional annotations and phenotypes

  • Add VCF information or phasing information if available

  • Create homology groups

  • Run the necessary analysis steps (gene classification, k-mer classification, multiple sequence alignment, group info) in order to create a PanVA instance

Step 1: Clone the PanTools pipeline v4

The PanTools pipeline contains all PanTools functions. We will use it here to create a pangenome and run all steps discussed above.

$ git clone https://github.com/PanUtils/pantools-pipeline-v4.git

Step 2: Run PanTools to for PanVA-specific analyses

The snakemake rule panva takes care of creating a pangenome and running all necessary functions to create a PanVA instance. Those are started with the same command, outlined below.

$ snakemake panva --use-conda --snakefile pantools-pipeline-v4/workflow/Snakefile --configfile config/<target-dataset>_pantools.yaml --cores <threads>

This will create a pangenome database from which PanVA files can be generated.

5: Generate input for PanVA instance

Goal:
  • Preprocessing data for PanVA

Step 1: Clone the export-to-panva python script

$ git clone https://github.com/PanUtils/export-to-panva.git

Step 2: Create a conda environment for the script

$ mamba env create -n export-to-panva -f export-to-panva/envs/pantova.yaml
$ conda activate export-to-panva

Note

Make sure to create an environment that can deal with Intel-based dependencies if you are on a silicon-based Mac.

Step 3: Run the export script

The export script reads data from the pangenome database and converts it to the proper format for PanVA. Run the following command from the root of the data package, to create the inputs for PanVA:

$ python3 export-to-panva/scripts/pan_to_va.py config/<target-dataset>_panva.ini

6: Create a PanVA instance

Goal:
  • Set up the PanVA instance

With the output of the export script, you should be able to create a PanVA instance for your dataset using the instructions from PanVA’s Technical Test.