Part 4 - Visualization using PanVA
This part guides you through the process of getting from raw data to PanVA instance.
Data Packages
Currently one small data package is available, which contain all the information for creating a pangenome using PanTools and creating a PanVA instance from it, all the way from raw data to pangenome browser. This data package is different from the one in the performance test, so don’t reuse that one here!
data type
link
contents
database size
Fungi
Yeast [4.3K]
10 genomes
4G
Steps to generate PanVA input
1: Download the data package
$ wget https://www.bioinformatics.nl/pangenomics/data/<target-dataset>.tar.gz
$ tar xvzf <target-dataset>.tar.gz
Every package contains a README with all the exact commands, so make sure to
check those if you’re stuck.
For all packages, those can be found at the root of the decompressed TAR-files.
The Snakemake pipelines used in this workflow create conda environments
in the data package directory. If you want to re-use these pipelines for
different data, use the --conda-prefix
Snakemake command to set a directory
where the conda environments will be stored.
All commands should be run from the root directory of the package.
Run the whole package from RAM-disk or SSD, or set the path of the results to
RAM-disk/SSD in the configs.
2: Downloading data for the pangenome
- Goal:
Acquire genome and structural annotation data
To download corresponding raw-data for each of the above-linked packages, follow the first steps of the instructions laid out in the README file. This command looks looks like this:
$ ./<target-dataset>_download.sh
Make sure the gunzip
command is available on your system.
3: Preprocessing the data for PanTools
- Goal:
Filtering the minimum sequence size of genomes in the FASTA file
Filtering the minimum ORF size of CDS features in the annotation
Extract protein-sequences by matching CDS features to genomic sequences
Create functional annotations for extracted protein sequences (Optional)
Generate statistics for raw and filtered data
This requires the following three steps:
Step 1: Clone the PanUtils data filtering pipeline
The data filtering pipeline filters out small sequences, matches FASTA with GFF contents and removes CDS features with ORF below cutoff value. Also extracts protein sequences and creates functional annotations from them.
$ git clone https://github.com/PanUtils/pantools-qc-pipeline.git
Step 2: Activate or create Snakemake
Activate or create a snakemake environment (works in python versions <= 3.11).
$ mamba create -c conda-forge -c bioconda -n snakemake snakemake=8
Note
If you are using an ARM-based machine (such as an M4-based Mac), make sure to make the new environment compatible with Intel-based packages. Many dependencies in conda are not yet compatible with ARM systems. Consider for example installing Rosetta 2.
$ softwareupdate --install-rosetta
Please use this command to set up your environment:
$ CONDA_SUBDIR=osx-64 mamba create -c conda-forge -c bioconda -n snakemake snakemake=8
This command ensures that packages are downloaded for an Intel architecture. Afterwards, restart your shell with the “Open using Rosetta”-setting enabled. For this, go to “Applications”/Utilities/Terminal” and click on “Get Info”. Select the option to start the terminal with Rosetta!
Step 3: Filter raw data and create functional annotations for protein sequences
Filter the raw data and create protein sequences:
$ snakemake --use-conda --snakefile pantools-qc-pipeline/workflow/Snakefile --configfile config/<target-dataset>_qc.yaml --cores <threads>
4: Running all pangenome analyses using PanTools
- Goal:
Build the pangenome
Add structural annotations, functional annotations and phenotypes
Add VCF information or phasing information if available
Create homology groups
Run the necessary analysis steps (gene classification, k-mer classification, multiple sequence alignment, group info) in order to create a PanVA instance
Step 1: Clone the PanTools pipeline v4
The PanTools pipeline contains all PanTools functions. We will use it here to create a pangenome and run all steps discussed above.
$ git clone https://github.com/PanUtils/pantools-pipeline-v4.git
Step 2: Run PanTools to for PanVA-specific analyses
The snakemake rule panva
takes care of creating a pangenome and running all
necessary functions to create a PanVA instance. Those are started with the same
command, outlined below.
$ snakemake panva --use-conda --snakefile pantools-pipeline-v4/workflow/Snakefile --configfile config/<target-dataset>_pantools.yaml --cores <threads>
This will create a pangenome database from which PanVA files can be generated.
5: Generate input for PanVA instance
- Goal:
Preprocessing data for PanVA
Step 1: Clone the export-to-panva python script
$ git clone https://github.com/PanUtils/export-to-panva.git
Step 2: Create a conda environment for the script
$ mamba env create -n export-to-panva -f export-to-panva/envs/pantova.yaml
$ conda activate export-to-panva
Note
Make sure to create an environment that can deal with Intel-based dependencies if you are on a silicon-based Mac.
Step 3: Run the export script
The export script reads data from the pangenome database and converts it to the proper format for PanVA. Run the following command from the root of the data package, to create the inputs for PanVA:
$ python3 export-to-panva/scripts/pan_to_va.py config/<target-dataset>_panva.ini
6: Create a PanVA instance
- Goal:
Set up the PanVA instance
With the output of the export script, you should be able to create a PanVA instance for your dataset using the instructions from PanVA’s Technical Test.