Part 4. Characterization ======================== Part 4 preparation ------------------ PanTools v3 is required to follow this part of the tutorial. In addition, MAFFT and R (and a few packages) need to be installed and set to your $PATH. Everything should already be correctly installed if you use the conda environment. Validate if the tools are executable by using the following commands. .. code:: bash $ Rscript --help $ mafft -h We assume a PanTools alias was set during the :ref:`installation `. This allows PanTools to be executed with ``pantools`` rather than the full path to the jar file. If you don’t have an alias, either set one or replace the pantools command with the full path to the .jar file in the tutorials. Input data ---------- .. csv-table:: :file: /tables/bacteria_datasets.csv :header-rows: 1 :delim: ; To demonstrate how to use the PanTools functionalities we use a small dataset of six bacteria to avoid long runtimes. Download a pre-constructed pangenome or test your new skills and construct a pangenome yourself using the fasta and gff files. Option 1: Download separate genome and annotation files .. code:: bash $ wget http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_input.tar.gz $ tar -xvzf pecto_dickeya_input.tar.gz $ gzip -d pecto_dickeya_input/annotations/* $ gzip -d pecto_dickeya_input/genomes/* $ gzip -d pecto_dickeya_input/functions/* $ pantools build_pangenome pecto_dickeya_DB pecto_dickeya_input/genomes.txt $ pantools add_annotations --connect pecto_dickeya_DB pecto_dickeya_input/annotations.txt $ pantools group --relaxation=4 pecto_dickeya_DB Option 2: Download the pre-constructed pangenome .. code:: bash $ wget http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz $ tar -xvzf pecto_dickeya_DB.tar.gz -------------- Adding phenotype/metadata to the pangenome ------------------------------------------ Before starting with the analysis, we will add some phenotype data to the pangenome. Phenotypes allow you to find similarities for a group of genomes sharing a phenotype as well as identifying variation between different phenotypes. Below is a textfile with data for three phenotypes. The third phenotype, *low_temperature*, is in this case a made up example! It states whether the strain is capable of growing on (extreme) low temperatures. The phenotype file can be found inside the database directory or create a new file using the text from the box below. Add the phenotype information to the pangenome using :ref:`add_phenotypes `. .. code:: text Genome, species, strain_name, low_temperature 1,P. odoriferum,P. odoriferum Q166, false 2,P. fontis, P. fontis M022, true 3,P. polaris,P. polaris S4.16.03.2B, false 4,P. brasiliense, P. brasiliense S2, true 5,P. brasiliense, P. brasiliense Y49, false 6,D. dadantii, D. dadantii 3937,? .. code:: bash $ pantools add_phenotypes pecto_dickeya_DB pecto_dickeya_input/phenotypes.txt -------------- Metrics and general statistics ------------------------------ After building or uncompressing the pangenome, run the :ref:`metrics ` functionality to produce various statistics that should verify an errorless construction. .. code:: bash $ pantools metrics pecto_dickeya_DB Open **metrics_per_genome.csv** with a spreadsheet tool (Excel, Libreoffice, Google sheets) and make sure the columns are split on commas. You may easily notice the many empty columns in this table as these type of annotations or features are not included in the database (yet). Functional annotations are incorporated later in this tutorial. Columns for features like exon and intron will remain empty as bacterial coding sequences are not interrupted. -------------- Gene classification ------------------- With the :ref:`gene_classification ` functionality you are able to organize the gene repertoire into the core, accessory or unique part of the pangenome. - **Core**, a gene is present in all genomes - **Unique**, a gene is present in a single genome - **Accessory**, a gene is present in some but not all genomes .. code:: bash $ pantools gene_classification pecto_dickeya_DB .. figure:: /figures/classify_table.png :width: 600 :align: center Take a look in **gene_classification_overview.txt**. Here you can find the number of classified homology groups and genes on a pangenome level but also for individual genomes. Open **additional_copies.csv** with a spreadsheet tool. This file can be useful to identify duplicated genes in relation to other genomes. The default criteria to call a group core is presence in all genomes where unique is only allowed to be present in one genome. These two categories are highly influenced by annotation quality, especially in large pangenomes. Luckily, the threshold for core and unique groups can easily be adjusted. Let’s consider genes to be core when present in only five of the six genomes by setting the ``--core-threshold`` argument. .. code:: bash $ pantools gene_classification --core-threshold=85 pecto_dickeya_DB Look in **gene_classification_overview.txt** again to observe the increase of core groups/genes at the cost of accessory groups. For this pangenome, the *Dickeya* genome is considered an outgroup to the five *Pectobacterium* genomes. While this outgroup is needed to root and analyze phylogenetic trees (tutorial part 5), it affects the number classified groups for the all other genomes. Use ``--include`` or ``--exclude`` to exclude the *Dickeya* genome. .. code:: bash $ pantools gene_classification --include=1-5 pecto_dickeya_DB $ pantools gene_classification --exclude=6 pecto_dickeya_DB Take a look in **gene_classification_overview.txt** one more time to examine the effect of excluding this genome. The total number of groups in the analysis is lower now but the number of core and unique genes have increased for the five remaining genomes. When phenotype information is used in the analysis, three additional categories can be assigned to a group: - **Shared**, a gene present in all genomes of a phenotype - **Exclusive**, a gene is only present in a certain phenotype - **Specific**, a gene present in all genomes of a phenotype and is also exclusive Include a ``--phenotype`` argument to find genes that are exclusive for a certain species. .. code:: bash $ pantools gene_classification --phenotype=species pecto_dickeya_DB Open **gene_classification_phenotype_overview.txt** to see the number of classified groups for the species phenotype. Open **phenotype_disrupted.csv** in a spreadsheet tool. This file explains exactly why a homology groups is labeled as phenotype shared and not specific. Open **phenotype_additional_copies.csv** in a spreadsheet tool. Similarly to *phenotype_additional.csv* this file shows groups where all genomes of a certain phenotype have additional gene copies to (at least one of) the other phenotypes. Each time you run the :ref:`gene_classification ` function, multiple files are created that contain node identifiers of a certain homology group category. These files can be given to other PanTools functions for a downstream analysis, for example, sequence alignment, phylogeny, or GO enrichment. We will use one of the files later in this tutorial. -------------- Pangenome structure ------------------- With the previous functionality we identified the core, accessory and unique parts of the pangenome. Now we will use the :ref:`pangenome_size_genes ` function to observe how these numbers are reached by simulating the growth of the pangenome. Simulating the growth helps explaining if a pangenome should be considered open or closed. An pangenome is called open as long as a significant number of new (unique) genes are added to the total gene repertoire. The openness of a pangenome is usually tested using Heap’s law. Heaps’ law (a power law) can be fitted to the number of new genes observed when increasing the pangenome by one random genome. The formula for the power law model is n = k x N-a, where n is the newly discovered genes, N is the total number of genomes, and k and a are the fitting parameters. A pangenome can be considered open when a < 1 and closed if a > 1. The outcome of the function can again be controlled through command line arguments. Genomes can be excluded from the analysis with ``--exclude``. You can set the number of iterations with ``--loops``. .. code:: bash $ pantools pangenome_structure pecto_dickeya_DB The current test set of six bacterial genomes is not representative of a full-sized pangenome. Therefore we prepared the results for the structure simulation on a set of 197 *Pectobacterium* genomes. The runtime of the analysis using 10.000 loops and 24 threads was 1 minute and 54 seconds. Download the files here, unpack the archive and take a look at the files. .. code:: bash $ wget wget http://bioinformatics.nl/pangenomics/tutorial/pectobacterium_structure.tar.gz $ tar -xvf pectobacterium_structure.tar.gz Normally you still have to run the R scripts to create the output figures and determine the openness of the pangenome. .. code:: bash cd pectobacterium_structure $ Rscript pangenome_growth.R $ Rscript gains_losses_median_and_average.R $ Rscript heaps_law.R Take a look at the plot. In **core_accessory_unique_size.png**, the number of classified groups are plotted for any of the genome combination that occurred during the simulation. For the **core_accessory_size.png** plots, the number of unique groups is combined with accessory groups. The **gains_losses.png** files display the average and mean group gain and loss between different pangenome sizes. The line of the core starts below zero, meaning for every random genome added, the core genome decreases by a number of *X* genes. .. raw:: html
The alpha value of heaps' law is written to heaps_law_alpha.txt. Click here to reveal the result.


With an alpha of 0.53 the Pectobacterium pangenome has an open structure, which is typical for many bacterial species due their extensive exchange of genetic information. Performing this analysis on a set of closely related animal or plant genomes usually results in a closed pangenome, indicating a limited gene pool.

-------------- Functional annotations ---------------------- PanTools is able to incorporate functional annotations into the pangenome by reading output of various functional annotation tools. In this tutorial we only include annotations from InterProScan. Please see the :ref:`add_functions ` manual to check which other tools are available. To include the annotations, create a file **functions.txt** using text from the box below and add it to the command line argument. .. code:: text 1 YOUR_PATH/GCF_002904195.1.gff3 2 YOUR_PATH/GCF_000803215.1.gff3 3 YOUR_PATH/GCF_003595035.1.gff3 4 YOUR_PATH/GCF_000808375.1.gff3 5 YOUR_PATH/GCF_000808115.1.gff3 6 YOUR_PATH/GCA_000147055.1.gff3 .. code:: bash $ pantools add_functions pecto_dickeya_DB functions.txt PanTools will ask you to download the InterPro database. Follow the steps and execute the program again. The complete GO, PFAM, Interpro and TIGRFAM, databases are now integrated in the graph database after. Genes with a predicted function have gained a relationship to that function node. Retrieving a set of genes that share a function is now possible through a simple cypher query. If you would run **metrics** again, statistics for these type functional annotations are calculated. To create a summary table for each type of functional annotation, run :ref:`function_overview `. .. code:: bash $ pantools function_overview pecto_dickeya_DB In **function_overview_per_group.csv** you can navigate to a homology group or gene to see the connected functions. You can also search in the opposite direction, use one of the created overview files for a type of functional annotation and quickly navigate to a function of interest to find which genes are connected. GO enrichment ~~~~~~~~~~~~~ We go back to the output files from gene classification that only contain node identifiers. We can retrieve group functions by providing one the files to :ref:`group_info ` with the ``--homology-groups`` argument. However, interpreting groups by assessing each one individually is not very practical. A common approach to discover interesting genes from a large set is GO-enrichment. This statistical method enables the identification of genes sharing a function that are significantly over or under-represented in a given gene set compared to the rest of the genome. Let’s perform a :ref:`analysis/go_enrichment:go enrichment` on homology groups of the core genome. .. code:: text Phenotype: P._brasiliense, 2 genomes, threshold of 2 genomes 1278537,1282642,1283856,1283861,1283862,1283869,1283906,1283921,1283934,1283941,1283945,1283946 .. code:: bash $ pantools group_info -H brasiliense_groups.csv pecto_dickeya_DB $ pantools go_enrichment -H brasiliense_groups.csv pecto_dickeya_DB Open **go_enrichment.csv** with a spreadsheet tool. This file holds GO terms found in at least one of the genomes, the p-value of the statistical test and whether it is actually enriched after the multiple testing correction. as this is different for each genome a function might enriched in one genome but not in another. A directory with seperate output files is created for each genome, open **go_enrichment.csv** for the genome 4 or 5 in a spreedsheet. Also take a look at the PDF files that visualize part of the Gene ontology hierarchy. Classifying functional annotations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Similarly to classifying gene content, functional annotations can be categorized using :ref:`functional_classification `. This tool provides an easy way to identify functions shared by a group of genomes of a certain phenotype but can also be used to identify core or unique functions. The functionality uses the same set of arguments as **gene_classification**. You can go through the same steps again to see the effect of changing the arguments. .. code:: bash $ pantools functional_classification pecto_dickeya_DB $ pantools functional_classification --core-threshold=85 pecto_dickeya_DB $ pantools functional_classification --exclude=6 pecto_dickeya_DB $ pantools functional_classification --phenotype=species pecto_dickeya_DB -------------- Sequence alignment ------------------ In the final part of this tutorial we will test the alignment function by aligning homology groups. PanTools is able to align genomic regions, genes and proteins to identify SNPs or amino acid changes with :doc:`msa `. Start with the alignment of protein sequences from the 12 *P. brasiliense* specific homology groups. .. code:: bash $ pantools msa --mode=protein --method=per-group -H=brasiliense_groups.csv pecto_dickeya_DB Go to the **pecto_dickeya_DB/alignments/grouping_v1/groups/** directory and select one of homology groups and check if you can find the following files - The alignments are written to **prot_trimmed.fasta** and **prot_trimmed.afa**. - A gene tree is written to **prot_trimmed.newick** - **prot_trimmed_variable_positions.csv** located in the **var_inf_positions** subdirectory. This matrix holds every variable position of the alignment; the rows are the position in the alignment and the columns are the 20 amino acids and gaps. - The identity and similarity (for proteins) is calculated between sequences and written to tables in the **similarity_identity** subdirectory. Run the function again but include the ``--no-trimming`` argument. .. code:: bash $ pantools msa --no-trimming --mode=protein --method=per-group -H=brasiliense_groups.csv pecto_dickeya_DB The output files are generated right after the first alignment without trimming the sequences first. The file names differ from the trimmed alignments by the ’*_trimmed*’ substring. Run the function again but exclude the ``--mode=protein`` and ``--no-trimming`` arguments. When no additional arguments are included to the command, both nucleotide and protein sequences are aligned two consecutive times. .. code:: bash $ pantools msa --method=per-group -H=brasiliense_groups.csv pecto_dickeya_DB Again, the same type of files are generated but the output files from nucleotide sequence can be recognized by the ‘*nuc\_*’ substrings. The matrix in **nuc_trimmed_variable_positions.csv** now only has columns for the four nucleotides and gaps. Finally, run the function one more time but include a phenotype. This allows you to identify phenotype specific SNPs or amino acid changes. .. code:: bash $ pantools msa --no-trimming --phenotype=low_temperature -H=brasiliense_groups.csv pecto_dickeya_DB Open the **nuc**- or **prot_trimmed_phenotype_specific_changes.info** file inside one of the homology group output directories. -------------- Besides the functionalities in this tutorial, PanTools has more useful functions that may aid you in retrieving more specific information from the pangenome. - Identify shared k-mers between genomes with :ref:`kmer_classification `. - Find co-localized genes in a set of homology groups: :ref:`locate_genes `. - Mapping short reads against the pangenome with :ref:`map `. In :doc:`part 5 ` of the tutorial we explore some of the phylogenetic methods implemented in PanTools.