Explore the pangenome
The functionalities on this page allow to actively explore the pangenome.
Retrieve regions from the pangenome
Retrieve sequences and functional annotations from homology groups
Search for genes using a gene name, functional annotation or database node identifier
Align homology groups or genomic regions
GO enrichment analysis
Locate Genes
Identify and compare gene clusters of neighbouring genes based on a set
of homology groups. First, identifies the genomic position of genes in
homology groups, retrieves the order of genes per genome and based on
this construct the gene clusters. If homology groups with multiple
genomes were selected, the gene cluster composition is compared between
genomes. When a --phenotype
is included, gene clusters can be found
that only consist of groups of a certain phenotype.
For example, 100 groups were predicted as core in a pangenome of 5
genomes. The gene clusters are first identified per genome, after which
it compares the gene order of one genome to all the other genomes. The
result could be 75 groups with genes that are not only homologous but
also share their gene neighbourhood. Another example, when accessory
(present 2 in to 4 genomes) groups are given to this function in
combination with a --phenotype
(assigned to only two genomes), the
function can return clusters that can only be found in the phenotype
members.
- Parameters
<databaseDirectory>
Path to the database root directory.
<homologyFile>
A text file with homology group node identifiers, seperated by a comma.
- Options
--include
/-i
Only include a selection of genomes.
--exclude
/-e
Exclude a selection of genomes.
--phenotype
/-p
A phenotype name, used to identify gene clusters shared by all phenotype members.
--nucleotides
The number of allowed nucleotides between two neighbouring genes (default is 1 MB).
--gap-open
When constructing the clusters, allow a number of genes for each cluster that are not originally part of the input groups (default: 0).
--core-threshold
Lower the threshold (%) for a group to be considered (soft) core (default is the total number of genomes found in the groups, not a percentage).
--ignore-duplications
Duplicated and co-localized genes no longer break up clusters.
- Example commands
$ pantools locate_genes tomato_DB phenotype_groups.csv $ pantools locate_genes --nucleotides=5000 --gap-open=1 tomato_DB unique_groups.csv $ pantools locate_genes --ignore-duplications --core-threshold=95 tomato_DB accessory_groups.csv
- Output
Output files are stored in database_directory/locate_genes/
gene_clusters_by_position.txt, the identified gene clusters ordered by their position in the genome.
gene_clusters_by_size.txt, the identified gene clusters ordered from largest to smallest.
compare_gene_clusters, the composition of found gene clusters is compared to the other genomes. For each cluster, it shows which parts match other clusters and which parts do not. The file is not created when homology groups only contain proteins of a single genome (unique).
When a
--phenotype
is includedphenotype_clusters, homology group node identifiers from phenotype shared and specific clusters.
compare_gene_clusters_PHENOTYPE.txt, the same information as compare_gene_clusters but now the gene cluster comparison is only done between phenotype members.
Find genes by name
Find your genes of interest in the pangenome by using the gene name and
extract the nucleotide and protein sequence. To be able to find a gene,
every letter of the given input must match a gene name. The search is
not case sensitive. Performing a search with ‘sonic1’ as query will not
be able find ‘sonic’, but is able to find Sonic1, SONIC1 or sOnIc1.
Including the --extensive
option allows a more relaxed search and
using ‘sonic’ will now also find gene name variations as ‘sonic1’,
‘sonic3’ etc..
Be aware, for this function to work it is important that genomes are annotated by a method that follows the rules for genetic nomenclature. Gene naming can be inconsistent when different tools are used for genome annotation, making this functionality ineffective.
This function is the same as mlsa_find_genes but uses a different output directory. Several warnings (shown in the other manual) can be generated during the search. These warning are less relevant for this function as the genes are not required to be single copy-orthologous.
- Parameters
<databaseDirectory>
Path to the database root directory.
- Options
--genes
/-g
Required. One or multiple gene names, separated by a comma.
--include
/-i
Only include a selection of genomes.
--exclude
/-e
Exclude a selection of genomes.
--extensive
Perform a more extensive gene search.
- Example commands
$ pantools find_genes_by_name --genes=dnaX,gapA,recA tomato_DB $ pantools find_genes_by_name --extensive -g=gapA tomato_DB
- Output
Output files are stored in /database_directory/find_genes/by_name/. For each gene name that was included, a nucleotide and protein and .FASTA file is created with sequences found in all genomes.
find_genes_by_name.log, relevant information about the extracted genes: node identifier, gene location, homology group etc..
Find genes by annotation
Find genes of interest in the pangenome that share a functional annotation node and extract the nucleotide and protein sequence.
- Parameters
<databaseDirectory>
Path to the database root directory.
- Options
Requires one of
--functions
|--nodes
.--functions
One or multiple function identifiers (GO, InterPro, PFAM, TIGRFAM), seperated by a comma.
--nodes
/-n
One or multiple identifiers of function nodes (GO, InterPro, PFAM, TIGRFAM), seperated by a comma.
--include
/-i
Only include a selection of genomes.
--exclude
/-e
Exclude a selection of genomes.
- Example commands
$ pantools find_genes_by_annotation --nodes=14928,25809 tomato_DB $ pantools find_genes_by_annotation --functions=PF00005,GO:0000160,IPR000683,TIGR02499 tomato_DB
- Output
Output files are stored in /database_directory/find_genes/by_annotation/. For each function (node) that was included, a nucleotide and protein and .FASTA file is created with sequences from the genes that are connected to the node.
find_genes_by_annotation.log, relevant information about the extracted genes: node identifier, gene location, homology group etc..
Find genes in region
Find genes of interest in the pangenome that can be (partially) found within a given region (partially). For each found gene, relevant information, the nucleotide sequence and protein sequence is extracted.
- Parameters
<databaseDirectory>
Path to the database root directory.
<regionsFile>
A text file containing genome locations with on each line: a genome number, sequence number, begin and end position, separated by a space.
- Options
--partial
Also retrieve genes that only partially overlap the input regions.
- Example input file
Each line must have a genome number, sequence number, begin and end positions that are separated by a space.
195 1 477722 478426 71 10 17346 18056 138 47 159593 160300
- Example commands
$ pantools find_genes_in_region tomato_DB regions.txt $ pantools find_genes_in_region --partial tomato_DB regions.txt
- Output
Output files are stored in /database_directory/find_genes/in_region/. For each region that was included, a nucleotide and protein and .FASTA file is created with sequences from the genes that are found within the region.
find_genes_in_region.log, relevant information about the extracted genes: node identifier, gene location, homology group etc..
Show GO
For a selection of ‘GO’ nodes, retrieves connected ‘mRNA’ nodes, child and all parent GO terms that are higher in the GO hierarchy. This function follows the ‘is_a’ relationships of GO each node to their parent GO term until the ‘biological process’, ‘molecular function’ or ‘cellular location’ node is reached. This can be is useful in case InterProScan annotations were included, as these only add the most specific GO terms of the hierarchy to a sequence.
- Parameters
<databaseDirectory>
Path to the database root directory.
- Options
Requires one of
--functions
|--nodes
.--functions
One or multiple GO term identifiers, seperated by a comma.
--nodes
/-n
One or multiple identifiers of ‘GO’ nodes, seperated by a comma.
- Example commands
$ pantools show_go --functions=GO:0000001,GO:0000002,GO:0008982 tomato_DB $ pantools show_go --nodes=15078,15079 tomato_DB
- Output
show_go.txt, information of the selected GO node(s): the connected ‘mRNA’ nodes, the GO layer below, and all layers above.
Compare GO
Check if and how similar two given GO terms are. For both nodes, follows the ‘is_a’ relationships up to their parent GO terms until the ‘biological process’, ‘molecular function’ or ‘cellular location’ node is reached. After all parent terms are found, the shared GO terms and their location in the hierarchy is reported.
- Parameters
<databaseDirectory>
Path to the database root directory.
- Options
Requires one of
--functions
|--nodes
.--functions
One or multiple GO term identifiers, seperated by a comma.
--nodes
/-n
One or multiple identifiers of ‘GO’ nodes, seperated by a comma.
--include
/-i
Only include a selection of genomes.
--exclude
/-e
Exclude a selection of genomes.
- Example commands
$ pantools compare_go --functions=GO:0032775,GO:0006313 tomato_DB $ pantools compare_go --nodes=741487,741488 tomato_DB
- Output
Output files are stored in database_directory/function/
compare_go.txt, information of the two GO nodes: the connected ‘mRNA’ nodes, the GO layer below, all layers above and the shared GO terms between the two nodes.
Group info
Report all available information of one or multiple homology groups.
- Parameters
<databaseDirectory>
Path to the database root directory.
- Options
--include
/-i
Only include a selection of genomes.
--exclude
/-e
Exclude a selection of genomes.
--homology-file
/-H
A text file with homology group node identifiers, separated by a comma. Default is all homology groups. (Mutually exclusive with
--homology-groups
.)--homology-groups
/-G
A comma separated list of homology group node identifiers. Default is all homology groups. (Mutually exclusive with
--homology-file
.)--functions
Name of function identifiers from GO, PFAM, InterPro or TIGRAM. To find Phobius (P) or SignalP (S) annotations, include: ‘secreted’ (P/S), ‘receptor’ (P/S), or ‘transmembrane’ (P).
--genes
/-g
One or multiple gene names, seperated by a comma.
--node
Retrieve the nucleotide nodes belonging to genes in homology groups
- Example commands
$ pantools group_info -H core_groups.txt yeast_DB $ pantools group_info --functions=GO:0032775,GO:0006313 --genes=budC,estP -H core_groups.txt yeast_DB
- Output
Output files are stored in database_directory/alignments/grouping_v?/groups/. For each homology group that was included, a nucleotide and protein and .FASTA file is created with sequences found in all genomes.
group_info.txt, relevant information for each homology group: number of copies per genome, gene names, mRNA node identifiers, functions, protein sequence lengths, etc.
group_functions.txt, full description of the functions found in homology groups
When function identifiers are included via
--functions
groups_with_function.txt, homology group node identifiers from groups that match one of the input functions.
When gene names are included via
--genes
groups_with_name.txt, homology group node identifiers from groups that match one of the input gene ames.
Retrieve regions
Retrieve the full genome sequence or genomic regions from the pangenome.
- Parameters
<databaseDirectory>
Path to the database root directory.
<regionsFile>
A text file containing genome locations with on each line: a genome number, sequence number, begin and end positions separated by a space.
- Example commands
$ pantools retrieve_regions pecto_DB regions.txt
- Example input
To extract:
Complete genome - Include a genome number
An entire sequence - Include a genome number with sequence number
A genomic region - Include a genome number, sequence number, begin and end positions that are separated by a space. Place a minus symbol behind the regions to extract the reverse complement sequence of the region.
1 1 1 1 1 1 10000 1 1 1000 1500 - 195 1 477722 478426 71 10 17346 18056 - 138 47 159593 160300 -
- Output
A single FASTA file is created for all given locations and is stored in the database directory.
Retrieve features
To retrieve the sequence of annotated features from the pangenome.
- Parameters
<databaseDirectory>
Path to the database root directory.
- Options
--feature-type
Required. The feature name; for example ‘gene’, ‘mRNA’, ‘exon’, ‘tRNA’, etc.
--include
/-i
Only include a selection of genomes.
--exclude
/-e
Exclude a selection of genomes.
- Example commands
$ pantools retrieve_features --feature-type=gene pecto_DB $ pantools retrieve_features --feature-type=mRNA -i=1-5 pecto_DB
- Output
For each genome a FASTA file containing the retrieved features will be stored in the database directory. For example, genes.1.fasta contains all the genes annotated in genome 1.