Grouping proteins
Group
Generate homology groups based on similarity of protein sequences. The
resulting homology groups connect similar sequences in the pangenome
database. Homology groups contain not only orthologous pairs, but also
pairs of homologs duplicated after the speciation of the two species,
so-called in-paralogs. The sizes of the groups are controlled by the
--relaxation
parameter that can be set very strict or more lenient,
depending on the evolutionary distance of the genomes. When you are
unsure which relaxation setting is most suitable for your dataset,
running the optimal_grouping
functionality is recommended.
Be aware that not every sequence within a homology group has to be similar to the other sequences. For example, two non-similar protein sequences each have a high-similarity hit with the same protein sequence but align to a different region, one at the start and one near the end of the sequence.
When you want to run group another time but with different parameters, the currently active grouping must first either be moved or removed. This can be achieved with the remove or deactivate grouping functions.
--relaxation
), the proteins are
connected with each other in the similarity graph. Every similarity
component is then passed to the MCL (Markov clustering) algorithm to
be possibly broken into several homology groups.relaxation
parameter is a combination of four sub-parameters:
intersection rate
, similarity threshold
, mcl inflation
and contrast
. The values for these parameters for each relaxation
setting can be seen in the table below. We strongly recommend using the
--relaxation
option to control the grouping, but advanced users still
have the option to control the individual sub-parameters.
Relaxation
1
2
3
4
5
6
7
8
Intersection rate
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
Similarity threshold
95
85
75
65
55
45
35
25
Mcl inflation
10.8
9.6
8.4
7.2
6.0
4.8
3.6
2.4
Contrast
8
7
6
5
4
3
2
1
- Required software
- Parameters
<databaseDirectory>
Path to the database root directory.
- Options
--threads
/-t
Number of parallel working threads, default is the number of available cores or 8, whichever is lower.
--include
/-i
Only include a selection of genomes.
--exclude
/-e
Exclude a selection of genomes.
--annotations-file
/-A
A text file with the identifiers of annotations to be included, each on a separate line. The most recent annotation is selected for genomes without an identifier.
--longest
Only cluster protein sequences of the longest transcript per gene.
--scoring-matrix
The scoring matrix used, default is BLOSUM62.
--relaxation
The relaxation in homology calls. Should be in range [1-8], from strict to relaxed. This argument automatically sets the four remaining arguments stated below.
--intersection-rate
The fraction of k-mers that needs to be shared by two intersecting proteins. Should be in range [0.001,0.1].
--similarity-threshold
The minimum normalized similarity score of two proteins. Should be in range [1..99].
--mcl-inflation
The MCL inflation. Should be in range [1,19].
--contrast
The contrast factor. Should be in range [0,10].
- Example commands
$ pantools group -t=12 -r=4 tomato_DB $ pantools group --intersection-rate=0.05 --similarity-threshold=65 --mcl-inflation=7.2 --contrast=5 tomato_DB
- Output
pantools_homology_groups.txt, overview of the created homology groups. Each line represents one homology group, starting with the homology group (database) identifier followed by a colon (:) and mRNA identifiers (from GFF) that are separated by a space. To ensure all identifiers are unique in this file, the mRNA ids are extended by a hash symbol (#) and a genome number. The following line is example output of an homology group with two genes from genome 1 and 146:
14001754: DLACAPHP_00001_mRNA#1 OPJEMMMF_03822_mRNA#146
- Relevant literature
Efficient inference of homologs in large eukaryotic pan-proteomes
BUSCO protein
BUSCO is used in PanTools to estimate the optimal settings for grouping relaxation, in conjunction with the optimal_grouping function.
BUSCO attempts to provide a quantitative assessment of the completeness in terms of expected gene content of a genome assembly. Proteins are placed into categories of Complete and single-copy (S), Complete and duplicated (D), fragmented (F), or missing (M). This function is able to run BUSCO v3, v4 or v5 against protein sequences of the pangenome.
The number of reported duplicated genes in eukaryotes is often to high
as different protein isoforms are counted multiple times. To adjust the
imprecise duplication score, include the --longest-transcripts
argument to the command.
When using BUSCO v3, go to https://busco.ezlab.org, download a odb9 set, and untar it with
tar -xvzf
. Include the entire directory in the command using the--input-file
argument.For BUSCO v4 or v5, you only have to provide the odb10 database name with the
--input-file
argument, the database is downloaded automatically. To get a full list of the available datasets, runbusco --list-datasets
.
- Required software
BUSCO must be set to your $PATH. For v3, test if the
which run_BUSCO.py
command displays the full path so it can accessed anywhere. For v4 and v5, test ifbusco
is executable.- Parameters
<databaseDirectory>
Path to the database root directory.
Options
--threads
/-t
Number of parallel working threads, default is the number of available cores or 8, whichever is lower.
--include
/-i
Only include a selection of genomes.
--exclude
/-e
Exclude a selection of genomes.
--annotations-file
/-A
A text file with the identifiers of annotations to be included, each on a separate line. The most recent annotation is selected for genomes without an identifier.
--busco-version
/-v
The BUSCO version. Select either 4 or 5 (default). (default).
--odb10
An odb10 benchmark dataset name.
--longest
Only search against the longest protein-coding transcript of genes.
skip-busco
A list of questionable BUSCOs. The completeness score is recalculated by skipping these genes.
- Example commands
$ pantools busco_protein --odb10=bacteria_odb10 bacteria_DB $ pantools busco_protein --skip-busco=POG093P01OY,POG093P0009,POG093P022K,POG093P027M,POG093P00Z2,POG093P013J bacteria_DB
- Output
The BUSCO scores are stored inside BUSCO nodes of the pangenome graph. Output files are written to the busco directory inside the database.
busco_scores.txt, overview of the BUSCO scores per genome. Average and median statistics are calculated per category.
busco_overview.csv, a table which combines the completeness scores per genome together with the duplicated, fragmented and missing BUSCO genes.
hmm_overview.txt, a list of BUSCO genes showing the assigned categories per genome.
Optimal grouping
Finding the most suitable settings for group
can be difficult and is always dependent on evolutionary distance of the
genomes in the pangenome. This functionality runs group on all eight
--relaxation
settings, from strictest (d1) to the most relaxed (d8).
To find the optimal setting, complete and non-duplicated BUSCO genes
that are present in all genomes are used to validate each setting.
\[ \begin{align}\begin{aligned}Recall &= \frac{TP}{TP + FN}\\Precision &= \frac{TP}{TP + FP}\\F-score &= 2 \frac{Recall * Precision}{Recall + Precision}\end{aligned}\end{align} \]
- Required software
- Parameters
<databaseDirectory>
Path to the database root directory.
<buscoDirectory>
The output directory created by the busco_protein function. This directory is found inside the pangenome database, in the busco directory.
- Options
--threads
/-t
Number of parallel working threads, default is the number of available cores or 8, whichever is lower.
--include
/-i
Only include a selection of genomes.
--exclude
/-e
Exclude a selection of genomes.
--annotations-file
/-A
A text file with the identifiers of annotations to be included, each on a separate line. The most recent annotation is selected for genomes without an identifier.
--fast
Assume the optimal grouping is found when the F1-score drops compared to the previous clustering round.
--longest
Only cluster protein sequences of the longest transcript per gene.
--scoring-matrix
The scoring matrix used, default is BLOSUM62.
--relaxation
Only consider a selection of relaxation settings (1-8 allowed).
- Example commands
$ pantools optimal_grouping bacteria_DB bacteria_DB/busco/bacteria_odb9 $ pantools optimal_grouping -t=12 --fast bacteria_DB bacteria_DB/busco/bacteria_odb9 $ pantools optimal_grouping -tn=12 --relaxation=1,2,3 bacteria_DB bacteria_DB/busco/bacteria_odb9 $ Rscript optimal_grouping.R
- Output
After each clustering round, homology groups are incorporated in the graph. A text file with homology group and gene identifiers is stored in the group directory in the pangenome database. This file is named after the used sequence similarity threshold (25-95). Each line represents one homology group, starting with the homology group (database) identifier followed by a colon (:) and mRNA identifiers (from GFF) that are separated by a space. The mRNA identifiers are extended by a hash (#) and their genome number. The following line is example output of an homology group with two genes from genome 1 and 146:
14001754: DLACAPHP_00001_mRNA#1 OPJEMMMF_03822_mRNA#146
Output files are written to optimal_grouping directory inside the database.
grouping_overview.csv, a summary of the benchmark statistics. Use this file to find the most suitable grouping for your pangenome.
optimal_grouping.R, Rscript to plot FN and FP values per grouping.
counts_per_busco.info, a log file of the scoring. Shows in which homology groups the BUSCO genes were placed for the different groupings.
Change grouping
Only a single homology grouping can be active in the pangenome. Use this function to change the active grouping version. Information of the available groupings and used settings is stored in the ‘pangenome’ node (inside the database) and can be created by running grouping_overview.
- Parameters
<databaseDirectory>
Path to the database root directory.
- Options
--grouping-version
/-v
Required. The version of homology grouping to become active.
- Example commands
$ pantools change_grouping -v=5 tomato_DB
Remove grouping
Delete all ‘homology_group’ nodes and ‘is_similar’ relations between ‘mRNA’ nodes from the database.
- Parameters
<databaseDirectory>
Path to the database root directory.
- Options
--fast
Do not remove the ‘is_similar’ relationships between mRNA nodes. This does not influence the next grouping.
--grouping-version
/-v
Select a specific grouping version to be removed. Should be either a grouping number, ‘all’ for all groupings or ‘all_inactive’ for all inactive groupings.
- Example commands
$ pantools remove_grouping --version=1 tomato_DB $ pantools remove_grouping --version=all --fast tomato_DB $ pantools remove_grouping --version=all_inactive tomato_DB
Deactivate grouping
Relabel ‘homology_group’ nodes to ‘inactive_homology_group’. The moved grouping can be activated again with change_grouping.
- Parameters
<databaseDirectory>
Path to the database root directory.
- Options
--fast
Do not remove the ‘is_similar’ relationships between mRNA nodes. This does not influence the next grouping.
- Example commands
$ pantools deactivate_grouping tomato_DB $ pantools deactivate_grouping --fast tomato_DB
Grouping overview
Reports the content of all (active & inactive) homology groups for the
different groupings in the pangenome. Include --fast
into the
command to get a quick overview of the available groupings and the
settings that were used.
- Parameters
<databaseDirectory>
Path to the pangenome database root directory.
- Options
--fast
Only show which grouping is active and which groupings can be activated.
- Example commands
$ pantools grouping_overview tomato_DB $ pantools grouping_overview --fast tomato_DB
- Output
Output files are written to /database_directory/group/
grouping_overview.txt, all homology groups in the pangenome. For each homology group, the total number of members and the number of members per per genome is reported.
current_pantools_homology_groups.txt, overview of the active homology groups. Each line represents one homology group. The line starts with the homology group (database) identifier followed by a colon and the rest are mRNA IDs (from gff/genbank) seperated by a space.