Synteny

Calculate synteny

Warning

This is a novel function and has not yet undergone testing by external users. Please report any bugs or issues to the PanTools team so we can improve it.

Estimate synteny between sequences of the pangenome using MCScanX. PanTools generates MCScanX’s required input GFF and .homology file for every pairwise comparison. The GFF holds the gene identififers and genomic coordinates. The .homology file is a two-column tab separated file that states which genes are homologous to another. Two genes are considered homologous when part of the same homology group together with a protein similarity greater than the threshold set during group.

Executing MCScanX
PanTools executes MCscanX_h with default settings when --run
argument is included. Pairwise comparisons are divided over the number of
--threads provided by the user. Please consider the number of sequences
in your analysis and create a subset by using --selection-file. The
output of each comparison is written to a separate folder. Once the threads
are finished, every output (.collinearity) file is collected and
combined into a single file.
To allow the visualization of synteny blocks we recommend using
Synvisio or
Accusyn.

Sequence identifiers change

Both MCScanX and the online visualization tools have some issues with using the regular identifiers (1_1, 1_2, 2_1 etc.). To be able to work with all tools we change the identifiers into a two letter combination with a number, current limit is 676 genomes with --genome or 676 sequences using --sequence. PanTools will try to give each genome a unique first letter but this is only possible with 26 or less genomes and 26 sequences per genome.

Required software

MCScanX must be manually installed and set to your $PATH, since it is unavailable on conda.

Parameters

Path to the database root directory.

Options

`--include`/`-i`	Only include a selection of genomes. This automatically lowers the threshold for core genes.
`--exclude`/`-e`	Exclude a selection of genomes. This automatically lowers the threshold for core genes.
`-—selection-file`	Text file with rules to use a specific set of genomes and sequences. This automatically lowers the threshold for core genes.
`--run`	Run MCScanX_h (default: false)
`--threads`/`-t`	Number of parallel threads to be used.
`--sequence`	Calculate synteny between sequences (from the same genome) instead of genomes.

Example commands

$ pantools calculate_synteny tomato_DB
$ pantools calculate_synteny tomato_DB --sequence
$ pantools calculate_synteny tomato_DB --sequence --run -t=24

Output

Output files are written to the synteny directory in the database.

mcscanx.gff, the genomic coordinates of all genes included in the analysis.
mcscanx.homology, all pairwise homology relationships in the analysis.
synteny_identifiers.csv, table with the original and synteny identifiers.

When --run is included:

mcscanx.collinearity, main output file of MCScanX_h. Contains synteny blocks that consist of pairwise collinear gene pairs. Usable by Synvisio and Accusyn in combination with mcscanx.gff. Can be included to the pangenome with add_synteny.
A .gff, .homology, .collinearity file for every sequence combination. Files are written to a folder named after the combination of two sequence identifiers. This folder also holds two .html files that visualizes collinear blocks and duplication depth between the two sequences.

Relevant literature

Add synteny

Warning

This is a novel function and has not yet undergone testing by external users. Please report any bugs or issues to the PanTools team so we can improve it.

Include synteny information into the pangenome.

Parameters

<databaseDirectory>	Path to the database root directory.
<collinearityFile>	A MCScanX .collinearity file.

Example commands

$ pantools add_synteny tomato_DB tomato_DB/synteny/mcscanx.collinearity

Example input

Required input is a .collinearity file generated by MCScanX. The example below shows the first three syntenic blocks calculated within an apple genome (sequence 2_3 to 2_28).

############### Parameters ###############
# MATCH_SCORE: 50
# MATCH_SIZE: 5
# GAP_PENALTY: -1
# OVERLAP_WINDOW: 5
# E_VALUE: 1e-05
# MAX GAPS: 25
############### Statistics ###############
# Number of collinear genes: 217123, Percentage: 80.16
# Number of all genes: 270847
##########################################
## Alignment 0: score=251.0 e_value=1.1e-10 N=6 bk1&cj1 plus
  0-  0:    2_3#Mdg_11A016020_mRNA1 2_28#Mdg_06B006370_mRNA1          0
  0-  1:    2_3#Mdg_11A016050_mRNA1 2_28#Mdg_06B006500_mRNA1          0
  0-  2:    2_3#Mdg_11A016230_mRNA1 2_28#Mdg_06B006510_mRNA1          0
  0-  3:    2_3#Mdg_11A016330_mRNA1 2_28#Mdg_06B006570_mRNA1          0
  0-  4:    2_3#Mdg_11A016390_mRNA1 2_28#Mdg_06B006590_mRNA1          0
  0-  5:    2_3#Mdg_11A016400_mRNA1 2_28#Mdg_06B006660_mRNA1          0
## Alignment 1: score=258.0 e_value=4.1e-11 N=6 bk1&cj1 minus
  1-  0:    2_3#Mdg_11A015930_mRNA1 2_28#Mdg_06B006850_mRNA1          0
  1-  1:    2_3#Mdg_11A016020_mRNA1 2_28#Mdg_06B006770_mRNA1          0
  1-  2:    2_3#Mdg_11A016050_mRNA1 2_28#Mdg_06B006760_mRNA1          0
  1-  3:    2_3#Mdg_11A016230_mRNA1 2_28#Mdg_06B006570_mRNA1          0
  1-  4:    2_3#Mdg_11A016330_mRNA1 2_28#Mdg_06B006510_mRNA1          0
  1-  5:    2_3#Mdg_11A016390_mRNA1 2_28#Mdg_06B006500_mRNA1          0
## Alignment 2: score=252.0 e_value=3.1e-12 N=6 bk1&cj1 minus
  2-  0:    2_3#Mdg_11A015930_mRNA1 2_28#Mdg_06B006880_mRNA1          0
  2-  1:    2_3#Mdg_11A016020_mRNA1 2_28#Mdg_06B006870_mRNA1          0
  2-  2:    2_3#Mdg_11A016050_mRNA1 2_28#Mdg_06B006860_mRNA1          0
  2-  3:    2_3#Mdg_11A016230_mRNA1 2_28#Mdg_06B006770_mRNA1          0
  2-  4:    2_3#Mdg_11A016390_mRNA1 2_28#Mdg_06B006660_mRNA1          0
  2-  5:    2_3#Mdg_11A016400_mRNA1 2_28#Mdg_06B006590_mRNA1          0

Synteny Overview

Warning

This is a novel function and has not yet undergone testing by external users. Please report any bugs or issues to the PanTools team so we can improve it.

Parameters

Path to the database root directory.

Example commands

$ pantools synteny_overview tomato_DB

Output

synteny_blocks_statistics.csv, statistics about homology relationships and synteny blocks between sequences.
blocks_overview.csv, overview of all synteny blocks.

In the synteny directory:

sequence_overlap_per_sequence.csv, statistics of overlap in genes between multiple synteny blocks with other sequences.
sequence_overlap_per_block.csv, statistics of overlap in genes between multiple synteny blocks.

In the synteny/statistics directory:

gene_frequency.csv, overview of which genes belong to which synteny blocks.
genome_overlap.csv, overview of the synteny blocks in each genome.