Gene retention

Warning

This is a novel function and has not yet undergone testing by external users. Please report any bugs or issues to the PanTools team so we can improve it.

Visualize gene retention of sequences in reference to a target sequence. The retention can be based on homology or synteny. Protein sequences are required to be clustered with group (include hyperlink) for the homology-based method. Synteny information must be included into the pangenome with add_synteny to allow the visualization of based on the retention of syntenic genes.

Method
- Go over the mRNA nodes for a selected reference sequence using a sliding window (set by --window-length) in steps of 10 gene positions. Only the longest transcripts of genes are used.
- At each window, the percentage of retention is calculated between the reference and query sequence as follows: (shared homologs or syntenic genes / window length) * 100. The percentage of retention cannot exceed 100, gene duplications are ignored.
- A gene is considered syntenic when they form a syntenic pair in a synteny block. For homologs we only check the presence of a gene and the location does not matter.
- Only full-length windows are visualized, the sliding stops when it is no longer able to move 10 mRNAs.
- These steps are repeated for every included reference sequence (determined by --selection-file.
Coloring
Two options for coloring the line graphs.
- With --coloring=phasing sequences belonging to the same chromosome get the same color and the phasing (letter) determines the shade. Colors are shared in the different figures of a combination plot for multiple genomes (see example below). This currently coloring (currently) works up to 6 chromosomes (green, red, blue, purple, orange, yellow).
- --coloring=distinct uses up to 21 distinct colors to color line graphs. Colors are not shared between figures of the combination plots with multiple genomes.
Parameters

<databaseDirectory>

Path to the database root directory.

Options

--window-length

Set the sliding window length. Default is 100.

--sequences-plot

Set the maximum number of sequences per (combination) plot. Default is 20.

--sequences-genome

Set the maximum number of sequences per genome plot. Default is 20.

--selection-file

Text file with sequences (identifiers) that should be used as reference. Default is all sequences.

--include/-i

Only include a selection of genomes.

--exclude/-e

Exclude a selection of genomes

-—selection-file

Text file with rules to use a specific set of genomes and sequences.

--coloring

For coloring the line graphs (“phasing” or “distinct”, see above). Reduces the maximum number of sequences per plot to 21.

Example commands
$ pantools gene_retention apple_DB --phasing
$ pantools gene_retention apple_DB --phasing --window-length 50
$ pantools gene_retention apple_DB --mode distinct-colors
$ pantools gene_retention apple_DB --mode distinct-colors --selection-file ref_sequences.txt
Example input

The --selection-file should hold one sequence identifier per line. In the following example four sequences are selected to be used as reference. All other sequences are still considered as query sequence. Use -—selection-file, --inlude or --exclude to adjust the query sequences.

1_1
1_2
2_1
3_1
Output

Output files are written to the retention directory in the database. Up to two Rscripts (homologs and syntelogs) are created for the sequence selection, placed in a subfolder named after the reference sequence identifier.

  • retention_rscripts.sh, a shell script to execute all Rscripts.

Example output
../_images/apple_retention_chr10.png

Fig. 16 Gene retention of Malus domestica cv Gala chr 10 to four other apple genomes. Genome 1-4 are chromosome-level assemblies, 1-3 are fully haplotype-phased. Genome 4 misses the red line because this was the selected reference sequence.