Sequence visualization

Warning

This is a novel function and has not yet undergone testing by external users. Please report any bugs or issues to the PanTools team so we can improve it.

Generate a visualization of multiple sequences with annotation bars.

Annotation bar types

  1. Gene and repeat coverage. Percentage of repeat and gene coverage calculated within a sliding window on the sequence. A coverage of 100% means every nucleotide of the window is covered.

  2. Found in other chromosome. The gene region is only visible if the gene is also found on another chromosome.

  3. Gene classification. Genes are coloured according their category: core, accessory or unique.

  4. Haplotype presence. Genes are coloured by the number of haplotypes (phases) they are found in.

  5. Synteny. Syntenic blocks are drawn between two sequences.

../_images/seq_vis_two_haplotypes_numbers.png

Fig. 18 Annotation plot for two sequences with all possible annotation bars.

Input requirements
Each bar type has specific requirements before it can be included. Input requirements are checked upon execution, bar types are excluded if they fail to meet the conditions.
1. Gene and repeat coverage are calculated by this function. To allow repeat coverage, repeat annotations must be added via add_repeats. The window size for which the coverage is calculated is controlled with --window-size.
2. Found in another chromosome. Requires phasing information to be added via add_phasing.
3. Gene classification uses the output from the previous gene_classification run.
4. Haplotype presence uses gene_classification with the --phasing argument.
5. Synteny information must be incorporated in the datababse. Use calculate_synteny followed by add_synteny. Do not forget to include the --sequence argument for the synteny calculation as the default is only between genomes.
Sequence selection
The visualization can be created with or without synteny information. The plots including synteny are limited up to eight sequences whereas the other can visualize all sequences of a single genome.
Up to eight sequences
When phasing information was included through add_phasing, sequences belonging to the same chromosome (number) are combined in a plot. The sequences are ordered in alphabetical order. To customly order sequences, use option 2.
Without phasing information a sequence selection is required.
Whole genome visualization
A visualization of all sequences for a genome. Available bar types are gene classification and haplotoype presence (Fig. 21, Fig. 22).
The Rscript to generate these visualization are only generated with phasing information (add_phasing) in the pangenome.
Parameters

<databaseDirectory>

Path to the database root directory.

Options

--include/-i

Only include a selection of genomes. This automatically lowers the threshold for core genes.

--exclude/-e

Exclude a selection of genomes. This automatically lowers the threshold for core genes.

-—selection-file

Text file with rules to use a specific set of genomes and sequences. This automatically lowers the threshold for core genes.

--rules

Text file with set of rules to determine which bar types and wha sequences (in which specific order) should be visualized.

Example commands
$ pantools sequence_visualization tomato_DB
$ pantools sequence_visualization tomato_DB --rules rules.txt
Example input

Rules set by the --rules file determine which bar types, in which order, and for which sequences they should be visualized.

Include all possible visualization and include all sequences (with phasing information).

gene_classification
haplotype_presence
other_chromosomes
repeat_coverage
gene_coverage

Visualize the haplotype counts for all sequences with phasing information.

haplotype_presence

Create one plot that visualizes the haplotype counts for the four sequences (with phasing information).

sequence 1_2,1_3,1_1,1_2
haplotype_presence
Output

Output files are written to the sequence_visualization directory in the database.

  • plot_sequences.R, Rscript to visualize the annotation bars. This script is created when a sequence selection was made using the sequence rule.

  • run_visualization_scripts.sh, shell script to execute all Rscripts. This script is only generated when there is phasing information (add_phasing) and no sequence selection was made.

../_images/seq_vis_two_haplotypes.png

Fig. 19 Sequence plot for two sequences with all possible annotation bars.


../_images/seq_vis_four_haplotypes.png

Fig. 20 Sequence plot of four sequences with haplotype copies bars.


../_images/seq_vis_genome_CAU.png

Fig. 21 Genome plot with core, accessory and unique bars. Diploid apple genome.


../_images/seq_vis_genome_haplotype.png

Fig. 22 Genome plot with haplotype copies bars. Tetraploid potato genome.