Gene retention ^^^^^^^^^^^^^^ .. warning:: This is a novel function and has not yet undergone testing by external users. Please report any bugs or issues to the PanTools team so we can improve it. Visualize gene retention of sequences in reference to a target sequence. The retention can be based on **homology** or **synteny**. Protein sequences are required to be clustered with group (include hyperlink) for the homology-based method. Synteny information must be included into the pangenome with :ref:`add_synteny ` to allow the visualization of based on the retention of syntenic genes. | **Method** | - Go over the mRNA nodes for a selected reference sequence using a sliding window (set by ``--window-length``) in steps of 10 gene positions. Only the longest transcripts of genes are used. | - At each window, the percentage of retention is calculated between the reference and query sequence as follows: (shared homologs or syntenic genes / window length) \* 100. The percentage of retention cannot exceed 100, gene duplications are ignored. | - A gene is considered syntenic when they form a syntenic pair in a synteny block. For homologs we only check the presence of a gene and the location does not matter. | - Only full-length windows are visualized, the sliding stops when it is no longer able to move 10 mRNAs. | - These steps are repeated for every included reference sequence (determined by ``--selection-file``. | **Coloring** | Two options for coloring the line graphs. | - With ``--coloring=phasing`` sequences belonging to the same chromosome get the same color and the phasing (letter) determines the shade. Colors are shared in the different figures of a combination plot for multiple genomes (see example below). This currently coloring (currently) works up to 6 chromosomes (green, red, blue, purple, orange, yellow). | - ``--coloring=distinct`` uses up to 21 distinct colors to color line graphs. Colors are not shared between figures of the combination plots with multiple genomes. **Parameters** .. list-table:: :widths: 30 70 * - - Path to the database root directory. **Options** .. list-table:: :widths: 30 70 * - ``--window-length`` - Set the sliding window length. Default is 100. * - ``--sequences-plot`` - Set the maximum number of sequences per (combination) plot. Default is 20. * - ``--sequences-genome`` - Set the maximum number of sequences per genome plot. Default is 20. * - ``--selection-file`` - Text file with sequences (identifiers) that should be used as reference. Default is all sequences. * - ``--include``/``-i`` - Only include a selection of genomes. * - ``--exclude``/``-e`` - Exclude a selection of genomes * - ``-—selection-file`` - Text file with rules to use a specific set of genomes and sequences. * - ``--coloring`` - For coloring the line graphs ("phasing" or "distinct", see above). Reduces the maximum number of sequences per plot to 21. **Example commands** .. code:: bash $ pantools gene_retention apple_DB --phasing $ pantools gene_retention apple_DB --phasing --window-length 50 $ pantools gene_retention apple_DB --mode distinct-colors $ pantools gene_retention apple_DB --mode distinct-colors --selection-file ref_sequences.txt **Example input** The ``--selection-file`` should hold one sequence identifier per line. In the following example four sequences are selected to be used as reference. All other sequences are still considered as query sequence. Use ``-—selection-file``, ``--inlude`` or ``--exclude`` to adjust the query sequences. .. code:: text 1_1 1_2 2_1 3_1 **Output** Output files are written to the **retention** directory in the database. Up to two Rscripts (homologs and syntelogs) are created for the sequence selection, placed in a subfolder named after the reference sequence identifier. - **retention_rscripts.sh**, a shell script to execute all Rscripts. **Example output** .. _gene retention: .. figure:: ../figures/apple_retention_chr10.png :width: 600 :align: center *Gene retention of Malus domestica cv Gala chr 10 to four other apple genomes. Genome 1-4 are chromosome-level assemblies, 1-3 are fully haplotype-phased. Genome 4 misses the red line because this was the selected reference sequence.*