Genome and sequence selection ============================= Some PanTools functions contain the ``--selection-file`` option which provides a finer control on sequence selection. This option is not needed if all genomes and sequences are included, or a simple selection of genomes (with ``--include`` / ``--exclude``). It is possible to compare individual sequences in addition to genomes. If the genomes in the pangenome are highly fragmented we suggest to make a sub-selection of sequences for the analysis. High fragmentation of genomes results in a high increase of runtimes, even when using only few number of genomes. For example, billions of pairwise comparisons are required to estimate the k-mer distance between sequences for a genome of 10,000 sequences to a genome with similar number of sequences. You can include ``--selection-file`` together with a text file (see example below) to set multiple rules creating a desired genome and sequence selection, in functions where this is available. Explanation of rules -------------------- | **SELECT_GENOME** | Only include this selection of genomes. | **SKIP_GENOME** | Exclude this selection of genomes. | **SELECT_SEQUENCE** | Only include this selection of sequences. The number of genomes in the analysis is adjusted automatically. | **SKIP_SEQUENCE** | Only include this selection of sequences. The number of genomes in the analysis is adjusted automatically. | **GENOME_MINIMUM_NUMBER_GENES** | The number of genes for a genome must be equal or higher than the given number. | **SEQUENCE_MINIMUM_NUMBER_GENES** | The number of genes for a sequence must be equal or higher than the se number. | **SEQUENCE_WITH_SYNTENY** | A sequence should have at least shared syntenic block with another sequence. | **GENOME_WITH_SYNTENY** | At least one sequence of a genome should have a syntenic block to another sequence. | **SEQUENCE_WITH_PHASING** | Only include sequences where phasing information was added. | **SELECT_CHROMOSOME** | Only include sequences with a specific chromosome number. | **GENOME_WITH_PHENOTYPE** | A genome must have a specific phenotype. | **SELECT_ANNOTATION** | Use a specific genome annotation. By default, the most recent annotation is selected. **Example input files** **1**. Every genome and sequence of the pangenome is included as no rules are set. .. code:: text SELECT_GENOME = SKIP_GENOME = SELECT_SEQUENCES = SKIP_SEQUENCES = SEQUENCE_MINIMUM_NUMBER_GENES = GENOME_WITH_SYNTENY = SEQUENCE_WITH_SYNTENY = FIRST_SEQUENCES = SELECT_SEQUENCES = GENOME_WITH_PHENOTYPE = SEQUENCE_WITH_PHASING = **2**. Select genomes 1, 2, and 3 and only the sequences with more than 100 genes. .. code:: text SELECT_GENOME = 1,2,3 SEQUENCE_MINIMUM_NUMBER_GENES = 100 **3**. Exclude genome 2 and sequences without synteny information. .. code:: text SKIP_GENOME = 2 SEQUENCES_WITH_SYNTENY = true **4**. Excludes genomes without the 'resistance' phenotype and exclude sequences without phasing information. .. code:: text GENOME_WITH_PHENOTYPE = resistance SEQUENCES_WITH_PHASING = true **5**. Select genome 2 and 3 but only include the first 100 sequences per genome. .. code:: text SELECT_GENOME = 2,3 FIRST_SEQUENCES = 100 **6**. Only include the 50 largest sequences of every genome. .. code:: text LARGEST_SEQUENCES = 50