Genome and sequence selection
=============================

Some PanTools functions contain the ``--selection-file`` option which
provides a finer control on sequence selection.
This option is not needed if all genomes and sequences are included,
or a simple selection of genomes (with ``--include`` / ``--exclude``).

It is possible to compare individual sequences in addition
to genomes. If the genomes in the pangenome are highly fragmented we suggest
to make a sub-selection of sequences for the analysis. High fragmentation of
genomes results in a high increase of runtimes, even when using only few
number of genomes.

For example, billions of pairwise comparisons are required to estimate the
k-mer distance between sequences for a genome of 10,000 sequences to a genome
with similar number of sequences.

You can include ``--selection-file`` together with a text file (see example
below) to set multiple rules creating a desired genome and sequence selection,
in functions where this is available.

Explanation of rules
--------------------

| **SELECT_GENOME**
|     Only include this selection of genomes.

| **SKIP_GENOME**
|     Exclude this selection of genomes.

| **SELECT_SEQUENCE**
|     Only include this selection of sequences. The number of genomes in the
      analysis is adjusted automatically.

| **SKIP_SEQUENCE**
|     Only include this selection of sequences. The number of genomes in the
      analysis is adjusted automatically.

| **GENOME_MINIMUM_NUMBER_GENES**
|     The number of genes for a genome must be equal or higher than the given
       number.

| **SEQUENCE_MINIMUM_NUMBER_GENES**
|     The number of genes for a sequence must be equal or higher than the se
      number.

| **SEQUENCE_WITH_SYNTENY**
|     A sequence should have at least shared syntenic block with another
      sequence.

| **GENOME_WITH_SYNTENY**
|     At least one sequence of a genome should have a syntenic block to
      another sequence.

| **SEQUENCE_WITH_PHASING**
|     Only include sequences where phasing information was added.

| **SELECT_CHROMOSOME**
|     Only include sequences with a specific chromosome number.

| **GENOME_WITH_PHENOTYPE**
|     A genome must have a specific phenotype.

| **SELECT_ANNOTATION**
|     Use a specific genome annotation. By default, the most recent annotation
      is selected.

**Example input files**
  **1**. Every genome and sequence of the pangenome is included as no rules are
  set.

  .. code:: text

      SELECT_GENOME =
      SKIP_GENOME =
      SELECT_SEQUENCES =
      SKIP_SEQUENCES =
      SEQUENCE_MINIMUM_NUMBER_GENES =
      GENOME_WITH_SYNTENY =
      SEQUENCE_WITH_SYNTENY =
      FIRST_SEQUENCES =
      SELECT_SEQUENCES =
      GENOME_WITH_PHENOTYPE =
      SEQUENCE_WITH_PHASING =

  **2**. Select genomes 1, 2, and 3 and only the sequences with more than 100
  genes.

  .. code:: text

      SELECT_GENOME = 1,2,3
      SEQUENCE_MINIMUM_NUMBER_GENES = 100

  **3**. Exclude genome 2 and sequences without synteny information.

  .. code:: text

      SKIP_GENOME = 2
      SEQUENCES_WITH_SYNTENY = true

  **4**. Excludes genomes without the 'resistance' phenotype and exclude
  sequences without phasing information.

  .. code:: text

      GENOME_WITH_PHENOTYPE = resistance
      SEQUENCES_WITH_PHASING = true

  **5**. Select genome 2 and 3 but only include the first 100 sequences per
  genome.

  .. code:: text

      SELECT_GENOME = 2,3
      FIRST_SEQUENCES = 100

  **6**. Only include the 50 largest sequences of every genome.

  .. code:: text

      LARGEST_SEQUENCES = 50