Genome and sequence selection

Some PanTools functions contain the --selection-file option which provides a finer control on sequence selection. This option is not needed if all genomes and sequences are included, or a simple selection of genomes (with --include / --exclude).

It is possible to compare individual sequences in addition to genomes. If the genomes in the pangenome are highly fragmented we suggest to make a sub-selection of sequences for the analysis. High fragmentation of genomes results in a high increase of runtimes, even when using only few number of genomes.

For example, billions of pairwise comparisons are required to estimate the k-mer distance between sequences for a genome of 10,000 sequences to a genome with similar number of sequences.

You can include --selection-file together with a text file (see example below) to set multiple rules creating a desired genome and sequence selection, in functions where this is available.

Explanation of rules

SELECT_GENOME
Only include this selection of genomes.
SKIP_GENOME
Exclude this selection of genomes.
SELECT_SEQUENCE
Only include this selection of sequences. The number of genomes in the analysis is adjusted automatically.
SKIP_SEQUENCE
Only include this selection of sequences. The number of genomes in the analysis is adjusted automatically.
GENOME_MINIMUM_NUMBER_GENES
The number of genes for a genome must be equal or higher than the given number.
SEQUENCE_MINIMUM_NUMBER_GENES
The number of genes for a sequence must be equal or higher than the se number.
SEQUENCE_WITH_SYNTENY
A sequence should have at least shared syntenic block with another sequence.
GENOME_WITH_SYNTENY
At least one sequence of a genome should have a syntenic block to another sequence.
SEQUENCE_WITH_PHASING
Only include sequences where phasing information was added.
SELECT_CHROMOSOME
Only include sequences with a specific chromosome number.
GENOME_WITH_PHENOTYPE
A genome must have a specific phenotype.
SELECT_ANNOTATION
Use a specific genome annotation. By default, the most recent annotation is selected.
Example input files

1. Every genome and sequence of the pangenome is included as no rules are set.

SELECT_GENOME =
SKIP_GENOME =
SELECT_SEQUENCES =
SKIP_SEQUENCES =
SEQUENCE_MINIMUM_NUMBER_GENES =
GENOME_WITH_SYNTENY =
SEQUENCE_WITH_SYNTENY =
FIRST_SEQUENCES =
SELECT_SEQUENCES =
GENOME_WITH_PHENOTYPE =
SEQUENCE_WITH_PHASING =

2. Select genomes 1, 2, and 3 and only the sequences with more than 100 genes.

SELECT_GENOME = 1,2,3
SEQUENCE_MINIMUM_NUMBER_GENES = 100

3. Exclude genome 2 and sequences without synteny information.

SKIP_GENOME = 2
SEQUENCES_WITH_SYNTENY = true

4. Excludes genomes without the ‘resistance’ phenotype and exclude sequences without phasing information.

GENOME_WITH_PHENOTYPE = resistance
SEQUENCES_WITH_PHASING = true

5. Select genome 2 and 3 but only include the first 100 sequences per genome.

SELECT_GENOME = 2,3
FIRST_SEQUENCES = 100

6. Only include the 50 largest sequences of every genome.

LARGEST_SEQUENCES = 50