Genome and sequence selection
Some PanTools functions contain the --selection-file
option which
provides a finer control on sequence selection.
This option is not needed if all genomes and sequences are included,
or a simple selection of genomes (with --include
/ --exclude
).
It is possible to compare individual sequences in addition to genomes. If the genomes in the pangenome are highly fragmented we suggest to make a sub-selection of sequences for the analysis. High fragmentation of genomes results in a high increase of runtimes, even when using only few number of genomes.
For example, billions of pairwise comparisons are required to estimate the k-mer distance between sequences for a genome of 10,000 sequences to a genome with similar number of sequences.
You can include --selection-file
together with a text file (see example
below) to set multiple rules creating a desired genome and sequence selection,
in functions where this is available.
Explanation of rules
- Example input files
1. Every genome and sequence of the pangenome is included as no rules are set.
SELECT_GENOME = SKIP_GENOME = SELECT_SEQUENCES = SKIP_SEQUENCES = SEQUENCE_MINIMUM_NUMBER_GENES = GENOME_WITH_SYNTENY = SEQUENCE_WITH_SYNTENY = FIRST_SEQUENCES = SELECT_SEQUENCES = GENOME_WITH_PHENOTYPE = SEQUENCE_WITH_PHASING =
2. Select genomes 1, 2, and 3 and only the sequences with more than 100 genes.
SELECT_GENOME = 1,2,3 SEQUENCE_MINIMUM_NUMBER_GENES = 100
3. Exclude genome 2 and sequences without synteny information.
SKIP_GENOME = 2 SEQUENCES_WITH_SYNTENY = true
4. Excludes genomes without the ‘resistance’ phenotype and exclude sequences without phasing information.
GENOME_WITH_PHENOTYPE = resistance SEQUENCES_WITH_PHASING = true
5. Select genome 2 and 3 but only include the first 100 sequences per genome.
SELECT_GENOME = 2,3 FIRST_SEQUENCES = 100
6. Only include the 50 largest sequences of every genome.
LARGEST_SEQUENCES = 50