Build a pangenome out of a set of genomes. The construction consists of two
steps: laying out the structure of the De Bruijn graph, and adding localization
information to the graph.
Optimized localization
The localization step of build_pangenome has been parallelized to increase
performance. The level of parallelism is controlled by the --threads
option (see below). Sequence nodes are localized in parallel, and updates to
the localization database cached to disk.
Localization updates are then sorted into a number of different files, called
buckets, whose contents are written to Neo4j by a number of database writer
threads in parallel (see the --num-db-writer-threads option below).
Because each database writer thread reads the contents of only a single
bucket into memory at a time, memory usage is reduced.
To cache localization updates on disk PanTools needs a scratch directory for
temporary storage. This directory will be created by PanTools
automatically, or can be set to a directory using the
--scratch-directory option.
Lastly, an in-memory cache has been introduced to store frequently-accessed
properties of nucleotide (sequence nodes). The cache will automatically retain
the most-frequently used properties and evict least-frequently used items.
This significantly increases performance by reducing Neo4j IO. The size of
the cache can be controlled with the --cache-size option. To calculate
the heap space the cache will occupy, multiply the maximum size of the
cache by 128 bytes, e.g. for the default cache size of 10,000,000 PanTools
will need an additional 10,000,000 * 128 B = 1.28 GB of heap space.
A text file containing paths to FASTA files of genomes to be added
to the pangenome; each on a separate line.
Options
--kmer-size
Size of k-mers. Should be in range [6..255]. By not giving this
argument, the most optimal k-mer size is calculated automatically.
--threads/-t
Number of parallel working threads, default is the number of cores
or 8 whichever is lower.
--scratch-directory
Temporary directory for storing localization update files. If not set
a temporary directory will be created inside the default temporary-file
directory. On most Linux distributions this default temporary-file
directory will be /tmp/, on MacOS typically /var/folder/.
If a scratch directory is set, it will be created if it does not exist.
If it does exist, PanTools will verify the directory is empty and, if
not, raise an exception.
--num-buckets
Number of buckets for sorting, default is 200. During the localization
phase updates are cached to disk and sorted into a number of files
called buckets. This is to reduce the memory usage of storing all
localization updates: instead of keeping them all in memory, we can
now read bucket with a given level of parallelism (see the
--num-db-writer-threads option), and update Neo4j with each
bucket’s contents instead.
The more buckets are available the lower the memory usage. However,
please make sure PanTools can keep a file open for each bucket
during the localization by setting the file descriptors limit to an
appropriate value. For the default of 200 buckets, we advise
setting the limit to 1024, like so: ulimit-n1024. For larger
number of buckets, set the limit to around 1,000 plus the number of
buckets.
--transaction-size
Number of localization updates to pack into a single Neo4j transaction,
default is 10,000. To increase throughput to Neo4j localization updates
are packed into a single transaction. The greater the number of updates
per transaction the higher the throughput (up to a point), but the
higher the memory usage.
In our experiments we have found 10,000 to provide a good balance
between memory usage and performance.
--num-db-writer-threads
Number of threads to use for writing to Neo4j, default is 2. After
sorting localization updates into buckets (see the --num-buckets
option), buckets are read in parallel by the specified number of
Neo4j database writer threads. With the default of two threads, the
contents of two buckets will be kept in memory at the same time, and
written to Neo4j with a given transaction size (see the
--transaction-size option).
In our experiments on SSD and network-backed storage we saw little
additional increase in performance by using more than two threads.
--cache-size
Maximum number of items in the node properties, default is 10,000,000.
During localization several properties of nucleotide (sequence) nodes
are accessed frequently. To prevent loading these from Neo4j every time
the specified number of most frequently used items are cached. The
cache can be disabled entirely by setting the cache size to zero.
--keep-intermediate-files
Do not delete intermediate localization files after the command
finishes. Disabled by default, i.e., files are deleted automatically
after the command finishes.
Construct or expand the annotation layer of an existing pangenome. The
layer consists of genomic features like genes, mRNAs, proteins, tRNAs
etc. PanTools is only able to read General Feature Format (GFF)
files.
Multiple annotations can be assigned to a single genome; however, only
one annotation a time can be included in an analysis. The most recently
included annotation of a genome is included as default, unless a
different annotation is specified via --annotations-file. This annotation
file contains only annotation identifiers, each on a separate line. The most
recent annotation is used for genomes where no annotation number is specified
in the file. Below is an example where the third annotation of genome 1 is
selected and the second annotation of genome 2 and 3.
1_3
2_2
3_2
Note on GFF files
GFF files are notoriously difficult to parse. PanTools uses
htsjdk to parse GFF files, which is a Java library. Since we need to
put this annotation in the graph database, it can be that the features
are not correctly added. This is especially true for non-standard GFF
files and annotated organellar genomes. If you encounter problems with
a gff file, please check whether it is valid to the
GFF3 specification.
Also, our code should be able to handle all valid GFF3 files, but
if the GFF3 file contains a trans-spliced gene that has alternative
splicing, it will not be able to handle it (it will only annotate one
mRNA).
Parameters
<databaseDirectory>
Path to the database root directory.
<annotationsFile>
A text file with on each line a genome number and the full path
to the corresponding annotation file, separated by a space.
Options
--connect
Connect the annotated genomic features to nucleotide nodes in the
DBG.
--ignore-invalid-features
Ignore GFF3 features that do not match the fasta.
--assume-one-mrna-per-cds
Only relevant for features in GFF files that lack an mRNA between CDS
and gene. By default, PanTools will assume that all CDS features belong
to the same mRNA. If this option is set, PanTools will assume that each
CDS feature belongs to a separate mRNA. For most GFF files this option
should not be set.
The annotated features are incorporated in the graph. Output files are
written to the database directory.
annotation_overview.txt, a summary of the GFF files incorporated
in the pangenome
annotation.log, a list of misannotated feature identifiers.
Example input file
Each line of the file starts with the genome number followed by the full
path to the annotation file. The genome numbers match the line number of
the file that you used to construct the pangenome.
The GFF format consists of one line per feature, each containing 9
columns of data, plus optional track definition lines, that must be
tab separated. Please use the proper hierarchy for the feature:
gene -> mRNA -> CDS. Where gene is the parent of mRNA
and mRNA is the parent of the CDS feature. The following example
from Saccharomyces cerevisiae YJM320 (GCA_000975885) displays a
correctly formatted gene entry:
Generate homology groups based on similarity of protein sequences. The
resulting homology groups connect similar sequences in the pangenome
database. Homology groups contain not only orthologous pairs, but also
pairs of homologs duplicated after the speciation of the two species,
so-called in-paralogs. The sizes of the groups are controlled by the
--relaxation parameter that can be set very strict or more lenient,
depending on the evolutionary distance of the genomes. When you are
unsure which relaxation setting is most suitable for your dataset,
running the optimal_grouping
functionality is recommended.
Be aware that not every sequence within a homology group has to be
similar to the other sequences. For example, two non-similar protein
sequences each have a high-similarity hit with the same protein sequence
but align to a different region, one at the start and one near the end
of the sequence.
When you want to run group another time but with different
parameters, the currently active grouping must first either be moved or
removed. This can be achieved with the
move or remove grouping
functions.
Method
Here, we explain a simplified version of the original algorithm,
please take a look at our publication for an extensive explanation.
First, potential similar sequences are identified by counting shared
k-mer (protein) sequences. Similarity between the selected protein
sequences is calculated through (local) Smith-Waterman alignments.
When the (normalized) similarity score of two sequences is above a
given threshold (controlled by --relaxation), the proteins are
connected with each other in the similarity graph. Every similarity
component is then passed to the MCL (Markov clustering) algorithm to
be possibly broken into several homology groups.
Relaxation
The relaxation parameter is a combination of four sub-parameters:
intersectionrate, similaritythreshold, mclinflation
and contrast. The values for these parameters for each relaxation
setting can be seen in the table below. We strongly recommend using the
--relaxation option to control the grouping, but advanced users still
have the option to control the individual sub-parameters.
Number of parallel working threads, default is the number of
available cores or 8, whichever is lower.
--include/-i
Only include a selection of genomes.
--exclude/-e
Exclude a selection of genomes.
--annotations-file/-A
A text file with the identifiers of annotations to be included,
each on a separate line. The most recent annotation is selected
for genomes without an identifier.
--longest
Only cluster protein sequences of the longest transcript per gene.
--scoring-matrix
The scoring matrix used, default is BLOSUM62.
--relaxation
The relaxation in homology calls. Should be in range [1-8], from
strict to relaxed. This argument automatically sets
the four remaining arguments stated below.
--intersection-rate
The fraction of k-mers that needs to be shared by two intersecting
proteins. Should be in range [0.001,0.1].
--similarity-threshold
The minimum normalized similarity score of two proteins. Should be in
range [1..99].
pantools_homology_groups.txt, overview of the created homology
groups. Each line represents one homology group, starting with the
homology group (database) identifier followed by a colon (:) and mRNA
identifiers (from GFF) that are separated by a space. To ensure all
identifiers are unique in this file, the mRNA ids are extended by a
hash symbol (#) and a genome number. The following line is example
output of an homology group with two genes from genome 1 and 146:
Finding the most suitable settings for group
can be difficult and is always dependent on evolutionary distance of the
genomes in the pangenome. This functionality runs group on all eight
--relaxation settings, from strictest (d1) to the most relaxed (d8).
To find the optimal setting, complete and non-duplicated BUSCO genes
that are present in all genomes are used to validate each setting.
Method
A perfect clustering of the sequences would place each BUSCO in a
separate homology group with one representative protein per genome.
When BUSCO is run against the pangenome, the proteins corresponding to
the BUSCO HMMs have been identified. For each BUSCO, the
representative proteins are checked whether these are clustered into a
single or multiple groups. These groups are searched to identify
sequences other than the current BUSCO. The highest number of
correctly clustered BUSCOs present in one group are true positives
(tp). Any other gene clustered inside this group is considered a
false positive (fp) The remaining BUSCO genes outside this best
group are counted as false negative (fn). The summation of tps fps
and fns are defined as TP, FP and FN, respectively. From
these scores recall, precision and F-score measures are calculated as
follows:
Fig. 1 Proteins of three distinct homology groups are represented as
triangles, circles and squares. Green shapes are true positives (tp)
which have been assigned to the true group; red shapes are false
positives (fp) for the group they have been incorrectly assigned to, and
false negatives (fn) for their true group
Choosing the optimal setting
Choosing the correct setting is usually a trade-off between TPs and
FNs. The most strict grouping results in a significantly higher number
of clusters as the more relaxed settings. With stringent settings,
related proteins could get separated; however, a high number of false
positives is (usually) prevented (FN > FP). When you would go for a
more loose setting, the related proteins are likely to part of the
same group, but other sequences could be included as well (FN < FP).
Note on active grouping
No grouping is active after running this function. Use the generated
output files to identify a suitable grouping. Activate this grouping
using change_grouping. An
overview of the available groupings and used settings is stored in the
‘pangenome’ node (inside the database), or can be created by running
grouping_overview.
The output directory created by the
busco_protein function.
This directory is found inside the pangenome database, in the
busco directory.
Options
--threads/-t
Number of parallel working threads, default is the number of
available cores or 8, whichever is lower.
--include/-i
Only include a selection of genomes.
--exclude/-e
Exclude a selection of genomes.
--annotations-file/-A
A text file with the identifiers of annotations to be included,
each on a separate line. The most recent annotation is selected
for genomes without an identifier.
--fast
Assume the optimal grouping is found when the F1-score drops
compared to the previous clustering round.
--longest
Only cluster protein sequences of the longest transcript per gene.
--scoring-matrix
The scoring matrix used, default is BLOSUM62.
--relaxation
Only consider a selection of relaxation settings (1-8 allowed).
After each clustering round, homology groups are incorporated in the
graph. A text file with homology group and gene identifiers is stored in
the group directory in the pangenome database. This file is named
after the used sequence similarity threshold (25-95). Each line
represents one homology group, starting with the homology group
(database) identifier followed by a colon (:) and mRNA identifiers (from
GFF) that are separated by a space. The mRNA identifiers are extended by
a hash (#) and their genome number. The following line is example output
of an homology group with two genes from genome 1 and 146:
Only a single homology grouping can be active in the pangenome. Use this
function to change the active grouping version. Information of the
available groupings and used settings is stored in the ‘pangenome’ node
(inside the database) and can be created by running
grouping_overview.
Parameters
<databaseDirectory>
Path to the database root directory.
Options
--grouping-version/-v
Required. The version of homology grouping to become active.
Build a panproteome out of a set of proteins. By only including protein
sequences, the usable functionalities are limited to a protein-based
analysis, please see differences pangenome and panproteome. No additional proteins can be added to the
panproteome, it needs to be rebuilt completely.
Parameters
<databaseDirectory>
Path to the database root directory.
<proteomesFile>
A text file containing paths to FASTA files of proteins to be
added to the panproteome; each on a separate line.
Including phenotype data to the pangenome which allows the
identification of phenotype specific genes, SNPs, functions, etc..
Altering the data is done by rerunning the command with an updated CSV
file.
Data types
Each phenotype node contains a genome number and can hold the
following data types: String, Integer, Float or
Boolean.
Values recognized as round number are converted to an Integer and
to a Double when having one or multiple decimals.
Boolean types are identified by checking if the value matches
‘true’ or ‘false’, ignoring capitalization of letters.
String values remain completely unaltered except for spaces and
quotes characters. Spaces are changed into an underscore (’_’)
character and quotes are completely removed.
Bin numerical values
When using numerical values, two genomes are only considered to share
a phenotype if the value is identical. PanTools creates an
alternative version for these phenotypes by binning the values. Taking
‘Pathogenicity’ from the example below we see the integers between 3
and 15. Using these two extreme values three bins are created for a
new phenotype ‘Pathogenicity_binned’: 3-6.33, 6.34-11.66 and 11.67-15.
The number of bins is controlled through --bins. For skewed data,
consider making the bins manually and include this as string
phenotype.
Parameters
<databaseDirectory>
Path to the database root directory.
<phenotypesFile>
A CSV file containing the phenotype information.
Options
--scratch-directory
Temporary directory for storing localization update files. If not set
a temporary directory will be created inside the default temporary-file
directory. On most Linux distributions this default temporary-file
directory will be /tmp, on MacOS typically /var/folders/.
If a scratch directory is set, it will be created if it does not exist.
If it does exist, PanTools will verify the directory is empty and, if
not, raise an exception.
--append
Do not remove existing phenotype nodes but only add new
properties to them. If a property already exists, values from
the new file will overwrite the old.
--bins
Number of bins used to group numerical values of a phenotype
(default: 3).
Example phenotypes file
The input file needs to be in .CSV format, a plain text file where each
value is separated by a comma. The first row should start with
‘Genome,’ followed by the phenotype names and/or identifiers. The first
column must start with genome numbers corresponding to the one in
your pangenome. Phenotypes and metadata must be placed on the same line
as their genome number. A field can remain empty when the phenotype for
a genome is missing or unknown. Here below is an example of five genomes
contains six phenotypes:
BUSCO attempts to provide a quantitative assessment of the completeness
in terms of expected gene content of a genome assembly. Proteins are
placed into categories of Complete and single-copy (S), Complete and
duplicated (D), fragmented (F), or missing (M). This
function is able to run BUSCO v3, v4 or v5 against protein
sequences of the pangenome.
The number of reported duplicated genes in eukaryotes is often to high
as different protein isoforms are counted multiple times. To adjust the
imprecise duplication score, include the --longest-transcripts
argument to the command.
What BUSCO benchmark set to use
When using BUSCO v3, go to https://busco.ezlab.org, download a odb9
set, and untar it with tar-xvzf. Include the entire directory in
the command using the --input-file argument.
For BUSCO v4 or v5, you only have to provide the odb10 database
name with the --input-file argument, the database is downloaded
automatically. To get a full list of the available datasets, run
busco--list-datasets.
Required software
BUSCO must be set to your $PATH. For v3, test if the
whichrun_BUSCO.py command displays the full path so it can accessed
anywhere. For v4 and v5, test if busco is executable.
Parameters
<databaseDirectory>
Path to the database root directory.
Options
Requires one of --busco9|--busco10.
--threads/-t
Number of parallel working threads, default is the number of
available cores or 8, whichever is lower.
--include/-i
Only include a selection of genomes.
--exclude/-e
Exclude a selection of genomes.
--annotations-file/-A
A text file with the identifiers of annotations to be included,
each on a separate line. The most recent annotation is selected
for genomes without an identifier.
--busco-version/-v
The BUSCO version. Select either ‘busco3’, ‘busco4’ or ‘busco5’
(default).
--busco9
An odb9 benchmark dataset file.
--busco10
An odb10 benchmark dataset name.
--longest
Only search against the longest protein-coding transcript of
genes.
skip-busco
A list of questionable BUSCOs. The completeness score is
recalculated by skipping these genes.
This function can integrate different functional annotations from a
variety of annotation files. Currently available functional annotations:
Gene Ontology, Pfam, InterPro, TIGRFAM, Phobius,
SignalP and COG. The first time this function is executed, the
Pfam, TIRGRAM, GO, and InterPro databases are integrated into the
pangenome. Phobius, SignalP and COG annotations do not have separate
nodes and are directly annotated on ‘mRNA’ nodes in the pangenome.
Gene names (or identifiers) from the input file are used to identify
gene nodes in the pangenome. Only genes with an exactly matching
name/identifier can be connected to functional annotation nodes! Use the
same FASTA and GFF3 files that were used to construct the pangenome database.
(It is best to use the protein fasta files in the proteins directory of the
database.)
Functional databases
If the needed databases are not available, they are downloaded by PanTools and
extracted (Pfam, TIGRFAM, GO and InterPro are downloaded from the web). Prior
to v4.2.0, PanTools came with these databases pre-downloaded. This is no
longer the case, as this limited the distribution of PanTools as a single
binary file. We strongly suggest to set the -F option to prevent
unnecessary downloads from the internet, preferably to a location easily
accessible.
PanTools has been tested with the following versions of the databases:
A text file with on each line a genome number and the full path
to the corresponding annotation file, separated by a space.
Options
--annotations-file/-A
A text file with the identifiers of annotations to be included,
each on a separate line. The most recent annotation is selected for
genomes without an identifier.
--functional-databases-directory/-F
Path to the directory containing the functional databases. If the
databases are not present, they are downloaded automatically. (Default
location is “functional_databases” in the database directory.)
Functional annotations are incorporated in the graph. A log file is
written to the log directory.
add_functional_annotations.log, a log file with the the number of
added functions per type and the identifiers of functions that could
not be included.
Example function files
The <functionsFile> requires to be formatted like an annotation input
file. Each line of the file starts with the genome number followed by
the full path to an annotation file.
PanTools can recognize functional annotations in different output
formats.
Phobius and SignalP are not standard analyses of the InterProScan
pipeline and require some additional steps during the InterProScan
installation. Please take a look at
our InterProScan install instruction
to verify if the tools are part of the prediction pipeline. Phobius 1.01
##gff-version 3
##interproscan-version 5.52-86.0
AT4G21230.1 ProSiteProfiles protein_match 333 620 39.000664 + . date=06-10-2021;Target=mRNA.AT4G21230.1 333 620;Ontology_term="GO:0004672","GO:0005524","GO:0006468";ID=match$42_333_620;signature_desc=Protein kinase domain profile.;Name=PS50011;status=T;Dbxref="InterPro:IPR000719"
AT3G08980.5 TIGRFAM protein_match 25 101 3.7E-14 + . date=06-10-2021;Target=mRNA.AT3G08980.5 25 101;Ontology_term="GO:0006508","GO:0008236","GO:0016020";ID=match$66_25_101;signature_desc=sigpep_I_bact: signal peptidase I;Name=TIGR02227;status=T;Dbxref="InterPro:IPR000223"
AT2G17780.2 Phobius protein_match 338 354 . + . date=06-10-2021;Target=AT2G17780.2 338 354;ID=match$141_338_354;signature_desc=Region of a membrane-bound protein predicted to be embedded in the membrane.;Name=TRANSMEMBRANE;status=T
AT2G17780.2 Phobius protein_match 1 337 . + . date=06-10-2021;Target=AT2G17780.2 1 337;ID=match$142_1_337;signature_desc=Region of a membrane-bound protein predicted to be outside the membrane, in the extracellular region.;Name=NON_CYTOPLASMIC_DOMAIN;status=T
AT3G11780.2 SignalP_EUK protein_match 1 24 . + . date=06-10-2021;Target=mRNA.AT3G11780.2 1 24;ID=match$230_1_24;Name=SignalP-noTM;status=T
AT1G04300.2 CDD protein_match 40 114 1.54717E-13 + . date=06-10-2021;Target=mRNA.AT1G04300.2 40 114;Ontology_term="GO:0005515";ID=match$212_40_114;signature_desc=MATH;Name=cd00121;status=T;Dbxref="InterPro:IPR002083"
eggNOG-mapper (tab separated) file:
#query_name seed_eggNOG_ortholog seed_ortholog_evalue seed_ortholog_score best_tax_level Preferred_name GOs EC KEGG_ko KEGG_Pathway KEGG_Module KEGG_Reaction KEGG_rclass BRITE KEGG_TC CAZy BiGG_Reaction taxonomic scope eggNOG OGs best eggNOG OG COG Functional cat. eggNOG free text desc.
ATKYO-2G54530.1 3702.AT2G35130.2 1.9e-179 636.0 Brassicales GO:0003674,GO:0003676,GO:0003723,GO:0003824,GO:0004518,GO:0004519,GO:0005488,GO:0005575,GO:0005622,GO:0005623,GO:0006139,GO:0006725,GO:0006807,GO:0008150,GO:0008152,GO:0009451,GO:0009987,GO:0016070,GO:0016787,GO:0016788,GO:0034641,GO:0043170,GO:0043226,GO:0043227,GO:0043229,GO:0043231,GO:0043412,GO:0044237,GO:0044238,GO:0044424,GO:0044464,GO:0046483,GO:0071704,GO:0090304,GO:0090305,GO:0097159,GO:1901360,GO:1901363 Viridiplantae 37R67@33090,3GAUT@35493,3HNDD@3699,KOG4197@1,KOG4197@2759 NA|NA|NA E Pentacotripeptide-repeat region of PRORP
ATKYO-UG22500.1 3712.Bo02269s010.1 7.5e-35 153.7 Brassicales Viridiplantae 29I9W@1,2RRH4@2759,383W6@33090,3GWQZ@35493,3I1A9@3699 NA|NA|NA
ATKYO-1G60060.1 3702.AT1G48090.1 0.0 6241.0 Brassicales ko:K19525 ko00000 Viridiplantae 37IJB@33090,3GAN0@35493,3HQ90@3699,COG5043@1,KOG1809@2759 NA|NA|NA U Vacuolar protein sorting-associated protein
ATKYO-3G74720.1 3702.AT3G52120.1 7.2e-245 852.8 Brassicales ko:K13096 ko00000,ko03041 Viridiplantae 37QYY@33090,3G9VU@35493,3HRDK@3699,KOG0965@1,KOG0965@2759 NA|NA|NA L SWAP (Suppressor-of-White-APricot) surp domain-containing protein D111 G-patch domain-containing protein
ATKYO-4G41660.1 3702.AT4G16340.1 0.0 3392.1 Brassicales GO:0003674,GO:0005085,GO:0005088,GO:0005089,GO:0005488,GO:0005515,GO:0005575,GO:0005622,GO:0005623,GO:0005634,GO:0005737,GO:0005783,GO:0005829,GO:0005886,GO:0006810,GO:0008064,GO:0008150,GO:0008360,GO:0009605,GO:0009606,GO:0009628,GO:0009629,GO:0009630,GO:0009958,GO:0009966,GO:0009987,GO:0010646,GO:0010928,GO:0012505,GO:0016020,GO:0016043,GO:0016192,GO:0017016,GO:0017048,GO:0019898,GO:0019899,GO:0022603,GO:0022604,GO:0023051,GO:0030832,GO:0031267,GO:0032535,GO:0032956,GO:0032970,GO:0033043,GO:0043226,GO:0043227,GO:0043229,GO:0043231,GO:0044422,GO:0044424,GO:0044425,GO:0044432,GO:0044444,GO:0044446,GO:0044464,GO:0048583,GO:0050789,GO:0050793,GO:0050794,GO:0050896,GO:0051020,GO:0051128,GO:0051179,GO:0051234,GO:0051493,GO:0065007,GO:0065008,GO:0065009,GO:0070971,GO:0071840,GO:0071944,GO:0090066,GO:0098772,GO:0110053,GO:1902903 ko:K21852 ko00000,ko04131 Viridiplantae 37QIM@33090,3G8RK@35493,3HSFN@3699,KOG1997@1,KOG1997@2759 NA|NA|NA T Belongs to the DOCK family
A custom input file must consist of two tab or comma separated columns.
The first column should contain a gene/mRNA id, the second an identifier
from one of four functional annotation databases: GO, Pfam, InterPro or
TIGRFAM.
ID mRNA-YPR204W
FT DOMAIN 1 1032 NON CYTOPLASMIC.
//
ID mRNA-ndhB-2_1
FT SIGNAL 1 21
FT DOMAIN 1 4 N-REGION.
FT DOMAIN 5 16 H-REGION.
FT DOMAIN 17 21 C-REGION.
FT DOMAIN 22 36 NON CYTOPLASMIC.
FT TRANSMEM 37 57
FT DOMAIN 58 63 CYTOPLASMIC.
FT TRANSMEM 64 83
FT DOMAIN 84 88 NON CYTOPLASMIC.
FT TRANSMEM 89 113
FT DOMAIN 114 133 CYTOPLASMIC.
FT TRANSMEM 134 156
FT DOMAIN 157 167 NON CYTOPLASMIC.
FT TRANSMEM 168 189
FT DOMAIN 190 222 CYTOPLASMIC.
FT TRANSMEM 223 246
FT DOMAIN 247 253 NON CYTOPLASMIC.
//
Read antiSMASH output and incorporate Biosynthetic Gene Clusters
(BGC) nodes into the pangenome database. A ‘bgc’ node holds the gene
cluster product, the cluster address and has a relationship to all gene
nodes of the cluster. For this function to work, antiSMASH should be
performed with the same FASTA and GFF3 files used for building the
pangenome. antiSMASH output will not match the identifiers of the
pangenome when no GFF file was included.
As of PanTools v3.3.4 the required antiSMASH version is 6.0.0. Gene
cluster information is parsed from the .JSON file that is generated in
each run. We try to keep the parser updated with newer versions but
please contact us when this is no longer the case.
Version
Version Date
antiSMASH
6.0.0
21-02-2021
Parameters
<databaseDirectory>
Path to the database root directory.
<antiSMASHFile>
A text file with on each line a genome number and the full path
to the corresponding antiSMASH output file, separated by a space.
Options
--annotations-file/-A
A text file with the identifiers of annotations to be included,
each on a separate line. The most recent annotation is selected for
genomes without an identifier.
Example antiSMASH file
The <antiSMASHFile> requires to be formatted like a regular annotation
input file. Each line of the file starts with the genome number followed
by the full path to the JSON file.
Add genomic variation to the pangenome database. These functions can
handle SNP (single nucleotide polymorphism)/InDel (insertion/deletion) and PAV
(presence/absence variation) information but will only consider genic variation
when adding the information to the database. For SNP/InDel information, VCF
(variant call format) files are required. For PAV information, a tab-separated
file with 1s and 0s describing the presence and absence, respectively.
Add variants to the pangenome database. The function will only consider
genomic variation that is present in the mRNA features of the pangenome.
The SNP/InDel information will be used to create a consensus sequence for each
mRNA features. For each accession and mRNA features, a new variant node will be
created to hold this consensus sequence.
Several temporary files will be created during the process: a fasta file
containing the original mRNA sequences and fasta files containing the consensus
mRNA sequences for each sample. These files will be deleted after the process
is finished unless the --keep-intermediate-files option is used.
By default, the location of these files will be at /tmp for Linux and
/var/folders for macOS. The location can be changed with the
--scratch-directory option.
NB: VCF files that are not indexed with tabix will be indexed automatically on
their original location!
A text file with on each line a genome number and the full path
to a corresponding VCF file, separated by a space.
Options
--threads/-t
Number of threads to use. Default: total number of cores
available or 8, whichever is lower.
--scratch-directory
Temporary directory for storing intermediate files. If not set a
temporary directory will be created inside the default temporary-file
directory. On most Linux distributions this default temporary-file
directory will be /tmp/, on MacOS typically /var/folders/.
If a scratch directory is set, it will be created if it does not exist.
If it does exist, PanTools will verify the directory is empty and, if
not, raise an exception.
--keep-intermediate-files
Keep intermediate consensus fasta and corresponding log files.
Remove variants from the pangenome database. This function will remove all
VCF information from the database. All variant nodes created by the
add_variants function will be removed. The VCF information will be
removed from the accession nodes. If there is no variant information
left for an accession node, the node will be removed.
Add PAVs to the pangenome database. PAV information can only be added
about mRNA features. For each accession and mRNA feature, PAV information can
be stores in the database. Only values of 1 and 0 are allowed in the
PAV file. A value of 1 indicates that the gene is present in the sample
and a value of 0 indicates that the gene is absent in the sample.
Parameters
<databaseDirectory>
Path to the database root directory.
<pavsFile>
A text file with on each line a genome number and the full path
to a corresponding PAV file, separated by a space.
Remove PAVs from the pangenome database. This function will remove all
PAV information from the database. All variant nodes created by the
add_pavs function will be removed. The PAV information will be
removed from the accession nodes. If there is no variant information
left for an accession node, the node will be removed.
The following functionalities allow the removal of large sets of nodes
and relationships from the pangenome. These functions will first ask for
a confirmation before the nodes are actually removed. Be careful, the
data is not backed up and removing nodes or properties means it is
permanently gone.
Remove a selection of nodes and their relationships from the pangenome.
For a pangenome database the following nodes should never be removed:
nucleotide, pangenome, genome, sequence. When using a
panproteome, mRNA nodes cannot be removed.
Parameters
<databaseDirectory>
Path to the database root directory.
Options
Requires one of --nodes|--label, include and exclude
only work for --label.
--include/-i
Only remove nodes of the selected genomes.
--exclude/-e
Do not remove nodes of the selected genomes.
--nodes/-n
One or multiple node identifiers, separated by a comma.
--label
A node label, all nodes matching the label are removed.
Delete phenotype nodes or remove specific phenotype information from
the nodes. The specific phenotype property needs to be specified with
--phenotype. When this argument is not included, phenotype nodes
are removed.
Parameters
<databaseDirectory>
Path to the database root directory.
Options
--include/-i
Only remove nodes of the selected genomes.
--exclude/-e
Do not remove nodes of the selected genomes.
--phenotype/-p
Name of the phenotype. All information of the given phenotype is
removed from ‘phenotype’ nodes.
Remove all the genomic features that belong to annotations, such as
gene, mRNA, exon, tRNA, and feature nodes. Functional
annotation nodes are not removed with this function but can be removed
with remove_functions. Removing
annotations can be done in two ways:
Selecting genomes with --include or --exclude, for which all
annotation features will be removed.
Remove specific annotations by providing a text file with identifiers
via the --annotations-file argument.
Parameters
<databaseDirectory>
Path to the database root directory.
Options
Requires one of --include|--exclude|--annotations-file.
--include/-i
A selection of genomes for which all annotations will be removed.
--exclude/-e
A selection of genomes excluded from the removal of annotations.
--annotations-file/-A
A text file with the identifiers of annotations to be removed,
each on a separate line.
Example annotations file
The annotations file should contain identifiers for annotations on each
line (genome number, annotation number). The following example will
remove the first annotations of genome 1, 2 and 3 and the second
annotation of genome 1.
Remove all the functional annotation features from the graph database.
Functional annotations include the GO, pfam, tigrfam and
interpro nodes as well as mRNA node properties for COG, phobius
and signalp. There are multiple modes available using --mode:
‘all’ removes all functional annotation nodes and properties.
‘nodes’ removes all GO, pfam, tigrfam and interpro nodes.
‘properties’ removes all COG, phobius and signalp properties
from mRNA nodes.
‘GO’, ‘pfam’ and ‘tigrfam’ only remove specific properties from mRNA
nodes.
Parameters
<databaseDirectory>
Path to the database root directory.
Options
--mode/-m
Mode for which annotations to remove (default: all)
Delete all ‘homology_group’ nodes and ‘is_similar’ relations between
‘mRNA’ nodes from the database.
Parameters
<databaseDirectory>
Path to the database root directory.
Options
--fast
Do not remove the ‘is_similar’ relationships between mRNA nodes.
This does not influence the next grouping.
--grouping-version/-v
Select a specific grouping version to be removed. Should be either a
grouping number, ‘all’ for all groupings or ‘all_inactive’ for
all inactive groupings.