Part 3. Explore the pangenome using the Neo4j browser
$ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz $ tar -xvzf chloroplast_DB.tar.gz
Set the full path to the chloroplast pangenome database by opening neo4j.conf (’neo4j-community-3.5.30/conf/neo4j.conf’) and include the following line in the config file. Please make sure there is always only a single uncommented line with ‘dbms.directories.data’.
Uncomment the four following lines in neo4j-community-3.5.30/conf/neo4j.conf.
Replace 7686, 7474, and 7473 by three different numbers that are not in use by other people on your server. In this way, everyone can have their own database running at the same time.
#dbms.connectors.default_listen_address=0.0.0.0 #dbms.connector.bolt.listen_address=:7687 #dbms.connector.http.listen_address=:7474 #dbms.connector.https.listen_address=:7473
Lets start up the Neo4j server!
$ neo4j start
Start Firefox (or a web browser of your own preference) and let it run on the background.
$ firefox &
In case you did not change the config to allow non-local connections, browse to http://localhost:7474. Whenever you did change the config file, go to server_address:7474, where 7474 should be replaced with the number you chose earlier.
If the database startup was successful, a login terminal will appear in the webpage. Use ‘neo4j’ both as username and password. After logging in, you are requested to set a new password.
Exploring nodes and edges in Neo4j
Go through the following steps to become proficient in using the Neo4j browser and the underlying PanTools data structure. If you have any difficulty trouble finding a node, relationship or any type of information, download and use this visual guide.
Click on the database icon on the left. A menu with all node types and relationship types will appear.
Click on the ‘gene’ button in the node label section. This automatically generated a query. Execute the query.
The LIMIT clause prevents large numbers of nodes popping up to avoid your web browser from crashing. Set LIMIT to 10 and execute the query.
Hover over the nodes, click on them and take a look at the values stored in the nodes. All these features (except ID) were extracted from the GFF annotation files. ID is an unique number automatically assigned to nodes and relationships by Neo4j.
Double-click on the matK gene node, all nodes with a connection to this gene node will appear. The nodes have distinct colors as these are different node types, such as mRNA, CDS, nucleotide. Take a look at the node properties to observe that most values and information is specific to a certain node type.
Double-click on the matK mRNA node, a homology_group node should appear. These type of nodes connect homologous genes in the graph. However, you can see this gene did not cluster with any other gene.
Hover over the start relation of the matK gene node. As you can see information is not only stored in nodes, but also in relationships! A relationship always has a certain direction, in this case the relation starts at the gene node and points to a nucleotide node. Offset marks the location within the node.
Double-click on the nucleotide node at the end of the ‘start’ relationship. An in- and outgoing relation appear that connect to other nucleotide nodes. Hover over both the relations and compare them. The relations holds the genomic coordinates and shows this path only occurs in contig/sequence 1 of genome 1.
Follow the outgoing FF-relationship to the next nucleotide node and expand this node by double-clicking. Three nodes will pop up this time. If you hover over the relations you see the coordinates belong to other genomes as well. You may also notice the relationships between nucleotide nodes is always a two letter combination of F (forward) and R (reverse) which state if a sequence is reverse complemented or not. The first letter corresponds to the sequence of the node at the start of the relation where the second letters refers to the sequence of the end node.
Finally, execute the following query to call the database scheme to see how all node types are connected to each other: CALL db.schema(). The schema will be useful when designing your own queries!
Query the pangenome database using CYPHER
Cypher is a declarative, SQL-inspired language and uses ASCII-Art to represent patterns. Nodes are represented by circles and relationships by arrows.
The MATCH clause allows you to specify the patterns Neo4j will search for in the database.
With WHERE you can add constraints to the patterns described.
In the RETURN clause you define which parts of the pattern to display.
Match and return 100 nucleotide nodes
MATCH (n:nucleotide) RETURN n LIMIT 100
Find all the genome nodes
MATCH (n:genome) RETURN n
Find the pangenome node
MATCH (n:pangenome) RETURN n
Match and return 100 genes
MATCH (g:gene) RETURN g LIMIT 100
Match and return 100 genes and order them by length
MATCH (g:gene) RETURN g ORDER BY g.length DESC LIMIT 100
The same query as before but results are now returned in a table
MATCH (g:gene) RETURN g.name, g.address, g.length ORDER BY g.length DESC LIMIT 100
Return genes which are longer as 100 but shorter than 250 bp (this can also be applied to other features such as exons introns or CDS)
MATCH (g:gene) where g.length > 100 AND g.length < 250 RETURN * LIMIT 100
Find genes located on first genome
MATCH (g:gene) WHERE g.address = 1 RETURN * LIMIT 100
Find genes located on first genome and first sequence
MATCH (g:gene) WHERE g.address = 1 AND g.address = 1 RETURN * LIMIT 100
Homology group queries
Return 100 homology groups
MATCH (h:homology_group) RETURN h LIMIT 100
Match homology groups which contain two members
MATCH (h:homology_group) WHERE h.num_members = 2 RETURN h
Match homology groups and ‘walk’ to the genes and corresponding start and end node
MATCH (h:homology_group)-->(f:feature)<--(g:gene)-->(n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25
Turn off autocomplete by clicking on the button on the bottom right. The graph falls apart because relations were not assigned to variables.
The same query as before but now the relations do have variables
MATCH (h:homology_group)-[r1]-> (f:feature) <-[r2]-(g:gene)-[r3]-> (n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25
When you turn off autocomplete again only the ‘is_similar_to’ relation disappears since we did not call it
Find homology group that belong to the rpoC1 gene
MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE g.name = 'rpoC1' RETURN *
Find genes on genome 1 which don’t show homology
MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE n.num_members = 1 and g.genome = 1 RETURN *
Structural variant detection
Find SNP bubbles (for simplification we only use the FF relation)
MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return * limit 50
The same query but returning the results in a table
MATCH (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return a1.length,b1.length, a1.sequence, b1.sequence limit 50
Functions such as count(), sum() and stDev() can be used in a query.
The same SNP query but count the hits instead of displaying them
MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return count(p)
Hopefully you know have some feeling with the Neo4j browser and cypher and you’re inspired to create your own queries!
When you’re done working in the browser, close the database (by using the command line again).
$ neo4j stop
In part 4 of the tutorial we explore some of the functionalities to analyze the pangenome.