Part 3. Explore the pangenome using the Neo4j browser

Did you skip part 2 of the tutorial or were you unable to build the chloroplast pangenome? Download the pre-constructed pangenome here or via wget.

$ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz
$ tar -xvzf chloroplast_DB.tar.gz

Configuring Neo4j

Set the full path to the chloroplast pangenome database by opening neo4j.conf (’neo4j-community-3.5.30/conf/neo4j.conf’) and include the following line in the config file. Please make sure there is always only a single uncommented line with ‘dbms.directories.data’.

#dbms.directories.data=/YOUR_PATH/any_other_database
dbms.directories.data=/YOUR_PATH/chloroplast_DB

Allowing non-local connections

To be able to run Neo4j on a server and have access to it from anywhere, some additional lines in the config file must be changed.

Uncomment the four following lines in neo4j-community-3.5.30/conf/neo4j.conf.
Replace 7686, 7474, and 7473 by three different numbers that are not in use by other people on your server. In this way, everyone can have their own database running at the same time.

#dbms.connectors.default_listen_address=0.0.0.0
#dbms.connector.bolt.listen_address=:7687
#dbms.connector.http.listen_address=:7474
#dbms.connector.https.listen_address=:7473

Lets start up the Neo4j server!

$ neo4j start

Start Firefox (or a web browser of your own preference) and let it run on the background.

$ firefox &

In case you did not change the config to allow non-local connections, browse to http://localhost:7474. Whenever you did change the config file, go to server_address:7474, where 7474 should be replaced with the number you chose earlier.

If the database startup was successful, a login terminal will appear in the webpage. Use ‘neo4j’ both as username and password. After logging in, you are requested to set a new password.

Exploring nodes and edges in Neo4j

Go through the following steps to become proficient in using the Neo4j browser and the underlying PanTools data structure. If you have any difficulty trouble finding a node, relationship or any type of information, download and use this visual guide.

Click on the database icon on the left. A menu with all node types and relationship types will appear.
Click on the ‘gene’ button in the node label section. This automatically generated a query. Execute the query.
The LIMIT clause prevents large numbers of nodes popping up to avoid your web browser from crashing. Set LIMIT to 10 and execute the query.
Hover over the nodes, click on them and take a look at the values stored in the nodes. All these features (except ID) were extracted from the GFF annotation files. ID is an unique number automatically assigned to nodes and relationships by Neo4j.
Double-click on the matK gene node, all nodes with a connection to this gene node will appear. The nodes have distinct colors as these are different node types, such as mRNA, CDS, nucleotide. Take a look at the node properties to observe that most values and information is specific to a certain node type.
Double-click on the matK mRNA node, a homology_group node should appear. These type of nodes connect homologous genes in the graph. However, you can see this gene did not cluster with any other gene.
Hover over the start relation of the matK gene node. As you can see information is not only stored in nodes, but also in relationships! A relationship always has a certain direction, in this case the relation starts at the gene node and points to a nucleotide node. Offset marks the location within the node.
Double-click on the nucleotide node at the end of the ‘start’ relationship. An in- and outgoing relation appear that connect to other nucleotide nodes. Hover over both the relations and compare them. The relations holds the genomic coordinates and shows this path only occurs in contig/sequence 1 of genome 1.
Follow the outgoing FF-relationship to the next nucleotide node and expand this node by double-clicking. Three nodes will pop up this time. If you hover over the relations you see the coordinates belong to other genomes as well. You may also notice the relationships between nucleotide nodes is always a two letter combination of F (forward) and R (reverse) which state if a sequence is reverse complemented or not. The first letter corresponds to the sequence of the node at the start of the relation where the second letters refers to the sequence of the end node.
Finally, execute the following query to call the database scheme to see how all node types are connected to each other: CALL db.schema(). The schema will be useful when designing your own queries!

Query the pangenome database using CYPHER

Cypher is a declarative, SQL-inspired language and uses ASCII-Art to represent patterns. Nodes are represented by circles and relationships by arrows.

The MATCH clause allows you to specify the patterns Neo4j will search for in the database.
With WHERE you can add constraints to the patterns described.
In the RETURN clause you define which parts of the pattern to display.

Cypher queries

Match and return 100 nucleotide nodes

MATCH (n:nucleotide) RETURN n LIMIT 100

Find all the genome nodes

MATCH (n:genome) RETURN n

Find the pangenome node

MATCH (n:pangenome) RETURN n

Match and return 100 genes

MATCH (g:gene) RETURN g LIMIT 100

Match and return 100 genes and order them by length

MATCH (g:gene) RETURN g ORDER BY g.length DESC LIMIT 100

The same query as before but results are now returned in a table

MATCH (g:gene) RETURN g.name, g.address, g.length ORDER BY g.length DESC LIMIT 100

Return genes which are longer as 100 but shorter than 250 bp (this can also be applied to other features such as exons introns or CDS)

MATCH (g:gene) where g.length > 100 AND g.length < 250 RETURN * LIMIT 100

Find genes located on first genome

MATCH (g:gene) WHERE g.address[0] = 1 RETURN * LIMIT 100

Find genes located on first genome and first sequence

MATCH (g:gene) WHERE g.address[0] = 1 AND g.address[1] = 1 RETURN * LIMIT 100

Homology group queries

Return 100 homology groups

MATCH (h:homology_group) RETURN h LIMIT 100

Match homology groups which contain two members

MATCH (h:homology_group) WHERE h.num_members = 2 RETURN h

Match homology groups and ‘walk’ to the genes and corresponding start and end node

MATCH (h:homology_group)-->(f:feature)<--(g:gene)-->(n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25

Turn off autocomplete by clicking on the button on the bottom right. The graph falls apart because relations were not assigned to variables.

The same query as before but now the relations do have variables

MATCH (h:homology_group)-[r1]-> (f:feature) <-[r2]-(g:gene)-[r3]-> (n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25

When you turn off autocomplete again only the ‘is_similar_to’ relation disappears since we did not call it

Find homology group that belong to the rpoC1 gene

MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE g.name = 'rpoC1' RETURN *

Find genes on genome 1 which don’t show homology

MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE n.num_members = 1 and g.genome = 1 RETURN *

Structural variant detection

Find SNP bubbles (for simplification we only use the FF relation)

MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return * limit 50

The same query but returning the results in a table

MATCH (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return a1.length,b1.length, a1.sequence, b1.sequence limit 50

Functions such as count(), sum() and stDev() can be used in a query.

The same SNP query but count the hits instead of displaying them

MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return count(p)

Hopefully you know have some feeling with the Neo4j browser and cypher and you’re inspired to create your own queries!

When you’re done working in the browser, close the database (by using the command line again).

$ neo4j stop

More information on Neo4j and the cypher language:
Neo4j Cypher Manual v3.5
Neo4j Cypher Refcard
Neo4j API

In part 4 of the tutorial we explore some of the functionalities to analyze the pangenome.