Part 3. Explore the pangenome using the Neo4j browser ===================================================== Did you skip :doc:`part 2 ` of the tutorial or were you unable to build the chloroplast pangenome? Download the pre-constructed pangenome `here `_ or via wget. .. code:: bash $ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz $ tar -xvzf chloroplast_DB.tar.gz -------------- Configuring Neo4j ----------------- Set the full path to the chloroplast pangenome database by opening neo4j.conf ('*neo4j-community-3.5.30/conf/neo4j.conf*') and include the following line in the config file. Please make sure there is always only a single uncommented line with 'dbms.directories.data'. .. code:: text #dbms.directories.data=/YOUR_PATH/any_other_database dbms.directories.data=/YOUR_PATH/chloroplast_DB | **Allowing non-local connections** | To be able to run Neo4j on a server and have access to it from anywhere, some additional lines in the config file must be changed. - **Uncomment** the four following lines in neo4j-community-3.5.30/conf/neo4j.conf. - Replace 7686, 7474, and 7473 by three different numbers that are not in use by other people on your server. In this way, everyone can have their own database running at the same time. .. code:: text #dbms.connectors.default_listen_address=0.0.0.0 #dbms.connector.bolt.listen_address=:7687 #dbms.connector.http.listen_address=:7474 #dbms.connector.https.listen_address=:7473 Lets start up the Neo4j server! .. code:: bash $ neo4j start Start Firefox (or a web browser of your own preference) and let it run on the background. .. code:: bash $ firefox & In case you did not change the config to allow non-local connections, browse to *http://localhost:7474*. Whenever you did change the config file, go to *server_address:7474*, where 7474 should be replaced with the number you chose earlier. If the database startup was successful, a login terminal will appear in the webpage. Use '*neo4j*' both as username and password. After logging in, you are requested to set a new password. -------------- Exploring nodes and edges in Neo4j ---------------------------------- Go through the following steps to become proficient in using the Neo4j browser and the underlying PanTools data structure. If you have any difficulty trouble finding a node, relationship or any type of information, download and use `this visual guide `_. 1. Click on the database icon on the left. A menu with all node types and relationship types will appear. 2. Click on the '*gene*' button in the node label section. This automatically generated a query. Execute the query. 3. The **LIMIT** clause prevents large numbers of nodes popping up to avoid your web browser from crashing. Set LIMIT to 10 and execute the query. 4. Hover over the nodes, click on them and take a look at the values stored in the nodes. All these features (except ID) were extracted from the GFF annotation files. ID is an unique number automatically assigned to nodes and relationships by Neo4j. 5. Double-click on the **matK** gene node, all nodes with a connection to this gene node will appear. The nodes have distinct colors as these are different node types, such as **mRNA**, **CDS**, **nucleotide**. Take a look at the node properties to observe that most values and information is specific to a certain node type. 6. Double-click on the *matK* mRNA node, a **homology_group** node should appear. These type of nodes connect homologous genes in the graph. However, you can see this gene did not cluster with any other gene. 7. Hover over the **start** relation of the *matK* gene node. As you can see information is not only stored in nodes, but also in relationships! A relationship always has a certain direction, in this case the relation starts at the gene node and points to a nucleotide node. Offset marks the location within the node. 8. Double-click on the **nucleotide** node at the end of the 'start' relationship. An in- and outgoing relation appear that connect to other nucleotide nodes. Hover over both the relations and compare them. The relations holds the genomic coordinates and shows this path only occurs in contig/sequence 1 of genome 1. 9. Follow the outgoing **FF**-relationship to the next nucleotide node and expand this node by double-clicking. Three nodes will pop up this time. If you hover over the relations you see the coordinates belong to other genomes as well. You may also notice the relationships between nucleotide nodes is always a two letter combination of F (forward) and R (reverse) which state if a sequence is reverse complemented or not. The first letter corresponds to the sequence of the node at the start of the relation where the second letters refers to the sequence of the end node. 10. Finally, execute the following query to call the database scheme to see how all node types are connected to each other: *CALL db.schema()*. The schema will be useful when designing your own queries! -------------- Query the pangenome database using CYPHER ----------------------------------------- Cypher is a declarative, SQL-inspired language and uses ASCII-Art to represent patterns. Nodes are represented by circles and relationships by arrows. - The **MATCH** clause allows you to specify the patterns Neo4j will search for in the database. - With **WHERE** you can add constraints to the patterns described. - In the **RETURN** clause you define which parts of the pattern to display. Cypher queries ~~~~~~~~~~~~~~ **Match and return 100 nucleotide nodes** .. code:: text MATCH (n:nucleotide) RETURN n LIMIT 100 **Find all the genome nodes** .. code:: text MATCH (n:genome) RETURN n **Find the pangenome node** .. code:: text MATCH (n:pangenome) RETURN n **Match and return 100 genes** .. code:: text MATCH (g:gene) RETURN g LIMIT 100 **Match and return 100 genes and order them by length** .. code:: text MATCH (g:gene) RETURN g ORDER BY g.length DESC LIMIT 100 **The same query as before but results are now returned in a table** .. code:: text MATCH (g:gene) RETURN g.name, g.address, g.length ORDER BY g.length DESC LIMIT 100 **Return genes which are longer as 100 but shorter than 250 bp** (this can also be applied to other features such as exons introns or CDS) .. code:: text MATCH (g:gene) where g.length > 100 AND g.length < 250 RETURN * LIMIT 100 **Find genes located on first genome** .. code:: text MATCH (g:gene) WHERE g.address[0] = 1 RETURN * LIMIT 100 **Find genes located on first genome and first sequence** .. code:: text MATCH (g:gene) WHERE g.address[0] = 1 AND g.address[1] = 1 RETURN * LIMIT 100 -------------- Homology group queries ~~~~~~~~~~~~~~~~~~~~~~ **Return 100 homology groups** .. code:: text MATCH (h:homology_group) RETURN h LIMIT 100 **Match homology groups which contain two members** .. code:: text MATCH (h:homology_group) WHERE h.num_members = 2 RETURN h **Match homology groups and 'walk' to the genes and corresponding start and end node** .. code:: text MATCH (h:homology_group)-->(f:feature)<--(g:gene)-->(n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25 Turn off autocomplete by clicking on the button on the bottom right. The graph falls apart because relations were not assigned to variables. **The same query as before but now the relations do have variables** .. code:: text MATCH (h:homology_group)-[r1]-> (f:feature) <-[r2]-(g:gene)-[r3]-> (n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25 When you turn off autocomplete again only the '*is_similar_to*' relation disappears since we did not call it **Find homology group that belong to the rpoC1 gene** .. code:: text MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE g.name = 'rpoC1' RETURN * **Find genes on genome 1 which don't show homology** .. code:: text MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE n.num_members = 1 and g.genome = 1 RETURN * -------------- Structural variant detection ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Find SNP bubbles (for simplification we only use the FF relation)** .. code:: text MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return * limit 50 **The same query but returning the results in a table** .. code:: text MATCH (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return a1.length,b1.length, a1.sequence, b1.sequence limit 50 Functions such as **count()**, **sum()** and **stDev()** can be used in a query. **The same SNP query but count the hits instead of displaying them** .. code:: text MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return count(p) -------------- Hopefully you know have some feeling with the Neo4j browser and cypher and you're inspired to create your own queries! When you're done working in the browser, close the database (by using the command line again). .. code:: text $ neo4j stop | More information on Neo4j and the cypher language: | `Neo4j Cypher Manual v3.5 `_ | `Neo4j Cypher Refcard `_ | `Neo4j API `_ -------------- In :doc:`part 4 ` of the tutorial we explore some of the functionalities to analyze the pangenome.