Part 3. Explore the pangenome using the Neo4j browser
=====================================================

Did you skip :doc:`part 2 <tutorial_part2>` of the tutorial or were you
unable to build the chloroplast pangenome? Download the pre-constructed
pangenome
`here <http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz>`_
or via wget.

.. code:: bash

   $ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz
   $ tar -xvzf chloroplast_DB.tar.gz

--------------

Configuring Neo4j
-----------------

Set the full path to the chloroplast pangenome database by opening
neo4j.conf ('*neo4j-community-3.5.30/conf/neo4j.conf*') and include the
following line in the config file. Please make sure there is always only
a single uncommented line with 'dbms.directories.data'.

.. code:: text

   #dbms.directories.data=/YOUR_PATH/any_other_database
   dbms.directories.data=/YOUR_PATH/chloroplast_DB

| **Allowing non-local connections**
| To be able to run Neo4j on a server and have access to it from
  anywhere, some additional lines in the config file must be changed.

-  **Uncomment** the four following lines in
   neo4j-community-3.5.30/conf/neo4j.conf.
-  Replace 7686, 7474, and 7473 by three different numbers that are not
   in use by other people on your server. In this way, everyone can have
   their own database running at the same time.

.. code:: text

   #dbms.connectors.default_listen_address=0.0.0.0
   #dbms.connector.bolt.listen_address=:7687
   #dbms.connector.http.listen_address=:7474
   #dbms.connector.https.listen_address=:7473

Lets start up the Neo4j server!

.. code:: bash

   $ neo4j start

Start Firefox (or a web browser of your own preference) and let it run
on the background.

.. code:: bash

   $ firefox &

In case you did not change the config to allow non-local connections,
browse to *http://localhost:7474*. Whenever you did change the config
file, go to *server_address:7474*, where 7474 should be replaced with the
number you chose earlier.

If the database startup was successful, a login terminal will appear in
the webpage. Use '*neo4j*' both as username and password. After logging
in, you are requested to set a new password.

--------------

Exploring nodes and edges in Neo4j
----------------------------------

Go through the following steps to become proficient in using the Neo4j
browser and the underlying PanTools data structure. If you have any
difficulty trouble finding a node, relationship or any type of
information, download and use `this visual guide
<http://www.bioinformatics.nl/pangenomics/tutorial/neo4j_browser.tar.gz>`_.

1.  Click on the database icon on the left. A menu with all node types
    and relationship types will appear.
2.  Click on the '*gene*' button in the node label section. This
    automatically generated a query. Execute the query.
3.  The **LIMIT** clause prevents large numbers of nodes popping up to
    avoid your web browser from crashing. Set LIMIT to 10 and execute the
    query.
4.  Hover over the nodes, click on them and take a look at the values
    stored in the nodes. All these features (except ID) were extracted
    from the GFF annotation files. ID is an unique number automatically
    assigned to nodes and relationships by Neo4j.
5.  Double-click on the **matK** gene node, all nodes with a connection
    to this gene node will appear. The nodes have distinct colors as
    these are different node types, such as **mRNA**, **CDS**,
    **nucleotide**. Take a look at the node properties to observe that
    most values and information is specific to a certain node type.
6.  Double-click on the *matK* mRNA node, a **homology_group** node
    should appear. These type of nodes connect homologous genes in the
    graph. However, you can see this gene did not cluster with any other
    gene.
7.  Hover over the **start** relation of the *matK* gene node. As you
    can see information is not only stored in nodes, but also in
    relationships! A relationship always has a certain direction, in
    this case the relation starts at the gene node and points to a
    nucleotide node. Offset marks the location within the node.
8.  Double-click on the **nucleotide** node at the end of the 'start'
    relationship. An in- and outgoing relation appear that connect to
    other nucleotide nodes. Hover over both the relations and compare
    them. The relations holds the genomic coordinates and shows this
    path only occurs in contig/sequence 1 of genome 1.
9.  Follow the outgoing **FF**-relationship to the next nucleotide node
    and expand this node by double-clicking. Three nodes will pop up
    this time. If you hover over the relations you see the coordinates
    belong to other genomes as well. You may also notice the
    relationships between nucleotide nodes is always a two letter
    combination of F (forward) and R (reverse) which state if a sequence
    is reverse complemented or not. The first letter corresponds to the
    sequence of the node at the start of the relation where the second
    letters refers to the sequence of the end node.
10. Finally, execute the following query to call the database scheme to
    see how all node types are connected to each other: *CALL
    db.schema()*. The schema will be useful when designing your own
    queries!

--------------

Query the pangenome database using CYPHER
-----------------------------------------

Cypher is a declarative, SQL-inspired language and uses ASCII-Art to
represent patterns. Nodes are represented by circles and relationships
by arrows.

-  The **MATCH** clause allows you to specify the patterns Neo4j will
   search for in the database.
-  With **WHERE** you can add constraints to the patterns described.
-  In the **RETURN** clause you define which parts of the pattern to
   display.

Cypher queries
~~~~~~~~~~~~~~

**Match and return 100 nucleotide nodes**

.. code:: text

   MATCH (n:nucleotide) RETURN n LIMIT 100

**Find all the genome nodes**

.. code:: text

   MATCH (n:genome) RETURN n

**Find the pangenome node**

.. code:: text

   MATCH (n:pangenome) RETURN n

**Match and return 100 genes**

.. code:: text

   MATCH (g:gene) RETURN g LIMIT 100

**Match and return 100 genes and order them by length**

.. code:: text

   MATCH (g:gene) RETURN g ORDER BY g.length DESC LIMIT 100

**The same query as before but results are now returned in a table**

.. code:: text

   MATCH (g:gene) RETURN g.name, g.address, g.length ORDER BY g.length DESC LIMIT 100

**Return genes which are longer as 100 but shorter than 250 bp** (this
can also be applied to other features such as exons introns or CDS)

.. code:: text

   MATCH (g:gene) where g.length > 100 AND g.length < 250 RETURN * LIMIT 100

**Find genes located on first genome**

.. code:: text

   MATCH (g:gene) WHERE g.address[0] = 1 RETURN * LIMIT 100

**Find genes located on first genome and first sequence**

.. code:: text

   MATCH (g:gene) WHERE g.address[0] = 1 AND g.address[1] = 1 RETURN * LIMIT 100

--------------

Homology group queries
~~~~~~~~~~~~~~~~~~~~~~

**Return 100 homology groups**

.. code:: text

   MATCH (h:homology_group) RETURN h LIMIT 100

**Match homology groups which contain two members**

.. code:: text

   MATCH (h:homology_group) WHERE h.num_members = 2 RETURN h

**Match homology groups and 'walk' to the genes and corresponding start
and end node**

.. code:: text

   MATCH (h:homology_group)-->(f:feature)<--(g:gene)-->(n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25

Turn off autocomplete by clicking on the button on the bottom right. The
graph falls apart because relations were not assigned to variables.

**The same query as before but now the relations do have variables**

.. code:: text

   MATCH (h:homology_group)-[r1]-> (f:feature) <-[r2]-(g:gene)-[r3]-> (n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25

When you turn off autocomplete again only the '*is_similar_to*' relation
disappears since we did not call it

**Find homology group that belong to the rpoC1 gene**

.. code:: text

   MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE g.name = 'rpoC1' RETURN *

**Find genes on genome 1 which don't show homology**

.. code:: text

   MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE n.num_members = 1 and g.genome = 1 RETURN *

--------------

Structural variant detection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Find SNP bubbles (for simplification we only use the FF relation)**

.. code:: text

   MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return * limit 50

**The same query but returning the results in a table**

.. code:: text

   MATCH (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return a1.length,b1.length, a1.sequence, b1.sequence limit 50

Functions such as **count()**, **sum()** and **stDev()** can be used in
a query.

**The same SNP query but count the hits instead of displaying them**

.. code:: text

   MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return count(p)

--------------

Hopefully you know have some feeling with the Neo4j browser and cypher
and you're inspired to create your own queries!

When you're done working in the browser, close the database (by using
the command line again).

.. code:: text

   $ neo4j stop

| More information on Neo4j and the cypher language:
| `Neo4j Cypher Manual v3.5 <https://neo4j.com/docs/developer-manual/3.5/cypher/>`_
| `Neo4j Cypher Refcard <http://neo4j.com/docs/cypher-refcard/3.5/>`_
| `Neo4j API <https://neo4j.com/developer/>`_

--------------

In :doc:`part 4 <tutorial_part4>` of the tutorial we explore some of the
functionalities to analyze the pangenome.