Querying the pangenome
======================

Cypher is Neo4j’s graph query language that lets you ask specific
questions or retrieve data from the graph database. The Cypher query
language depicts patterns of nodes and relationships and filters those
patterns based on labels and properties. While using node and
relationship patterns in databases queries may seem a little daunting, it
is easy to pick up! This page contains some example queries to help you
get started. Feel free to email us if you have any question regarding
Cypher queries.

More information on Neo4j and the Cypher language:

| `Neo4j Cypher Manual v3.5 <https://neo4j.com/docs/developer-manual/3.5/cypher/>`_
| `Neo4j Cypher Refcard <http://neo4j.com/docs/cypher-refcard/3.5/>`_
| `Neo4j API <https://neo4j.com/developer/>`_

**Match and return 100 nucleotide nodes**

.. code:: text

   MATCH (n:nucleotide) RETURN n LIMIT 100

**Find all the genome nodes**

.. code:: text

   MATCH (n:genome) RETURN n

Retrieve the pangenome node

.. code:: text

   MATCH (n:pangenome) RETURN n

**Match and return 100 genes**

.. code:: text

   MATCH (g:gene) RETURN g LIMIT 100

**Match and return 100 genes and order them by length**

.. code:: text

   MATCH (g:gene) RETURN g ORDER BY g.length DESC LIMIT 100

**The same query as before but results are now returned in a table**

.. code:: text

   MATCH (g:gene) RETURN g.name, g.address, g.length ORDER BY g.length DESC LIMIT 100

**Return genes which are between 100 and 250 bp. This can also be
applied to other features such as exons introns or CDS.**

.. code:: text

   MATCH (g:gene) where g.length > 100 AND g.length < 250 RETURN * LIMIT 100

**Find genes located on first genome**

.. code:: text

   MATCH (g:gene) WHERE g.address[0] = 1 RETURN * LIMIT 100

**Find genes located on first genome and first sequence**

.. code:: text

   MATCH (g:gene) WHERE g.address[0] = 1 AND g.address[1] = 1 RETURN * LIMIT 100

**Obtain genes between 100 and 250 nucleotides**

.. code:: text

   MATCH (g:gene) where g.length > 100 AND g.length < 250 RETURN *

**Return pfam identifiers for genes between 100 and 250 nucleotides
long**

.. code:: text

   match (n:mRNA)--(m:pfam) where n.length > 100 and n.length < 150 return m.id

**Return all genes for a specific contig and count them**

.. code:: text

   MATCH (n:gene) WHERE n.address[0] = 1 and n.address[1] = 1 RETURN count(n)

**Return all genes genes between 1000-1500 nucleotides and order them by
length**

.. code:: text

   MATCH (n:gene) WHERE n.length > 1000 and n.length < 1500 RETURN n order by n.length DESC

**Returns the homology group matching your gene of interest**

.. code:: text

   MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE g.name = 'GENE\_NAME' RETURN *

**Returns the genes of genome 1 that don’t have a homolog in a the other
genome**

.. code:: text

   MATCH (n:homology_group)--(m:mRNA)--(g:gene) where n.num_members = 1 and g.genome = 1 RETURN g 

**Retrieve unique GO identifiers for mRNA’s with a signal peptide**

.. code:: text

   MATCH (m:mRNA)--(g:GO) where m.signalp_signal_peptide = true RETURN DISTINCT m.id, g.id

**Return all sequence nodes for a specific contig**

.. code:: text

   MATCH (n)-[r]->() WHERE exists (r.'a1\_1') and (n:degenerate or n:node) RETURN id(n), n.sequence , r.'a1\_1'

**Return all sequence nodes for a specific contig within the range of
position 1000 and 2000**

.. code:: text

   MATCH (n)-[r]->() WHERE exists (r.'a1\_1') and (n:degenerate or n:node) and r.'a1'\_1[0] > 1000 and r.'a1\_1'[0] < 2000 RETURN id(n), n.sequence, r.'a1\_1'

**Find SNP bubbles in the graph. For simplification we only use the FF
relation**

.. code:: text

   MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return * limit 50