Querying the pangenome

Cypher is Neo4j’s graph query language that lets you ask specific questions or retrieve data from the graph database. The Cypher query language depicts patterns of nodes and relationships and filters those patterns based on labels and properties. While using node and relationship patterns in databases queries may seem a little daunting, it is easy to pick up! This page contains some example queries to help you get started. Feel free to email us if you have any question regarding Cypher queries.

More information on Neo4j and the Cypher language:

Match and return 100 nucleotide nodes

MATCH (n:nucleotide) RETURN n LIMIT 100

Find all the genome nodes

MATCH (n:genome) RETURN n

Retrieve the pangenome node

MATCH (n:pangenome) RETURN n

Match and return 100 genes

MATCH (g:gene) RETURN g LIMIT 100

Match and return 100 genes and order them by length

MATCH (g:gene) RETURN g ORDER BY g.length DESC LIMIT 100

The same query as before but results are now returned in a table

MATCH (g:gene) RETURN g.name, g.address, g.length ORDER BY g.length DESC LIMIT 100

Return genes which are between 100 and 250 bp. This can also be applied to other features such as exons introns or CDS.

MATCH (g:gene) where g.length > 100 AND g.length < 250 RETURN * LIMIT 100

Find genes located on first genome

MATCH (g:gene) WHERE g.address[0] = 1 RETURN * LIMIT 100

Find genes located on first genome and first sequence

MATCH (g:gene) WHERE g.address[0] = 1 AND g.address[1] = 1 RETURN * LIMIT 100

Obtain genes between 100 and 250 nucleotides

MATCH (g:gene) where g.length > 100 AND g.length < 250 RETURN *

Return pfam identifiers for genes between 100 and 250 nucleotides long

match (n:mRNA)--(m:pfam) where n.length > 100 and n.length < 150 return m.id

Return all genes for a specific contig and count them

MATCH (n:gene) WHERE n.address[0] = 1 and n.address[1] = 1 RETURN count(n)

Return all genes genes between 1000-1500 nucleotides and order them by length

MATCH (n:gene) WHERE n.length > 1000 and n.length < 1500 RETURN n order by n.length DESC

Returns the homology group matching your gene of interest

MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE g.name = 'GENE\_NAME' RETURN *

Returns the genes of genome 1 that don’t have a homolog in a the other genome

MATCH (n:homology_group)--(m:mRNA)--(g:gene) where n.num_members = 1 and g.genome = 1 RETURN g

Retrieve unique GO identifiers for mRNA’s with a signal peptide

MATCH (m:mRNA)--(g:GO) where m.signalp_signal_peptide = true RETURN DISTINCT m.id, g.id

Return all sequence nodes for a specific contig

MATCH (n)-[r]->() WHERE exists (r.'a1\_1') and (n:degenerate or n:node) RETURN id(n), n.sequence , r.'a1\_1'

Return all sequence nodes for a specific contig within the range of position 1000 and 2000

MATCH (n)-[r]->() WHERE exists (r.'a1\_1') and (n:degenerate or n:node) and r.'a1'\_1[0] > 1000 and r.'a1\_1'[0] < 2000 RETURN id(n), n.sequence, r.'a1\_1'

Find SNP bubbles in the graph. For simplification we only use the FF relation

MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return * limit 50