Querying the pangenome
Cypher is Neo4j’s graph query language that lets you ask specific questions or retrieve data from the graph database. The Cypher query language depicts patterns of nodes and relationships and filters those patterns based on labels and properties. While using node and relationship patterns in databases queries may seem a little daunting, it is easy to pick up! This page contains some example queries to help you get started. Feel free to email us if you have any question regarding Cypher queries.
More information on Neo4j and the Cypher language:
Match and return 100 nucleotide nodes
MATCH (n:nucleotide) RETURN n LIMIT 100
Find all the genome nodes
MATCH (n:genome) RETURN n
Retrieve the pangenome node
MATCH (n:pangenome) RETURN n
Match and return 100 genes
MATCH (g:gene) RETURN g LIMIT 100
Match and return 100 genes and order them by length
MATCH (g:gene) RETURN g ORDER BY g.length DESC LIMIT 100
The same query as before but results are now returned in a table
MATCH (g:gene) RETURN g.name, g.address, g.length ORDER BY g.length DESC LIMIT 100
Return genes which are between 100 and 250 bp. This can also be applied to other features such as exons introns or CDS.
MATCH (g:gene) where g.length > 100 AND g.length < 250 RETURN * LIMIT 100
Find genes located on first genome
MATCH (g:gene) WHERE g.address[0] = 1 RETURN * LIMIT 100
Find genes located on first genome and first sequence
MATCH (g:gene) WHERE g.address[0] = 1 AND g.address[1] = 1 RETURN * LIMIT 100
Obtain genes between 100 and 250 nucleotides
MATCH (g:gene) where g.length > 100 AND g.length < 250 RETURN *
Return pfam identifiers for genes between 100 and 250 nucleotides long
match (n:mRNA)--(m:pfam) where n.length > 100 and n.length < 150 return m.id
Return all genes for a specific contig and count them
MATCH (n:gene) WHERE n.address[0] = 1 and n.address[1] = 1 RETURN count(n)
Return all genes genes between 1000-1500 nucleotides and order them by length
MATCH (n:gene) WHERE n.length > 1000 and n.length < 1500 RETURN n order by n.length DESC
Returns the homology group matching your gene of interest
MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE g.name = 'GENE\_NAME' RETURN *
Returns the genes of genome 1 that don’t have a homolog in a the other genome
MATCH (n:homology_group)--(m:mRNA)--(g:gene) where n.num_members = 1 and g.genome = 1 RETURN g
Retrieve unique GO identifiers for mRNA’s with a signal peptide
MATCH (m:mRNA)--(g:GO) where m.signalp_signal_peptide = true RETURN DISTINCT m.id, g.id
Return all sequence nodes for a specific contig
MATCH (n)-[r]->() WHERE exists (r.'a1\_1') and (n:degenerate or n:node) RETURN id(n), n.sequence , r.'a1\_1'
Return all sequence nodes for a specific contig within the range of position 1000 and 2000
MATCH (n)-[r]->() WHERE exists (r.'a1\_1') and (n:degenerate or n:node) and r.'a1'\_1[0] > 1000 and r.'a1\_1'[0] < 2000 RETURN id(n), n.sequence, r.'a1\_1'
Find SNP bubbles in the graph. For simplification we only use the FF relation
MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return * limit 50