Helping Rocketgraph's customers with an OpenCypher-specialized small language model

Helping Rocketgraph’s customers with an OpenCypher-specialized small language model

Rocketgraph was founded in 2014 by Cray executives to commercialize software developed by Pacific Northwest National Laboratory (PNNL) with funding from DOD. They hold 4 patents relating to algorithms for finding connections in datasets that were previously impossible to discover.

Solving some of the world’s most challenging problems—building fine-tuned forecasts, detecting fraudulent and illegal activity, and keeping IT infrastructure secure—requires analysts to uncover patterns and discover connections across massive quantities of data. The Rocketgraph platform was made for this.

Their mission is to give enterprises and government agencies the power to discover the hardest to-find insights without hiring a command center full of rocket scientists.

WHY?

Many customers of Rocketgraph have deployed the graph analytics platform on IBM Power hardware. Some of Rocketgraph’s AI features currently rely on open LLMs such as ChatGPT or Claude. Given the privacy concerns of using these models for many customers, another solution is required, which enables customers to benefit from models that have LLM-like performance without having to compromise on privacy or quality.

WHAT?

Rocketgraph partnered with distil labs to finetune a small language model (SLM) that is specialized in translating user questions to Rocketgraph-compliant Cypher queries. distil labs’ expertise in finetuning SLMs using knowledge distillation allows Rocketgraph to benefit from LLM-level performance using models many orders of magnitude smaller.

HOW?

The challenge

A key challenge in this project is that the Rocketgraph platform uses an OpenCypher variant for its querying language. Language models trained on public data have a bias towards more popular languages such as Python, JavaScript and in the case of graphs, standard Cypher. As such, open LLMs have a tendency to miss Rocketgraph specific functions (i.e. outdegree and indegree) and to include syntax not supported by Rocketgraph (i.e. usage of NONE, ALL, or ANY predicate functions).

As such, we leveraged the Rocketgraph documentation along with a small dataset of example Rocketgraph queries to generate synthetic text to query pairs. Once validated against the Rocketgraph platform, we obtain a large dataset of syntactically correct Rocketgraph queries, which can be used to finetune an SLM, thereby adapting it to the relevant domain.

To demonstrate this with an example, imagine you want to find the number of outgoing edges for all nodes in a graph. In standard Cypher, this can be done using the following query:

MATCH (d)-[r:EdgeType]->() RETURN d, count(r) AS count

In Rocketgraph, the idiomatic way to write this query is:

MATCH (d) RETURN d, outdegree(d, EdgeType) AS count

The challenge therefore is to make language models generate queries that adhere to Rocketgraph’s OpenCypher variant instead of standard Cypher.

The solution

The pace of development in AI has given us not only larger language models but also more powerful SLMs that are capable of achieving LLM-level performance on specific tasks when they have been specialized. Whilst LLMs can do thousands of things very well, SLMs are more than capable of solving one specific task just as well. As teams begin to scale their AI solutions, issues such as cost, efficiency and privacy come into the fore and this is where SLMs shine. The text-to-query system needed by Rocketgraph is a task-specific use-case and therefore a perfect candidate for an SLM.

One challenge teams will encounter with both LLMs and SLMs is that the results often miss the mark because those models have not been trained on domain-specific data. Privacy constraints means that teams cannot cannot simply upload their data to LLMs like ChatGPT. While open LLMs can provide great results as well, they’re challenging to host on in-house infrastructure. Finetuning a smaller language model is therefore one approach that solves this problem.

Finetuning language models is typically very challenging. The data requirements can be prohibitive in cases where teams do not have high volumes (>10,000s) of high-quality, expertly labelled data and setting up scalable and robust training pipelines requires specialist ML engineers, something not all teams have access to. The distil labs platform solves this exact problem. We make it possible to finetune an SLM using just a prompt and a small seed dataset. This dramatically reduces the effort needed to finetune and deploy an expert agent.

The first problem we solved was the issue of limited data. We had a small (<100) dataset of text to query pairs for only 3 database schemas. To have a dataset that is more likely to be representative of the variety of schemas that Rocketgraph customers may have, we needed far more example database schemas. To achieve this, we use an open-source Text2Cypher dataset from Neo4j that contains over 900 database schemas which covers a wide range of different use-cases and industries i.e. e-commerce, financial, chemistry.

We translated these schemas into a Rocketgraph-compliant schemas and selected only those of sufficient complexity (i.e. those which had both nodes and edges). This resulted in 160 validated schemas, covering a wide range of sectors. We now had the necessary ingredients, ready to generate high-quality synthetic text to query examples. To summarise, we had for our inputs:

Example queries/schemas provided by Rocketgraph
Rocketgraph documentation on the fundamentals of writing queries
160 example schemas (translated from a public Text2Cypher dataset)

Using these components, we generated a dataset of 15,000 text to query examples, covering a wide range of sample database schemas and validated against the Rocketgraph platform. This provides us with a high-quality dataset of syntactically correct queries. To enrich the dataset, we also rephrased some of the generated examples to align more closely with how users ask questions. For example, we should expect the same output query when asking any of the following questions:

List all transactions
Find all transactions
Show me all transactions

We then fine-tuned the IBM Granite 3.3 8b model on this dataset, which gave a model that achieved ~85% of the performance capable from popular open-source LLMs such as Llama 405b and Claude. This was made possible due to the high-quality generation of synthetic data and the simple process of finetuning SLMs, all made possible using the distil labs platform.

Helping Rocketgraph's customers with an OpenCypher-specialized small language model

Helping Rocketgraph’s customers with an OpenCypher-specialized small language model

Cookie preferences