GraphRAG Is the Logical Step From RAG — So Why the Sudden Hype?
Author(s): Daniel Voyce
Originally published on Towards AI.
It seems like everyone is currently talking about GraphRAG as the successor to RAG (Retrieval-Augmented Generation) in the Generative AI / LLM world right now.
But is it actually that much of a surprise? I posit that its not and that its actually a very logical progression when you look at the strengths and weaknesses of LLMs. Put simply:
Context is everything!
What is RAG?
I’m not going to go into huge details on this as if you follow AI / LLM (which I assume you do if you are reading this) but in a nutshell, RAG is the process whereby you feed external data into an LLM alongside prompts to ensure it has all of the information it needs to make decisions.
What is GraphRAG?
Microsoft recently put out a blog first discussing GraphRAG and provided an accelerator; since then the name GraphRAG seems to have snowballed with everyone doing a “same” blog/post.
One blog that caught my attention was from Philip (Neo4J’s CTO) saying that they have basically done the same thing but in a different way, this post resonated with me as it's the same approach I have been working on for the past few weeks on a little LLM side project I am exploring so this is my “same” blog post.
Why use Graphs and what are they?
Back during my time as CTO of Locally, I was introduced to GraphDB as a mechanism for defining and discovering relationships between data, even using it as a simple definition store, it allows for depth and breadth-first searches to help discover relationships that might not have been explicitly defined.
For example — an RDBMS by definition, has relationships. However, how those relationships are defined doesn’t always follow a particular set of rules. This is a large limitation of a relational database system as relationships need to be defined — they cannot really be discovered.
Some databases might use Foreign Key constraints, some might use field naming conventions, some might use neither — its the wild west out there.
So, how do we define and discover relationships in data that has a haphazard way of defining them? We use a graph database that is designed for it.
There are several papers that go into this method (one is here: https://arxiv.org/pdf/2310.01080) but essentially, it is about parsing a relational database structure into a graph structure.
I won’t go too deep into the secret sauce that I have built to determine the relationships between tables in an RDBMS but I will say that it doesn’t use an LLM (yet) to do this — it’s all old-school coding. Why? Well in short, when I tested this — hallucinations were pretty crazy, and those then get built into the application at the ground level, providing false information to every prompt that follows — not ideal for something that needs to return truthful answers.
The different approaches are taken by Microsoft and Neo4j
Microsoft and Neo4j describe 2 different approaches.
Microsoft uses the LLM itself to create the graph, this is my main issue with the Microsoft approach (and I am by no means an expert in their approach) is that it puts the knowledge generation in the hands of the LLM by having the LLM create the graph. I am sure with the appropriate guard rails in place, this approach will work but with LLMs being so prone to hallucination, I would be interested in understanding more about how these guard rails protect from potentially putting hallucinations at the core of knowledge generation. Microsoft's approach is also much more focused on deep information retrieval rather than specific engineering tasks.
Neo4J describes an approach whereby data is parsed into a graph database, and then this is queried and provided to the LLM as additional context. This resonated with me as it is very similar to how I have built my GraphRAG system.
My approach is more aligned with Neo4J’s approach to GraphRAG, where the Graph Database is part of the system that provides context to the LLM
My implementation of GraphRAG in an application
By having an external source of truth that is directly linked to the Data Model, the context provided to the LLM is as ‘truthy’ as it can be, prone only to good ol’ coding errors instead of potential LLM hallucination.
The approach is a bit more rooted in traditional methods, I parse the Data Model (an SQL-based relational system) into Nodes and Relationships in a graph database and then provide an endpoint where those relationships can be queried to provide a source of truth.
This information is then collated alongside the other information points that we collect and run them all through an embedding model and stored in a Vector database.
relations = discover_relationships(
db_url=f"sqlite:///{db_file}",
clear_graph=True)
for r in relations:
rv.train(
documentation=f"Table {r['parent_table']} "
f"{r['relationship_type']} "
f"{r['child_table']} "
f"ON {r['parent_table']}.{r['parent_column']} "
f"= {r['child_table']}.{r['child_column']}")
Or it can be queried on the fly for a particular Table to return either first or multi-level relationships to that table.
query_relationships = discoverer.query_relationships(table_name="Album",
depth=2)
This returns a list of all of the nodes and relationships between them that can then be provided as additional context to the LLM as required.
{'type(r)': 'HAS_MANY', 'r.parent_table': 'Album', 'r.parent_column': 'ArtistId', 'r.child_table': 'Artist', 'r.child_column': 'ArtistId'}
{'type(r)': 'HAS_MANY', 'r.parent_table': 'Artist', 'r.parent_column': 'ArtistId', 'r.child_table': 'Album', 'r.child_column': 'ArtistId'}
{'type(r)': 'HAS_MANY', 'r.parent_table': 'Album', 'r.parent_column': 'AlbumId', 'r.child_table': 'Track', 'r.child_column': 'AlbumId'}
{'type(r)': 'HAS_MANY', 'r.parent_table': 'Track', 'r.parent_column': 'AlbumId', 'r.child_table': 'Album', 'r.child_column': 'AlbumId'}
{'type(r)': 'HAS_ONE', 'r.parent_table': 'Track', 'r.parent_column': 'GenreId', 'r.child_table': 'Genre', 'r.child_column': 'GenreId'}
{'type(r)': 'HAS_MANY', 'r.parent_table': 'Track', 'r.parent_column': 'TrackId', 'r.child_table': 'InvoiceLine', 'r.child_column': 'TrackId'}
{'type(r)': 'HAS_ONE', 'r.parent_table': 'Track', 'r.parent_column': 'MediaTypeId', 'r.child_table': 'MediaType', 'r.child_column': 'MediaTypeId'}
{'type(r)': 'HAS_MANY', 'r.parent_table': 'Track', 'r.parent_column': 'TrackId', 'r.child_table': 'PlaylistTrack', 'r.child_column': 'TrackId'}
Advanced features
I have also coded endpoints that allow for a deep path traversal to help establish if a path between 2 nodes exists. This is great to use as a rationalization checkpoint step for SQL returned from the LLM whereby I can check to see if the queries it is generating are actually possible or even provide deep relationship paths for very complex databases.
Why not use an LLM for this?
My goal for this is to have a source of truth for the relationships in the database, if you use an LLM to produce this, without some form of 3rd party checking, this runs the risk of introducing errors at a very low level.
In the future, I will bring in an LLM to help decode some of the more esoteric database structures and ensure that I can verify the output that it creates through a non-LLM process.
Quick Example of a GraphRAG system
So, what does this look like in action? Well, hopefully, it's the same as pretty much every other system that tries to implement RAG. It's more what happens in the background that’s important, but this is a quick Streamlit application I threw together to test it out.
It uses Ollama locally with Llama-3–8b as the model, its a bit clunky, but overall, it demonstrates a level of accuracy that is unobtainable without programming the database structure into an LLM and defining the relationships in advance.
Conclusion
GraphRAG is another way to provide additional context to an LLM and helps them not only link disparate data but also discover new linkages that might not be obvious.
I don’t think GraphRAG will be the last step in the LLM augmentation space, it was a logical progression, in my opinion, so much so that I don't really understand the hype around it, Microsoft's implementation is very different (and much more complex) to what both myself and Neo4J have written about, and I expect there will be even more impressive systems that surface over the coming months.
Anything that can add the required context to an LLM to help it generate more accurate output then it will be brought into the ecosystem.
The supporting facets of Generative AI platforms are becoming just as important as the LLMs themselves, Context is Everything and providing that context efficiently will only become more important as we ask more of them.
About the author
Dan is a Principal of Data Engineering at Slalom, focusing on modernizing customer data landscapes, machine learning, and AI.
A start-up veteran of over 20 years delivering solutions for some of the largest names in Australia and the UK.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI