GraphRAG Is the Logical Step From RAG — So Why the Sudden Hype?

Author(s): Daniel Voyce

Originally published on Towards AI.

It seems like everyone is currently talking about GraphRAG as the successor to RAG (Retrieval-Augmented Generation) in the Generative AI / LLM world right now.

But is it actually that much of a surprise? I posit that its not and that its actually a very logical progression when you look at the strengths and weaknesses of LLMs. Put simply:

Context is everything!

What is RAG?

I’m not going to go into huge details on this as if you follow AI / LLM (which I assume you do if you are reading this) but in a nutshell, RAG is the process whereby you feed external data into an LLM alongside prompts to ensure it has all of the information it needs to make decisions.

What is GraphRAG?

Microsoft recently put out a blog first discussing GraphRAG and provided an accelerator; since then the name GraphRAG seems to have snowballed with everyone doing a “same” blog/post.
One blog that caught my attention was from Philip (Neo4J’s CTO) saying that they have basically done the same thing but in a different way, this post resonated with me as it's the same approach I have been working on for the past few weeks on a little LLM side project I am exploring so this is my “same” blog post.

Why use Graphs and what are they?

Back during my time as CTO of Locally, I was introduced to GraphDB as a mechanism for defining and discovering relationships between data, even using it as a simple definition store, it allows for depth and breadth-first searches to help discover relationships that might not have been explicitly defined.

For example — an RDBMS by definition, has relationships. However, how those relationships are defined doesn’t always follow a particular set of rules. This is a large limitation of a relational database system as relationships need to be defined — they cannot really be discovered.

Some databases might use Foreign Key constraints, some might use field naming conventions, some might use neither — its the wild west out there.

So, how do we define and discover relationships in data that has a haphazard way of defining them? We use a graph database that is designed for it.

There are several papers that go into this method (one is here: https://arxiv.org/pdf/2310.01080) but essentially, it is about parsing a relational database structure into a graph structure.

I won’t go too deep into the secret sauce that I have built to determine the relationships between tables in an RDBMS but I will say that it doesn’t use an LLM (yet) to do this — it’s all old-school coding. Why? Well in short, when I tested this — hallucinations were pretty crazy, and those then get built into the application at the ground level, providing false information to every prompt that follows — not ideal for something that needs to return truthful answers.

The different approaches are taken by Microsoft and Neo4j

Microsoft and Neo4j describe 2 different approaches.

Microsoft uses the LLM itself to create the graph, this is my main issue with the Microsoft approach (and I am by no means an expert in their approach) is that it puts the knowledge generation in the hands of the LLM by having the LLM create the graph. I am sure with the appropriate guard rails in place, this approach will work but with LLMs being so prone to hallucination, I would be interested in understanding more about how these guard rails protect from potentially putting hallucinations at the core of knowledge generation. Microsoft's approach is also much more focused on deep information retrieval rather than specific engineering tasks.

Neo4J describes an approach whereby data is parsed into a graph database, and then this is queried and provided to the LLM as additional context. This resonated with me as it is very similar to how I have built my GraphRAG system.

My approach is more aligned with Neo4J’s approach to GraphRAG, where the Graph Database is part of the system that provides context to the LLM

My implementation of GraphRAG in an application

By having an external source of truth that is directly linked to the Data Model, the context provided to the LLM is as ‘truthy’ as it can be, prone only to good ol’ coding errors instead of potential LLM hallucination.

My approach to graph-based Retrieval Augmented Generation

The approach is a bit more rooted in traditional methods, I parse the Data Model (an SQL-based relational system) into Nodes and Relationships in a graph database and then provide an endpoint where those relationships can be queried to provide a source of truth.

This information is then collated alongside the other information points that we collect and run them all through an embedding model and stored in a Vector database.

relations = discover_relationships(
 db_url=f"sqlite:///{db_file}",
 clear_graph=True)

for r in relations:
 rv.train(
 documentation=f"Table {r['parent_table']} "
 f"{r['relationship_type']} "
 f"{r['child_table']} "
 f"ON {r['parent_table']}.{r['parent_column']} "
 f"= {r['child_table']}.{r['child_column']}")

Or it can be queried on the fly for a particular Table to return either first or multi-level relationships to that table.

query_relationships = discoverer.query_relationships(table_name="Album",
 depth=2)

This returns a list of all of the nodes and relationships between them that can then be provided as additional context to the LLM as required.

{'type(r)': 'HAS_MANY', 'r.parent_table': 'Album', 'r.parent_column': 'ArtistId', 'r.child_table': 'Artist', 'r.child_column': 'ArtistId'}
{'type(r)': 'HAS_MANY', 'r.parent_table': 'Artist', 'r.parent_column': 'ArtistId', 'r.child_table': 'Album', 'r.child_column': 'ArtistId'}
{'type(r)': 'HAS_MANY', 'r.parent_table': 'Album', 'r.parent_column': 'AlbumId', 'r.child_table': 'Track', 'r.child_column': 'AlbumId'}
{'type(r)': 'HAS_MANY', 'r.parent_table': 'Track', 'r.parent_column': 'AlbumId', 'r.child_table': 'Album', 'r.child_column': 'AlbumId'}
{'type(r)': 'HAS_ONE', 'r.parent_table': 'Track', 'r.parent_column': 'GenreId', 'r.child_table': 'Genre', 'r.child_column': 'GenreId'}
{'type(r)': 'HAS_MANY', 'r.parent_table': 'Track', 'r.parent_column': 'TrackId', 'r.child_table': 'InvoiceLine', 'r.child_column': 'TrackId'}
{'type(r)': 'HAS_ONE', 'r.parent_table': 'Track', 'r.parent_column': 'MediaTypeId', 'r.child_table': 'MediaType', 'r.child_column': 'MediaTypeId'}
{'type(r)': 'HAS_MANY', 'r.parent_table': 'Track', 'r.parent_column': 'TrackId', 'r.child_table': 'PlaylistTrack', 'r.child_column': 'TrackId'}

The Graph Representation of my test dataset relationships

Rendering of all of the nodes & relationships in the dataset — fascinating to see the correlation like this!

Advanced features

I have also coded endpoints that allow for a deep path traversal to help establish if a path between 2 nodes exists. This is great to use as a rationalization checkpoint step for SQL returned from the LLM whereby I can check to see if the queries it is generating are actually possible or even provide deep relationship paths for very complex databases.

Why not use an LLM for this?

My goal for this is to have a source of truth for the relationships in the database, if you use an LLM to produce this, without some form of 3rd party checking, this runs the risk of introducing errors at a very low level.

In the future, I will bring in an LLM to help decode some of the more esoteric database structures and ensure that I can verify the output that it creates through a non-LLM process.

Quick Example of a GraphRAG system

So, what does this look like in action? Well, hopefully, it's the same as pretty much every other system that tries to implement RAG. It's more what happens in the background that’s important, but this is a quick Streamlit application I threw together to test it out.
It uses Ollama locally with Llama-3–8b as the model, its a bit clunky, but overall, it demonstrates a level of accuracy that is unobtainable without programming the database structure into an LLM and defining the relationships in advance.

Conclusion

GraphRAG is another way to provide additional context to an LLM and helps them not only link disparate data but also discover new linkages that might not be obvious.

I don’t think GraphRAG will be the last step in the LLM augmentation space, it was a logical progression, in my opinion, so much so that I don't really understand the hype around it, Microsoft's implementation is very different (and much more complex) to what both myself and Neo4J have written about, and I expect there will be even more impressive systems that surface over the coming months.

Anything that can add the required context to an LLM to help it generate more accurate output then it will be brought into the ecosystem.

The supporting facets of Generative AI platforms are becoming just as important as the LLMs themselves, Context is Everything and providing that context efficiently will only become more important as we ask more of them.

About the author

Dan is a Principal of Data Engineering at Slalom, focusing on modernizing customer data landscapes, machine learning, and AI.

A start-up veteran of over 20 years delivering solutions for some of the largest names in Australia and the UK.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

GraphRAG Is the Logical Step From RAG — So Why the Sudden Hype?

Author(s): Daniel Voyce

What is RAG?

What is GraphRAG?

Why use Graphs and what are they?

The different approaches are taken by Microsoft and Neo4j

My implementation of GraphRAG in an application

Advanced features

Why not use an LLM for this?

Quick Example of a GraphRAG system

Conclusion

About the author

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

GraphRAG Is the Logical Step From RAG — So Why the Sudden Hype?

Author(s): Daniel Voyce

What is RAG?

What is GraphRAG?

Why use Graphs and what are they?

The different approaches are taken by Microsoft and Neo4j

My implementation of GraphRAG in an application

Advanced features

Why not use an LLM for this?

Quick Example of a GraphRAG system

Conclusion

About the author

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥