Exploring GraphRAG: Smarter AI Knowledge Retrieval with Neo4j & LLMs

Last Updated on March 6, 2025 by Editorial Team

Author(s): sridhar sampath

Originally published on Towards AI.

Exploring GraphRAG: Smarter AI Knowledge Retrieval with Neo4j & LLMs

Traditional AI models struggle with retrieving structured knowledge efficiently. Large Language Models (LLMs) rely on text-based data, often leading to hallucinations, fragmented context, and limited reasoning.

GraphRAG (Graph + Retrieval-Augmented Generation)-a technique that enhances AI capabilities by integrating Neo4j Knowledge Graphs with LLMs like OpenAI.

In this guide, I explore GraphRAG’s potential using a Football Knowledge Graph chatbot, but this technique applies to finance, healthcare, legal AI, and enterprise knowledge management.

📌 Why GraphRAG? (Graph + RAG)

GraphRAG (Graph + Retrieval Augmented Generation) is a technique pioneered by Microsoft, designed to enhance the accuracy and reasoning of LLM (Large Language Model) responses using knowledge graphs. Traditional LLMs often struggle with hallucinations, fragmented context, and limited reasoning-GraphRAG fixes these gaps by introducing structured graph-based retrieval before generating AI responses.

🔹 Key Advantages of GraphRAG

✅ Improved Contextual Understanding:
In this demo , LLMs retrieve knowledge from Neo4j GraphDB, allowing them to understand the relationships between players, clubs, leagues, and historical data, resulting in more contextually accurate answers.

✅ Higher Accuracy & Reduced Hallucinations:
By grounding responses in a structured knowledge graph, GraphRAG ensures fact-based retrieval from trusted sources rather than relying on the LLM’s pretrained memory.

✅ Multi-Hop Reasoning & Deep Insights:
Unlike simple retrieval, GraphRAG supports multi-hop queries-enabling complex questions like:
“Which player has the highest goal-scoring in La Liga?” This requires chaining multiple relationships:
(Players → Clubs → Leagues → Goals)

Exploring GraphRAG: Smarter AI Knowledge Retrieval with Neo4j & LLMs — Source: Image by the author.

LLM Output Finally for above Cypher

✅ Increased Transparency & Source Traceability:
Every response is backed by structured Cypher queries to Neo4j, ensuring that answers can be traced back to the graph for verification.

🔹 The GraphRAG Process: From Query to Answer

GraphRAG combines Knowledge Graphs, Graph Retrieval, and LLM Summarization in a single AI pipeline:

1️⃣ Neo4j Knowledge Graph Construction → Extract structured Nodes and relationships (Players, Clubs, Leagues).
2️⃣ Graph-Based Retrieval → Convert user queries into Cypher queries for structured retrieval.
3️⃣ LLM Response Generation → Use GPT to format the retrieved knowledge into human-readable responses. Model which I used is gpt-3.5-turbo.

🚀 Why This Matters?
Traditional LLMs rely solely on embeddings, but GraphRAG + Neo4j enables AI-driven reasoning, making AI more explainable, accurate, and scalable.

Below diagram shows the Graph-Based Retrieval and the LLM Response Generation .

GraphRAG in Action: Building a Football Knowledge Graph Chatbot

To illustrate GraphRAG’s capabilities, lets’s built a Football Knowledge Graph Chatbot using Neo4j, OpenAI, and Streamlit.

⚽ Why Football Data?

Football is just an example use case to showcase how GraphRAG enhances AI-powered retrieval. The same approach applies to medical,finance, legal, or enterprise AI applications.

Used the Kaggle dataset: Top Football Leagues Scorers

This dataset includes:

✅ Top goal scorers from major leagues

✅ Players, clubs, and league affiliations

✅ Performance stats (goals, xG, shots, matches played, etc.)

This is the Excel data Screenshot

Final Demo Screenshot

Neo4j Auro Console Output for the above Query in our Demo

📢 Note:

This dataset is not exhaustive and I used it for structured football data sample for demo purpose. The results are based only on the CSV data loaded into Neo4j and do not reflect real-world live stats.

In this guide, lets walk through how to built a Football Knowledge Graph Chatbot that combines Neo4j, OpenAI, and Streamlit to answer complex football queries.

📂 Step 1: Creating a Neo4j Knowledge Graph

We set up a Neo4j AuraDB instance and connect it using the Neo4j Python driver. The Football Knowledge Graph structures relationships between Players, Clubs, and Leagues, enabling efficient retrieval of football insights.

Create your free Neo4j AuraDB instance here: Neo4j AuraDB Setup Guide

Graph Structure:

Players → (:Player)-[:PLAYS_FOR]->(:Club)
Clubs → (:Club)-[:PART_OF]->(:League)
Leagues → (:League)-[:IN_COUNTRY]->(:Country)

Once the data is loaded, we can visualize our graph structure in the Neo4j Browser. Below is a sample screenshot of the Football Knowledge Graph displaying the connected entities:

Screenshots from Neo4j Aura

✅ This enables powerful knowledge retrieval, such as:

“What are the stats for Erling Haaland?”
“Who has played the most matches in the Bundesliga?”
“Which players have similar goal-scoring stats to Mohamed Salah?”

Loading Data into Neo4j (File: `football_kg_loader.py`)

1️⃣ Prepare a CSV file with player, club, and league data.

2️⃣ Use Python & Neo4j Driver to insert data into the graph.

3️⃣ Run Cypher queries to define relationships like .

MERGE (p:Player {name: "Lionel Messi", year: 2023, goals: 30, matches: 38}) 
RETURN p

🔍 Step 2: Enhancing Retrieval with OpenAI Embeddings

Why Use AI Embeddings?

🔹 Find players with similar playing styles
🔹 Compare performance stats across leagues
🔹 Improve AI-powered player recommendations

How It Works?

Enhance our GraphRAG pipeline by:
1️⃣ Extracting player statistics (goals, xG, matches, shots, etc.).
2️⃣ Generating OpenAI embeddings for each player.
3️⃣ Storing embeddings in Neo4j for efficient retrieval.
4️⃣ Performing similarity search using vector queries.

Creating a Vector Index for Player Embeddings (File: `football_kg_embeddings.py`)

Before the storing and Quering embeddings, We need to create a vector index in Neo4j.

Why Create a Vector Index?

🔹 Fast retrieval: Helps in performing quick similarity searches over thousands of players.
🔹 Optimized search: Uses cosine similarity to efficiently compare player embeddings.
🔹 Structured AI-powered queries: Allows Neo4j to store & query OpenAI-generated embeddings directly.

Cypher Query to Create Vector Index:

CREATE VECTOR INDEX football_players_embeddings IF NOT EXISTS 
FOR (p:Player) ON (p.embedding) 
OPTIONS { 
 indexConfig: { 
 `vector.dimensions`: 1536, 
 `vector.similarity_function`: 'cosine' 
 } 
}

👉 This ensures that we can efficiently retrieve similar players based on statistical embeddings stored in the graph.

Storing Embeddings in Neo4j

Once the index is created, we generate vector embeddings using OpenAI’s Embedding API and store them in Neo4j.

CALL db.create.setNodeVectorProperty(p, "embedding", vector)

✅ This process converts numerical player stats into AI-readable embeddings, allowing similarity-based retrieval.

Performing Similarity Search

With embeddings stored in the knowledge graph, we can now retrieve players similar to a given player using a vector search query:

CALL db.index.vector.queryNodes( 
 'football_players_embeddings', 5, 
 genai.vector.encode("Find players similar to Lionel Messi") 
) YIELD node AS player, score 
RETURN player.name, score

✅ This query returns the top 5 players whose stats closely match Messi’s.

🤖 Step 3: AI Chatbot with Streamlit + OpenAI

Building the Chatbot (File: `football_kg_chatbot.py`)

How It Works:

1️⃣ User asks a football question in Streamlit.

2️⃣ LLM converts the query into Cypher.

3️⃣ Neo4j fetches structured football data.

4️⃣ LLM (OpenAPI) formats a natural language response and enhances the response.

Example Queries & Responses

1️⃣ “Which players scored more than 30 goals in a season?”

🔍The is the VSCode terminal Output of the Generated Cypher Query

2️⃣”What are the stats for Erling Haaland?”

🔍The is the VSCode terminal Output of the Generated Cypher Query

3️⃣ “Which players have similar goal-scoring stats to Mohamed Salah?”

🔍The is the VSCode terminal Output of the Generated Cypher Query

4️⃣ “Which Clubs in Spain ?”

🔍The is the VSCode terminal Output of the Generated Cypher Query

“Which club does lionel messi play for”

🔍The is the VSCode terminal Output of the Generated Cypher Query

🔍 Explanation of What’s Happening in the Query & LLM Response

This process demonstrates how GraphRAG (Graph + RAG) combines Neo4j’s structured retrieval with LLM summarization to generate meaningful insights. Here’s a breakdown:

Example Query : “Which players have similar goal-scoring stats to Mohamed Salah?”

1️⃣ Cypher Query Execution in Neo4j

When a user asks, “Which players have similar goal-scoring stats to Mohamed Salah?”, the LLM generates the following Cypher query:

MATCH (p:Player {name: "Mohamed Salah"})-[:PLAYS_FOR]->(c:Club)-[:PART_OF]->(l:League)-[:IN_COUNTRY]->(co:Country)
WITH p, co
MATCH (player:Player)-[:PLAYS_FOR]->(:Club)-[:PART_OF]->(l)-[:IN_COUNTRY]->(co)
WHERE player.goals >= p.goals - 5 AND player.goals <= p.goals + 5 AND player.name <> "Mohamed Salah"
RETURN player.name, player.goals

✅ How This Works:

Identifies Mohamed Salah in the graph.
Finds players in the same country and league.
Filters players with a similar goal range (within ±5 goals).

2️⃣ Neo4j Query Result (Structured Graph Retrieval)

Neo4j executes the query and returns structured football data:

[
 {"player.name": "Antoine Griezmann", "player.goals": 16},
 {"player.name": "Philippe Coutinho", "player.goals": 13},
 {"player.name": "Antoine Griezmann", "player.goals": 19},
 {"player.name": "Mirco Antenucci", "player.goals": 11},
 {"player.name": "Antoine Griezmann", "player.goals": 15},
 {"player.name": "Morata", "player.goals": 15},
 {"player.name": "Morata", "player.goals": 12},
 {"player.name": "Neymar", "player.goals": 13},
 {"player.name": "Pablo Sarabia", "player.goals": 13},
 {"player.name": "Mauro Icardi", "player.goals": 11}
]

📌 Key Takeaway:

The system retrieves structured player data without hallucinations.
Graph-based filtering ensures relevant, league-specific player comparisons.

(Neo4j console Query Execution Screenshot)

3️⃣ LLM Summarization (Neo4j + OpenAI Response Formatting)

Once Neo4j retrieves structured football data, it’s sent to OpenAI’s LLM for natural language formatting.

📷 Streamlit Output Screenshot (LLM Answer Formatting) ( See attached image-this is what will be included in the blog.)

How the LLM Enhances the Response:

✅ Structured answer: Lists players with similar stats in an easy-to-read format.
✅ Adds context: Explains the logic behind similarity matching.
✅ Human-like reasoning: Groups players logically rather than just listing data.

🔥 Why GraphRAG is a good option than RAG

🚀 Compared to Traditional RAG:
✅ No hallucinations — Only fact-based retrieval.
✅ Structured reasoning — Graph-based multi-hop analytics.
✅ Scalability — Handles large knowledge bases effortlessly.
✅ Domain-agnostic — Use in Medical, finance, healthcare, and enterprise AI.

🔗 Let’s Connect

📂 GitHub Repo: Football Knowledge Graph Chatbot
🔗 Dataset: Kaggle — Top Football Scorers (Also added in Github repo, under data folder)

🏅 Conclusion

GraphRAG enhances AI knowledge retrieval by combining Neo4j’s structured search with LLMs’ natural language understanding. This approach ensures factual, multi-hop reasoning and improved accuracy, reducing hallucinations common in traditional AI models.

References

Note on API Costs

I used OpenAI API (gpt-3.5-turbo) for both embeddings and LLM responses, which incurs very small cost. Below is a screenshot of the OpenAI API usage for our Football Knowledge Graph Chatbot

In my case around $0.06 for this Demo project:

While the costs are minimal for small-scale experiments, it’s important to monitor API usage and optimize queries to keep expenses under control.

Originally published at https://sridhartech.hashnode.dev on March 2, 2025.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Exploring GraphRAG: Smarter AI Knowledge Retrieval with Neo4j & LLMs

Author(s): sridhar sampath

Exploring GraphRAG: Smarter AI Knowledge Retrieval with Neo4j & LLMs

📌 Why GraphRAG? (Graph + RAG)

🔹 Key Advantages of GraphRAG

🔹 The GraphRAG Process: From Query to Answer

GraphRAG in Action: Building a Football Knowledge Graph Chatbot

⚽ Why Football Data?

📢 Note:

📂 Step 1: Creating a Neo4j Knowledge Graph

Graph Structure:

Loading Data into Neo4j (File: football_kg_loader.py)

🔍 Step 2: Enhancing Retrieval with OpenAI Embeddings

Why Use AI Embeddings?

How It Works?

Creating a Vector Index for Player Embeddings (File: football_kg_embeddings.py)

Why Create a Vector Index?

Cypher Query to Create Vector Index:

Storing Embeddings in Neo4j

Performing Similarity Search

🤖 Step 3: AI Chatbot with Streamlit + OpenAI

Building the Chatbot (File: football_kg_chatbot.py)

How It Works:

Example Queries & Responses

🔍 Explanation of What’s Happening in the Query & LLM Response

1️⃣ Cypher Query Execution in Neo4j

2️⃣ Neo4j Query Result (Structured Graph Retrieval)

3️⃣ LLM Summarization (Neo4j + OpenAI Response Formatting)

How the LLM Enhances the Response:

🔥 Why GraphRAG is a good option than RAG

🔗 Let’s Connect

🏅 Conclusion

References

Note on API Costs

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement

Loading Data into Neo4j (File: `football_kg_loader.py`)

Creating a Vector Index for Player Embeddings (File: `football_kg_embeddings.py`)

Building the Chatbot (File: `football_kg_chatbot.py`)