GraphRAG Analysis, Part 1: How Indexing Elevates Knowledge Graph Performance in RAG

Last Updated on July 13, 2024 by Editorial Team

Author(s): Jonathan Bennion

Originally published on Towards AI.

TLDR:

Knowledge graphs may not significantly impact context retrieval — all knowledge graph RAG methods I examined showed similar context relevancy scores to those of FAISS (~0.74).
Neo4j withOUT its own index achieves a higher answer relevancy score (0.93) but an 8% lift over FAISS may not be worth the ROI constraints. This score is compared to Neo4j WITH index (0.74) and FAISS (0.87), suggesting potential benefits for applications requiring high-precision answers — where used in high-value use cases that do not require finetuning.
The faithfulness score improved significantly when using Neo4j’s index (0.52) compared to not using it (0.21) or using FAISS (0.20). This decreases fabricated information, and is of benefit but still throws a question for developers if using GraphRAG is worth ROI constraints (vs finetuning, which could cost slightly more but lead to much higher scores).

GraphRAG Analysis, Part 1: How Indexing Elevates Knowledge Graph Performance in RAG — Image created by the author

Original question that led to my analysis:

If “GraphRAG” methods are as profound as the hype, when and why would I use a knowledge graph in my RAG application?

I’ve been seeking to understand the practical applications of this technology beyond the currently hyped discussions, so I examined the original Microsoft research paper to gain a deeper understanding of their methodology and findings.

The 2 metrics the MSFT paper claims GraphRAG lifts:

Metric #1 – “Comprehensiveness”:

“How much detail does the answer provide to cover all aspects and details of the question?”

Recognizing that response level of detail can be influenced by various factors beyond knowledge graph implementation — the paper’s inclusion of a ‘Directness’ metric offers an interesting approach to controlling for response length, but I was surprised this was only one of the 2 metrics cited for lift, and was curious on other measures.

Metric #2 – “Diversity”:

“How varied and rich is the answer in providing different perspectives and insights on the question?”

The concept of diversity in responses presents a complex metric that may be influenced by various factors, including audience expectations and prompt design. This metric presents an interesting approach to evaluation, though for directly measuring knowledge graphs in RAG it may benefit from further refinement.

Was even more curious why lift magnitude is vague:

The paper’s official statement on reported lift of the 2 metrics above:

“substantial improvements over the naive RAG baseline”

The paper reports that GraphRAG, a newly open-sourced RAG pipeline, showed ‘substantial improvements’ over a ‘baseline‘. These vague terms sparked my interest in quantifying with more precision (taking into account all known biases of a measurement).

After studying the lack of specifics in their paper, I was inspired to conduct additional research to further explore the topic of knowledge graphs overall in RAG, which allowed me to examine additional metrics that might provide further insights into RAG performance.

Note: Microsoft’s GraphRAG paper is downloadable here, but consider reviewing the following analysis as a complementary perspective that contains more relevant details to the paper’s findings.

Analysis methodology overview:

I split a PDF document into the same chunks for all variants of this analysis (The June 2024 US Presidential Debate transcript, an appropriate RAG opportunity for models created before that debate).
Loaded the document into Neo4j using its graphical representation of the semantic values it finds, and created a Neo4j index.
Created 3 retrievers to use as variants to test:

One using Neo4j knowledge graph AND the Neo4j index
Another using Neo4j knowledge graph WITHOUT the Neo4j index
A FAISS retriever baseline that loads the same document without ANY reference to Neo4j.

Developed ground truth Q&A datasets to investigate potential scale-dependent effects on performance metrics.
Used RAGAS to evaluate results (precision and recall) of both the retrieval quality as well as the answer quality, which offer a complementary perspective to the metrics used in the Microsoft study.
Plotted the results below and caveat with biases.

Analysis:

Quick run through the code below — I’d used langchain, OpenAI for embeddings (and eval as well as retrieval), Neo4j and RAGAS

# Ignore Warnings
import warnings
warnings.filterwarnings('ignore')

# Import packages
import os
import asyncio 
import nest_asyncio
nest_asyncio.apply()
import pandas as pd
from dotenv import load_dotenv
from typing import List, Dict, Union
from scipy import stats
from collections import OrderedDict
import openai
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.text_splitter import TokenTextSplitter
from langchain_community.vectorstores import Neo4jVector, FAISS
from langchain_core.retrievers import BaseRetriever
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema import Document
from neo4j import GraphDatabase 
import numpy as np
import matplotlib.pyplot as plt
from ragas import evaluate
from ragas.metrics import (
 faithfulness,
 answer_relevancy,
 context_relevancy,
 context_recall,
)
from datasets import Dataset
import random

Added OpenAI API key from OAI and neo4j authentication from Neo4j

# Set up API keys 
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
neo4j_url = os.getenv("NEO4J_URL")
neo4j_user = os.getenv("NEO4J_USER")
neo4j_password = os.getenv("NEO4J_PASSWORD")
openai_api_key = os.getenv("OPENAI_API_KEY") # changed keys - ignore

# Load and process the PDF
pdf_path = "debate_transcript.pdf"
loader = PyPDFLoader(pdf_path)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) # Comparable to Neo4j
texts = text_splitter.split_documents(documents)

# Set up Neo4j connection
driver = GraphDatabase.driver(neo4j_url, auth=(neo4j_user, neo4j_password))

Used Cypher to load Neo4j with its own graph representation of the document and created a Neo4j index

# Create function for vector index in Neo4j after the graph representation is complete below
def create_vector_index(tx):
 query = """
 CREATE VECTOR INDEX pdf_content_index IF NOT EXISTS
 FOR (c:Content)
 ON (c.embedding)
 OPTIONS {indexConfig: {
 `vector.dimensions`: 1536,
 `vector.similarity_function`: 'cosine'
 }}
 """
 tx.run(query)

# Function for Neo4j graph creation
def create_document_graph(tx, texts, pdf_name):
 query = """
 MERGE (d:Document {name: $pdf_name})
 WITH d
 UNWIND $texts AS text
 CREATE (c:Content {text: text.page_content, page: text.metadata.page})
 CREATE (d)-[:HAS_CONTENT]->(c)
 WITH c, text.page_content AS content
 UNWIND split(content, ' ') AS word
 MERGE (w:Word {value: toLower(word)})
 MERGE (c)-[:CONTAINS]->(w)
 """
 tx.run(query, pdf_name=pdf_name, texts=[
 {"page_content": t.page_content, "metadata": t.metadata}
 for t in texts
 ])

# Create graph index and structure
with driver.session() as session:
 session.execute_write(create_vector_index)
 session.execute_write(create_document_graph, texts, pdf_path)

# Close driver
driver.close()

Setup OpenAI for retrieval as well as embeddings

# Define model for retrieval 
llm = ChatOpenAI(model_name="gpt-3.5-turbo", openai_api_key=openai_api_key)

# Setup embeddings model w default OAI embeddings 
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

Setup 3 retrievers to test:

Neo4j with reference to its index
Neo4j without reference to its index so it created embeddings from Neo4j as it was stored
FAISS to setup a non-Neo4j vector database on the same chunked document as a baseline

# Neo4j retriever setup using Neo4j, OAI embeddings model using Neo4j index 
neo4j_vector_store = Neo4jVector.from_existing_index(
 embeddings,
 url=neo4j_url,
 username=neo4j_user,
 password=neo4j_password,
 index_name="pdf_content_index",
 node_label="Content",
 text_node_property="text",
 embedding_node_property="embedding"
)
neo4j_retriever = neo4j_vector_store.as_retriever(search_kwargs={"k": 2})

# OpenAI retriever setup using Neo4j, OAI embeddings model NOT using Neo4j index 
openai_vector_store = Neo4jVector.from_documents(
 texts,
 embeddings,
 url=neo4j_url,
 username=neo4j_user,
 password=neo4j_password
)
openai_retriever = openai_vector_store.as_retriever(search_kwargs={"k": 2})

# FAISS retriever setup - OAI embeddings model baseline for non Neo4j vector store touchpoint
faiss_vector_store = FAISS.from_documents(texts, embeddings)
faiss_retriever = faiss_vector_store.as_retriever(search_kwargs={"k": 2})

Created ground truth from PDF for RAGAS eval (N = 100).

Using an OpenAI model for the ground truth, but also used OpenAI models as the default for retrieval in all variants, so no real bias introduced when creating the ground truth (outside of OpenAI training data!).

# Move to N = 100 for more Q&A ground truth
def create_ground_truth2(texts: List[Union[str, Document]], num_questions: int = 100) -> List[Dict]:
 llm_ground_truth = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)
 
 # Function to extract text from str or Document
 def get_text(item):
 if isinstance(item, Document):
 return item.page_content
 return item
 
 # Split long texts into smaller chunks
 text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
 all_splits = text_splitter.split_text(' '.join(get_text(doc) for doc in texts))
 
 ground_truth2 = []
 
 question_prompt = ChatPromptTemplate.from_template(
 "Given the following text, generate {num_questions} diverse and specific questions that can be answered based on the information in the text. "
 "Provide the questions as a numbered list.\n\nText: {text}\n\nQuestions:"
 )
 
 all_questions = []
 for split in all_splits:
 response = llm_ground_truth(question_prompt.format_messages(num_questions=3, text=split))
 questions = response.content.strip().split('\n')
 all_questions.extend([q.split('. ', 1)[1] if '. ' in q else q for q in questions])
 
 random.shuffle(all_questions)
 selected_questions = all_questions[:num_questions]
 
 llm = ChatOpenAI(temperature=0)
 
 for question in selected_questions:
 answer_prompt = ChatPromptTemplate.from_template(
 "Given the following question, provide a concise and accurate answer based on the information available. "
 "If the answer is not directly available, respond with 'Information not available in the given context.'\n\nQuestion: {question}\n\nAnswer:"
 )
 answer_response = llm(answer_prompt.format_messages(question=question))
 answer = answer_response.content.strip()
 
 context_prompt = ChatPromptTemplate.from_template(
 "Given the following question and answer, provide a brief, relevant context that supports this answer. "
 "If no relevant context is available, respond with 'No relevant context available.'\n\n"
 "Question: {question}\nAnswer: {answer}\n\nRelevant context:"
 )
 context_response = llm(context_prompt.format_messages(question=question, answer=answer))
 context = context_response.content.strip()
 
 ground_truth2.append({
 "question": question,
 "answer": answer,
 "context": context,
 })
 
 return ground_truth2

ground_truth2 = create_ground_truth2(texts)

Created a RAG chain for each retrieval method.

# RAG chain works for each retrieval method
def create_rag_chain(retriever):
 template = """Answer the question based on the following context:
 {context}
 
 Question: {question}
 Answer:"""
 prompt = PromptTemplate.from_template(template)
 
 return (
 {"context": retriever, "question": RunnablePassthrough()}
 | prompt
 | llm
 | StrOutputParser()
 )

# Calling the function for each method
neo4j_rag_chain = create_rag_chain(neo4j_retriever)
faiss_rag_chain = create_rag_chain(faiss_retriever)
openai_rag_chain = create_rag_chain(openai_retriever)

Then ran evaluation on each RAG chain using all 4 metrics from RAGAS (context relevancy and context recall metrics evaluate the RAG retrieval, while answer relevancy and faithfulness metrics evaluate the full prompt response, against ground truth)

# Eval function for RAGAS at N = 100
async def evaluate_rag_async2(rag_chain, ground_truth2, name):
 splitter = TokenTextSplitter(chunk_size=500, chunk_overlap=50)

 generated_answers = []
 for item in ground_truth2:
 question = splitter.split_text(item["question"])[0]

 try:
 answer = await rag_chain.ainvoke(question)
 except AttributeError:
 answer = rag_chain.invoke(question)

 truncated_answer = splitter.split_text(str(answer))[0]
 truncated_context = splitter.split_text(item["context"])[0]
 truncated_ground_truth = splitter.split_text(item["answer"])[0]

 generated_answers.append({
 "question": question,
 "answer": truncated_answer,
 "contexts": [truncated_context],
 "ground_truth": truncated_ground_truth
 })

 dataset = Dataset.from_pandas(pd.DataFrame(generated_answers))

 result = evaluate(
 dataset,
 metrics=[
 context_relevancy,
 faithfulness,
 answer_relevancy,
 context_recall,
 ]
 )

 return {name: result}

async def run_evaluations(rag_chains, ground_truth2):
 results = {}
 for name, chain in rag_chains.items():
 result = await evaluate_rag_async(chain, ground_truth2, name)
 results.update(result)
 return results

def main(ground_truth2, rag_chains):
 # Get event loop
 loop = asyncio.get_event_loop()
 
 # Run evaluations
 results = loop.run_until_complete(run_evaluations(rag_chains, ground_truth2))
 
 return results

# Run main function for N = 100
if __name__ == "__main__":

 rag_chains = {
 "Neo4j": neo4j_rag_chain,
 "FAISS": faiss_rag_chain,
 "OpenAI": openai_rag_chain
 }

 results = main(ground_truth2, rag_chains)
 
 for name, result in results.items():
 print(f"Results for {name}:")
 print(result)
 print()

Developed a function to calculate confidence intervals at 95%, providing a measure of uncertainty for the similarity between LLM retrievals and ground truth, however since the results were already one value, I did not use the function and confirmed the directional differences when the same delta magnitudes and pattern was observed after rerunning multiple times.

# Plot CI - low sample size due to Q&A constraint at 100
def bootstrap_ci(data, num_bootstraps=1000, ci=0.95):
 bootstrapped_means = [np.mean(np.random.choice(data, size=len(data), replace=True)) for _ in range(num_bootstraps)]
 return np.percentile(bootstrapped_means, [(1-ci)/2 * 100, (1+ci)/2 * 100])

Created a function to plot bar plots, initially with estimated error.

# Function to plot
def plot_results(results):
 name_mapping = {
 'Neo4j': 'Neo4j with its own index',
 'OpenAI': 'Neo4j without using Neo4j index',
 'FAISS': 'FAISS vector db (not knowledge graph)'
 }
 
 # Create a new OrderedDict
 ordered_results = OrderedDict()
 ordered_results['Neo4j with its own index'] = results['Neo4j']
 ordered_results['Neo4j without using Neo4j index'] = results['OpenAI']
 ordered_results['Non-Neo4j FAISS vector db'] = results['FAISS']
 
 metrics = list(next(iter(ordered_results.values())).keys())
 chains = list(ordered_results.keys())
 
 fig, ax = plt.subplots(figsize=(18, 10)) 
 
 bar_width = 0.25
 opacity = 0.8
 index = np.arange(len(metrics))
 
 for i, chain in enumerate(chains):
 means = [ordered_results[chain][metric] for metric in metrics]
 
 all_values = list(ordered_results[chain].values())
 error = (max(all_values) - min(all_values)) / 2
 yerr = [error] * len(means)
 
 bars = ax.bar(index + i*bar_width, means, bar_width,
 alpha=opacity,
 color=plt.cm.Set3(i / len(chains)),
 label=chain,
 yerr=yerr,
 capsize=5)
 
 
 for bar in bars:
 height = bar.get_height()
 ax.text(bar.get_x() + bar.get_width()/2., height,
 f'{height:.2f}', # Changed to 2 decimal places
 ha='center', va='bottom', rotation=0, fontsize=18, fontweight='bold')
 
 ax.set_xlabel('RAGAS Metrics', fontsize=16)
 ax.set_ylabel('Scores', fontsize=16)
 ax.set_title('RAGAS Evaluation Results with Error Estimates', fontsize=26, fontweight='bold')
 ax.set_xticks(index + bar_width * (len(chains) - 1) / 2)
 ax.set_xticklabels(metrics, rotation=45, ha='right', fontsize=14, fontweight='bold')
 
 ax.legend(loc='upper right', fontsize=14, bbox_to_anchor=(1, 1), ncol=1)
 
 plt.ylim(0, 1)
 plt.tight_layout()
 plt.show()

Finally, plotted these metrics.

To facilitate a focused comparison, key parameters such as document chunking, embedding model, and retrieval model were held constant across experiments. CI was not plotted, and while I normally would plot that, I feel comfortable knowing this pattern after seeing it hold true after multiple reruns in this case (this presumes a level of uniformity to the data). So, caveat is that the results are pending that statistical window of difference.

When rerunning, the patterns of relative scores at repeated runs consistently showed negligible variability (surprisingly), and after running this analysis a few times by accident due to resource time-outs, the patterns stayed consistent and I am generally ok with this result.

# Plot
plot_results(results)

Summary of key observations and implications:

All methods showed similar context relevancy, implying knowledge graphs in RAG do not benefit context retrieval, but Neo4j with its own index, significantly improved faithfulness. Note this is pending CI and balancing for bias.

Follow me for more insights on AI tools and otherwise.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

GraphRAG Analysis, Part 1: How Indexing Elevates Knowledge Graph Performance in RAG

Author(s): Jonathan Bennion

TLDR:

Original question that led to my analysis:

The 2 metrics the MSFT paper claims GraphRAG lifts:

Was even more curious why lift magnitude is vague:

Analysis methodology overview:

Analysis:

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

GraphRAG Analysis, Part 1: How Indexing Elevates Knowledge Graph Performance in RAG

Author(s): Jonathan Bennion

TLDR:

Original question that led to my analysis:

The 2 metrics the MSFT paper claims GraphRAG lifts:

Was even more curious why lift magnitude is vague:

Analysis methodology overview:

Analysis:

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement