Graph RAG with Property Graphs: A Quick Foray
Last Updated on December 14, 2024 by Editorial Team
Author(s): Kundan Joshi
Originally published on Towards AI.
As Retrieval Augmentation Generation frameworks evolve rapidly, graph RAG is also keeping pace and has emerged as the rejoinder to questions raised on retrieval augmentation accuracy. Traditional vector embedding-based approaches to RAG work very well in extracting similarity-based unstructured content but fail to overcome the inherent limitations of βunstructurednessβ. With the advent of knowledge graphs, query support for AI is enabled by an external source supplement that is not merely based on similarity but is fundamentally constructed on logical relationships stitched together with an organized representation of interconnection between entities.
By virtue of its inherent design, Graph RAG is more capable of addressing analytical and relatively complex questions that need reasoning, by empowering the LLM to understand the broader context and providing it an insightful approach to question-solving.
The paradigm has brought into play a simple but effective concept of triples, that form the basis of a knowledge graph – modeling relations as a subjectβpredicateβobject triage. Yet, while this may be great for standard QnA based use cases, it too could fall short in complex dynamics or amidst real world knowledge bases where each informational node could possibly contain various attributes, at times requiring complex ontologies to enable graph utility.
To address this, Property Graphs, that are premised on nodes and relationships, further extend the utility quotient. Each node has a taggability associated with labels and capable of attribute storage as key-value pairs. Relationships among nodes can also have properties
and enable navigation. Further, metadata population enhance Property graphs by providing a holistic nudge to overall inference.
So with that much prelude, let's commence the ceremony –
We shall use the LlamaIndex framework through this example for demonstrating Graph DB integration along with other open-source components. To begin with the customary virtual environment-
python3 -m venv grenv
source grenv/bin/activate
followed by
pip install -r requirements.txt
requirements.txt expanding to llama-index llama-index-embeddings-huggingface llama_index.graph_stores.neo4j llama-index-readers-file
pydantic_settings
As a case study, we take one of the Fed speeches that addresses the economy and provides some figures as a measure of economic indicators.
Speech by Governor Cook on the economic outlook
Thank you, Christa. It is wonderful to be with you here on the University of Virginia's beautiful campus. This is myβ¦
www.federalreserve.gov
Prepare the ground for ingesting data and doing some basic data dusting.
Besides sanitizing, we need to chunk the data β it should be noted that chunk size would need to trade off between capturing more complete relationships with larger text pieces as against the speed of graph nodal analysis for smaller chunks. Alternately you can also use a semantic chunking strategy that uses an embedding model to split chunks based on similarity.
import re
import nest_asyncio
import os, sys, logging
from llama_index.core import Settings
from llama_index.core import PromptTemplate
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import SimpleDirectoryReader, PropertyGraphIndex, Document
from llama_index.core.indices.property_graph import SimpleLLMPathExtractor, ImplicitPathExtractor, SchemaLLMPathExtractor, DynamicLLMPathExtractor
from llama_index.core.node_parser import (
SentenceSplitter,
SemanticSplitterNodeParser,
)
from llama_index.llms.llama_cpp import LlamaCPP
from typing import Literal, List
from llama_index.core.ingestion import IngestionPipeline, IngestionCache
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore
nest_asyncio.apply()
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
embed_model = HuggingFaceEmbedding()
Settings.embed_model = embed_model
Settings.chunk_size = 512
Settings.chunk_overlap = 50
content = SimpleDirectoryReader(input_files=["./fedspeech.txt"]).load_data()
document = Document(text=content[0].text, metadata={"title": 'Federal Reserve Speech on Economic Policy'})
splitter = SentenceSplitter(
chunk_size=256,
chunk_overlap=50,
paragraph_separator='\n\n',
secondary_chunking_regex=r'\n[\dA-Za-z]\.'
)
from llama_index.core.schema import TransformComponent
class TextSanitizer(TransformComponent):
def __call__(self, nodes, **kwargs):
for node in nodes:
node.text = re.sub(r'\n\s*\n',"\n\n",node.text)
node.text = re.sub(r'[ ]+'," ",node.text)
node.text = re.sub(r'\b \.',".",node.text)
return nodes
pipeline = IngestionPipeline(
transformations=[
splitter,
TextSanitizer(),
embed_model,
]
)
nodes = pipeline.run(documents=[document])
Finally the protagonist of the show β the LLM. We use Mistral 7B-0.3 as it is better equipped for function calling compared to other smaller models. The choice of LLM is vital for Graph RAG as it needs to generate the Cypher statements and identify graph nodes/relations, besides fitting snug into the LlamaIndex chain of function processing.
7B is a limitation here only if we need to fit into a CPU resource constrained environment. We could alternately run with any of the abundantly available and much larger serverless LLM models that have LlamaIndex integrated adapters.
def completion_to_prompt(completion):
return f"<|im_start|>system\n<|im_end|>\n<|im_start|>user\n{completion}<|im_end|>\n<|im_start|>assistant\n"
def messages_to_prompt(messages):
prompt = ""
for message in messages:
if message.role == "system":
prompt += f"<|im_start|>system\n{message.content}<|im_end|>\n"
elif message.role == "user":
prompt += f"<|im_start|>user\n{message.content}<|im_end|>\n"
elif message.role == "assistant":
prompt += f"<|im_start|>assistant\n{message.content}<|im_end|>\n"
if not prompt.startswith("<|im_start|>system"):
prompt = "<|im_start|>system\n" + prompt
prompt = prompt + "<|im_start|>assistant\n"
return prompt
llm = LlamaCPP(
model_url='https://huggingface.co/lmstudio-community/Mistral-7B-Instruct-v0.3-GGUF/resolve/main/Mistral-7B-Instruct-v0.3-Q5_K_M.gguf',
temperature=0.1,
max_new_tokens=512,
context_window=2048,
generate_kwargs={},
model_kwargs={"n_gpu_layers": -1},
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
verbose=False,
)
Settings.llm = llm
LlamaIndex is bundled with a number of extractors that define the nodes generation process for Graph DB; we choose the DynamicLLMPathExtractor as it offers the most flexibility.
You could also experiment with SimpleLLMPathExtractor or the more schema-stipulative SchemaLLMPathExtractor.
kg_extractor = DynamicLLMPathExtractor(
llm=llm,
max_triplets_per_chunk=10, # feel free to raise this
num_workers=4, # if multi core
# Let the LLM infer relationships on the fly
allowed_entity_types=None,
allowed_relation_types=None,
allowed_relation_props=[],
allowed_entity_props=[],
)
Finally, the wingman to the GraphRAG/LLM party is the Graph database itself. Among the choices of graph databases, there are options like NebulaGraph, Ontotext etc. One of the few options with the most permissive open source licenses for commercial deployment would probably be Memgraph, however, we chose the Graphdb front-runner Neo4j for this demo. Simply because Neo4j and its drivers integrate more smoothly with LlamaIndex and it has multiple options for deployment.
However, as an important caveat, it needs to be stated that, if an open source Graph DB is your way to a commercial enterprise, Neo4j community edition has a GPL v3 license β which is strong copyleft.
But we defer the discussion on the legal viability of that to some other time, and stick to technical trappings for now.
If Neo4j needs to be installed, please note a few changes with neo4j.conf and APOC jar file related requirements at the server end β refer to the latest Neo4j deployment instructions for that.
But letβs not let deployment testiness get in the way of testing and go with the free version of Neo4j Aura (https://neo4j.com/product/auradb/).
Connect to the Neo4j instance to create a PropertyGraph store.
username = os.environ.get('NEOUSER') # your DB username (default "neo4j")
password = os.environ.get('NEOPWD') # your DB password
url = "neo4j+s://{#redacted#}.databases.neo4j.io" # Specify the connection URL for Aura
graph_store = Neo4jPropertyGraphStore(
username=username,
password=password,
url=url,
)
Here, the PropertyGraphIndex will be instantiated using the nodes that were distilled from the ingestion pipeline.
We can also use documents, instead of nodes, to initiate the GraphIndex.
index = PropertyGraphIndex(
nodes=nodes,
property_graph_store=graph_store,
embed_model=embed_model,
kg_extractors=[kg_extractor],
show_progress=True)
This step can take a while if you are using a local LLM, as the Graph nodes are extracted by the LLM here.
Once done, you can check the nodes created in the database with Cypher, the ordained query language for Property graphs –
MATCH p=()-[]-() RETURN p;
Inference :
Once you have the graph store created or if you are using a previously created Graphstore, we could simply instantiate it (in another program like a FastAPI server endpoint) using –
index = PropertyGraphIndex.from_existing(
property_graph_store=graph_store,
show_progress=True,
)
followed by inference to a question:
retriever = index.as_retriever(
include_text=False, # include source text in returned nodes, default True
)
nodes = retriever.retrieve("how is continued disinflation justified in the speech?")
for node in nodes:
print(f'node............{node.text}')
query_engine = index.as_query_engine(
include_text=True,
similarity_top_k=3)
response = query_engine.query("how is continued disinflation justified in the speech?")
print(response)
The response to the last question was the following elaborate textβ¦.
Continued disinflation is justified in the speech due to the following reasons:
1. Inflation, as measured by the personal consumption expenditures (PCE) price index, has eased notably from a peak of 7.2 percent in June 2022. Estimates based on the consumer price index and other data released last week indicate that total PCE prices rose 2.3 percent over the 12 months ending in October. Core PCE prices β which exclude the volatile food and energy categories β increased 2.8 percent, down from a peak of 5.6 percent in February 2022.
2. The labor market is in a good position β with the supply and demand for workers being roughly in balance β such that it is no longer a source of inflationary pressure in the economy.
3. Economic activity is moving along at a strong pace, with real gross domestic product (GDP) increasing at a 2.8 percent annual rate in the third quarter.
4. American consumers remain resilient, with broad-based gains in household spending on both goods and services. This supports broader economic growth because consumer spending constitutes roughly two-thirds of GDP.
5. Housing services account for most of the excess of core inflation over the target, and it is expected that housing services inflation will come down gradually over the next two years as the earlier slowing of growth in new tenant rent feeds through into the overall rate.
As we can see, the retrieval covers a substantial amount of context that speaks to the question in play.
The nodes retrieved during graph building can be a product of several factors like the triplets per chunk, extractors used, schema specified
and of course, the LLM used. Similarly the retriever boilerplate in the last code block can be replaced by custom retrieval mechanisms that combine
vector retrieval with specific Cypher retrieval functionality that could possibly better speak to your specific use case.
So we end this note while looking forward to traversing more edges in the graph of this fast evolving field!
Attributions:
- https://www.federalreserve.gov/ for the speech by Gov. Cook on Nov 20, 2024 titled Economic Outlook.
- Image from Unsplash
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI