The Complete Guide to Implementing RAG Locally: No Cloud or Frameworks are Required
Last Updated on January 3, 2025 by Editorial Team
Author(s): BeastBoyJay
Originally published on Towards AI.
Content :
- Understanding RAG Fundamentals β Building Blocks of Local RAG
- Creating Local RAG from scratch without using any frameworks like langchain,llamaindex.
What is RAG ?
Retrieval-Augmented Generation (RAG) is a cutting-edge method of natural language processing that produces precise and contextually relevant answers by fusing the strength of large language models (LLMs) with an external knowledge retrieval system. During inference, RAG dynamically gets data from a connected database or document store, in contrast to standard generative models that only use pre-trained data. This guarantees that the output is current, based on data from the real world, and coherent. RAG is especially useful for applications where accuracy and relevance are crucial, such as customer service, document summarization, and question answering, by utilizing this hybrid mechanism.
Components of RAG :
Retrieval-Augmented Generation, as the name suggests, consists of three main components: Retrieval, Augmentation, and Generation. By breaking down the term, we can easily understand its structure and purpose.
Retrieval :
In this stage, when a user submits a query to the RAG pipeline, it first retrieves relevant resources based on the userβs input.
Augmentation :
Once the relevant resources are retrieved, they are augmented with the userβs query using a predefined prompt template.
Generation :
After augmentation, the input for the LLM is ready. This input is then passed through the LLM, which generates the desired results
Implementation :
Now that you have learned about the RAG, its time for some coding.
Step 1: Setting up Environment
pip install torch PyMuPDF tqdm transformers sentence-transformers
For this project we will be using torch(for some calculations),PyMuPDF(for reading PDF),tqdm(for progress bars),transformers(for the LLM),sentence-transformers(for the embedding model).
Step 2 : Document/Text Processing
Internal Steps:
- Import PDF document.
- Process Text for embedding (eg. Split into chunks of sentences)
import fitz # PyMuPDF
from tqdm import tqdm
from spacy.lang.en import English
import re
class PDF_Processor:
def __init__(self, pdf_path):
self.pdf_path = pdf_path
@staticmethod
def text_formatter(text: str) -> str:
# removes "\n" from the text
cleaned_text = text.replace("\n", " ").strip()
return cleaned_text
@staticmethod
def split_list(input_list: list, slice_size: int) -> list[list[str]]:
return [
input_list[i : i + slice_size]
for i in range(0, len(input_list), slice_size)
]
def _read_PDF(self) -> list[dict]:
# Opening the PDF.
try:
pdf_document = fitz.open(self.pdf_path)
except fitz.FileDataError:
print(f"Error: Unable to open PDF file '{self.pdf_path}'.")
pages_and_texts = []
for page_number, page in tqdm(
enumerate(pdf_document), total=len(pdf_document), desc="Reading PDF"
):
# Reading all the pages line by line.
text = page.get_text()
text = self.text_formatter(text)
pages_and_texts.append(
{
"page_number": page_number,
"page_char_count": len(text),
"page_word_count": len(text.split(" ")),
"page_sentence_count_raw": len(text.split(". ")),
"page_token_count": len(text) / 4,
"text": text,
}
)
return pages_and_texts
def _split_sentence(self, pages_and_texts: list):
# Splitting each text into sentences and creating its own object.
nlp = English()
nlp.add_pipe("sentencizer")
for item in tqdm(pages_and_texts, desc="Text to sentence"):
item["sentences"] = list(nlp(item["text"]).sents)
item["sentences"] = [str(sentence) for sentence in item["sentences"]]
item["page_sentence_count_spacy"] = len(item["sentences"])
return pages_and_texts
def _chunk_sentence(self, pages_and_texts: list, chunk_size: int = 10):
# Chunking each sentence with the chunk size of 10.
for item in tqdm(pages_and_texts, desc="Sentence to chunk"):
item["sentence_chunks"] = self.split_list(item["sentences"], chunk_size)
item["page_chunk_count"] = len(item["sentence_chunks"])
return pages_and_texts
def _pages_and_chunks(self, pages_and_texts: list):
# Creating a new variable for the chunks and its metadata as its own
pages_and_chunks = []
for item in tqdm(pages_and_texts, desc="Splitting each chunk into its own"):
for sentence_chunk in item["sentence_chunks"]:
chunk_dict = {}
chunk_dict["page_number"] = item["page_number"]
# Join the sentence chunks into a single string and clean up any excess spaces.
joined_sentence_chunk = (
"".join(sentence_chunk).replace(" ", " ").strip()
)
# Fix any missing spaces after periods (e.g., "Hello.World" becomes "Hello. World").
joined_sentence_chunk = re.sub(
r"\.([A-Z])", r". \1", joined_sentence_chunk
)
chunk_dict["sentence_chunk"] = joined_sentence_chunk
chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
chunk_dict["chunk_word_count"] = len(
[word for word in joined_sentence_chunk.split(" ")]
)
chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4
pages_and_chunks.append(chunk_dict)
return pages_and_chunks
def _remove_irrelevant_chunks(self, pages_and_chunks: list):
# removing chunk with token count > 10, Because mostly will be small peices of text which are irrelevant.
relevant_pages_and_chunks = [
item for item in pages_and_chunks if item["chunk_token_count"] > 30
]
return relevant_pages_and_chunks
def run(self):
pages_and_texts = self._read_PDF() # Read the PDF and extract text.
self._split_sentence(pages_and_texts) # Split text into sentences.
self._chunk_sentence(pages_and_texts) # Chunk sentences into smaller sections.
pages_and_chunks = self._pages_and_chunks(
pages_and_texts
) # Create chunks with metadata.
relevant_pages_and_chunks = self._remove_irrelevant_chunks(
pages_and_chunks
) # Filter out small chunks.
return relevant_pages_and_chunks
The above block of code processes the PDF and returns the relevant pages in chunks, along with the associated metadata
Step 3 : Generating embedding and Saving it
from sentence_transformers import SentenceTransformer
import torch
from tqdm import tqdm
import pandas as pd
class SaveEmbeddings:
def __init__(self, pdf_path, embedding_model="all-mpnet-base-v2"):
self.device = "cuda" if torch.cuda.is_available() else "cpu"
# Initialize the PDF processor to extract text chunks from the PDF.
self.pdf_processor = PDF_Processor(pdf_path=pdf_path)
# Process the PDF and extract page-wise text chunks.
self.pages_and_chunks = self.pdf_processor.run()
# Load the sentence transformer model.
self.embedding_model = SentenceTransformer(
model_name_or_path=embedding_model, device=self.device
)
def _generate_embeddings(self):
# Generating embeddings from the model.
for item in tqdm(self.pages_and_chunks, desc="Generating embeddings"):
# Generate embeddings for the sentence chunk and add it to the item.
item["embedding"] = self.embedding_model.encode(item["sentence_chunk"])
def _save_embeddings(self):
# Convert the list of dictionaries to a DataFrame for saving as CSV.
data_frame = pd.DataFrame(self.pages_and_chunks)
data_frame.to_csv("embeddings.csv", index=False)
def run(self):
self._generate_embeddings() # Generate embeddings for text chunks.
self._save_embeddings() # Save the embeddings to a CSV file.
The above block of code generates embeddings for each chunk. In this project, I have used the all-mpnet-base-v2 embedding model, which is a robust model with a vector size of 768. It then saves all the embeddings along with their corresponding chunks in a CSV file.
Step 4 : Retrieval
import numpy as np
import pandas as pd
import torch
from sentence_transformers import util, SentenceTransformer
class Semantic_search:
def __init__(self, embeddings_csv: str = "embeddings.csv"):
self.device = (
"cuda" if torch.cuda.is_available() else "cpu"
) # Set device based on availability
self.embeddings_csv = embeddings_csv # Path to embeddings CSV
self.embeddings_df = pd.read_csv(
self.embeddings_csv
) # Load embeddings into DataFrame
# Load pre-trained SentenceTransformer model
self.embedding_model = SentenceTransformer(
model_name_or_path="all-mpnet-base-v2", device=self.device
)
def _process_embeddings(self):
self.embeddings_df["embedding"] = self.embeddings_df["embedding"].apply(
lambda x: np.fromstring(
x.strip("[]"), sep=" "
) # Convert string to numpy array
)
def _get_pages_and_chunks_dict(self):
pages_and_chunks = self.embeddings_df.to_dict(orient="records")
return pages_and_chunks
def _convert_embeddings_to_tensor(self):
return torch.tensor(
np.array(self.embeddings_df["embedding"].tolist()), dtype=torch.float32
).to(self.device)
def _retrieve_relevant_resources(
self, query: str, embeddings: torch.tensor, n_resources_to_return: int = 5
):
query_embedding = self.embedding_model.encode(
query, convert_to_tensor=True
) # Encode the query
dot_scores = util.dot_score(query_embedding, embeddings)[
0
] # Calculate dot product scores
scores, indices = torch.topk(
input=dot_scores, k=n_resources_to_return
) # Get top results
return scores, indices
def _get_top_results(
self,
query: str,
embeddings: torch.tensor,
pages_and_chunks: list[dict],
n_resources_to_return: int = 5,
):
relevant_chunks = [] # List to store relevant chunks
scores, indices = self._retrieve_relevant_resources(
query=query,
embeddings=embeddings,
n_resources_to_return=n_resources_to_return,
)
# Retrieve the relevant sentence chunks based on the indices
for index in indices:
sentence_chunk = pages_and_chunks[index]["sentence_chunk"]
relevant_chunks.append(sentence_chunk)
return relevant_chunks
def run(self, query: str):
self._process_embeddings() # Process the embeddings to convert string to numpy arrays
pages_and_chunks = (
self._get_pages_and_chunks_dict()
) # Convert embeddings DataFrame to list of dictionaries
embeddings_tensor = (
self._convert_embeddings_to_tensor()
) # Convert embeddings to tensor
relevant_chunks = self._get_top_results(
query=query,
embeddings=embeddings_tensor,
pages_and_chunks=pages_and_chunks,
) # Retrieve the top relevant sentence chunks
return relevant_chunks # Return the relevant chunks
In the above code block, we perform semantic search within each chunk embedding to retrieve relevant resources based on the query. We use the same embedding model that was used to generate the chunk embeddings. For the semantic search, we compute the dot product of each chunk embedding with the query vector to identify the most similar chunk.
Note: For the semantic searching im using dot product because the embeddings are normalized if your embedding are not normalized you should use cosine similarity.
Step 5 : Augmentation
class Create_prompt:
def __init__(self):
self.semantic_search = (
Semantic_search()
) # Initialize the Semantic_search instance
self.base_prompt = """Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.
\nExample 1:
Query: What are the fat-soluble vitamins?
Answer: The fat-soluble vitamins include Vitamin A, Vitamin D, Vitamin E, and Vitamin K. These vitamins are absorbed along with fats in the diet and can be stored in the body's fatty tissue and liver for later use. Vitamin A is important for vision, immune function, and skin health. Vitamin D plays a critical role in calcium absorption and bone health. Vitamin E acts as an antioxidant, protecting cells from damage. Vitamin K is essential for blood clotting and bone metabolism.
\nExample 2:
Query: What are the causes of type 2 diabetes?
Answer: Type 2 diabetes is often associated with overnutrition, particularly the overconsumption of calories leading to obesity. Factors include a diet high in refined sugars and saturated fats, which can lead to insulin resistance, a condition where the body's cells do not respond effectively to insulin. Over time, the pancreas cannot produce enough insulin to manage blood sugar levels, resulting in type 2 diabetes. Additionally, excessive caloric intake without sufficient physical activity exacerbates the risk by promoting weight gain and fat accumulation, particularly around the abdomen, further contributing to insulin resistance.
\nExample 3:
Query: What is the importance of hydration for physical performance?
Answer: Hydration is crucial for physical performance because water plays key roles in maintaining blood volume, regulating body temperature, and ensuring the transport of nutrients and oxygen to cells. Adequate hydration is essential for optimal muscle function, endurance, and recovery. Dehydration can lead to decreased performance, fatigue, and increased risk of heat-related illnesses, such as heat stroke. Drinking sufficient water before, during, and after exercise helps ensure peak physical performance and recovery.
\nNow use the following context items to answer the user query:
{context}\n
User query: {query}
Answer:"""
def _get_releveant_chunks(self, query: str):
relevant_chunks = self.semantic_search.run(
query=query
) # Run semantic search to find relevant context
return relevant_chunks
def _join_chunks(self, relevant_chunks: list):
context = "- " + "\n- ".join(
item for item in relevant_chunks
) # Join chunks with list item format
return context
def run(self, query: str):
relevant_chunks = self._get_releveant_chunks(
query=query
) # Get relevant context for the query
context = self._join_chunks(relevant_chunks) # Format the context into a string
prompt = self.base_prompt.format(
context=context, query=query
) # Format the base prompt with context and query
return prompt
In the above block of code, I am simply adding the retrieved context and the query to the predefined prompt template, which is then passed to the LLM model.
Step 6 : Setting up LLM
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
class LLM_Model:
def __init__(self, model_id: str = "tiiuae/Falcon3-3B-Instruct"):
# Set the device based on the availability of a GPU.
self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.model_id = model_id
# Load the tokenizer for the specified model.
self.tokenizer = AutoTokenizer.from_pretrained(
pretrained_model_name_or_path=model_id
)
# Load the language model with the specified configuration.
self.llm_model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=False,
).to(self.device)
# Set the pad token ID to the end-of-sequence (EOS) token if it is not already set.
if self.tokenizer.pad_token_id is None:
self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
def _get_model_inputs(self, base_prompt):
# Define a dialogue template with the user's role and content.
dialogue_template = [{"role": "user", "content": base_prompt}]
# Use the tokenizer to apply the chat template to the input prompt.
input_data = self.tokenizer.apply_chat_template(
conversation=dialogue_template, tokenize=False, add_generation_prompt=True
)
# Convert the dialogue into input tensors suitable for the model.
input_data = self.tokenizer(input_data, return_tensors="pt").to(self.device)
return input_data
def run(self, base_prompt):
# Get the model inputs from the base prompt.
input_data = self._get_model_inputs(base_prompt=base_prompt)
# Generate the text output from the model.
output_ids = self.llm_model.generate(
input_ids=input_data["input_ids"],
attention_mask=input_data["attention_mask"],
max_length=256,
do_sample=True,
pad_token_id=self.tokenizer.pad_token_id,
)
# Decode the generated output_ids to get the text response.
response = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
# Split the response to remove any extra content added by the model.
response = response.split("<|assistant|>")[-1].strip()
return response
The above code block loads an LLM from Hugging Face and generates text based on the input provided.
Step 7 : Completing the whole Pipeline
import os
class Local_RAG:
def __init__(self, pdf_path):
self.pdf_path = pdf_path # Path to the PDF file
# Check if the embeddings CSV file already exists. If not, generate and save embeddings.
if not os.path.exists("embeddings.csv"):
self.save_embeddings = SaveEmbeddings(
pdf_path=self.pdf_path
) # Initialize Save_Embeddings
self.save_embeddings.run() # Run the embedding saving process
self.create_prompt = (
Create_prompt()
) # Initialize Create_prompt for prompt generation
self.llm_model = (
LLM_Model()
) # Initialize LLM_Model for language model response generation
def run(self, query):
print("Creating Prompt....")
base_prompt = self.create_prompt.run(
query=query
) # Generate the base prompt using the query
print("Generating Results....")
response = self.llm_model.run(
base_prompt=base_prompt
) # Generate a response from the language model
return response # Return the generated response
The code block streamlines all the components of the RAG pipeline, making it ready for use.
Example Usage :
pdf_path = r"your/pdf/path"
local_rag = Local_RAG(pdf_path=pdf_path)
query = "What is the purpose of the paper?"
local_rag.run(query=query)
By just running the above code you can use your own Local_RAG.
Improvements :
Despite the effectiveness of the existing implementation, there are a few areas that might be enhanced for increased scalability and performance:
Model Optimization:
- To increase response relevance and accuracy, try using more sophisticated embeddings or alternative LLM structures.
- Use more effective data structures, such as FAISS (Facebook AI Similarity Search), to streamline the embedding creation and retrieval process for quicker retrieval.
Chunk Management:
- Use more complex chunking techniques to increase the coherence of the generated replies and lessen the context loss between chunks.
- To capture deeper context, including extra metadata (such as semantic tags or document section headers) in the chunking process.
Dynamic Document Updates:
- By allowing the system to dynamically alter document embeddings in response to new data, you may increase the RAG systemβs adaptability to new content without having to retrain the entire model.
Conclusion :
An effective, privacy-conscious method of improving AI systems is through local Retrieval-Augmented Generation (RAG) implementation, particularly in settings where offline operation is required. RAG is perfect for applications like customer service, document summarizing, and question answering because it combines retrieval, augmentation, and generation to create a dynamic system that can provide extremely relevant and accurate solutions. This tutorial showed how to build RAG from the ground up without the aid of third-party frameworks, guaranteeing total control over the procedure and its privacy features.
The step-by-step process covers:
- Installing dependencies and configuring the environment.
- Dividing up documents (like PDFs) and getting them ready for embedding.
- Creating and storing embeddings in a file for later access.
- Putting semantic search into practice to get pertinent document sections in response to user inquiries.
- Adding pertinent context to the query and adapting its syntax for the language model.
- Use the augmented query to run the LLM and produce contextual responses.
Checkout full implementation on github :https://github.com/BEASTBOYJAY/Local_RAG
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI