The Complete Guide to Implementing RAG Locally: No Cloud or Frameworks are Required

Last Updated on January 3, 2025 by Editorial Team

Author(s): BeastBoyJay

Originally published on Towards AI.

Content :

Understanding RAG Fundamentals — Building Blocks of Local RAG
Creating Local RAG from scratch without using any frameworks like langchain,llamaindex.

What is RAG ?

Retrieval-Augmented Generation (RAG) is a cutting-edge method of natural language processing that produces precise and contextually relevant answers by fusing the strength of large language models (LLMs) with an external knowledge retrieval system. During inference, RAG dynamically gets data from a connected database or document store, in contrast to standard generative models that only use pre-trained data. This guarantees that the output is current, based on data from the real world, and coherent. RAG is especially useful for applications where accuracy and relevance are crucial, such as customer service, document summarization, and question answering, by utilizing this hybrid mechanism.

Components of RAG :

Retrieval-Augmented Generation, as the name suggests, consists of three main components: Retrieval, Augmentation, and Generation. By breaking down the term, we can easily understand its structure and purpose.

Retrieval :

In this stage, when a user submits a query to the RAG pipeline, it first retrieves relevant resources based on the user’s input.

Augmentation :

Once the relevant resources are retrieved, they are augmented with the user’s query using a predefined prompt template.

Generation :

After augmentation, the input for the LLM is ready. This input is then passed through the LLM, which generates the desired results

Implementation :

Now that you have learned about the RAG, its time for some coding.

Step 1: Setting up Environment

pip install torch PyMuPDF tqdm transformers sentence-transformers

For this project we will be using torch(for some calculations),PyMuPDF(for reading PDF),tqdm(for progress bars),transformers(for the LLM),sentence-transformers(for the embedding model).

Step 2 : Document/Text Processing

Internal Steps:

Import PDF document.
Process Text for embedding (eg. Split into chunks of sentences)

import fitz # PyMuPDF
from tqdm import tqdm
from spacy.lang.en import English
import re

class PDF_Processor:

 def __init__(self, pdf_path):
 self.pdf_path = pdf_path

 @staticmethod
 def text_formatter(text: str) -> str:
 # removes "\n" from the text
 cleaned_text = text.replace("\n", " ").strip()
 return cleaned_text

 @staticmethod
 def split_list(input_list: list, slice_size: int) -> list[list[str]]:
 return [
 input_list[i : i + slice_size]
 for i in range(0, len(input_list), slice_size)
 ]

 def _read_PDF(self) -> list[dict]:
 # Opening the PDF.
 try:
 pdf_document = fitz.open(self.pdf_path)
 except fitz.FileDataError:
 print(f"Error: Unable to open PDF file '{self.pdf_path}'.")

 pages_and_texts = []
 for page_number, page in tqdm(
 enumerate(pdf_document), total=len(pdf_document), desc="Reading PDF"
 ):
 # Reading all the pages line by line.
 text = page.get_text()
 text = self.text_formatter(text)
 
 pages_and_texts.append(
 {
 "page_number": page_number,
 "page_char_count": len(text),
 "page_word_count": len(text.split(" ")),
 "page_sentence_count_raw": len(text.split(". ")),
 "page_token_count": len(text) / 4,
 "text": text,
 }
 )
 return pages_and_texts

 def _split_sentence(self, pages_and_texts: list):
 # Splitting each text into sentences and creating its own object.

 nlp = English()
 nlp.add_pipe("sentencizer")
 for item in tqdm(pages_and_texts, desc="Text to sentence"):
 item["sentences"] = list(nlp(item["text"]).sents)
 item["sentences"] = [str(sentence) for sentence in item["sentences"]]
 item["page_sentence_count_spacy"] = len(item["sentences"])

 return pages_and_texts

 def _chunk_sentence(self, pages_and_texts: list, chunk_size: int = 10):
 # Chunking each sentence with the chunk size of 10.
 for item in tqdm(pages_and_texts, desc="Sentence to chunk"):
 item["sentence_chunks"] = self.split_list(item["sentences"], chunk_size)
 item["page_chunk_count"] = len(item["sentence_chunks"])

 return pages_and_texts

 def _pages_and_chunks(self, pages_and_texts: list):
 # Creating a new variable for the chunks and its metadata as its own
 pages_and_chunks = []
 for item in tqdm(pages_and_texts, desc="Splitting each chunk into its own"):
 for sentence_chunk in item["sentence_chunks"]:
 chunk_dict = {}
 chunk_dict["page_number"] = item["page_number"]

 # Join the sentence chunks into a single string and clean up any excess spaces.
 joined_sentence_chunk = (
 "".join(sentence_chunk).replace(" ", " ").strip()
 )
 # Fix any missing spaces after periods (e.g., "Hello.World" becomes "Hello. World").
 joined_sentence_chunk = re.sub(
 r"\.([A-Z])", r". \1", joined_sentence_chunk
 )
 chunk_dict["sentence_chunk"] = joined_sentence_chunk

 chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
 chunk_dict["chunk_word_count"] = len(
 [word for word in joined_sentence_chunk.split(" ")]
 )
 chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4

 pages_and_chunks.append(chunk_dict)

 return pages_and_chunks

 def _remove_irrelevant_chunks(self, pages_and_chunks: list):
 # removing chunk with token count > 10, Because mostly will be small peices of text which are irrelevant.
 relevant_pages_and_chunks = [
 item for item in pages_and_chunks if item["chunk_token_count"] > 30
 ]

 return relevant_pages_and_chunks

 def run(self):
 pages_and_texts = self._read_PDF() # Read the PDF and extract text.
 self._split_sentence(pages_and_texts) # Split text into sentences.
 self._chunk_sentence(pages_and_texts) # Chunk sentences into smaller sections.
 pages_and_chunks = self._pages_and_chunks(
 pages_and_texts
 ) # Create chunks with metadata.
 relevant_pages_and_chunks = self._remove_irrelevant_chunks(
 pages_and_chunks
 ) # Filter out small chunks.
 return relevant_pages_and_chunks

The above block of code processes the PDF and returns the relevant pages in chunks, along with the associated metadata

Step 3 : Generating embedding and Saving it

from sentence_transformers import SentenceTransformer
import torch
from tqdm import tqdm
import pandas as pd


class SaveEmbeddings:

 def __init__(self, pdf_path, embedding_model="all-mpnet-base-v2"):

 self.device = "cuda" if torch.cuda.is_available() else "cpu"
 # Initialize the PDF processor to extract text chunks from the PDF.
 self.pdf_processor = PDF_Processor(pdf_path=pdf_path)
 # Process the PDF and extract page-wise text chunks.
 self.pages_and_chunks = self.pdf_processor.run()
 # Load the sentence transformer model.
 self.embedding_model = SentenceTransformer(
 model_name_or_path=embedding_model, device=self.device
 )

 def _generate_embeddings(self):
 # Generating embeddings from the model.
 for item in tqdm(self.pages_and_chunks, desc="Generating embeddings"):
 # Generate embeddings for the sentence chunk and add it to the item.
 item["embedding"] = self.embedding_model.encode(item["sentence_chunk"])

 def _save_embeddings(self):
 # Convert the list of dictionaries to a DataFrame for saving as CSV.
 data_frame = pd.DataFrame(self.pages_and_chunks)
 data_frame.to_csv("embeddings.csv", index=False)

 def run(self):
 self._generate_embeddings() # Generate embeddings for text chunks.
 self._save_embeddings() # Save the embeddings to a CSV file.

The above block of code generates embeddings for each chunk. In this project, I have used the all-mpnet-base-v2 embedding model, which is a robust model with a vector size of 768. It then saves all the embeddings along with their corresponding chunks in a CSV file.

Step 4 : Retrieval

import numpy as np
import pandas as pd
import torch
from sentence_transformers import util, SentenceTransformer


class Semantic_search:

 def __init__(self, embeddings_csv: str = "embeddings.csv"):
 self.device = (
 "cuda" if torch.cuda.is_available() else "cpu"
 ) # Set device based on availability
 self.embeddings_csv = embeddings_csv # Path to embeddings CSV
 self.embeddings_df = pd.read_csv(
 self.embeddings_csv
 ) # Load embeddings into DataFrame

 # Load pre-trained SentenceTransformer model
 self.embedding_model = SentenceTransformer(
 model_name_or_path="all-mpnet-base-v2", device=self.device
 )

 def _process_embeddings(self):
 self.embeddings_df["embedding"] = self.embeddings_df["embedding"].apply(
 lambda x: np.fromstring(
 x.strip("[]"), sep=" "
 ) # Convert string to numpy array
 )

 def _get_pages_and_chunks_dict(self):
 pages_and_chunks = self.embeddings_df.to_dict(orient="records")
 return pages_and_chunks

 def _convert_embeddings_to_tensor(self):
 return torch.tensor(
 np.array(self.embeddings_df["embedding"].tolist()), dtype=torch.float32
 ).to(self.device)

 def _retrieve_relevant_resources(
 self, query: str, embeddings: torch.tensor, n_resources_to_return: int = 5
 ):
 query_embedding = self.embedding_model.encode(
 query, convert_to_tensor=True
 ) # Encode the query
 dot_scores = util.dot_score(query_embedding, embeddings)[
 0
 ] # Calculate dot product scores
 scores, indices = torch.topk(
 input=dot_scores, k=n_resources_to_return
 ) # Get top results
 return scores, indices

 def _get_top_results(
 self,
 query: str,
 embeddings: torch.tensor,
 pages_and_chunks: list[dict],
 n_resources_to_return: int = 5,
 ):

 relevant_chunks = [] # List to store relevant chunks
 scores, indices = self._retrieve_relevant_resources(
 query=query,
 embeddings=embeddings,
 n_resources_to_return=n_resources_to_return,
 )

 # Retrieve the relevant sentence chunks based on the indices
 for index in indices:
 sentence_chunk = pages_and_chunks[index]["sentence_chunk"]
 relevant_chunks.append(sentence_chunk)

 return relevant_chunks

 def run(self, query: str):
 self._process_embeddings() # Process the embeddings to convert string to numpy arrays
 pages_and_chunks = (
 self._get_pages_and_chunks_dict()
 ) # Convert embeddings DataFrame to list of dictionaries
 embeddings_tensor = (
 self._convert_embeddings_to_tensor()
 ) # Convert embeddings to tensor
 relevant_chunks = self._get_top_results(
 query=query,
 embeddings=embeddings_tensor,
 pages_and_chunks=pages_and_chunks,
 ) # Retrieve the top relevant sentence chunks

 return relevant_chunks # Return the relevant chunks

In the above code block, we perform semantic search within each chunk embedding to retrieve relevant resources based on the query. We use the same embedding model that was used to generate the chunk embeddings. For the semantic search, we compute the dot product of each chunk embedding with the query vector to identify the most similar chunk.

Note: For the semantic searching im using dot product because the embeddings are normalized if your embedding are not normalized you should use cosine similarity.

Step 5 : Augmentation

class Create_prompt:
 def __init__(self):

 self.semantic_search = (
 Semantic_search()
 ) # Initialize the Semantic_search instance
 self.base_prompt = """Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.
\nExample 1:
Query: What are the fat-soluble vitamins?
Answer: The fat-soluble vitamins include Vitamin A, Vitamin D, Vitamin E, and Vitamin K. These vitamins are absorbed along with fats in the diet and can be stored in the body's fatty tissue and liver for later use. Vitamin A is important for vision, immune function, and skin health. Vitamin D plays a critical role in calcium absorption and bone health. Vitamin E acts as an antioxidant, protecting cells from damage. Vitamin K is essential for blood clotting and bone metabolism.
\nExample 2:
Query: What are the causes of type 2 diabetes?
Answer: Type 2 diabetes is often associated with overnutrition, particularly the overconsumption of calories leading to obesity. Factors include a diet high in refined sugars and saturated fats, which can lead to insulin resistance, a condition where the body's cells do not respond effectively to insulin. Over time, the pancreas cannot produce enough insulin to manage blood sugar levels, resulting in type 2 diabetes. Additionally, excessive caloric intake without sufficient physical activity exacerbates the risk by promoting weight gain and fat accumulation, particularly around the abdomen, further contributing to insulin resistance.
\nExample 3:
Query: What is the importance of hydration for physical performance?
Answer: Hydration is crucial for physical performance because water plays key roles in maintaining blood volume, regulating body temperature, and ensuring the transport of nutrients and oxygen to cells. Adequate hydration is essential for optimal muscle function, endurance, and recovery. Dehydration can lead to decreased performance, fatigue, and increased risk of heat-related illnesses, such as heat stroke. Drinking sufficient water before, during, and after exercise helps ensure peak physical performance and recovery.
\nNow use the following context items to answer the user query:
{context}\n
User query: {query}
Answer:"""

 def _get_releveant_chunks(self, query: str):
 relevant_chunks = self.semantic_search.run(
 query=query
 ) # Run semantic search to find relevant context
 return relevant_chunks

 def _join_chunks(self, relevant_chunks: list):
 context = "- " + "\n- ".join(
 item for item in relevant_chunks
 ) # Join chunks with list item format
 return context

 def run(self, query: str):
 relevant_chunks = self._get_releveant_chunks(
 query=query
 ) # Get relevant context for the query
 context = self._join_chunks(relevant_chunks) # Format the context into a string
 prompt = self.base_prompt.format(
 context=context, query=query
 ) # Format the base prompt with context and query
 return prompt

In the above block of code, I am simply adding the retrieved context and the query to the predefined prompt template, which is then passed to the LLM model.

Step 6 : Setting up LLM

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


class LLM_Model:

 def __init__(self, model_id: str = "tiiuae/Falcon3-3B-Instruct"):

 # Set the device based on the availability of a GPU.
 self.device = "cuda" if torch.cuda.is_available() else "cpu"
 self.model_id = model_id

 # Load the tokenizer for the specified model.
 self.tokenizer = AutoTokenizer.from_pretrained(
 pretrained_model_name_or_path=model_id
 )

 # Load the language model with the specified configuration.
 self.llm_model = AutoModelForCausalLM.from_pretrained(
 pretrained_model_name_or_path=model_id,
 torch_dtype=torch.float16,
 low_cpu_mem_usage=False,
 ).to(self.device)

 # Set the pad token ID to the end-of-sequence (EOS) token if it is not already set.
 if self.tokenizer.pad_token_id is None:
 self.tokenizer.pad_token_id = self.tokenizer.eos_token_id

 def _get_model_inputs(self, base_prompt):

 # Define a dialogue template with the user's role and content.
 dialogue_template = [{"role": "user", "content": base_prompt}]

 # Use the tokenizer to apply the chat template to the input prompt.
 input_data = self.tokenizer.apply_chat_template(
 conversation=dialogue_template, tokenize=False, add_generation_prompt=True
 )

 # Convert the dialogue into input tensors suitable for the model.
 input_data = self.tokenizer(input_data, return_tensors="pt").to(self.device)
 return input_data

 def run(self, base_prompt):

 # Get the model inputs from the base prompt.
 input_data = self._get_model_inputs(base_prompt=base_prompt)

 # Generate the text output from the model.
 output_ids = self.llm_model.generate(
 input_ids=input_data["input_ids"],
 attention_mask=input_data["attention_mask"],
 max_length=256,
 do_sample=True,
 pad_token_id=self.tokenizer.pad_token_id,
 )

 # Decode the generated output_ids to get the text response.
 response = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
 # Split the response to remove any extra content added by the model.
 response = response.split("<|assistant|>")[-1].strip()
 return response

The above code block loads an LLM from Hugging Face and generates text based on the input provided.

Step 7 : Completing the whole Pipeline

import os


class Local_RAG:

 def __init__(self, pdf_path):

 self.pdf_path = pdf_path # Path to the PDF file

 # Check if the embeddings CSV file already exists. If not, generate and save embeddings.
 if not os.path.exists("embeddings.csv"):
 self.save_embeddings = SaveEmbeddings(
 pdf_path=self.pdf_path
 ) # Initialize Save_Embeddings
 self.save_embeddings.run() # Run the embedding saving process

 self.create_prompt = (
 Create_prompt()
 ) # Initialize Create_prompt for prompt generation
 self.llm_model = (
 LLM_Model()
 ) # Initialize LLM_Model for language model response generation

 def run(self, query):
 print("Creating Prompt....")
 base_prompt = self.create_prompt.run(
 query=query
 ) # Generate the base prompt using the query
 print("Generating Results....")
 response = self.llm_model.run(
 base_prompt=base_prompt
 ) # Generate a response from the language model
 return response # Return the generated response

The code block streamlines all the components of the RAG pipeline, making it ready for use.

Example Usage :

pdf_path = r"your/pdf/path"
local_rag = Local_RAG(pdf_path=pdf_path)
query = "What is the purpose of the paper?"
local_rag.run(query=query)

By just running the above code you can use your own Local_RAG.

Improvements :

Despite the effectiveness of the existing implementation, there are a few areas that might be enhanced for increased scalability and performance:

Model Optimization:

To increase response relevance and accuracy, try using more sophisticated embeddings or alternative LLM structures.
Use more effective data structures, such as FAISS (Facebook AI Similarity Search), to streamline the embedding creation and retrieval process for quicker retrieval.

Chunk Management:

Use more complex chunking techniques to increase the coherence of the generated replies and lessen the context loss between chunks.
To capture deeper context, including extra metadata (such as semantic tags or document section headers) in the chunking process.

Dynamic Document Updates:

By allowing the system to dynamically alter document embeddings in response to new data, you may increase the RAG system’s adaptability to new content without having to retrain the entire model.

Conclusion :

An effective, privacy-conscious method of improving AI systems is through local Retrieval-Augmented Generation (RAG) implementation, particularly in settings where offline operation is required. RAG is perfect for applications like customer service, document summarizing, and question answering because it combines retrieval, augmentation, and generation to create a dynamic system that can provide extremely relevant and accurate solutions. This tutorial showed how to build RAG from the ground up without the aid of third-party frameworks, guaranteeing total control over the procedure and its privacy features.

The step-by-step process covers:

Installing dependencies and configuring the environment.
Dividing up documents (like PDFs) and getting them ready for embedding.
Creating and storing embeddings in a file for later access.
Putting semantic search into practice to get pertinent document sections in response to user inquiries.
Adding pertinent context to the query and adapting its syntax for the language model.
Use the augmented query to run the LLM and produce contextual responses.

Checkout full implementation on github :https://github.com/BEASTBOYJAY/Local_RAG

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

The Complete Guide to Implementing RAG Locally: No Cloud or Frameworks are Required

Author(s): BeastBoyJay

Content :

What is RAG ?

Components of RAG :

Retrieval :

Augmentation :

Generation :

Implementation :

Step 1: Setting up Environment

Step 2 : Document/Text Processing

Step 3 : Generating embedding and Saving it

Step 4 : Retrieval

Step 5 : Augmentation

Step 6 : Setting up LLM

Step 7 : Completing the whole Pipeline

Example Usage :

Improvements :

Model Optimization:

Chunk Management:

Dynamic Document Updates:

Conclusion :

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥