Introduction to RAG: Basics to Mastery. 1-Build Your Own Local RAG Pipeline (No Cloud, No API Keys)

Author(s): Taha Azizi

Originally published on Towards AI.

Part 1 of the mini-series introduction to RAG
A step-by-step guide to running Retrieval-Augmented Generation fully offline with Ollama, ChromaDB, and SentenceTransformers.

Introduction

Large Language Models (LLMs) are powerful, but they come with two big limitations:

They often “hallucinate” — generating answers that sound correct but are factually wrong.
Their knowledge is frozen at training time.
The answer is tailored towards a generic knowledge based not a specific one.

Retrieval-Augmented Generation (RAG) solves all above problems by giving models access to an external knowledge base. Instead of relying only on what the model has memorized, RAG retrieves relevant facts from documents and injects them into the prompt.

Originally proposed by Facebook AI Research in 2020 [1], RAG has quickly become a core technique behind production AI systems. OpenAI uses it in enterprise ChatGPT deployments, Cohere builds retrieval into its APIs, and many startups rely on it to deliver trustworthy AI applications.

In this first part of my mini-series on RAG, I’ll show you how to:

Store documents in a vector database using Chroma
Create embeddings with SentenceTransformers
Run a local LLM using Ollama
Combine them into a working offline RAG pipeline

By the end, you’ll have a private, local assistant that can answer questions with grounded knowledge — no cloud, no API keys, and no cost.

Theory: How RAG Works

At its core, RAG combines two key ideas: retrieval and generation.

Document Ingestion
You load your dataset (text, PDFs, CSVs, etc.) and split it into manageable chunks (e.g., 500 characters).
Embedding Generation
Each chunk is converted into a high-dimensional vector (embedding) using a pretrained model. Embeddings capture semantic meaning — chunks about “solar panels” and “wind farms” will be close together in vector space.
Vector Database Storage
These embeddings are stored in a database optimized for similarity search (e.g., Chroma, FAISS, Pinecone).
Query + Retrieval
When a user asks a question, it’s also embedded into a vector. The system retrieves the most semantically similar chunks from the database.
Augmented Generation
Retrieved chunks are added to the LLM’s prompt, grounding its answer in external knowledge rather than its static training set.

Why does this matter?

Hallucination reduction: The LLM is forced to cite context rather than guess.
Knowledge freshness: You can update the database without retraining the model.
Domain adaptation: Add proprietary or niche documents that the LLM was never trained on.

This approach has been validated across multiple domains, from open-domain question answering [1], to enterprise search [2], to clinical decision support [3].

In short: RAG makes your LLM smarter, safer, and more useful.

Setup

We’ll use:

Ollama — to run an LLM like mistral locally.
SentenceTransformers — to embed our text.
ChromaDB — to store and search embeddings.

Requirements:

Python 3.10+
Capable Graphic Card — Nvidia RTX 4060 GPU or above (CUDA installed)
32+ GB RAM

Install dependencies:

pip install chromadb sentence-transformers ollama

Install Ollama and pull a model:

ollama pull mistral

Step-by-Step Code

Step 1. Load Documents

We’ll use a simple .txt file for this demo.

# load_docs.py
from pathlib import Path

def load_text_files(folder_path):
 texts = []
 for file in Path(folder_path).glob("*.txt"):
 with open(file, "r", encoding="utf-8") as f:
 texts.append(f.read())
 return texts

docs = load_text_files("./data")
print(f"Loaded {len(docs)} documents.")

Step 2. Chunk Text

# chunking.py
def chunk_text(text, chunk_size=500, overlap=50):
 chunks = []
 start = 0
 while start < len(text):
 end = min(start + chunk_size, len(text))
 chunks.append(text[start:end])
 start += chunk_size - overlap
 return chunks

all_chunks = []
for doc in docs:
 all_chunks.extend(chunk_text(doc))
print(f"Total chunks: {len(all_chunks)}")

Step 3. Create Embeddings

We’ll use all-MiniLM-L6-v2 for fast, GPU-accelerated embeddings.

# embeddings.py
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embeddings = embedder.encode(all_chunks, convert_to_numpy=True, show_progress_bar=True)
print(f"Created embeddings with shape: {embeddings.shape}")

Step 4. Store in Chroma

# store.py
import chromadb
client = chromadb.Client()

collection = client.create_collection(name="local_rag")
for i, chunk in enumerate(all_chunks):
 collection.add(
 ids=[f"chunk_{i}"],
 documents=[chunk],
 embeddings=[embeddings[i]]
 )
print("Chunks stored in Chroma.")

Step 5. Query + Retrieval

# query.py
query = "What does the document say about renewable energy?"
query_embedding = embedder.encode([query], convert_to_numpy=True)[0]
results = collection.query(
 query_embeddings=[query_embedding],
 n_results=3
)

for i, doc in enumerate(results["documents"][0]):
 print(f"Result {i+1}: {doc}\n")

Step 6. Generate Answer with Ollama

We now send retrieved context to the LLM.

# rag.py
import subprocess

context = "\n".join(results["documents"][0])
prompt = f"Answer the question using only the following context:\n{context}\n\nQuestion: {query}\nAnswer:"
ollama_cmd = ["ollama", "run", "mistral", prompt] #or your specific model
response = subprocess.run(ollama_cmd, capture_output=True, text=True)
print("LLM Response:\n", response.stdout)

Full Workflow Script

You can merge all steps into one rag_basic.py file so you can run:

#refer to the github repository for the complete execution
python rag_basic.py

Expected Output

Result 1: Renewable energy sources such as solar and wind...
Result 2: The government plans to expand green infrastructure...
Result 3: A report on renewable adoption in rural areas...

LLM Response:
Renewable energy sources, particularly solar and wind, are being prioritized...

Conclusion & Next Steps

Congratulations — you now have a fully offline RAG system running on your own machine. You can:

Add more documents
Experiment with different embedding models
Swap in other Ollama LLMs

In Part 2 of this series, I’ll dive into Hybrid RAG — combining semantic search with keyword-based BM25 for even more accurate retrieval.

💬 I’d love to hear how you’re experimenting with RAG. What challenges have you faced running it locally? Share your experiences in the comments — and follow me to catch the next article in this series.

References

[1] Patrick Lewis, Ethan Perez, Aleksandra Piktus et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020), NeurIPS.
[2] Jay Alammar, The Illustrated Retrieval-Augmented Generation (2023), Cohere Blog.
[3] OpenAI, Reducing Hallucinations in LLMs with Retrieval-Augmented Generation (2023), OpenAI Research.

Github Repository: https://github.com/Taha-azizi/RAG

All images were generated by the author using AI tools.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Introduction to RAG: Basics to Mastery. 1-Build Your Own Local RAG Pipeline (No Cloud, No API Keys)

Author(s): Taha Azizi

Part 1 of the mini-series introduction to RAG
A step-by-step guide to running Retrieval-Augmented Generation fully offline with Ollama, ChromaDB, and SentenceTransformers.

Introduction

Theory: How RAG Works

Setup

Step-by-Step Code

Step 1. Load Documents

Step 2. Chunk Text

Step 3. Create Embeddings

Step 4. Store in Chroma

Step 5. Query + Retrieval

Step 6. Generate Answer with Ollama

Full Workflow Script

Expected Output

Conclusion & Next Steps

References

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Introduction to RAG: Basics to Mastery. 1-Build Your Own Local RAG Pipeline (No Cloud, No API Keys)

Author(s): Taha Azizi

Part 1 of the mini-series introduction to RAG A step-by-step guide to running Retrieval-Augmented Generation fully offline with Ollama, ChromaDB, and SentenceTransformers.

Introduction

Theory: How RAG Works

Setup

Step-by-Step Code

Step 1. Load Documents

Step 2. Chunk Text

Step 3. Create Embeddings

Step 4. Store in Chroma

Step 5. Query + Retrieval

Step 6. Generate Answer with Ollama

Full Workflow Script

Expected Output

Conclusion & Next Steps

References

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement

Part 1 of the mini-series introduction to RAG
A step-by-step guide to running Retrieval-Augmented Generation fully offline with Ollama, ChromaDB, and SentenceTransformers.