
Introduction to RAG: Basics to Mastery. 1-Build Your Own Local RAG Pipeline (No Cloud, No API Keys)
Author(s): Taha Azizi
Originally published on Towards AI.
Part 1 of the mini-series introduction to RAG
A step-by-step guide to running Retrieval-Augmented Generation fully offline with Ollama, ChromaDB, and SentenceTransformers.

Introduction
Large Language Models (LLMs) are powerful, but they come with two big limitations:
- They often “hallucinate” — generating answers that sound correct but are factually wrong.
- Their knowledge is frozen at training time.
- The answer is tailored towards a generic knowledge based not a specific one.
Retrieval-Augmented Generation (RAG) solves all above problems by giving models access to an external knowledge base. Instead of relying only on what the model has memorized, RAG retrieves relevant facts from documents and injects them into the prompt.
Originally proposed by Facebook AI Research in 2020 [1], RAG has quickly become a core technique behind production AI systems. OpenAI uses it in enterprise ChatGPT deployments, Cohere builds retrieval into its APIs, and many startups rely on it to deliver trustworthy AI applications.
In this first part of my mini-series on RAG, I’ll show you how to:
- Store documents in a vector database using Chroma
- Create embeddings with SentenceTransformers
- Run a local LLM using Ollama
- Combine them into a working offline RAG pipeline
By the end, you’ll have a private, local assistant that can answer questions with grounded knowledge — no cloud, no API keys, and no cost.
Theory: How RAG Works
At its core, RAG combines two key ideas: retrieval and generation.
- Document Ingestion
You load your dataset (text, PDFs, CSVs, etc.) and split it into manageable chunks (e.g., 500 characters). - Embedding Generation
Each chunk is converted into a high-dimensional vector (embedding) using a pretrained model. Embeddings capture semantic meaning — chunks about “solar panels” and “wind farms” will be close together in vector space. - Vector Database Storage
These embeddings are stored in a database optimized for similarity search (e.g., Chroma, FAISS, Pinecone). - Query + Retrieval
When a user asks a question, it’s also embedded into a vector. The system retrieves the most semantically similar chunks from the database. - Augmented Generation
Retrieved chunks are added to the LLM’s prompt, grounding its answer in external knowledge rather than its static training set.
Why does this matter?
- Hallucination reduction: The LLM is forced to cite context rather than guess.
- Knowledge freshness: You can update the database without retraining the model.
- Domain adaptation: Add proprietary or niche documents that the LLM was never trained on.
This approach has been validated across multiple domains, from open-domain question answering [1], to enterprise search [2], to clinical decision support [3].
In short: RAG makes your LLM smarter, safer, and more useful.

Setup
We’ll use:
- Ollama — to run an LLM like
mistral
locally. - SentenceTransformers — to embed our text.
- ChromaDB — to store and search embeddings.
Requirements:
- Python 3.10+
- Capable Graphic Card — Nvidia RTX 4060 GPU or above (CUDA installed)
- 32+ GB RAM
Install dependencies:
pip install chromadb sentence-transformers ollama
Install Ollama and pull a model:
ollama pull mistral
Step-by-Step Code
Step 1. Load Documents
We’ll use a simple .txt
file for this demo.
# load_docs.py
from pathlib import Path
def load_text_files(folder_path):
texts = []
for file in Path(folder_path).glob("*.txt"):
with open(file, "r", encoding="utf-8") as f:
texts.append(f.read())
return texts
docs = load_text_files("./data")
print(f"Loaded {len(docs)} documents.")
Step 2. Chunk Text
# chunking.py
def chunk_text(text, chunk_size=500, overlap=50):
chunks = []
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
chunks.append(text[start:end])
start += chunk_size - overlap
return chunks
all_chunks = []
for doc in docs:
all_chunks.extend(chunk_text(doc))
print(f"Total chunks: {len(all_chunks)}")
Step 3. Create Embeddings
We’ll use all-MiniLM-L6-v2
for fast, GPU-accelerated embeddings.
# embeddings.py
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embeddings = embedder.encode(all_chunks, convert_to_numpy=True, show_progress_bar=True)
print(f"Created embeddings with shape: {embeddings.shape}")
Step 4. Store in Chroma
# store.py
import chromadb
client = chromadb.Client()
collection = client.create_collection(name="local_rag")
for i, chunk in enumerate(all_chunks):
collection.add(
ids=[f"chunk_{i}"],
documents=[chunk],
embeddings=[embeddings[i]]
)
print("Chunks stored in Chroma.")
Step 5. Query + Retrieval
# query.py
query = "What does the document say about renewable energy?"
query_embedding = embedder.encode([query], convert_to_numpy=True)[0]
results = collection.query(
query_embeddings=[query_embedding],
n_results=3
)
for i, doc in enumerate(results["documents"][0]):
print(f"Result {i+1}: {doc}\n")
Step 6. Generate Answer with Ollama
We now send retrieved context to the LLM.
# rag.py
import subprocess
context = "\n".join(results["documents"][0])
prompt = f"Answer the question using only the following context:\n{context}\n\nQuestion: {query}\nAnswer:"
ollama_cmd = ["ollama", "run", "mistral", prompt] #or your specific model
response = subprocess.run(ollama_cmd, capture_output=True, text=True)
print("LLM Response:\n", response.stdout)
Full Workflow Script
You can merge all steps into one rag_basic.py
file so you can run:
#refer to the github repository for the complete execution
python rag_basic.py
Expected Output
Result 1: Renewable energy sources such as solar and wind...
Result 2: The government plans to expand green infrastructure...
Result 3: A report on renewable adoption in rural areas...
LLM Response:
Renewable energy sources, particularly solar and wind, are being prioritized...
Conclusion & Next Steps
Congratulations — you now have a fully offline RAG system running on your own machine. You can:
- Add more documents
- Experiment with different embedding models
- Swap in other Ollama LLMs
In Part 2 of this series, I’ll dive into Hybrid RAG — combining semantic search with keyword-based BM25 for even more accurate retrieval.
💬 I’d love to hear how you’re experimenting with RAG. What challenges have you faced running it locally? Share your experiences in the comments — and follow me to catch the next article in this series.
References
[1] Patrick Lewis, Ethan Perez, Aleksandra Piktus et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020), NeurIPS.
[2] Jay Alammar, The Illustrated Retrieval-Augmented Generation (2023), Cohere Blog.
[3] OpenAI, Reducing Hallucinations in LLMs with Retrieval-Augmented Generation (2023), OpenAI Research.
Github Repository: https://github.com/Taha-azizi/RAG
All images were generated by the author using AI tools.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.