Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

Introduction to RAG: Basics to Mastery. 1-Build Your Own Local RAG Pipeline (No Cloud, No API Keys)
Artificial Intelligence   Latest   Machine Learning

Introduction to RAG: Basics to Mastery. 1-Build Your Own Local RAG Pipeline (No Cloud, No API Keys)

Author(s): Taha Azizi

Originally published on Towards AI.

Part 1 of the mini-series introduction to RAG
A step-by-step guide to running Retrieval-Augmented Generation fully offline with Ollama, ChromaDB, and SentenceTransformers.

Introduction to RAG: Basics to Mastery. 1-Build Your Own Local RAG Pipeline (No Cloud, No API Keys)

Introduction

Large Language Models (LLMs) are powerful, but they come with two big limitations:

  1. They often “hallucinate” — generating answers that sound correct but are factually wrong.
  2. Their knowledge is frozen at training time.
  3. The answer is tailored towards a generic knowledge based not a specific one.

Retrieval-Augmented Generation (RAG) solves all above problems by giving models access to an external knowledge base. Instead of relying only on what the model has memorized, RAG retrieves relevant facts from documents and injects them into the prompt.

Originally proposed by Facebook AI Research in 2020 [1], RAG has quickly become a core technique behind production AI systems. OpenAI uses it in enterprise ChatGPT deployments, Cohere builds retrieval into its APIs, and many startups rely on it to deliver trustworthy AI applications.

In this first part of my mini-series on RAG, I’ll show you how to:

  • Store documents in a vector database using Chroma
  • Create embeddings with SentenceTransformers
  • Run a local LLM using Ollama
  • Combine them into a working offline RAG pipeline

By the end, you’ll have a private, local assistant that can answer questions with grounded knowledge — no cloud, no API keys, and no cost.

Theory: How RAG Works

At its core, RAG combines two key ideas: retrieval and generation.

  1. Document Ingestion
    You load your dataset (text, PDFs, CSVs, etc.) and split it into manageable chunks (e.g., 500 characters).
  2. Embedding Generation
    Each chunk is converted into a high-dimensional vector (embedding) using a pretrained model. Embeddings capture semantic meaning — chunks about “solar panels” and “wind farms” will be close together in vector space.
  3. Vector Database Storage
    These embeddings are stored in a database optimized for similarity search (e.g., Chroma, FAISS, Pinecone).
  4. Query + Retrieval
    When a user asks a question, it’s also embedded into a vector. The system retrieves the most semantically similar chunks from the database.
  5. Augmented Generation
    Retrieved chunks are added to the LLM’s prompt, grounding its answer in external knowledge rather than its static training set.

Why does this matter?

  • Hallucination reduction: The LLM is forced to cite context rather than guess.
  • Knowledge freshness: You can update the database without retraining the model.
  • Domain adaptation: Add proprietary or niche documents that the LLM was never trained on.

This approach has been validated across multiple domains, from open-domain question answering [1], to enterprise search [2], to clinical decision support [3].

In short: RAG makes your LLM smarter, safer, and more useful.

Simple RAG pipeline

Setup

We’ll use:

  • Ollama — to run an LLM like mistral locally.
  • SentenceTransformers — to embed our text.
  • ChromaDB — to store and search embeddings.

Requirements:

  • Python 3.10+
  • Capable Graphic Card — Nvidia RTX 4060 GPU or above (CUDA installed)
  • 32+ GB RAM

Install dependencies:

pip install chromadb sentence-transformers ollama

Install Ollama and pull a model:

ollama pull mistral

Step-by-Step Code

Step 1. Load Documents

We’ll use a simple .txt file for this demo.

# load_docs.py
from pathlib import Path

def load_text_files(folder_path):
texts = []
for file in Path(folder_path).glob("*.txt"):
with open(file, "r", encoding="utf-8") as f:
texts.append(f.read())
return texts

docs = load_text_files("./data")
print(f"Loaded {len(docs)} documents.")

Step 2. Chunk Text

# chunking.py
def chunk_text(text, chunk_size=500, overlap=50):
chunks = []
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
chunks.append(text[start:end])
start += chunk_size - overlap
return chunks

all_chunks = []
for doc in docs:
all_chunks.extend(chunk_text(doc))
print(f"Total chunks: {len(all_chunks)}")

Step 3. Create Embeddings

We’ll use all-MiniLM-L6-v2 for fast, GPU-accelerated embeddings.

# embeddings.py
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embeddings = embedder.encode(all_chunks, convert_to_numpy=True, show_progress_bar=True)
print(f"Created embeddings with shape: {embeddings.shape}")

Step 4. Store in Chroma

# store.py
import chromadb
client = chromadb.Client()

collection = client.create_collection(name="local_rag")
for i, chunk in enumerate(all_chunks):
collection.add(
ids=[f"chunk_{i}"],
documents=[chunk],
embeddings=[embeddings[i]]
)
print("Chunks stored in Chroma.")

Step 5. Query + Retrieval

# query.py
query = "What does the document say about renewable energy?"
query_embedding = embedder.encode([query], convert_to_numpy=True)[0]
results = collection.query(
query_embeddings=[query_embedding],
n_results=3
)

for i, doc in enumerate(results["documents"][0]):
print(f"Result {i+1}: {doc}\n")

Step 6. Generate Answer with Ollama

We now send retrieved context to the LLM.

# rag.py
import subprocess

context = "\n".join(results["documents"][0])
prompt = f"Answer the question using only the following context:\n{context}\n\nQuestion: {query}\nAnswer:"
ollama_cmd = ["ollama", "run", "mistral", prompt] #or your specific model
response = subprocess.run(ollama_cmd, capture_output=True, text=True)
print("LLM Response:\n", response.stdout)

Full Workflow Script

You can merge all steps into one rag_basic.py file so you can run:

#refer to the github repository for the complete execution
python rag_basic.py

Expected Output

Result 1: Renewable energy sources such as solar and wind...
Result 2: The government plans to expand green infrastructure...
Result 3: A report on renewable adoption in rural areas...

LLM Response:
Renewable energy sources, particularly solar and wind, are being prioritized...

Conclusion & Next Steps

Congratulations — you now have a fully offline RAG system running on your own machine. You can:

  • Add more documents
  • Experiment with different embedding models
  • Swap in other Ollama LLMs

In Part 2 of this series, I’ll dive into Hybrid RAG — combining semantic search with keyword-based BM25 for even more accurate retrieval.

💬 I’d love to hear how you’re experimenting with RAG. What challenges have you faced running it locally? Share your experiences in the comments — and follow me to catch the next article in this series.

References

[1] Patrick Lewis, Ethan Perez, Aleksandra Piktus et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020), NeurIPS.
[2] Jay Alammar, The Illustrated Retrieval-Augmented Generation (2023), Cohere Blog.
[3] OpenAI, Reducing Hallucinations in LLMs with Retrieval-Augmented Generation (2023), OpenAI Research.

Github Repository: https://github.com/Taha-azizi/RAG

All images were generated by the author using AI tools.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.