Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

🧠 Implementing RAG from Scratch in Google Colab
Latest   Machine Learning

🧠 Implementing RAG from Scratch in Google Colab

Last Updated on October 15, 2025 by Editorial Team

Author(s): Alok Ranjan Singh

Originally published on Towards AI.

🧠 Implementing RAG from Scratch in Google Colab

🧠 Implementing RAG from Scratch in Google Colab

> Build your own Retrieval-Augmented Generation (RAG) system in Google Colab — understand how LLMs retrieve, reason, and respond with real facts.

💡 Why I Built This

A few weeks ago, I realized most explanations of RAG (Retrieval-Augmented Generation) online are either too abstract or too complex.

So I decided to build something anyone could understand — and more importantly, run inside Google Colab.

The result?
A Minimal, Fully Documented RAG Example — where you can drop in your own .txt files and watch an LLM retrieve, reason, and respond with context-aware answers.

🧩 What You’ll Learn

This project walks through every major step of a RAG system:

  • Installing the required libraries
  • Loading and chunking your text documents
  • Converting those chunks into dense embeddings
  • Building a FAISS similarity index
  • Retrieving the most relevant chunks for a user query
  • Building a context-rich prompt
  • Generating a deterministic, reproducible answer using GPT-2

⚙️ How It Works

Here’s what happens under the hood:

1️⃣ Document Chunking

Each .txt file from your /docs/ folder is split into 50-word chunks with a 10-word overlap.
This ensures smooth semantic continuity — no idea gets lost between boundaries.

2️⃣ Embeddings with SentenceTransformers

Each chunk is transformed into a vector — a numerical summary of its meaning.
These embeddings help the system search by meaning, not just keywords.

3️⃣ FAISS Indexing

The embeddings are stored in a FAISS (Facebook AI Similarity Search) index, which makes it possible to instantly find the most relevant chunks for any question.

4️⃣ Retrieval

When you ask a question, the system encodes it into the same embedding space, then retrieves the top-3 most similar chunks.

5️⃣ Prompt Construction

The retrieved chunks are stitched into a structured prompt like this:

Context:
<chunk1>
<chunk2>
<chunk3>

Question: <your question>
Answer:

6️⃣ Answer Generation

Finally, a GPT-2 language model reads the context and generates an answer — deterministically (same input = same output every time).

🧪 Example Output

Question:

What are the essential kits for hiking or trekking?

Generated Answer:

1. Backpack
2. Water bottle / Hydration pack
3. Trekking poles
4. Map / Compass / GPS
5. First-aid kit
6. Rain gear
7. Snacks / Energy bars

Everything you see is retrieved from your own documents, not from the model’s memory — that’s the beauty of RAG.

🧰 How to Use

  1. Open the Google Colab Notebook.
  2. Create a folder named docs/ in the Colab root.
  3. Place any number of .txt files inside it.
  4. Run all cells (or execute python rag_colab.py).

📜 Full Code Walkthrough

# ---------- 1. Install required libraries ----------
# (Run this in Colab or your environment; in a script omit the '!' when using pip)
!pip install -q sentence-transformers faiss-cpu transformers
# ---------- 2. Imports ----------
import os
import faiss
import numpy as np
import torch
from pathlib import Path
from sentence_transformers import SentenceTransformer
from transformers import pipeline, set_seed
# ---------- 3. Document loader ----------
def load_documents(doc_dir: str) -> list:
"""
Load and preprocess all plain‑text documents in a directory.

Parameters
----------
doc_dir : str
Path to a folder that contains one or more *.txt files.

Returns
-------
docs : list of str
A flat list where each element is a chunk of 50 words taken from the
original files, with a 10‑word overlap between consecutive chunks.

Notes
-----
Chunking at a small granularity (50 words) allows the retriever to
identify highly relevant snippets rather than whole paragraphs.
• The 10‑word overlap ensures that the boundary words of a chunk
are not lost when we split a document – this improves semantic
continuity for the embedding model.
"""

print("Loading documents...")

# Helper that splits a single string into overlapping chunks
def chunk_text(text: str, chunk_size: int = 50, overlap: int = 10) -> list:
"""
Split a block of text into overlapping word‑based chunks.

Parameters
----------
text : str
Raw document text.
chunk_size : int, optional
Number of words per chunk (default 50).
overlap : int, optional
Number of words that consecutive chunks share (default 10).

Returns
-------
list of str
List of chunk strings.
"""

words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + chunk_size
chunk = " ".join(words[start:end])
chunks.append(chunk)
start += chunk_size - overlap
return chunks

docs = []
for file_path in Path(doc_dir).glob("*.txt"):
with open(file_path, "r", encoding="utf-8") as f:
content = f.read()
# Extend the master list with the new chunks
docs.extend(chunk_text(content))

print(f"Loaded: Number of docs: {len(docs)}")
return docs
# ---------- 4. Embeddings ----------
def embed_documents(docs: list, model_name: str = "all-MiniLM-L6-v2") -> tuple:
"""
Convert text chunks into dense vector representations.

Parameters
----------
docs : list of str
The list of document chunks to embed.
model_name : str, optional
The sentence‑transformer model to use. The default
“all-MiniLM-L6-v2” is lightweight and works well for quick demos.

Returns
-------
embeddings : np.ndarray
2‑D array of shape (num_chunks, embedding_dim).
model : SentenceTransformer
The loaded embedding model – kept for re‑encoding queries later.
"""

print("Embedding documents...")
model = SentenceTransformer(model_name)
embeddings = model.encode(docs, convert_to_numpy=True)
return embeddings, model
# ---------- 5. FAISS index ----------
def build_faiss_index(embeddings: np.ndarray) -> faiss.IndexFlatL2:
"""
Build a FAISS index for fast nearest‑neighbour search.

Parameters
----------
embeddings : np.ndarray
2‑D array of document embeddings (float32 or float64).

Returns
-------
faiss.IndexFlatL2
A FAISS index that can answer distance‑based queries.
"""

print("Building FAISS index...")
if embeddings.ndim != 2 or embeddings.shape[0] == 0:
raise ValueError(
f"Embeddings must be a non‑empty 2‑D array. Received shape: {embeddings.shape}"
)
dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(embeddings)
return indexp
# ---------- 6. Retrieval ----------
def retrieve(index: faiss.IndexFlatL2, query_embedding: np.ndarray, k: int = 3) -> tuple:
"""
Find the top‑k most similar document chunks to a query vector.

Parameters
----------
index : faiss.IndexFlatL2
The pre‑built FAISS index.
query_embedding : np.ndarray
Embedding of the user question, shape (1, dim).
k : int, optional
How many neighbours to return (default 3).

Returns
-------
indices : np.ndarray of int
1‑D array of the top‑k document indices.
distances : np.ndarray of float
Corresponding L2 distances – useful for debugging.
"""

print(f"Retrieving top-{k} documents...")
distances, indices = index.search(query_embedding, k)
return indices[0], distances[0]
# ---------- 7. Prompt construction ----------
def build_prompt(context_docs: list, user_query: str) -> str:
"""
Assemble the final prompt that will be fed to the language model.

Parameters
----------
context_docs : list of str
The text chunks that were retrieved for the query.
user_query : str
The original user question.

Returns
-------
prompt : str
A single string that follows the format required by the
generation step: “Context:\n<docs>\n\nQuestion: <q>\nAnswer:”
"""

print("Building prompt...")
context = "\n\n".join(context_docs)
prompt = f"Context:\n{context}\n\nQuestion: {user_query}\nAnswer:"
return prompt
# ---------- 8. Generation ----------
def generate_answer(
prompt: str,
model_name: str = "gpt2-large",
max_new_tokens: int = 50,
dtype: torch.dtype = None,
) -> str:
"""
Generate a deterministic answer using a causal language model.

The generation step is *deterministic* because we fix the random seed
and set `do_sample=False`. This makes the output reproducible – a
crucial property for teaching labs.

Parameters
----------
prompt : str
Prompt produced by :func:`build_prompt`.
model_name : str, optional
Name of the Hugging‑Face transformer model to use.
`"gpt2-large"` is a good trade‑off between quality and speed.
max_new_tokens : int, optional
Maximum number of tokens to generate beyond the prompt.
dtype : torch.dtype, optional
Data type for model tensors – `torch.float16` reduces GPU memory
usage when a GPU is available.

Returns
-------
answer : str
The generated text after the last “Answer:” marker.
"""

# Initialise the generation pipeline (will cache the model on disk)
generator = pipeline(
"text-generation",
model=model_name,
tokenizer=model_name,
device=0 if torch.cuda.is_available() else -1,
truncation=True,
dtype=dtype,
)

# Deterministic behaviour
set_seed(42)

# Some models (e.g., GPT‑2) do not define a pad token.
# We fall back to the EOS token to avoid warnings.
if generator.model.config.pad_token_id is None:
generator.model.config.pad_token_id = generator.model.config.eos_token_id

# Generate text without sampling
output = generator(
prompt,
max_new_tokens=max_new_tokens,
num_return_sequences=1,
do_sample=False, # deterministic
)[0]["generated_text"]

# Remove the prompt part, keep only the answer text
answer = output.split("Answer:")[-1].strip()
return answer
# ---------- 9. Main workflow ----------
"""
Full RAG pipeline executed when the script runs.

1. Load and chunk documents from the `docs/` folder.
2. Embed the chunks with Sentence‑Transformer.
3. Build a FAISS index for efficient similarity search.
4. Encode a sample user query and retrieve the 3 most relevant
context snippets.
5. Build a prompt that combines the retrieved context with
the question.
6. Generate a deterministic answer using GPT‑2‑large.
"""

# 1️⃣ Load documents
docs = load_documents("docs") # ← put your .txt files in `docs/`

# 2️⃣ Create embeddings
embeddings, embed_model = embed_documents(docs)

# 3️⃣ Build FAISS index
faiss_index = build_faiss_index(embeddings)

# 4️⃣ Example query
user_query = "What are the essential kits for hiking/ trekking?"
query_vec = embed_model.encode([user_query], convert_to_numpy=True)

# 5️⃣ Retrieve top‑k contexts
top_k_indices, _ = retrieve(faiss_index, query_vec, k=3)
retrieved_docs = [docs[i] for i in top_k_indices]

# 6️⃣ Build prompt
prompt = build_prompt(retrieved_docs, user_query)

# 7️⃣ Generate answer
answer = generate_answer(
prompt,
max_new_tokens=70, # increase if you want longer answers
dtype=torch.float16, # keeps GPU memory down
)

# 8️⃣ Output the results
print(f"\nPrompt:\n<start_of_prompt>\n{prompt}\n<end_of_prompt>")
print(f"\nQuestion:\n<start_of_question>\n{user_query}\n<end_of_question>")
print(f"\nAnswer:\n<start_of_answer>\n{answer}\n<end_of_answer>")
Loading documents...
Loaded: Number of docs: 52
Embedding documents...
Building FAISS index...
Retrieving top-3 documents...
Building prompt...
Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Prompt:
<start_of_prompt>
Context:
Equipment** ### 🥾 **For Hiking** * Sturdy **hiking boots** * **Backpack** (20–40 L) * **Water bottle / Hydration pack** * **Trekking poles** * **Map / Compass / GPS** * **Weather-appropriate clothing** * **Snacks / Energy bars** * **First-aid kit** * **Rain gear** ### 🧗‍♀️ **For Mountain Climbing** * **Helmet** *

*
Basic first aid * Weather awareness ### For Mountaineering: * Rope handling & knots * Ice axe use & self-arrest * Crampon walking * Crevasse rescue * Altitude management * Team coordination --- ## 🎒 **6. Essential Gear & Equipment** ### 🥾 **For Hiking** * Sturdy **
hiking boots** *

*
*Expedition Climbing** | Multi-week climbs of massive peaks (e.g., Mount Everest). | | **Indoor Climbing** | Practicing on artificial climbing walls. | --- ## 🧠 **5. Skills Required** ### For Hiking: * Navigation (map, compass, GPS) * Endurance & pacing * Basic first aid * Weather awareness ### For Mountaineering:

Question: What are the essential kits for hiking/ trekking?
Answer:
<end_of_prompt>

Question:
<start_of_question>
What are the essential kits for hiking/ trekking?
<end_of_question>

Answer:
<start_of_answer>
1. Backpack

2. Water bottle / Hydration pack

3. Trekking poles

4. Map / Compass / GPS

5. First-aid kit

6. Rain gear

7. First-aid kit

8. Snacks / Energy bars

9. First-aid
<end_of_answer>

✅That’s it — your mini RAG pipeline in action!

🔒 Notes & Tips

  • Use SentenceTransformer('all-MiniLM-L6-v2') for quick demos; swap for larger models for better retrieval quality.
  • IndexFlatL2 is exact L2 search — fine for small corpora. Use HNSW/IVF/PQ for millions of vectors.
  • Deterministic output: set_seed(42) + do_sample=False. For creative outputs, enable sampling & temperature.
  • In Colab GPU, use dtype=torch.float16 to reduce memory.

💬 Final Thought

RAG is the simplest, most practical way to ground language models in real knowledge.
This notebook helps you experiment and learn — not just run — the pipeline.
Happy building! ⚡

📂 GitHub Repository

Full code, PDF walkthrough, and examples are available here:
👉 Learn RAG Repository

🧑‍💻 Author

Alok Ranjan Singh
• AI Engineer • Open-Source Enthusiast

🔗 GitHub Profile
🔗 LinkedIn Profile

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.