Introduction to RAG: Basics to Mastery. 5-Advanced RAG: Fast Retrieval (ANN) and Reranking

Last Updated on September 9, 2025 by Editorial Team

Author(s): Taha Azizi

Originally published on Towards AI.

Part 5 of the mini-series introduction to RAG

Introduction

In the earlier articles, we built RAG pipelines that worked great for small datasets.
But what happens when your knowledge base grows to millions or billions of products, like in an e-commerce catalog?

Exact vector search becomes too slow and memory-heavy.
Approximate Nearest Neighbors (ANN) speeds up retrieval, but may sometimes return slightly less relevant results.

This is where Reranking comes in. By combining ANN retrieval (fast, broad search) with a reranker model (accurate, fine-grained relevance scoring), you get both speed and precision.

In this article, we’ll:

Use FAISS with HNSW for lightning-fast ANN retrieval.
Add a cross-encoder reranker from Hugging Face to refine the top results.
Apply this pipeline to an e-commerce product search example.

Theory

Approximate Nearest Neighbors (ANN)
When building a RAG pipeline, one of the biggest bottlenecks is retrieval speed. If your knowledge base only has a few thousand documents, exact vector search works fine — the query embedding can be compared to every document embedding quickly. But when the dataset grows to millions or even billions of entries (like in e-commerce product catalogs, research archives, or enterprise knowledge graphs), exact search becomes prohibitively slow and memory-intensive. That’s where Approximate Nearest Neighbors (ANN) comes in. ANN algorithms — such as FAISS or HNSW — cleverly index embeddings in a graph or tree structure, so instead of scanning all vectors, they navigate through the index to quickly find candidates that are “close enough.” The trade-off is that you might miss a few results, but in practice, accuracy loss is often under 1%, while speed gains can be 100x to 1000x faster. This makes ANN essential for scalable RAG pipelines where real-time responses are required.

Reranking
ANN solves the speed problem, but it comes with another subtle issue: the top-k retrieved documents may not be in the best possible order. ANN ensures they’re roughly relevant, but sometimes the most contextually useful result may be ranked lower. That’s where Reranking comes in. After ANN fetches the candidate documents, a more powerful model (often a cross-encoder or large language model) is used to re-score and reorder them based on the query. This second step improves retrieval quality dramatically by pushing the most relevant results to the top. In e-commerce, for example, if you search for “lightweight running shoes,” ANN may surface 50 products that roughly match, but a reranker can prioritize the exact models that fit best — like shoes specifically optimized for marathon training over casual sneakers. Together, ANN + Reranking give you the best of both worlds: lightning-fast retrieval and high precision in the final ranked list.

Summary of the Theory

Exact vs Approximate Search

Exact Search: Compares a query vector to every product vector. Very accurate, but painfully slow at scale.
Approximate Search (ANN): Uses smart indexing (like HNSW) to quickly jump to likely candidates.

Trade-offs

Speed: ANN can be 100–1000x faster than exact search.
Accuracy: You might miss a few top results, but usually <1% drop in recall.

Reranking

ANN gives you a broad top-k set (fast but noisy).
A cross-encoder model reranks those candidates by reading both the query and document together → much better final ranking.
This two-stage retrieval is standard in modern search engines (Google, Amazon, etc.).

Setup

We’ll replace Chroma’s exact search with FAISS ANN search.

Install FAISS with GPU support — or just FAISS if you do not have a capable GPU:

pip install faiss-gpu sentence-transformers

We’ll use:

sentence-transformers/all-MiniLM-L6-v2 → embeddings
cross-encoder/ms-marco-MiniLM-L-6-v2 → reranker

Step-by-Step Code

Flowchart of the advanced RAG steps in this article

1. Create Embeddings

(Same as before, but we’ll store them in FAISS instead of Chroma.)

from sentence_transformers import SentenceTransformer
import numpy as np

# Embedding model
embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Example product catalog (chunks)
all_products = [
 "Red running shoes with breathable mesh",
 "Wireless noise-cancelling headphones",
 "Ergonomic office chair with lumbar support",
 "4K Ultra HD Smart TV, 55 inch",
 "Stainless steel water bottle, 1L",
 "Gaming laptop with RTX 4070 GPU",
 "Cotton crew neck T-shirt, black",
 "Bluetooth fitness tracker with heart rate monitor",
 "Smartphone with 128GB storage and 5G support",
 "Portable Bluetooth speaker, waterproof",
 "Adjustable standing desk with memory presets",
 "Noise-reducing mechanical keyboard with RGB backlight",
 "Stainless steel cookware set, 10 pieces",
 "Wireless ergonomic mouse with programmable buttons",
 "High-resolution DSLR camera with 24MP sensor",
 "Electric mountain bike with 500W motor",
 "Smartwatch with sleep tracking and GPS",
 "Leather messenger bag with laptop compartment",
 "Robot vacuum cleaner with smart mapping",
 "Air fryer with 6-quart capacity and digital display",
 "Portable power bank, 20,000mAh fast charging",
 "Noise-isolating in-ear monitors for musicians",
 "Adjustable dumbbell set, 5–50 lbs",
 "Smart home hub with voice assistant",
 "Curved 34-inch ultrawide monitor, QHD",
 "Premium mattress with cooling gel memory foam",
 "Electric toothbrush with pressure sensor",
 "Cordless stick vacuum with HEPA filter",
 "Espresso machine with milk frother",
 "Winter jacket with thermal insulation",
 "Camping tent for 4 people, waterproof",
 "LED desk lamp with wireless charging pad",
 "External SSD, 2TB, USB-C",
 "Yoga mat with non-slip surface, 8mm thick",
 "Noise-cancelling over-ear gaming headset",
 "Smart thermostat with energy saver mode",
 "Wireless charging pad for multiple devices",
 "Compact air purifier with HEPA filter",
 "Adjustable kettlebell with quick-lock system"
]

# Encode product embeddings
embeddings = embedder.encode(all_products, convert_to_numpy=True, show_progress_bar=True)
embeddings = np.array(embeddings).astype("float32")

2. Build ANN Index with HNSW

import faiss

dim = embeddings.shape[1] # embedding dimension
index = faiss.IndexHNSWFlat(dim, 32) # 32 neighbors per node
index.hnsw.efConstruction = 200
index.add(embeddings)

print(f"Indexed {index.ntotal} products.")

3. Search with ANN

def ann_search(query, top_k=5):
 q_emb = embedder.encode([query], convert_to_numpy=True).astype("float32")
 distances, indices = index.search(q_emb, top_k)
 results = [all_products[i] for i in indices[0]]
 return results

query = "best laptop for gaming"
top_candidates = ann_search(query, top_k=5)
print("ANN Top Candidates:", top_candidates)

4. Generate Answer with Ollama

from sentence_transformers import CrossEncoder

# Load reranker
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query, candidates):
 pairs = [[query, doc] for doc in candidates]
 scores = reranker.predict(pairs)
 
 # Sort by score descending
 ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
 return [doc for doc, score in ranked]

# Apply reranking
reranked_results = rerank(query, top_candidates)
print("Final Reranked Results:", reranked_results)

5. Generate Final Answer with Ollama

import subprocess
context = "\n".join(reranked_results[:3]) # top-3 reranked
prompt = f"""
Answer the customer query using only the following product context:
{context}
Customer Query: {query}
Answer:
"""
ollama_cmd = ["ollama", "run", "mistral", prompt]
response = subprocess.run(ollama_cmd, capture_output=True, text=True)
print("LLM Response:\n", response.stdout)

Expected Output

You’ll get retrieval results in milliseconds — even with millions of documents in your local knowledge base.

For the query: “best laptop for gaming”

ANN retrieves multiple products (laptop, TV, headphones, etc.).
Reranker boosts the Gaming laptop with RTX 4070 GPU to the top.
LLM generates an answer grounded in that product.

Performance Tips

For ultra-large datasets, use IVF (Inverted File Index) in FAISS.
Tune efSearch in HNSW for better accuracy at the cost of speed:

index.hnsw.efSearch = 50

Store embeddings in float16 for memory savings:

embeddings = embeddings.astype("float16")

Use ANN to pull top-100 candidates, then rerank down to top-10.
Trade-off: more rerank candidates = higher accuracy, but slower.
For ultra-large datasets: combine ANN (FAISS) + reranking (CrossEncoder) + caching (Redis).

Why ANN + Reranking is a Game-Changer for RAG

✅ Scalability: Handles millions of products.
✅ Precision: Reranker ensures the “right” items float to the top.
✅ Hybrid-friendly: Can combine keyword + semantic retrieval before reranking.

This two-step approach is exactly what powers real-world search engines and product recommenders.

Next Steps

In Article 6, we’ll focus on evaluating and optimizing RAG pipelines — using real metrics like precision@k and recall@k to measure how well your retrieval is actually performing.

👉 Check out more tutorials and follow me here: https://medium.com/@taha.azizi

and subscribe with your email to be the first to get notified when I publish new articles.

Github repo with practical example:

https://github.com/Taha-azizi/RAG

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Introduction to RAG: Basics to Mastery. 5-Advanced RAG: Fast Retrieval (ANN) and Reranking

Author(s): Taha Azizi

Part 5 of the mini-series introduction to RAG

Introduction

Theory

Summary of the Theory

Exact vs Approximate Search

Trade-offs

Reranking

Setup

We’ll use:

Step-by-Step Code

1. Create Embeddings

2. Build ANN Index with HNSW

3. Search with ANN

4. Generate Answer with Ollama

5. Generate Final Answer with Ollama

Expected Output

Performance Tips

Why ANN + Reranking is a Game-Changer for RAG

Next Steps

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Introduction to RAG: Basics to Mastery. 5-Advanced RAG: Fast Retrieval (ANN) and Reranking

Author(s): Taha Azizi

Part 5 of the mini-series introduction to RAG

Introduction

Theory

Summary of the Theory

Exact vs Approximate Search

Trade-offs

Reranking

Setup

We’ll use:

Step-by-Step Code

1. Create Embeddings

2. Build ANN Index with HNSW

3. Search with ANN

4. Generate Answer with Ollama

5. Generate Final Answer with Ollama

Expected Output

Performance Tips

Why ANN + Reranking is a Game-Changer for RAG

Next Steps

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement