Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

Introduction to RAG: Basics to Mastery. 5-Advanced RAG: Fast Retrieval (ANN) and Reranking
Data Science   Latest   Machine Learning

Introduction to RAG: Basics to Mastery. 5-Advanced RAG: Fast Retrieval (ANN) and Reranking

Last Updated on September 9, 2025 by Editorial Team

Author(s): Taha Azizi

Originally published on Towards AI.

Part 5 of the mini-series introduction to RAG

Introduction to RAG: Basics to Mastery. 5-Advanced RAG: Fast Retrieval (ANN) and Reranking

Introduction

In the earlier articles, we built RAG pipelines that worked great for small datasets.
But what happens when your knowledge base grows to millions or billions of products, like in an e-commerce catalog?

  • Exact vector search becomes too slow and memory-heavy.
  • Approximate Nearest Neighbors (ANN) speeds up retrieval, but may sometimes return slightly less relevant results.

This is where Reranking comes in. By combining ANN retrieval (fast, broad search) with a reranker model (accurate, fine-grained relevance scoring), you get both speed and precision.

In this article, we’ll:

  • Use FAISS with HNSW for lightning-fast ANN retrieval.
  • Add a cross-encoder reranker from Hugging Face to refine the top results.
  • Apply this pipeline to an e-commerce product search example.

Theory

Approximate Nearest Neighbors (ANN)
When building a RAG pipeline, one of the biggest bottlenecks is retrieval speed. If your knowledge base only has a few thousand documents, exact vector search works fine — the query embedding can be compared to every document embedding quickly. But when the dataset grows to millions or even billions of entries (like in e-commerce product catalogs, research archives, or enterprise knowledge graphs), exact search becomes prohibitively slow and memory-intensive. That’s where Approximate Nearest Neighbors (ANN) comes in. ANN algorithms — such as FAISS or HNSW — cleverly index embeddings in a graph or tree structure, so instead of scanning all vectors, they navigate through the index to quickly find candidates that are “close enough.” The trade-off is that you might miss a few results, but in practice, accuracy loss is often under 1%, while speed gains can be 100x to 1000x faster. This makes ANN essential for scalable RAG pipelines where real-time responses are required.

Reranking
ANN solves the speed problem, but it comes with another subtle issue: the top-k retrieved documents may not be in the best possible order. ANN ensures they’re roughly relevant, but sometimes the most contextually useful result may be ranked lower. That’s where Reranking comes in. After ANN fetches the candidate documents, a more powerful model (often a cross-encoder or large language model) is used to re-score and reorder them based on the query. This second step improves retrieval quality dramatically by pushing the most relevant results to the top. In e-commerce, for example, if you search for “lightweight running shoes,” ANN may surface 50 products that roughly match, but a reranker can prioritize the exact models that fit best — like shoes specifically optimized for marathon training over casual sneakers. Together, ANN + Reranking give you the best of both worlds: lightning-fast retrieval and high precision in the final ranked list.

Summary of the Theory

Exact vs Approximate Search

  • Exact Search: Compares a query vector to every product vector. Very accurate, but painfully slow at scale.
  • Approximate Search (ANN): Uses smart indexing (like HNSW) to quickly jump to likely candidates.

Trade-offs

  • Speed: ANN can be 100–1000x faster than exact search.
  • Accuracy: You might miss a few top results, but usually <1% drop in recall.

Reranking

  • ANN gives you a broad top-k set (fast but noisy).
  • A cross-encoder model reranks those candidates by reading both the query and document together → much better final ranking.
  • This two-stage retrieval is standard in modern search engines (Google, Amazon, etc.).

Setup

We’ll replace Chroma’s exact search with FAISS ANN search.

Install FAISS with GPU support — or just FAISS if you do not have a capable GPU:

pip install faiss-gpu sentence-transformers 

We’ll use:

  • sentence-transformers/all-MiniLM-L6-v2 → embeddings
  • cross-encoder/ms-marco-MiniLM-L-6-v2 → reranker

Step-by-Step Code

Flowchart of the advanced RAG steps in this article

1. Create Embeddings

(Same as before, but we’ll store them in FAISS instead of Chroma.)

from sentence_transformers import SentenceTransformer
import numpy as np

# Embedding model
embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Example product catalog (chunks)
all_products = [
"Red running shoes with breathable mesh",
"Wireless noise-cancelling headphones",
"Ergonomic office chair with lumbar support",
"4K Ultra HD Smart TV, 55 inch",
"Stainless steel water bottle, 1L",
"Gaming laptop with RTX 4070 GPU",
"Cotton crew neck T-shirt, black",
"Bluetooth fitness tracker with heart rate monitor",
"Smartphone with 128GB storage and 5G support",
"Portable Bluetooth speaker, waterproof",
"Adjustable standing desk with memory presets",
"Noise-reducing mechanical keyboard with RGB backlight",
"Stainless steel cookware set, 10 pieces",
"Wireless ergonomic mouse with programmable buttons",
"High-resolution DSLR camera with 24MP sensor",
"Electric mountain bike with 500W motor",
"Smartwatch with sleep tracking and GPS",
"Leather messenger bag with laptop compartment",
"Robot vacuum cleaner with smart mapping",
"Air fryer with 6-quart capacity and digital display",
"Portable power bank, 20,000mAh fast charging",
"Noise-isolating in-ear monitors for musicians",
"Adjustable dumbbell set, 5–50 lbs",
"Smart home hub with voice assistant",
"Curved 34-inch ultrawide monitor, QHD",
"Premium mattress with cooling gel memory foam",
"Electric toothbrush with pressure sensor",
"Cordless stick vacuum with HEPA filter",
"Espresso machine with milk frother",
"Winter jacket with thermal insulation",
"Camping tent for 4 people, waterproof",
"LED desk lamp with wireless charging pad",
"External SSD, 2TB, USB-C",
"Yoga mat with non-slip surface, 8mm thick",
"Noise-cancelling over-ear gaming headset",
"Smart thermostat with energy saver mode",
"Wireless charging pad for multiple devices",
"Compact air purifier with HEPA filter",
"Adjustable kettlebell with quick-lock system"
]

# Encode product embeddings
embeddings = embedder.encode(all_products, convert_to_numpy=True, show_progress_bar=True)
embeddings = np.array(embeddings).astype("float32")

2. Build ANN Index with HNSW

import faiss

dim = embeddings.shape[1] # embedding dimension
index = faiss.IndexHNSWFlat(dim, 32) # 32 neighbors per node
index.hnsw.efConstruction = 200
index.add(embeddings)

print(f"Indexed {index.ntotal} products.")

3. Search with ANN

def ann_search(query, top_k=5):
q_emb = embedder.encode([query], convert_to_numpy=True).astype("float32")
distances, indices = index.search(q_emb, top_k)
results = [all_products[i] for i in indices[0]]
return results

query = "best laptop for gaming"
top_candidates = ann_search(query, top_k=5)
print("ANN Top Candidates:", top_candidates)

4. Generate Answer with Ollama

from sentence_transformers import CrossEncoder

# Load reranker
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query, candidates):
pairs = [[query, doc] for doc in candidates]
scores = reranker.predict(pairs)

# Sort by score descending
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked]

# Apply reranking
reranked_results = rerank(query, top_candidates)
print("Final Reranked Results:", reranked_results)

5. Generate Final Answer with Ollama

import subprocess
context = "\n".join(reranked_results[:3]) # top-3 reranked
prompt = f"""
Answer the customer query using only the following product context:
{context}
Customer Query: {query}
Answer:
"""

ollama_cmd = ["ollama", "run", "mistral", prompt]
response = subprocess.run(ollama_cmd, capture_output=True, text=True)
print("LLM Response:\n", response.stdout)

Expected Output

You’ll get retrieval results in milliseconds — even with millions of documents in your local knowledge base.

For the query: “best laptop for gaming”

  • ANN retrieves multiple products (laptop, TV, headphones, etc.).
  • Reranker boosts the Gaming laptop with RTX 4070 GPU to the top.
  • LLM generates an answer grounded in that product.

Performance Tips

  • For ultra-large datasets, use IVF (Inverted File Index) in FAISS.
  • Tune efSearch in HNSW for better accuracy at the cost of speed:
index.hnsw.efSearch = 50
  • Store embeddings in float16 for memory savings:
embeddings = embeddings.astype("float16")
  • Use ANN to pull top-100 candidates, then rerank down to top-10.
  • Trade-off: more rerank candidates = higher accuracy, but slower.
  • For ultra-large datasets: combine ANN (FAISS) + reranking (CrossEncoder) + caching (Redis).

Why ANN + Reranking is a Game-Changer for RAG

Scalability: Handles millions of products.
Precision: Reranker ensures the “right” items float to the top.
Hybrid-friendly: Can combine keyword + semantic retrieval before reranking.

This two-step approach is exactly what powers real-world search engines and product recommenders.

Next Steps

In Article 6, we’ll focus on evaluating and optimizing RAG pipelines — using real metrics like precision@k and recall@k to measure how well your retrieval is actually performing.

👉 Check out more tutorials and follow me here: https://medium.com/@taha.azizi

and subscribe with your email to be the first to get notified when I publish new articles.

Github repo with practical example:

https://github.com/Taha-azizi/RAG

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.