
Enhance Your LLM Agents with BM25: Lightweight Retrieval That Works
Author(s): Syed Affan
Originally published on Towards AI.
Prerequisites
Before diving in, you should have:
- Basic AI/ML understanding: concepts like language models, embeddings, and model inference.
- Software engineering skills: familiarity with Python, virtual environments, and package installation.
- Python libraries: comfort importing and using packages and file I/O.
If any of these are new, consider reviewing a quick Python tutorial or AI primer before proceeding.
1. Your LLM Agent Is Only as Good as Its Retrieval
Every intelligent GenAI agent today, from chatbots to autonomous assistants, depends on retrieving the right information at the right time. Retrieval is the process of fetching relevant knowledge or context to augment an AIβs ability to reason and answer accurately.
Like all data science problems, the choice of retrieval should be highly dependent on the problem you are solving, yet, most practitioners default to the cosine metric regardless of if it is even required or not.
In real-world applications, retrieval is everywhere: customer support bots fetching updated policies, legal research assistants sourcing statutes, AI tutors pulling examples from textbooks, and AI search engines navigating complex knowledge bases. Retrieval empowers small and large LLMs alike by providing fresh, factual context that may not be captured during model training.
Imagine a customer-support chatbot powered by a compact language model. Without a retrieval layer, it can only rely on its fixed training data β and will inevitably hallucinate or go silent when asked about the latest policy updates. Retrieval fills this gap by fetching fresh, factual context (e.g., support articles, knowledge-base entries) at query time.
2. The Problem: Where Embedding-Based Retrieval Struggles
What Is Embedding + Cosine Similarity?
The dominant retrieval approach today uses vector embeddings. Embeddings are dense vector representations of text. Models like Sentence Transformers map words, sentences, or documents into high-dimensional vectors. To find relevant text, you compare vectors using metrics like cosine similarity, retrieving documents whose embeddings are closest to the query embedding.
Why It Works
- Captures semantic meaning beyond exact keywords.
- Supports fuzzy matches, paraphrases, and synonyms.
Challenges
- GPU-heavy: computing embeddings for thousands of documents on the fly can stall on CPUs.
- Memory-hungry: storing millions of 768β1024 dimensional vectors requires gigabytes of RAM or specialized vector databases (FAISS, Pinecone).
- Overkill for keyword tasks: when a user asks βWhen did X happen?β, exact date matches may be buried by semantic noise.
In structured domains like encyclopedias, technical manuals, or product catalogs β where users typically ask factual, direct questions β lighter, keyword-driven methods often outperform embedding-heavy setups, being faster, simpler, and cheaper. Thatβs where BM25 shines.
3. The BM25 Alternative: A Faster, Simpler Way
BM25 is a decades-old ranking function at the heart of search engines like Elasticsearch and Apache Lucene. It scores documents based on:
1. Term Frequency (TF): how often query terms appear in a document.
2. Inverse Document Frequency (IDF): how rare those terms are across the entire corpus.
3. Document Length Normalization: prevents long docs from dominating the score.
This blend yields a robust, explainable relevance metric that runs entirely on CPU, with no training needed.
BM25 + Lightweight AI Wrappers
By combining BM25 retrieval with a small LLM (e.g., Gemma-2B or TinyLlama), you can create an info retrieval system thatβs:
- Cost-effective: no GPU for retrieval, small model inference.
- Responsive: sub-100ms queries on 100kβ200k docs.
- Transparent: easy to explain βwhyβ a document scored highly (TF/IDF contributions).
Perplexity CEO commented, saying that they use BM25 apart from other methods, reflecting the industry practice of combining traditional retrieval algorithms with modern techniques rather than relying solely on embedding models.
βItβs not purely vector space. Itβs not like once the content is fetched, there is some BERT model that runs on all of it and puts it into a big gigantic vector database, which you retrieve from, itβs not like that. Because packing all the knowledge about a webpage into one vector space representation is very, very difficult. Thereβs like, first of all, vector embeddings are not magically working for text. Itβs very hard to like understand whatβs a relevant document to a particular query.β
Source: https://www.reddit.com/r/LocalLLaMA/comments/1ds30l9/perplexity_seems_to_favor_the_traditional/
BM25 vs. Embeddings
4. The Retrieval-Augmented Workflow (RAG and Beyond)
What Is RAG?
Retrieval-Augmented Generation (RAG) is an AI architecture that bridges information retrieval with text generation. Instead of relying only on its trained weights, an LLM is provided with external context retrieved dynamically based on a user query. This greatly enhances factual accuracy, reduces hallucination, and enables models to βknowβ about recent or external facts.
Retrieval-Augmented Generation (RAG) is a two-step pipeline:
1. Retrieve: find top-k relevant passages from an external corpus.
2. Generate: feed those passages into an LLM prompt to produce the final answer.
Why RAG Works
- Keeps LLM context windows focused on relevant facts.
- Reduces hallucinations by grounding the model in real data.
- Enables up-to-date answers without retraining the LLM.
Other Agentic Use Cases
- Planner Agents: retrieve tool specs, API docs, or environment states before deciding actions.
- Monitoring Agents: query logs or metrics to detect anomalies, then generate alerts.
- Summarization Bots: fetch the latest articles or emails and summarize key points.
Across all these scenarios, retrieval is the unsung hero enabling the LLM to act knowledgeably.
5. Implementation: Lightweight Retrieval with BM25
Before coding, install the libraries:
pip -q install whoosh rank_bm25 sentence-transformers \
transformers accelerate optimum --upgrade
5.1 Dataset Overview
I used the Wikipedia Structured Contents by Wikimedia dataset on Kaggle: JSONL files containing titles, abstracts, and infoboxes for each article. This structured format makes keyword retrieval highly effective.
My workbook is here: https://www.kaggle.com/code/sulphatet/bm-method-of-rag
5.2 In-Memory BM25 with rank_bm25
`rank_bm25` is a python library which bills itself as a βtwo-line search engine.β It implements several BM25 variants (Okapi BM25, BM25+, BM25L, and more), plus options for stemming and stopword removal.
- Best for: Up to ~100k documents on a single CPU; ideal for rapid prototyping.
- Tradeoff: Entire index lives in RAM, so very large corpora may exceed memory.
from rank_bm25 import BM25Okapi
# 1. Tokenize your corpus
corpus = docs # list of "title. abstract" strings
tokenized = [doc.lower().split() for doc in corpus]
# 2. Initialize BM25
bm25 = BM25Okapi(tokenized)
#ReadTheDocs: https://github.com/dorianbrown/rank_bm25
# 3. Search function
def search_bm25(query, k=5):
tokens = query.lower().split()
scores = bm25.get_scores(tokens)
top_idx = sorted(range(len(scores)), key=lambda i: -scores[i])[:k]
return [(titles[i], abstracts[i]) for i in top_idx]
5.3 Persistent BM25F with Whoosh
Whoosh is a pure-Python search library supporting fielded BM25 (BM25F). You can boost the title field to improve title matches.
Best for: Datasets up to 1M+ documents, where disk persistence and field weighting matter.
Tradeoff: Slower index writes, slightly higher query latency. (Still faster than cosine)
from whoosh.index import create_in, open_dir
from whoosh.fields import Schema, TEXT, ID
from whoosh.qparser import MultifieldParser
import os, shutil
# 1. Define schema with field boosts\sschema = Schema(
title=TEXT(stored=True, field_boost=2.0),
abstract=TEXT(stored=True)
)
#ReadTheDocs: https://whoosh.readthedocs.io/
# 2. Create or clear the index directory
if os.path.exists("indexdir"): shutil.rmtree("indexdir")
os.mkdir("indexdir")
ix = create_in("indexdir", schema)
# 3. Index documents
writer = ix.writer()
for t, a in zip(titles, abstracts):
writer.add_document(title=t, abstract=a)
writer.commit()
# 4. Search function
def search_whoosh(query, k=5):
ix = open_dir("indexdir")
with ix.searcher() as searcher:
parser = MultifieldParser(["title", "abstract"], schema=ix.schema)
q = parser.parse(query)
res = searcher.search(q, limit=k)
return [(r["title"], r["abstract"]) for r in res]
6. Generating Answers: Pairing BM25 with a Small LLM
With retrieval in place, you can feed top-k passages into a lightweight LLM like Gemma-2B, GPT-2, HuggingFace Smol Agents and so on:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model
tok = AutoTokenizer.from_pretrained("google/gemma-2b-it")
mod = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it").eval()
# RAG answer function
def generate_answer(query):
contexts = search_bm25(query) #OR SEARCH WHOOSH
ctx_text = "\n".join(f"{t}: {a}" for t, a in contexts)
prompt = f"Answer using context:\n{ctx_text}\nQuestion: {query}\nAnswer:"
inp = tok(prompt, return_tensors="pt", truncation=True, max_length=4096)
out = mod.generate(**inp, max_new_tokens=150)
return tok.decode(out[0], skip_special_tokens=True)
This setup runs entirely on CPU and delivers sub-second end-to-end latency for 5 passages.
7. When (and Why) You Should Use BM25
Use BM25 when:
- Your data is structured (titles, abstracts, FAQs).
❗BONUS: The BM25 Algorithm can allow you to insert your own βclass weightβ, that is, you can decide for example, that the title of the text is 5 times more important than the content itself, and tune your retrieval accordingly.
- Queries are keyword-driven or factoid-focused.
- Youβre limited to CPU or lightweight hosts.
- You want transparent, explainable relevance scores.
Use Embeddings and Cosine when:
🌐 You need deep semantic matches or paraphrase handling.
🔄 You require multilingual retrieval.
🚀 You operate at a massive (or critical) scale (>1M docs) with GPU/cluster resources.
8. Simpler Is Often Better
In modern AI, bigger isnβt always better. By combining a time-tested algorithm like BM25 with compact LLMs, you get a nimble, cost-effective agentic system that excels at real-world tasks β without the complexity and expense of full-scale vector search.
Give this approach a try in your next project, and watch how simplicity unlocks performance.
❗Bonus Libraries You can Try: RagMeUp offers a generic framework for spinning up a RAG over your own database easily. If you REALLY want amazing efficiency, BM25S is a variant of the BM25 algorithm designed for efficiency.
Keywords: BM25, RAG, LLM retrieval, lightweight retrieval, agentic AI, rank_bm25, Whoosh, retrieval-augmented generation.
Me after you decide to clap and follow
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI