Enhance Your LLM Agents with BM25: Lightweight Retrieval That Works

Author(s): Syed Affan

Originally published on Towards AI.

Prerequisites

Before diving in, you should have:

Basic AI/ML understanding: concepts like language models, embeddings, and model inference.
Software engineering skills: familiarity with Python, virtual environments, and package installation.
Python libraries: comfort importing and using packages and file I/O.

If any of these are new, consider reviewing a quick Python tutorial or AI primer before proceeding.

1. Your LLM Agent Is Only as Good as Its Retrieval

Every intelligent GenAI agent today, from chatbots to autonomous assistants, depends on retrieving the right information at the right time. Retrieval is the process of fetching relevant knowledge or context to augment an AI’s ability to reason and answer accurately.

Like all data science problems, the choice of retrieval should be highly dependent on the problem you are solving, yet, most practitioners default to the cosine metric regardless of if it is even required or not.

In real-world applications, retrieval is everywhere: customer support bots fetching updated policies, legal research assistants sourcing statutes, AI tutors pulling examples from textbooks, and AI search engines navigating complex knowledge bases. Retrieval empowers small and large LLMs alike by providing fresh, factual context that may not be captured during model training.

Imagine a customer-support chatbot powered by a compact language model. Without a retrieval layer, it can only rely on its fixed training data — and will inevitably hallucinate or go silent when asked about the latest policy updates. Retrieval fills this gap by fetching fresh, factual context (e.g., support articles, knowledge-base entries) at query time.

2. The Problem: Where Embedding-Based Retrieval Struggles

What Is Embedding + Cosine Similarity?

The dominant retrieval approach today uses vector embeddings. Embeddings are dense vector representations of text. Models like Sentence Transformers map words, sentences, or documents into high-dimensional vectors. To find relevant text, you compare vectors using metrics like cosine similarity, retrieving documents whose embeddings are closest to the query embedding.

Why It Works

Captures semantic meaning beyond exact keywords.
Supports fuzzy matches, paraphrases, and synonyms.

Challenges

GPU-heavy: computing embeddings for thousands of documents on the fly can stall on CPUs.
Memory-hungry: storing millions of 768–1024 dimensional vectors requires gigabytes of RAM or specialized vector databases (FAISS, Pinecone).
Overkill for keyword tasks: when a user asks “When did X happen?”, exact date matches may be buried by semantic noise.

In structured domains like encyclopedias, technical manuals, or product catalogs — where users typically ask factual, direct questions — lighter, keyword-driven methods often outperform embedding-heavy setups, being faster, simpler, and cheaper. That’s where BM25 shines.

3. The BM25 Alternative: A Faster, Simpler Way

BM25 is a decades-old ranking function at the heart of search engines like Elasticsearch and Apache Lucene. It scores documents based on:

1. Term Frequency (TF): how often query terms appear in a document.

2. Inverse Document Frequency (IDF): how rare those terms are across the entire corpus.

3. Document Length Normalization: prevents long docs from dominating the score.

This blend yields a robust, explainable relevance metric that runs entirely on CPU, with no training needed.

BM25 + Lightweight AI Wrappers

By combining BM25 retrieval with a small LLM (e.g., Gemma-2B or TinyLlama), you can create an info retrieval system that’s:

Cost-effective: no GPU for retrieval, small model inference.
Responsive: sub-100ms queries on 100k–200k docs.
Transparent: easy to explain “why” a document scored highly (TF/IDF contributions).

Perplexity CEO commented, saying that they use BM25 apart from other methods, reflecting the industry practice of combining traditional retrieval algorithms with modern techniques rather than relying solely on embedding models.

“It’s not purely vector space. It’s not like once the content is fetched, there is some BERT model that runs on all of it and puts it into a big gigantic vector database, which you retrieve from, it’s not like that. Because packing all the knowledge about a webpage into one vector space representation is very, very difficult. There’s like, first of all, vector embeddings are not magically working for text. It’s very hard to like understand what’s a relevant document to a particular query.”

Source: https://www.reddit.com/r/LocalLLaMA/comments/1ds30l9/perplexity_seems_to_favor_the_traditional/

BM25 vs. Embeddings

Enhance Your LLM Agents with BM25: Lightweight Retrieval That Works — Advantages of BM25 Over Cosine Similarity for simple tasks

4. The Retrieval-Augmented Workflow (RAG and Beyond)

What Is RAG?

Retrieval-Augmented Generation (RAG) is an AI architecture that bridges information retrieval with text generation. Instead of relying only on its trained weights, an LLM is provided with external context retrieved dynamically based on a user query. This greatly enhances factual accuracy, reduces hallucination, and enables models to “know” about recent or external facts.

Retrieval-Augmented Generation (RAG) is a two-step pipeline:

1. Retrieve: find top-k relevant passages from an external corpus.

2. Generate: feed those passages into an LLM prompt to produce the final answer.

Why RAG Works

Keeps LLM context windows focused on relevant facts.
Reduces hallucinations by grounding the model in real data.
Enables up-to-date answers without retraining the LLM.

Other Agentic Use Cases

Planner Agents: retrieve tool specs, API docs, or environment states before deciding actions.
Monitoring Agents: query logs or metrics to detect anomalies, then generate alerts.
Summarization Bots: fetch the latest articles or emails and summarize key points.

*Source: https://techcommunity.microsoft.com/blog/azuredevcommunityblog/doing-rag-vector-search-is-not-enough/4161073*

Across all these scenarios, retrieval is the unsung hero enabling the LLM to act knowledgeably.

5. Implementation: Lightweight Retrieval with BM25

Before coding, install the libraries:


pip -q install whoosh rank_bm25 sentence-transformers \
 transformers accelerate optimum --upgrade

5.1 Dataset Overview

I used the Wikipedia Structured Contents by Wikimedia dataset on Kaggle: JSONL files containing titles, abstracts, and infoboxes for each article. This structured format makes keyword retrieval highly effective.
My workbook is here: https://www.kaggle.com/code/sulphatet/bm-method-of-rag

5.2 In-Memory BM25 with rank_bm25

`rank_bm25` is a python library which bills itself as a “two-line search engine.” It implements several BM25 variants (Okapi BM25, BM25+, BM25L, and more), plus options for stemming and stopword removal.

Best for: Up to ~100k documents on a single CPU; ideal for rapid prototyping.
Tradeoff: Entire index lives in RAM, so very large corpora may exceed memory.

from rank_bm25 import BM25Okapi

# 1. Tokenize your corpus
corpus = docs # list of "title. abstract" strings
tokenized = [doc.lower().split() for doc in corpus]

# 2. Initialize BM25
bm25 = BM25Okapi(tokenized)

#ReadTheDocs: https://github.com/dorianbrown/rank_bm25

# 3. Search function
def search_bm25(query, k=5):
 tokens = query.lower().split()
 scores = bm25.get_scores(tokens)
 top_idx = sorted(range(len(scores)), key=lambda i: -scores[i])[:k]
 return [(titles[i], abstracts[i]) for i in top_idx]

5.3 Persistent BM25F with Whoosh

Whoosh is a pure-Python search library supporting fielded BM25 (BM25F). You can boost the title field to improve title matches.

Best for: Datasets up to 1M+ documents, where disk persistence and field weighting matter.

Tradeoff: Slower index writes, slightly higher query latency. (Still faster than cosine)


from whoosh.index import create_in, open_dir
from whoosh.fields import Schema, TEXT, ID
from whoosh.qparser import MultifieldParser
import os, shutil

# 1. Define schema with field boosts\sschema = Schema(
 title=TEXT(stored=True, field_boost=2.0),
 abstract=TEXT(stored=True)
)

#ReadTheDocs: https://whoosh.readthedocs.io/

# 2. Create or clear the index directory
if os.path.exists("indexdir"): shutil.rmtree("indexdir")
os.mkdir("indexdir")
ix = create_in("indexdir", schema)

# 3. Index documents
writer = ix.writer()
for t, a in zip(titles, abstracts):
 writer.add_document(title=t, abstract=a)
writer.commit()

# 4. Search function
def search_whoosh(query, k=5):
 ix = open_dir("indexdir")
 with ix.searcher() as searcher:
 parser = MultifieldParser(["title", "abstract"], schema=ix.schema)
 q = parser.parse(query)
 res = searcher.search(q, limit=k)
 return [(r["title"], r["abstract"]) for r in res]

6. Generating Answers: Pairing BM25 with a Small LLM

With retrieval in place, you can feed top-k passages into a lightweight LLM like Gemma-2B, GPT-2, HuggingFace Smol Agents and so on:


from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model
tok = AutoTokenizer.from_pretrained("google/gemma-2b-it")
mod = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it").eval()

# RAG answer function
def generate_answer(query):
 contexts = search_bm25(query) #OR SEARCH WHOOSH
 ctx_text = "\n".join(f"{t}: {a}" for t, a in contexts)
 prompt = f"Answer using context:\n{ctx_text}\nQuestion: {query}\nAnswer:"
 inp = tok(prompt, return_tensors="pt", truncation=True, max_length=4096)
 out = mod.generate(**inp, max_new_tokens=150)
 return tok.decode(out[0], skip_special_tokens=True)

This setup runs entirely on CPU and delivers sub-second end-to-end latency for 5 passages.

7. When (and Why) You Should Use BM25

Use BM25 when:

Your data is structured (titles, abstracts, FAQs).

❗BONUS: The BM25 Algorithm can allow you to insert your own “class weight”, that is, you can decide for example, that the title of the text is 5 times more important than the content itself, and tune your retrieval accordingly.

Queries are keyword-driven or factoid-focused.
You’re limited to CPU or lightweight hosts.
You want transparent, explainable relevance scores.

Use Embeddings and Cosine when:

🌐 You need deep semantic matches or paraphrase handling.

🔄 You require multilingual retrieval.

🚀 You operate at a massive (or critical) scale (>1M docs) with GPU/cluster resources.

8. Simpler Is Often Better

In modern AI, bigger isn’t always better. By combining a time-tested algorithm like BM25 with compact LLMs, you get a nimble, cost-effective agentic system that excels at real-world tasks — without the complexity and expense of full-scale vector search.

Give this approach a try in your next project, and watch how simplicity unlocks performance.

❗Bonus Libraries You can Try: RagMeUp offers a generic framework for spinning up a RAG over your own database easily. If you REALLY want amazing efficiency, BM25S is a variant of the BM25 algorithm designed for efficiency.

RAGMeUp Source: https://github.com/FutureClubNL/RAGMeUp

Keywords: BM25, RAG, LLM retrieval, lightweight retrieval, agentic AI, rank_bm25, Whoosh, retrieval-augmented generation.

Me after you decide to clap and follow

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication