Leveling Up RAG Chatbot: Enhancing Chatbots with Advanced RAG techniques
Author(s): Sunil Rao
Originally published on Towards AI.
In my previous article, we laid the groundwork with a basic RAG architecture. Now, we’re moving towards building robust and performant RAG systems. This and the following articles will delve into the specific stages of advanced RAG, focusing on practical implementation.
This article will specifically cover chunking strategies, indexing methods, and re-ranking techniques, providing actionable insights for optimizing your RAG applications.
Chunking
Chunking is a crucial step in RAG pipelines. It involves dividing large documents into smaller, more manageable pieces (chunks) that can be effectively indexed and retrieved. The choice of chunking strategy significantly impacts retrieval performance and, consequently, the chatbot’s accuracy and relevance.
Some common chunking techniques are:
- Recursive Character Text Chunking:
To divide large text documents into smaller, more manageable chunks while preserving semantic coherence.
It aims to avoid splitting sentences or paragraphs in the middle.
Recursive Splitting:
The text splitter uses a list of separators (e.g., \n\n
, \n
, .
, , “) to split the text.
It starts with the first separator in the list.
If a chunk is too large, it tries to split it using the next separator.
This process is repeated recursively until all chunks are within the desired size.
The order of separators is crucial.chunk_size
: The maximum size of each chunk.chunk_overlap
: The number of characters that overlap between consecutive chunks.
The overlap is very important, as it provides context between chunks.

Recursive Character Text Chunking with Metadata:
To add metadata to each chunk, such as the source document, page number, or other relevant information. This metadata can be used for filtering, sorting, or displaying retrieved chunks.
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
def load_and_index_documents(uploaded_files):
if not uploaded_files:
return None
documents = []
for uploaded_file in uploaded_files:
with open(uploaded_file.name, "wb") as f:
f.write(uploaded_file.getbuffer())
loader = PyPDFLoader(uploaded_file.name)
documents.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,
chunk_overlap=200)
#Create Documents including metadata.
all_texts = []
for doc in documents:
texts = text_splitter.create_documents([doc.page_content],
metadatas=[{"source": doc.metadata["source"],
"page": doc.metadata["page"]}])
all_texts.extend(texts)
return all_texts
texts = load_and_index_documents(uploaded_files)
Instead of split_documents
, the code now uses create_documents
.create_documents
allows you to pass metadata to each document. The metadata is extracted from the doc.metadata
attribute of each document loaded from the PDF.
2. Fixed-Size Chunking:
Fixed-size chunking is about dividing a large text document into smaller, uniformly sized pieces.
It’s very easy to implement and understand. It’s computationally efficient, requiring minimal processing.

The biggest drawback is that it often splits sentences, paragraphs, or even words in the middle. This can lead to chunks that lack coherent meaning.
LLMs rely on context to understand text, and fragmented chunks can severely hinder their performance.
If a chunk contains mostly irrelevant information, the LLM still has to process it. This can waste computational resources and increase latency.
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
def load_and_index_documents(uploaded_files, chunk_size=1000, chunk_overlap=200):
"""Loads and indexes documents with fixed-size chunking."""
if not uploaded_files:
return None
documents = []
for uploaded_file in uploaded_files:
with open(uploaded_file.name, "wb") as f:
f.write(uploaded_file.getbuffer())
loader = PyPDFLoader(uploaded_file.name)
documents.extend(loader.load())
# Use CharacterTextSplitter for fixed-size chunking
text_splitter = CharacterTextSplitter(chunk_size=chunk_size,
chunk_overlap=chunk_overlap, separator="\n")
texts = text_splitter.split_documents(documents)
return texts
texts = load_and_index_documents(uploaded_files)
3. Semantic Chunking
Semantic chunking aims to create text chunks that are semantically related, improving retrieval accuracy and context for LLMs.
Step-by-step Process involved are:
- Sentence Segmentation:
Divide the document into sentences or other semantically meaningful units (e.g., paragraphs, phrases). This is crucial because we want to create embeddings that capture the meaning of these units. - Generate Sentence Embeddings:
Use an embedding model (here:OllamaEmbeddings
) to convert each sentence into a numerical vector. These vectors represent the semantic meaning of the text. - Clustering:
Apply a clustering algorithm (e.g., k-means) to the sentence embeddings. Clustering groups sentences with similar meanings together. - Chunk Formation:
Create chunks by combining the sentences within each cluster. This results in chunks that are semantically coherent.
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OllamaEmbeddings
from sentence_transformers import SentenceTransformer # For generating embeddings
from sklearn.cluster import KMeans
import numpy as np
def load_and_index_documents(uploaded_files, embedding_model="llama3", num_clusters=5):
"""Loads, indexes, and chunks documents semantically."""
if not uploaded_files:
return None
documents = []
for uploaded_file in uploaded_files:
with open(uploaded_file.name, "wb") as f:
f.write(uploaded_file.getbuffer())
loader = PyPDFLoader(uploaded_file.name)
documents.extend(loader.load())
# Sentence Segmentation
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=0,
separators=["\n\n", "\n", "(?<=[.!?])\s"]
)
sentences = text_splitter.split_documents(documents)
sentences_text = [doc.page_content for doc in sentences]
# Generate Sentence Embeddings (Using Sentence Transformers)
model = SentenceTransformer('all-mpnet-base-v2')
# or Use any other sentence transformer model.
embeddings_array = model.encode(sentences_text)
# Clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
clusters = kmeans.fit_predict(embeddings_array)
# Chunk Formation
chunks = [" ".join([sentences_text[i]
for i, cluster in enumerate(clusters) if cluster == c])
for c in range(num_clusters)]
# Create Langchain compatible Embeddings.
embeddings = OllamaEmbeddings(model=embedding_model)
return chunks, embeddings # return chunks and embedding.
chunks, embeddings = load_and_index_documents(uploaded_files)
# Then you can use chunks and embeddings to create vector store.
4. Document Chunking (Code-Specific Documents)
When dealing with code or structured documents, naive chunking methods can severely impact the effectiveness of your LLM application. Code has a hierarchical and semantic structure that needs to be preserved for the LLM to understand it correctly.
Challenges with Code-Specific Documents:
- Syntax and Semantics: Code relies heavily on syntax and semantics. Breaking code at arbitrary points can render it meaningless.
- Function and Class Boundaries: It’s crucial to keep functions, classes, and other code blocks intact.
- Markdown Structure: Markdown documents have a logical structure (headings, lists, code blocks) that should be preserved.
- Code Dependencies: Code snippets often rely on other parts of the code. Splitting them can break dependencies.
- Token efficiency: Code can be very token heavy, and needs to be handled carefully.

Strategies for Code-Specific Chunking:
- Use language-aware text splitters that understand the syntax of the programming language.
Ex: LangChain provides specialized text splitters for
Python (PythonCodeTextSplitter
),
JavaScript (JavascriptCodeTextSplitter
),
Markdown (MarkdownTextSplitter
) and other languages. - Chunk code at the function or class level to preserve logical units.
- Ensure that code blocks within Markdown or other documents are kept intact.
- Add metadata to the chunks, such as the file name, line numbers, or function/class names. This can help the LLM understand the context of the code.
from langchain.document_loaders import PyPDFLoader, TextLoader
from langchain.text_splitter import (
RecursiveCharacterTextSplitter,
PythonCodeTextSplitter,
JavascriptCodeTextSplitter,
MarkdownTextSplitter,
)
import os
def load_and_index_documents(uploaded_files):
"""Loads and indexes documents, with code-specific handling."""
if not uploaded_files:
return None
documents = []
for uploaded_file in uploaded_files:
file_name, file_extension = os.path.splitext(uploaded_file.name)
with open(uploaded_file.name, "wb") as f:
f.write(uploaded_file.getbuffer())
if file_extension.lower() == ".py":
loader = TextLoader(uploaded_file.name)
documents.extend(loader.load())
text_splitter = PythonCodeTextSplitter(
chunk_size=1000, chunk_overlap=200)
elif file_extension.lower() == ".js":
loader = TextLoader(uploaded_file.name)
documents.extend(loader.load())
text_splitter = JavascriptCodeTextSplitter(
chunk_size=1000, chunk_overlap=200)
elif file_extension.lower() == ".md":
loader = TextLoader(uploaded_file.name)
documents.extend(loader.load())
text_splitter = MarkdownTextSplitter(
chunk_size=1000, chunk_overlap=200)
elif file_extension.lower() == ".pdf":
loader = PyPDFLoader(uploaded_file.name)
documents.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200)
else:
# Default to RecursiveCharacterTextSplitter for other file types
loader = TextLoader(uploaded_file.name)
documents.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)
return texts
texts = load_and_index_documents(uploaded_files)
5. Agentic Chunking
Agentic Chunking moves beyond static document splitting. It’s about dynamically creating and selecting chunks based on the agent’s current task, the user’s query, and the ongoing conversation history. The goal is to provide the agent with the most relevant and timely information to perform its tasks effectively.
Key Aspects:
- Contextual Awareness: The agent considers the user’s query, the ongoing conversation, and its internal state.
Ex: If a user asks, “What is the capital of France?” and then follows up with “Tell me more about its history,” the agent should retrieve information about Paris’s history, not just general French history. - Dynamic Retrieval: The agent doesn’t just retrieve a static set of chunks. It adapts its retrieval based on the evolving context.
Ex: In a multi-step task like booking a flight, the agent might first retrieve information about available flights, then about seat selection, and finally about payment options. - Tool Integration: Chunks can be used to provide input to external tools. Ex: If the agent has a weather tool, it might create a chunk containing the user’s location to pass to the tool.
- Conversation Memory: The agent considers the previous turns in the conversation to maintain context.
- Task-Specific Chunking: The agent tailors chunks to the specific task it’s trying to perform.
Ex: For a summarization task, the agent might create chunks that represent key sections of the document.
For a question-answering task, it might create chunks that contain relevant facts.
Scenario:
Imagine a customer support chatbot.
User: “My order hasn’t arrived.”
Agent:
- Analyzes the query and identifies the need to check order status.
- Retrieves the user’s order history from a database (using a tool).
- Creates a chunk containing the relevant order details.
- Uses the order details to query a shipping API (using another tool).
- Creates a chunk containing the shipping status.
- Generates a response to the user.
Propositional Retrieval:
Propositional Retrieval breaks down documents into individual “propositions” or atomic facts. A proposition is a simple, declarative statement that expresses a single idea. This allows for highly granular retrieval and reasoning.
Key Aspects:
Proposition Extraction:
Documents are processed to identify individual propositions. This often involves NLP techniques like dependency parsing or semantic role labeling.
Ex: The sentence “The quick brown fox jumps over the lazy dog” could be broken down into propositions like:
- “There is a fox.”
- “The fox is brown.”
- “The fox is quick”
- “There is a dog.”
- “The dog is lazy.”
- “The fox jumps over the dog.”
Proposition Indexing:
Propositions are embedded and stored in a vector database.
Proposition Retrieval:
When a query is received, the system retrieves the propositions that are most relevant.
Ex: If the query is “What color is the fox?”, the system would retrieve the proposition “The fox is brown.”
Reasoning:
Retrieved propositions can be combined and reasoned over to answer complex questions.
Ex: If the query is “did the fast animal jump over the slow animal?”, the system can combine the propositions from the earlier example to answer positively.
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import OllamaEmbeddings
from langchain.agents import AgentType, initialize_agent
from langchain.llms import Ollama
from langchain.memory import ConversationBufferMemory
from langchain.chains import RetrievalQA
from langchain.runnables import RemoteRunnable
#NLP Library for proposition Extraction.
import spacy
# Load spaCy model for proposition extraction.
nlp = spacy.load("en_core_web_sm")
def extract_propositions(text):
"""Extracts propositions from text using spaCy."""
doc = nlp(text)
propositions = []
for sent in doc.sents:
# Conceptual: Extract propositions based on
# dependency parsing or other NLP techniques.
# This part requires more advanced NLP knowledge and implementation.
propositions.append(sent.text) #This is a very simple example.
return propositions
def load_and_index_documents(uploaded_files):
if not uploaded_files:
return None
propositions = []
for uploaded_file in uploaded_files:
with open(uploaded_file.name, "wb") as f:
f.write(uploaded_file.getbuffer())
loader = PyPDFLoader(uploaded_file.name)
documents = loader.load()
for doc in documents:
propositions.extend(extract_propositions(doc.page_content))
embeddings = OllamaEmbeddings(model="llama3")
vectordb = Chroma.from_texts(
propositions,
embeddings,
persist_directory="chroma_db_propositions")
return vectordb.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={"k": 10, "score_threshold": 0.0})
retriever = load_and_index_documents(uploaded_files)
llm = Ollama(model="llama3")
memory = ConversationBufferMemory(
memory_key="chat_history", return_messages=True)
# Create retrieval QA chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm, chain_type="stuff", retriever=retriever)
# Initialize agent
tools = [
# Add tools here
]
agent = initialize_agent(
tools,
llm,
agent=AgentType.CONVERSATIONAL_REACT_DESCRIPTION,
verbose=True,
memory=memory,
)
def agentic_propositional_query(query):
"""Handles queries with agentic propositional retrieval."""
if "information about" in query.lower():
response = qa_chain.run(query)
return response
else:
response = agent.run(query)
return response
# Example usage
query = "What is the capital of france?"
response = agentic_propositional_query(query)
print(response)
remote_chain = RemoteRunnable.from_endpoint("your_remote_endpoint")
remote_response = remote_chain.invoke({"input": "your_input"})
print(remote_response)
Search Indexing
In a RAG architecture, the core challenge is to efficiently access and retrieve relevant information from a vast knowledge base to augment the LLM responses. Without effective search indexing, the following problems arise:
- If the LLM had to scan the entire knowledge base for every query, it would be extremely slow, making the chatbot unusable for real-time applications.
- As the knowledge base grows, the time required to search it would increase linearly, eventually making the system impractical.
- Without indexing, the LLM might retrieve irrelevant or outdated information, leading to inaccurate or unhelpful responses.
- LLMs have token limits. Without indexing, it would be impossible to fit the entire knowledge base into the LLM’s context window.
- Simple keyword searches might miss relevant information if the query and the knowledge base use different words or phrases.
Search indexing addresses these problems by creating a structured representation of the knowledge base that allows for fast and accurate retrieval.
During Vectorization process, each text chunk is converted into a numerical vector (embedding) using an embedding model. These embeddings capture the semantic meaning of the text. These embeddings are stored in a vector database, which creates an index that allows for efficient similarity search. This index is the core of the search indexing process. You may think why we use Vector DB’s for this :
- Vector databases are designed to perform semantic search, which means they can find information based on meaning, not just keywords.
- This is crucial for RAG chatbots, which need to understand the user’s intent and retrieve relevant context.
- Embeddings are high-dimensional vectors, and vector databases are optimized for storing and searching this type of data.
- Vector databases provide efficient similarity search algorithms that can quickly find the most similar embeddings to a query embedding.
- This allows the chatbot to retrieve relevant information in real-time.
- Vector databases can handle large datasets and high query volumes, making them suitable for production RAG applications.
Indexing:
In general, indexing is a technique used to optimize data retrieval by creating a data structure that allows for fast lookups. Instead of searching through every piece of data, an index allows you to quickly locate the relevant items.
Think of it like an index in a book. It tells you where to find specific topics without reading the entire book. In the context of databases:
- Traditional Databases: Indexes are often created on specific columns (e.g., customer ID, product name) to speed up queries that filter or sort by those columns.
- Vector Databases: Indexes are created on vector embeddings to speed up similarity searches.
Various methods for creating vector indexes:
- Flat Index (Brute-Force Search):
Stores all vector embeddings in a simple, linear data structure (e.g., an array).
When a query embedding arrives, it calculates the distance (e.g., cosine similarity) between the query and every vector in the index.
Returns the top-k vectors with the smallest distances.

Imagine a RAG chatbot with a very small knowledge base (e.g., limited set of FAQs or a very small collection of specialized documents.). A flat index would be sufficient to quickly retrieve relevant chunks. However, if the knowledge base grows to thousands or millions of documents, it quickly becomes impractical for real-world RAG systems.
2. Locality Sensitive Hashing (LSH):
The problem is High-dimensional vector similarity search is computationally expensive, especially with large datasets. We need a method that can quickly find approximate nearest neighbors (ANN) without performing exhaustive comparisons.
LSH uses hash functions that are “locality sensitive,” meaning they hash similar vectors into the same buckets with high probability.
This allows us to quickly identify candidate vectors by comparing hash buckets instead of individual vectors. For more info
How it Works:
- LSH uses a family of hash functions designed so that similar vectors have a higher probability of colliding (being hashed into the same bucket).
- Each vector in the database is hashed using one or more LSH functions.
- The hashed values are used to assign vectors to buckets in hash tables.
- When a query arrives, it’s also hashed using the same LSH functions.
- The query’s hash values are used to look up the corresponding buckets in the hash tables.
- Vectors in the retrieved buckets are considered candidate nearest neighbors.
- The actual distances (e.g., cosine similarity) are calculated between the query vector and the candidate vectors. The k nearest neighbors are returned.
LSH is valuable in RAG systems where speed and scalability are crucial, especially with very large knowledge bases.
Ex: A RAG system used for real-time customer support, where queries need to be answered quickly. LSH allows for rapid retrieval of relevant information from a vast database of customer support documents. When dealing with very high dimensional embeddings, LSH can be very useful.
3. Hierarchical Navigable Small Worlds (HNSW):
HNSW is designed to efficiently solve the approximate nearest neighbor (ANN) search problem in high-dimensional vector spaces. It’s particularly useful when dealing with embeddings generated from text, images, or other data, which are common in RAG pipelines.
How it works:
- Multi-Layered Graph:
HNSW constructs a multi-layered graph structure.
Each layer is a “small world” graph, meaning that most nodes are connected to a few nearby nodes.
The top layer is a coarse representation of the vector space, while the lower layers provide increasingly finer-grained representations. - Node Representation:
Each vector embedding in the dataset becomes a node in the graph.
The nodes are connected to their nearest neighbors in each layer. - Hierarchical Structure:
The layers are organized hierarchically, with fewer nodes in the higher layers and more nodes in the lower layers.
The top layer provides a high-level overview of the vector space, while the lower layers provide more detailed information. - Search Process:
When a query embedding arrives, the search starts at the top layer.
The algorithm finds the nearest neighbors to the query in the top layer.
It then moves down to the next layer and refines the search by finding the nearest neighbors to the query within the neighborhood of the nodes found in the previous layer.
This process continues until the bottom layer is reached.
The algorithm returns the k nearest neighbors found in the bottom layer.
Ex: A RAG chatbot with a large knowledge base (e.g., thousands of PDF documents). HNSW allows the chatbot to quickly retrieve relevant chunks, even with a large number of vectors. For example, chromaDB uses HNSW.
def create_vector_store(texts, embeddings):
if not texts:
return None
#Explicitly configure HNSW parameters.
vectordb = Chroma.from_documents(texts, embeddings,
persist_directory="chroma_db",
collection_metadata={"hnsw:space": "cosine"}) # or "l2", "ip"
return vectordb.as_retriever(search_type="similarity_score_threshold",
search_kwargs={"k": 10, "score_threshold": 0.0})
retriever = create_vector_store(texts, embeddings)
4. Inverted File System (IVF):
IVF is a method for creating vector indexes that efficiently narrows down the search space for approximate nearest neighbor (ANN) searches. It’s particularly useful for large datasets where brute-force search is impractical.
How it Works:
- Vector Space Partitioning (Clustering):
The vector space is partitioned into a set of clusters.
This is typically done using a clustering algorithm like k-means.
Each cluster is represented by a centroid vector.
Each vector in the dataset is assigned to the cluster with the closest centroid. - Inverted File Creation:
An inverted file is created, which maps each cluster ID to the list of vectors belonging to that cluster.
This is similar to an inverted index in traditional text search, where terms are mapped to the documents they appear in. - Query Processing:
When a query vector arrives, the algorithm first identifies the closest cluster(s) to the query.
This is done by comparing the query vector to the cluster centroids.
The algorithm then retrieves the list of vectors belonging to the closest cluster(s) from the inverted file. - Local Search:
The algorithm performs a local search within the retrieved clusters to find the nearest neighbors to the query.
This involves calculating the distances (e.g., cosine similarity) between the query vector and the vectors in the selected clusters. - Result Ranking:
The results are ranked based on their distances to the query vector.
The k nearest neighbors are returned.
Ex: A RAG chatbot with a very large knowledge base (e.g., millions of web pages). IVF can quickly identify the most relevant clusters of vectors, reducing the search time. Then, it performs a more detailed search within those clusters.
5. Product Quantization (PQ)
PQ is a technique for compressing high-dimensional vectors, which is particularly beneficial in vector databases where memory usage is a concern. It’s often used in conjunction with other indexing methods, such as IVF, to further improve performance.
How PQ Works:
- Subvector Decomposition:
Each high-dimensional vector is divided into multiple subvectors.
For example, a 128-dimensional vector might be divided into 8 subvectors of 16 dimensions each. - Subvector Quantization:
Each subvector is quantized (approximated) using a codebook.
A codebook is a set of representative vectors (centroids) for each subvector space.
The quantization process involves finding the closest centroid in the codebook for each subvector.
The subvector is then replaced with the index of its closest centroid in the codebook. - Codebook Creation:
Codebooks are typically created using a clustering algorithm like k-means.
The clustering is performed on the subvectors of the training data. - Vector Compression:
The original vector is replaced with a sequence of codebook indices, one for each subvector.
This significantly reduces the memory footprint of the vectors. - Distance Calculation:
When a query vector arrives, it’s also decomposed and quantized in the same way.
The distance between the query vector and a database vector is approximated by calculating the distances between the corresponding codebook indices.
This can be done efficiently using precomputed distance tables.
Ex: A RAG chatbot with a very large knowledge base and limited memory resources. PQ can compress the embeddings, allowing the chatbot to store more vectors in memory.
For instance, PQ may be used in edge devices, where resources are limited.
6. Random Projection (RP):
Random Projection is a dimensionality reduction technique that aims to preserve the pairwise distances between vectors in a lower-dimensional space. It’s used to create approximate nearest neighbor (ANN) indexes by reducing the dimensionality of high-dimensional vector embeddings, making similarity search faster and more memory-efficient.
How RP Works:
- Random Matrix Generation:
A random matrix is generated with dimensions d’ x d, where d’ is the desired lower dimensionality and d is the original dimensionality.
The elements of the random matrix are typically drawn from a Gaussian distribution or a sparse distribution. - Vector Projection:
Each high-dimensional vector in the dataset is projected onto the lower-dimensional space by multiplying it with the random matrix.
This results in a lower-dimensional representation of the original vector. - Indexing:
The lower-dimensional vectors are then indexed using a suitable indexing method, such as a flat index or a tree-based index. - Query Processing:
When a query vector arrives, it’s also projected onto the lower-dimensional space using the same random matrix.
The algorithm then performs a similarity search in the lower-dimensional space to find the nearest neighbors to the projected query vector. - Result Ranking:
The results are ranked based on their distances in the lower-dimensional space.
The k nearest neighbors are returned.
7. Approximate Nearest Neighbors Oh Yeah (ANNOY)
ANNOY is a library designed for efficient approximate nearest neighbor (ANN) search in high-dimensional spaces. It’s particularly useful for vector embeddings, which are common in RAG pipelines. ANNOY sacrifices some accuracy for speed, making it suitable for applications where near-real-time retrieval is crucial.
How ANNOY Works:
- Random Projection Trees:
ANNOY builds a forest of random projection trees.
Each tree partitions the vector space using random hyperplanes.
A hyperplane is defined by a random vector and a threshold.
Vectors on one side of the hyperplane go to the left child node, and vectors on the other side go to the right child node.
This partitioning process is repeated recursively until each leaf node contains a small number of vectors. - Tree Construction:
Multiple random projection trees are built to increase the probability of finding good approximate nearest neighbors.
Each tree uses different random hyperplanes, providing diverse views of the vector space.
The number of trees is a parameter that affects the trade-off between accuracy and speed. More trees improve accuracy but increase indexing time and memory usage. - Search Process:
When a query vector arrives, the algorithm traverses each tree, starting from the root node.
At each node, it determines which side of the hyperplane the query vector falls on and follows the corresponding child node.
This process continues until a leaf node is reached.
The algorithm collects the vectors from the leaf nodes of all trees. - Candidate Selection and Ranking:
The algorithm calculates the distances (e.g., cosine similarity) between the query vector and the collected candidate vectors.
The k nearest neighbors are returned.
Vector Store Flat Index
A “flat index” in a vector store is the simplest possible approach to storing and retrieving vector embeddings. It’s essentially a brute-force method.
- Storage:
All vector embeddings are stored in a simple, linear data structure, like a list or array.
There’s no complex organization or partitioning of the vector space. - Retrieval:
When a query embedding is received, the system calculates the distance (or similarity) between the query embedding and every embedding in the flat index.
The distances are calculated using a chosen metric (e.g., cosine similarity, Euclidean distance).
The vectors are then ranked based on their distances, and the top-k nearest neighbors are returned.
Vector Store Flat Index in RAG:
- Embedding Creation:
Your RAG system first processes the knowledge base (documents, text chunks).
Each chunk is converted into a vector embedding using an embedding model (e.g., OpenAI Embeddings, Ollama Embeddings). - Flat Index Storage:
These embeddings are stored in the vector store’s flat index.
Along with the embeddings, the system stores pointers or references to the original text chunks. - Query Processing:
When a user asks a question, the query is also converted into an embedding using the same embedding model. - Brute-Force Similarity Search:
The query embedding is compared to every embedding in the flat index.
The system calculates the similarity score for each comparison. - Ranking and Retrieval:
The embeddings are ranked based on their similarity scores.
The top-k embeddings (the most similar ones) are selected.
The system retrieves the corresponding text chunks from the knowledge base. - Context Augmentation:
The retrieved text chunks are provided as context to the LLM.
The LLM uses this context to generate a response to the user’s query.
Retrieval Methods
In RAG, retrieval methods are crucial for fetching relevant information from an external knowledge base to enhance the language model’s (LLM) responses. Here’s a comprehensive overview of the various retrieval methods used in RAG
Let’s break down how different search methods work within the context of RAG systems.
1. Keyword Search (Lexical Search) in RAG:
- The user’s query is treated as a string of keywords.
- The system searches an inverted index (a data structure mapping keywords to documents) for documents containing those keywords.
- Results are often ranked based on term frequency-inverse document frequency (TF-IDF) or similar metrics.
In RAG, keyword search can be used as a first-pass filter to narrow down the search space before vector search. It can also be combined with vector search in hybrid retrieval.
Ex: If a user asks, “How do neural networks learn?”, a keyword search would quickly identify documents containing “neural,” “networks,” and “learn”
Documents containing the exact phrases “neural networks” and “learn” would rank highly.
Documents mentioning “neural network architecture” or “learning algorithms” would be retrieved.
Documents discussing “machine learning” without explicitly using “neural networks” or might be missed.
2. Semantic Search (Vector Search) in RAG:
- Both the user’s query and the documents in the knowledge base are converted into vector embeddings using an embedding model.
- The embeddings represent the semantic meaning of the text.
- A similarity search is performed to find the documents with the most similar embeddings to the query embedding.
- Cosine similarity is a common metric used to measure similarity.
Semantic search is the core retrieval method in most modern RAG systems. It allows the system to retrieve documents that are semantically relevant to the query, even if they don’t share keywords.
Ex: If a user asks, “How do neural networks learn?”, a semantic search would retrieve documents discussing “backpropagation,” “gradient descent,” or “optimization algorithms”, even if they don’t explicitly mentioned.
Documents discussing “backpropagation,” “gradient descent,” or “optimization algorithms” would be retrieved, even if they don’t explicitly contain the phrase “neural networks learn.”
Documents discussing the general concepts of learning, training, or fitting a model would also be retrieved.
Documents only mentioning the history of neural networks, and not the learning process, would be ranked lower.
3. Hybrid Search:
- Combines keyword search and semantic search.
- This could be done in various ways, such as:
- Performing both searches and combining the results.
- Using keyword search to pre-filter the documents and then performing semantic search on the filtered set.
- Using Reciprocal Rank Fusion(RRF) to combine results.
In the “How do neural networks learn?” example:
Keyword search might miss documents about backpropagation if they don’t explicitly use “learn.”
Semantic search would capture those documents but might also retrieve some irrelevant documents about general machine learning.
Hybrid search aims to find the “sweet spot” by retrieving documents that are both relevant in terms of keywords and semantic meaning.
NOTE: Reciprocal Rank Fusion will be explained later in this article
Parent Document Retriever:
The Parent Document Retriever addresses the challenge of balancing detailed information retrieval with the need for broader context. Traditional chunking strategies can sometimes split relevant information across multiple chunks, leading to a loss of context. This method creates a hierarchical structure of “parent” and “child” documents to mitigate this problem.
Chunking:
- Child Chunks: The original document is divided into small, detailed chunks (e.g., sentences, short paragraphs). These are designed to be highly specific.
- Parent Chunks: The original document is also divided into larger, more comprehensive chunks (e.g., full paragraphs, sections). These provide broader context.
Indexing:
- Only the child chunks are embedded and indexed in the vector database.
- A mapping is maintained between each child chunk and its corresponding parent chunk.
Retrieval:
- When a user query is received, it’s embedded, and a similarity search is performed against the indexed child chunks.
- The most relevant child chunks are retrieved.
Parent Context Inclusion:
- For each retrieved child chunk, the corresponding parent chunk is also retrieved from the storage.
- The parent chunk is included as additional context alongside the child chunk.
Context Augmentation:
- The retrieved child chunks and their associated parent chunks are combined and provided as context to the LLM.
- The LLM generates a response based on this enhanced context.
Ex: Imagine a chatbot designed to answer questions about a technical manual.
Document: A technical manual describing how to use a software application.
Chunking:
- Child Chunks: Individual sentences or short code snippets.
- Parent Chunks: Full paragraphs or sections describing specific features or procedures.
Indexing:
- Only the child chunks are embedded and indexed.
- A mapping is created, for example:
- Child Chunk “Click the ‘Save’ button.” -> Parent Chunk “Saving Your Work” section.
- Child Chunk “Use the command ‘/export data’ to export.” -> Parent Chunk “Data Export Features” section.
User Query: “How do I save my work?”
Retrieval:
- The query is embedded, and the vector database retrieves the child chunk: “Click the ‘Save’ button.”
Parent Context Inclusion:
- The corresponding parent chunk, “Saving Your Work” section, is also retrieved.
Context Augmentation:
- The chatbot provides the following context to the LLM:
- Child Chunk: “Click the ‘Save’ button.”
- Parent Chunk: “Saving Your Work: This section describes the various methods for saving your work. You can use the ‘Save’ button, the ‘Save As’ menu option, or the keyboard shortcut Ctrl+S. Ensure you save your work regularly to avoid data loss.”
LLM Response:
The LLM generates a response based on the combined context: “To save your work, you can click the ‘Save’ button, use the ‘Save As’ menu option, or press Ctrl+S. It’s important to save your work regularly to prevent data loss.”
Sentence Window Retrieval:
The primary goal of Sentence Window Retrieval is to provide the LLM with richer, more contextually relevant information by retrieving a “window” of sentences around the most relevant sentence, instead of just a single isolated sentence. This helps the LLM understand the surrounding context and generate more coherent and accurate responses.
Sentence Segmentation:
- The documents in the knowledge base are first broken down into individual sentences.
Embedding Generation:
- Embeddings are generated for each individual sentence.
Vector Indexing:
- The sentence embeddings are stored in a vector database.
Query Embedding:
- The user’s query (“when will we reach AGI”) is converted into a vector embedding.
Similarity Search:
- A similarity search is performed in the vector database to find the sentence embedding that is most similar to the query embedding.
Window Expansion:
- Instead of just retrieving the most similar sentence, a “window” of sentences is created around it.
- This window typically includes a few sentences before and after the most relevant sentence.
- The size of the window (number of sentences before and after) is a parameter that can be adjusted.
Context Augmentation:
- The expanded sentence window is provided as context to the LLM.
Response Generation:
- The LLM generates a response based on the enhanced context.
Chatbot Example: “When will we reach AGI?”
Knowledge Base:
- Imagine a knowledge base containing various articles and research papers on artificial general intelligence (AGI).
- One document contains these sentences:
"Many researchers believe that AGI is still decades away."
"The complexity of human cognition presents a significant challenge."
"Advances in deep learning and reinforcement learning are promising."
"However, the development of true understanding and reasoning remains elusive."
"Some experts predict that AGI could be achieved within the next 20–30 years."
"Others argue that it may take much longer, or may never be possible."
"The ethical implications of AGI are also a major concern."
Sentence Segmentation and Embedding:
- Each of these sentences is embedded and indexed.
User Query:
- “When will we reach AGI?”
Similarity Search:
- The query embedding is compared to the sentence embeddings.
- The sentence “Some experts predict that AGI could be achieved within the next 20–30 years.” is identified as the most relevant.
Window Expansion:
- A sentence window is created, including:
"However, the development of true understanding and reasoning remains elusive." (sentence before)
"Some experts predict that AGI could be achieved within the next 20–30 years." (most relevant sentence)
"Others argue that it may take much longer, or may never be possible." (sentence after)
Context Augmentation: The following context is provided to the LLM:
"However, the development of true understanding and reasoning remains elusive.
Some experts predict that AGI could be achieved within the next 20–30 years.
Others argue that it may take much longer, or may never be possible."
Response Generation:
The LLM generates a response based on this context:
"While some experts believe AGI could be achieved within the next 20–30 years,
others argue that it may take much longer, or may never be possible.
The development of true understanding and reasoning remains a significant
challenge."
Re-ranking
Re-ranking is a crucial post-retrieval step designed to refine the initial set of retrieved documents or chunks, improving the overall relevance and accuracy of the context provided to the LLM.

The Need for Re-ranking:
- Initial Retrieval Limitations:
Basic vector search might retrieve a set of documents that are semantically similar to the query but not necessarily the most relevant in terms of answering the specific question.
Vector search can sometimes retrieve documents that are too broad or too narrow, or that contain irrelevant information. - Improving Context Quality:
Re-ranking helps to identify the most informative and relevant documents within the initial set, ensuring that the LLM receives high-quality context.
It can help to eliminate noise and irrelevant information, improving the LLM’s ability to generate accurate and coherent responses.
How Re-ranking Works:
- Initial Retrieval: The user’s query is used to retrieve a set of candidate documents or chunks from the vector database.
- Re-ranking Model: A separate model, typically a cross-encoder, is used to re-score the candidate documents based on their relevance to the query.
- Re-scoring: The re-ranking model takes the query and each candidate document as input and produces a relevance score.
- Re-ordering: The candidate documents are re-ordered based on their relevance scores.
- Context Selection: The top-ranked documents are selected as the final context for the LLM.

Re-ranking is the process of taking an initial set of retrieved documents or text chunks from a knowledge base and re-ordering them to improve the relevance of the context provided to a Language Model (LLM). This step is crucial because:
- Initial Retrieval Limitations: Vector databases or keyword searches often provide a broad set of results, some of which may be less relevant or noisy.
- Context Optimization: LLMs thrive on precise and relevant context. Re-ranking ensures that the most pertinent information is presented, leading to more accurate and coherent responses.
Two-stage retrieval systems, especially when combined with re-rankers, are a common and effective architecture for improving the performance of information retrieval, particularly in contexts like RAG.
Two-Stage Retrieval Systems: The core idea is to break the retrieval process into two distinct phases:
- First Stage (Candidate Generation):
- This stage aims to quickly retrieve a large set of potentially relevant documents or chunks from the knowledge base.
- Efficiency is paramount in this stage.
- Common methods include:
- Vector Search (Bi-encoders): Fast similarity search using bi-encoders, which generate independent embeddings for queries and documents.
- Keyword Search (Lexical Search): Fast retrieval based on keyword matching.
- Hybrid Search: A combination of vector and keyword search.
The goal is to cast a wide net and retrieve a superset of relevant documents, even if it includes some irrelevant ones.
2. Second Stage (Re-ranking):
- This stage focuses on refining the results from the first stage, selecting the most relevant documents from the candidate set.
- Accuracy is the priority in this stage.
Common methods include:
- Cross-Encoders: More accurate but computationally expensive models that jointly process queries and documents.
- Learning-to-Rank (LTR) Models: Machine learning models trained to predict document relevance.
- LLM assisted reranking: Using LLMs to determine the most relevant documents.
- ColBERT: A model that achieves a good balance of speed and accuracy.
Learning-to-Rank (LTR) Models:
LTR models are machine learning models specifically designed for ranking tasks. They learn to predict the relevance of documents or items to a given query, allowing for optimal re-ordering.
How LTR Works in RAG:
- Feature Extraction:
Features are extracted from the query and the retrieved documents. These features can include:
Similarity scores from the initial retrieval.
Keyword matches.
Metadata (e.g., publication date, source).
Textual features (e.g., term frequency, document length).
LLM generated features. - Model Training:
An LTR model is trained on a dataset of queries and documents, labeled with relevance scores or rankings.
The model learns to predict the relevance of documents based on the extracted features. - Re-ranking:
When a user query is received, the LTR model is used to score the retrieved documents.
The documents are then re-ordered based on their predicted relevance scores.
Different Types of LTR Models: LTR models can be categorized into three main approaches:
Pointwise LTR:
- Treats ranking as a regression or classification problem.
- Predicts the relevance score of each document independently.
- Ignores the relationships between documents.
Ex: Linear regression, Decision trees, Neural networks.
Pairwise LTR:
- Treats ranking as a binary classification problem.
- Learns to predict which of two documents is more relevant to a query.
- Focuses on the relative ordering of document pairs.
Ex: RankSVM, RankNet, LambdaRank.
Listwise LTR:
- Treats ranking as a list-level optimization problem.
- Directly optimizes a ranking metric, such as Normalized Discounted Cumulative Gain (NDCG) or Mean Average Precision (MAP).
- Considers the entire list of retrieved documents.
Ex: LambdaMART, ListNet, RankBoost.
Cross-Encoder Models for Re-ranking:
Cross-encoders are a type of neural network architecture that excels at capturing fine-grained relationships between two pieces of text, making them ideal for re-ranking in RAG.

Input Concatenation:
Unlike bi-encoders (used for initial embedding), cross-encoders take both the query and the document (or chunk) as a single, concatenated input.
For instance, if the query is “What are the benefits of vitamin D?” and a document chunk is “Vitamin D promotes calcium absorption,” the input to the cross-encoder might be: [CLS] What are the benefits of vitamin D? [SEP] Vitamin D promotes calcium absorption. [SEP]
and
[CLS][SEP]
are special tokens used by BERT-based models to indicate the start and separation of sequences.
Transformer Encoding:
The concatenated input is fed into a Transformer-based model (like BERT).
The model processes the entire sequence, allowing it to attend to interactions between the query and the document.
This enables the model to understand the nuanced relationships between the two pieces of text.
Relevance Score Prediction:
The output of the Transformer is passed through a classification layer.
This layer produces a relevance score, indicating how well the document answers the query.
The score is typically a single value between 0 and 1 (or a similar range), representing the probability of relevance.
Re-ordering:
The retrieved documents are re-ordered based on their relevance scores.
The documents with the highest scores are placed at the top of the ranking.
Context Selection:
The top-ranked documents are selected as the final context for the LLM.
BERT-Based Cross-Encoder Models:
BERT-based models are a popular choice for cross-encoders due to their strong performance in natural language understanding tasks. Some examples include:
cross-encoder/ms-marco-TinyBERT-L-2-v2
: A smaller, faster model, suitable for applications where speed is critical.cross-encoder/ms-marco-MiniLM-L-12-v2
: A more accurate model, providing a good balance between speed and performance.cross-encoder/ms-marco-distilbert-base-tas-bilingual-en-de
: A multilingual model, that can rerank in both English and German.
Imagine your RAG chatbot has a knowledge base containing various articles, research papers, and documentation about transformer models and deep learning.
1. Initial Retrieval (Bi-encoder):
The user asks: “Benefits of self-attention in transformers”.
The query is embedded using a bi-encoder model (e.g., Sentence Transformers).
A vector database (e.g., ChromaDB with HNSW) is queried for the most similar documents. The initial retrieval returns a set of candidate documents, including:
- Document A: “Introduction to Transformer Architectures” (general overview)
- Document B: “Self-Attention Mechanism Explained” (detailed explanation)
- Document C: “Applications of Transformers in NLP” (applications)
- Document D: “Recurrent Neural Networks vs. Transformers” (comparison)
2. Cross-Encoder Re-ranking:
Model Selection: We use a pre-trained BERT-based cross-encoder model (e.g., cross-encoder/ms-marco-MiniLM-L-12-v2
).
Input Preparation: For each retrieved document, we create a pair of (query, document content). Example:
- Pair 1: (“benefits of self-attention in transformers”, “Introduction to Transformer Architectures”)
- Pair 2: (“benefits of self-attention in transformers”, “Self-Attention Mechanism Explained”)
- Pair 3: (“benefits of self-attention in transformers”, “Applications of Transformers in NLP”)
- Pair 4: (“benefits of self-attention in transformers”, “Recurrent Neural Networks vs. Transformers”)
Cross-Encoder Inference: Each pair is fed into the cross-encoder model. The model processes the query and document content together, generating a relevance score for each pair. Example scores:
- Pair 1: 0.6
- Pair 2: 0.95
- Pair 3: 0.7
- Pair 4: 0.4
Re-ordering: The documents are re-ordered based on their relevance scores:
- Document B: “Self-Attention Mechanism Explained” (0.95)
- Document C: “Applications of Transformers in NLP” (0.7)
- Document A: “Introduction to Transformer Architectures” (0.6)
- Document D: “Recurrent Neural Networks vs. Transformers” (0.4)
3. Context Selection: The top-ranked documents (e.g., Document B and C) are selected as the context for the LLM.
4. LLM Response Generation:
The selected context (Document B and C) is passed to the LLM along with the original query. The LLM generates a response based on the relevant context:
“Self-attention allows transformers to weigh the importance of different words in a sentence, improving their ability to understand context. This leads to better performance in tasks like machine translation and text summarization. Self-attention also allows for parallel processing of input sequences, making transformers more efficient than recurrent neural networks.”
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
# or another BERT-based cross-encoder
def create_vector_store(texts, embeddings):
......
return vectordb.as_retriever(search_type="similarity_score_threshold",
search_kwargs={"k": 20, "score_threshold": 0.0})
def rerank_documents(query, documents, cross_encoder):
"""Reranks documents using a cross-encoder model."""
pairs = [(query, doc.page_content) for doc in documents]
scores = cross_encoder.predict(pairs)
# Sort documents by scores in descending order
ranked_documents = [doc for _, doc in sorted(zip(scores, documents),
reverse=True)]
return ranked_documents
def process_query(query, retriever, cross_encoder, source_filter=None):
if retriever is None:
return "Please upload documents first."
if source_filter:
results = retriever.get_relevant_documents(
query, where={"source": source_filter})
else:
results = retriever.get_relevant_documents(query)
# Re-rank the retrieved documents
ranked_results = rerank_documents(query, results, cross_encoder)
context = "\n".join([doc.page_content for doc in ranked_results[:10]])
return context
EMBEDDING_MODEL = "llama3" #
embeddings = OllamaEmbeddings(model=EMBEDDING_MODEL)
retriever = create_vector_store(texts, embeddings)
query = "What is the document about?"
# example of metadata filtering
source_filter = "transformers.pdf"
response = process_query(query, retriever, cross_encoder,
source_filter=source_filter)
print(response)
# example without metadata filtering
response = process_query(query, retriever, cross_encoder)
print(response)
Contextualized Late Interaction over BERT [ ColBERT ]
ColBERT (Contextualized Late Interaction over BERT) is a unique and powerful retrieval and re-ranking model that stands out for its efficiency and effectiveness, particularly in RAG systems.

ColBERT aims to address the limitations of traditional bi-encoders (which are fast but less accurate) and cross-encoders (which are accurate but slow). It achieves this by:
- Late Interaction: Instead of computing a single embedding for the entire document, ColBERT computes contextualized embeddings for each token in the document and the query.
- Fine-Grained Relevance: It performs a fine-grained relevance scoring by comparing these token-level embeddings, allowing for a more precise understanding of the relationship between the query and the document.
How ColBERT Re-ranking Works in RAG:
Token-Level Embeddings:
- Query: The user’s query is passed through a BERT-based model, and a contextualized embedding is generated for each token in the query.
- Documents: Each document (or chunk) in the retrieved set is also passed through the same BERT-based model, generating contextualized embeddings for each token in the document.
Late Interaction (Fine-Grained Scoring):
- For each query token embedding, ColBERT finds the most similar document token embedding.
- It calculates the maximum similarity score between the query token and any document token.
- These maximum similarity scores are then aggregated (summed) to produce a relevance score for the entire document.
- This process is called “late interaction” because the comparison of token embeddings occurs after the contextualized embeddings have been generated.
Re-ordering:
- The retrieved documents are re-ordered based on their ColBERT relevance scores.
- The documents with the highest scores are placed at the top of the ranking.
Context Selection:
- The top-ranked documents are selected as the final context for the LLM.
Imagine your RAG chatbot has a knowledge base containing various articles, research papers, and documentation about transformer models and deep learning.
1. Initial Retrieval (Bi-encoder):
The user asks: “Benefits of self-attention in transformers”. The query is embedded using a bi-encoder model (e.g., Sentence Transformers).
A vector database (e.g., FAISS) is queried for the most similar documents. The initial retrieval returns a set of candidate documents, including:
- Document A: “Introduction to Transformer Architectures” (general overview)
- Document B: “Self-Attention Mechanism Explained” (detailed explanation)
- Document C: “Applications of Transformers in NLP” (applications)
- Document D: “Recurrent Neural Networks vs. Transformers” (comparison)
2. ColBERT Re-ranking:
Token-Level Embeddings:
- ColBERT processes both the query and the documents at the token level using a BERT-based model.
- For the query “Benefits of self-attention in transformers,” ColBERT generates contextualized embeddings for each token: “Benefits,” “of,” “self-attention,” “in,” “transformers.”
- Similarly, each document is tokenized, and contextualized embeddings are created for each token within them.
Late Interaction:
- ColBERT performs a fine-grained comparison between the query token embeddings and the document token embeddings.
- For each query token embedding, it finds the most similar document token embedding.
- The maximum similarity scores are aggregated (summed) to produce a relevance score for the entire document.
- Example:
- For the query token “self-attention,” ColBERT finds the most similar token embeddings in each document.
- Document B, “Self-Attention Mechanism Explained,” will have many high similarity scores for this token, as well as for other query tokens.
- Document A, “Introduction to Transformer Architectures,” will have some matches but fewer high similarity scores.
- Document C, “Applications of Transformers in NLP,” will have matches, but will also have sections that do not match the query well.
- Document D, “Recurrent Neural Networks vs. Transformers,” will have the least amount of high score matches.
The sum of these maximum similarity scores determines the document’s relevance score.
Re-ordering: The retrieved documents are re-ordered based on their ColBERT relevance scores:
- Document B: “Self-Attention Mechanism Explained” (highest score)
- Document C: “Applications of Transformers in NLP” (higher score)
- Document A: “Introduction to Transformer Architectures” (lower score)
- Document D: “Recurrent Neural Networks vs. Transformers” (lowest score)
3. Context Selection: The top-ranked documents (e.g., Document B and C) are selected as the context for the LLM.
4. LLM Response Generation:
The selected context (Document B and C) is passed to the LLM along with the original query.
The LLM generates a response based on the relevant context:
“Self-attention allows transformers to weigh the importance of different words in a sentence, improving their ability to understand context. This leads to better performance in tasks like machine translation and text summarization. Self-attention also allows for parallel processing of input sequences, making transformers more efficient than recurrent neural networks.”
LLM-based re-ranking
LLM-based re-ranking leverages the reasoning and understanding capabilities of LLMs to refine the relevance of retrieved documents in a RAG system. Instead of relying solely on vector similarity or keyword matching, LLMs can assess the context and nuanced relationships between the query and the retrieved content.
How LLM-Based Re-ranking Works:
Initial Retrieval:
Similar to other re-ranking approaches, the RAG system first retrieves a set of candidate documents or chunks using methods like vector search or keyword matching.
Prompt Engineering:
A carefully crafted prompt is created to instruct the LLM on how to re-rank the documents. The prompt typically includes:
- The user’s query.
- A list of the retrieved documents or chunks.
- Instructions on how to assess relevance (e.g., “Rank the following documents based on their relevance to the query”).
- Optionally, instructions on the format of the output (e.g., a ranked list or a score for each document).
LLM Inference:
- The prompt is sent to the LLM, which processes the information and generates a response.
- The response contains the re-ranked list of documents or relevance scores.
Context Selection:
- The documents are re-ordered based on the LLM’s output.
- The top-ranked documents are selected as the final context for the RAG system.
Types of LLM Re-rankers:
- Direct Ranking:
- The LLM is asked to directly rank the retrieved documents.
- The prompt might ask the LLM to provide a ranked list of document IDs or titles.
- Ex: “Rank the following documents from most to least relevant to the query: [query]. Documents: [document1], [document2], [document3].”
2. Score-Based Ranking:
- The LLM is asked to assign a relevance score to each document.
- The prompt might ask the LLM to provide a numerical score or a qualitative assessment (e.g., “highly relevant,” “somewhat relevant”).
- Ex: “For each document, provide a relevance score (1–5) based on its relevance to the query: [query]. Documents: [document1], [document2], [document3].”
3. Reasoning-Based Ranking:
- The LLM is asked to explain its reasoning for ranking the documents.
- This can provide insights into the LLM’s decision-making process and improve the quality of the re-ranking.
- Ex: “Rank the following documents and explain your reasoning for each ranking: [query]. Documents: [document1], [document2], [document3].”
4. Multi-Criteria Ranking:
- The prompt includes multiple criteria for ranking the documents, such as relevance, accuracy, and recency.
- This allows for more nuanced and comprehensive re-ranking.
- Ex: “Rank the following documents based on relevance, accuracy, and recency: [query]. Documents: [document1], [document2], [document3].”
Ex: A RAG chatbot is used to answer questions about a company’s internal knowledge base.
Query: “What are the company’s policies on remote work?”
Initial Retrieval: The RAG system retrieves three documents:
- Document 1: “Company Benefits Overview”
- Document 2: “Remote Work Policy”
- Document 3: “Office Equipment Guidelines”
LLM-Based Re-ranking (Direct Ranking):
- Prompt: “Rank the following documents from most to least relevant to the query: ‘What are the company’s policies on remote work?’ Documents: ‘Company Benefits Overview’, ‘Remote Work Policy’, ‘Office Equipment Guidelines’.”
- LLM Response: “1. ‘Remote Work Policy’, 2. ‘Company Benefits Overview’, 3. ‘Office Equipment Guidelines’.”
Context Selection: The RAG system selects “Remote Work Policy” and “Company Benefits Overview” as the context for the LLM
Reciprocal Rank Fusion
Reciprocal Rank Fusion (RRF) is a technique used to combine the results of multiple retrieval methods, such as keyword search and semantic search, into a single, unified ranking. It aims to leverage the strengths of each method while mitigating their weaknesses.
How Reciprocal Rank Fusion Works:
Individual Rankings: Each retrieval method (e.g., keyword search, semantic search) produces its own ranked list of documents.
Reciprocal Rank Calculation: For each document in each ranked list, the reciprocal rank (RR) is calculated.
- The reciprocal rank is defined as:
RR = 1 / (k + rank)
, where: k
is a constant (typically a small number, like 60).rank
is the position of the document in the ranked list.
Fusion and Aggregation:
- The reciprocal ranks for each document are summed across all ranked lists.
- Documents with higher summed reciprocal ranks are considered more relevant.
Final Ranking:
- The documents are re-ranked based on their summed reciprocal ranks, producing the final, fused ranking.
Ex: If user query “How do neural networks learn?”
Let’s assume we have two retrieval methods:
- Keyword Search (KS): Ranks documents based on keyword matches.
- Semantic Search (SS): Ranks documents based on semantic similarity.
Retrieved Documents and Ranks:
KS Results:
- Document A: “Neural Network Training Methods” (Rank 1)
- Document B: “Learning Algorithms in Neural Networks” (Rank 2)
- Document C: “History of Neural Networks” (Rank 3)
- Document D: “Machine Learning Concepts” (Rank 4)
SS Results:
- Document E: “Backpropagation Explained” (Rank 1)
- Document B: “Learning Algorithms in Neural Networks” (Rank 2)
- Document F: “Gradient Descent for Deep Learning” (Rank 3)
- Document D: “Machine Learning Concepts” (Rank 4)
Reciprocal Rank Calculation (with k = 60):
KS:
- Document A: 1 / (60 + 1) = 0.01639
- Document B: 1 / (60 + 2) = 0.01613
- Document C: 1 / (60 + 3) = 0.01587
- Document D: 1 / (60 + 4) = 0.01562
SS:
- Document E: 1 / (60 + 1) = 0.01639
- Document B: 1 / (60 + 2) = 0.01613
- Document F: 1 / (60 + 3) = 0.01587
- Document D: 1 / (60 + 4) = 0.01562
Fusion and Aggregation:
- Document A: 0.01639 (KS) + 0 (SS) = 0.01639
- Document B: 0.01613 (KS) + 0.01613 (SS) = 0.03226
- Document C: 0.01587 (KS) + 0 (SS) = 0.01587
- Document D: 0.01562 (KS) + 0.01562 (SS) = 0.03124
- Document E: 0 (KS) + 0.01639 (SS) = 0.01639
- Document F: 0 (KS) + 0.01587 (SS) = 0.01587
Final Ranking:
- Document B (0.03226)
- Document D (0.03124)
- Document A (0.01639)
- Document E (0.01639)
- Document F (0.01587)
- Document C (0.01587)
Conclusion
- Document B, which appeared in the top ranks of both KS and SS, is ranked highest.
- Document D, which appeared in the lower ranks of both KS and SS, is ranked second.
- Documents A and E, which appeared in the top rank of only one of the methods, are ranked lower.
- This demonstrates how RRF effectively combines the strengths of both retrieval methods.
I appreciate you taking the time to read this article. I’m passionate about exploring the exciting world of RAG and LLMs, and I’d love to hear your thoughts! Leave your suggestions and questions in the comments.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.