Evaluating Retrieval & Generation Pipelines

Author(s): Nilesh Raghuvanshi

Originally published on Towards AI.

Improving Retrieval Augmented Generation (RAG) Systematically

Evaluating Retrieval & Generation Pipelines — Evaluating the pipeline — AI generated image

Introduction

This is the third and final article in a short series on systematically improving retrieval-augmented generation (RAG). In earlier articles, we evaluated the performance of multiple embedding models on a domain-specific dataset and selected the optimal embedding model. We followed it up with how to fine-tune embedding model on domain specific data and compared its performance against top models in first step. In this article, we will evaluate the performance of retrieval and generation pipelines. This will help in determining the most optimal RAG pipeline for your application.

Evaluating the Retrieval Pipeline

Now that we’ve found the optimal embedding model for our use case, the next step is to evaluate the retrieval pipeline itself. Evaluation allows us to select the top embedding models across various dimensions, potentially considering multiple values for k nearest neighbors. We might also explore different search options, such as vector search, full-text search, or hybrid search. For each, we may conduct a broad search across the entire corpus. Alternatively, we may conduct a focused search on specific topics based on metadata extracted from the user query, or even use a combination of both approaches. Additionally, it’s worth evaluating the pipeline with and without re-ranking. As you can see, we quickly end up dealing with numerous combinations of components in our retrieval pipeline. Therefore, it’s crucial to compare and determine the optimal configuration for your application.

Create a Golden Dataset

The first step is to create a “golden dataset” comprising queries, relevant context (chunks or documents from the corpus), and ground truth answers. We started with about 80–100 samples, manually curating the dataset to ensure it closely represents real-world scenarios. This range was chosen to balance feasibility and diversity, providing enough examples to cover key use cases without becoming overly time-consuming. To achieve this, we engaged domain experts and product owners to create this dataset. You can leverage off-the-shelf labeling tools to streamline this process — Argilla is an excellent choice!

Configure and Run Retrieval

Next, we need a way to configure multiple retrieval options and execute the retrieval process for each combination independently from the main application. We built a script for this purpose, which creates a separate results dataset for each retrieval configuration, while maintaining a trace back to the original golden dataset by including unique identifiers for each query and context pair. This allows us to easily compare results across different runs and configurations, ensuring traceability and consistency in the evaluation process. The new dataset logs the top k documents for each query in the golden dataset for a given retrieval pipeline configuration. Logging results as dataset in a tool like Argilla can also be beneficial, as it allows for human review and labeling of the results in terms of relevance and ranking, with built-in automation to help scale up human labeling efforts.

Leverage LLMs for Evaluation

Since human review isn’t scalable beyond a certain point, we use a large language model (LLM) to help evaluate retrieval quality. You can use a top LLM like GPT-4 or specialized models like Prometheus for this task.

We keep the LLM’s job as simple as possible to ensure higher reliability. Given a query, ground truth answer, and retrieved context, we ask the LLM to answer a straightforward question: Does the retrieved context contain relevant information to answer the query as per the ground truth answer? The LLM responds with “relevant” or “unrelated,” along with an explanation for its decision. Again, these results can be logged in Argilla for human review to study how LLM judgment correlates with human evaluation.

Visualize the Results

Finally, it’s essential to visualize the results to compare configurations and determine the most effective retrieval pipeline. Even a basic visualization helps in understanding the effectiveness of each combination and facilitates better decision-making for selecting the final setup if you are dealing with numerous options.

Evaluating retrieval pipeline (Image by author)

When we rolled out the first prototype of the project, we used BAAI/bge-large-en-v15 as the embedding model. Our retrieval pipeline was based on vector search with a combination of wide search and narrow search. In the end, we combined the results and used an off-the-shelf re-ranking model to re-rank and pick top n documents. As you might have guessed, this added a lot of complexity and contributed to latency.

Looking at the evaluation results, we found that the fine-tuned model with just a wide search and no re-ranking provided the best accuracy. This was a significant insight, as we not only reduced our embedding size by half but also simplified our retrieval pipeline, resulting in substantial savings on storage, memory, latency, and cost, along with an 8% improvement in retrieval performance.

Evaluating the Generation Pipeline

Now that we’ve identified the most effective retrieval pipeline, our next focus is on evaluating the generation pipeline. Evaluating the generation process typically involves testing different large language models (LLMs) with various hyperparameters, such as temperature, maximum tokens. Additionally, you may experiment with multiple system prompts. The number of chunks included in the context can significantly impact the results. You may also want to include post-processing steps, such as applying guardrails or validation checks, before presenting the final response to the user.

The evaluation process begins by selecting the top two or three retrieval pipelines from the previous evaluation phase. We then build upon these by adding variations in the generation pipeline. Keeping organized logs of results at every stage is beneficial, as these form the foundation for subsequent evaluations. For instance, the results of the retrieval pipeline evaluation become the starting point for evaluating the generation pipeline. Similar to the retrieval evaluation, we need a way to configure and run each pipeline variation while consistently logging all results, such as the AI-generated answers, for each iteration.

After generating results for each pipeline variation, the next step is evaluation. This can be performed through human judgment or using an LLM-based approach. In our case, we focused on determining whether the AI-generated response was correct in substance — meaning that it aligned with the ground truth rather than being an exact match. For example, we checked whether the response used the correct method and parameters in SimTalk. The LLM-based judge was instructed to classify each answer as either ‘correct’ or ‘incorrect’ and provide an explanation for its decision.

Visualization of Generation Pipeline Evaluation

The charts below show a rather straightforward generation pipeline where we evaluated the performance of GPT-4O and CLAUDE 3.5 Sonnet across four different retrieval pipelines. This setup allowed us to determine not only the best retrieval pipeline but also the optimal generation configuration for our specific use case. Please note that LLM judgement can't be trusted blindly. It is important to establish the correlation with human judgement at-least for a small subset of evaluation dataset.

Evaluating generation pipeline (Image by author)

Conclusion

By systematically evaluating both the retrieval and generation pipelines, you can create a robust retrieval-augmented generation system that is accurate and responsive to nuanced information needs. Evaluations helps refine your understanding of what works best, providing insights such as which model configurations are most effective for specific tasks. These iterative improvements help you understand where to focus and eventually will get you closer to achieving your business objectives.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Evaluating Retrieval & Generation Pipelines

Author(s): Nilesh Raghuvanshi

Improving Retrieval Augmented Generation (RAG) Systematically

Introduction

Evaluating the Retrieval Pipeline

Create a Golden Dataset

Configure and Run Retrieval

Leverage LLMs for Evaluation

Visualize the Results

Evaluating the Generation Pipeline

Visualization of Generation Pipeline Evaluation

Conclusion

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Evaluating Retrieval & Generation Pipelines

Author(s): Nilesh Raghuvanshi

Improving Retrieval Augmented Generation (RAG) Systematically

Introduction

Evaluating the Retrieval Pipeline

Create a Golden Dataset

Configure and Run Retrieval

Leverage LLMs for Evaluation

Visualize the Results

Evaluating the Generation Pipeline

Visualization of Generation Pipeline Evaluation

Conclusion

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement