Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Evaluating Retrieval & Generation Pipelines
Latest   Machine Learning

Evaluating Retrieval & Generation Pipelines

Author(s): Nilesh Raghuvanshi

Originally published on Towards AI.

Improving Retrieval Augmented Generation (RAG) Systematically

Evaluating the pipeline β€” AI generated image

Introduction

This is the third and final article in a short series on systematically improving retrieval-augmented generation (RAG). In earlier articles, we evaluated the performance of multiple embedding models on a domain-specific dataset and selected the optimal embedding model. We followed it up with how to fine-tune embedding model on domain specific data and compared its performance against top models in first step. In this article, we will evaluate the performance of retrieval and generation pipelines. This will help in determining the most optimal RAG pipeline for your application.

Evaluating the Retrieval Pipeline

Now that we’ve found the optimal embedding model for our use case, the next step is to evaluate the retrieval pipeline itself. Evaluation allows us to select the top embedding models across various dimensions, potentially considering multiple values for k nearest neighbors. We might also explore different search options, such as vector search, full-text search, or hybrid search. For each, we may conduct a broad search across the entire corpus. Alternatively, we may conduct a focused search on specific topics based on metadata extracted from the user query, or even use a combination of both approaches. Additionally, it’s worth evaluating the pipeline with and without re-ranking. As you can see, we quickly end up dealing with numerous combinations of components in our retrieval pipeline. Therefore, it’s crucial to compare and determine the optimal configuration for your application.

Create a Golden Dataset

The first step is to create a β€œgolden dataset” comprising queries, relevant context (chunks or documents from the corpus), and ground truth answers. We started with about 80–100 samples, manually curating the dataset to ensure it closely represents real-world scenarios. This range was chosen to balance feasibility and diversity, providing enough examples to cover key use cases without becoming overly time-consuming. To achieve this, we engaged domain experts and product owners to create this dataset. You can leverage off-the-shelf labeling tools to streamline this process β€” Argilla is an excellent choice!

Configure and Run Retrieval

Next, we need a way to configure multiple retrieval options and execute the retrieval process for each combination independently from the main application. We built a script for this purpose, which creates a separate results dataset for each retrieval configuration, while maintaining a trace back to the original golden dataset by including unique identifiers for each query and context pair. This allows us to easily compare results across different runs and configurations, ensuring traceability and consistency in the evaluation process. The new dataset logs the top k documents for each query in the golden dataset for a given retrieval pipeline configuration. Logging results as dataset in a tool like Argilla can also be beneficial, as it allows for human review and labeling of the results in terms of relevance and ranking, with built-in automation to help scale up human labeling efforts.

Leverage LLMs for Evaluation

Since human review isn’t scalable beyond a certain point, we use a large language model (LLM) to help evaluate retrieval quality. You can use a top LLM like GPT-4 or specialized models like Prometheus for this task.

We keep the LLM’s job as simple as possible to ensure higher reliability. Given a query, ground truth answer, and retrieved context, we ask the LLM to answer a straightforward question: Does the retrieved context contain relevant information to answer the query as per the ground truth answer? The LLM responds with β€œrelevant” or β€œunrelated,” along with an explanation for its decision. Again, these results can be logged in Argilla for human review to study how LLM judgment correlates with human evaluation.

Visualize the Results

Finally, it’s essential to visualize the results to compare configurations and determine the most effective retrieval pipeline. Even a basic visualization helps in understanding the effectiveness of each combination and facilitates better decision-making for selecting the final setup if you are dealing with numerous options.

Evaluating retrieval pipeline (Image by author)

When we rolled out the first prototype of the project, we used BAAI/bge-large-en-v15 as the embedding model. Our retrieval pipeline was based on vector search with a combination of wide search and narrow search. In the end, we combined the results and used an off-the-shelf re-ranking model to re-rank and pick top n documents. As you might have guessed, this added a lot of complexity and contributed to latency.

Looking at the evaluation results, we found that the fine-tuned model with just a wide search and no re-ranking provided the best accuracy. This was a significant insight, as we not only reduced our embedding size by half but also simplified our retrieval pipeline, resulting in substantial savings on storage, memory, latency, and cost, along with an 8% improvement in retrieval performance.

Evaluating the Generation Pipeline

Now that we’ve identified the most effective retrieval pipeline, our next focus is on evaluating the generation pipeline. Evaluating the generation process typically involves testing different large language models (LLMs) with various hyperparameters, such as temperature, maximum tokens. Additionally, you may experiment with multiple system prompts. The number of chunks included in the context can significantly impact the results. You may also want to include post-processing steps, such as applying guardrails or validation checks, before presenting the final response to the user.

The evaluation process begins by selecting the top two or three retrieval pipelines from the previous evaluation phase. We then build upon these by adding variations in the generation pipeline. Keeping organized logs of results at every stage is beneficial, as these form the foundation for subsequent evaluations. For instance, the results of the retrieval pipeline evaluation become the starting point for evaluating the generation pipeline. Similar to the retrieval evaluation, we need a way to configure and run each pipeline variation while consistently logging all results, such as the AI-generated answers, for each iteration.

After generating results for each pipeline variation, the next step is evaluation. This can be performed through human judgment or using an LLM-based approach. In our case, we focused on determining whether the AI-generated response was correct in substance β€” meaning that it aligned with the ground truth rather than being an exact match. For example, we checked whether the response used the correct method and parameters in SimTalk. The LLM-based judge was instructed to classify each answer as either β€˜correct’ or β€˜incorrect’ and provide an explanation for its decision.

Visualization of Generation Pipeline Evaluation

The charts below show a rather straightforward generation pipeline where we evaluated the performance of GPT-4O and CLAUDE 3.5 Sonnet across four different retrieval pipelines. This setup allowed us to determine not only the best retrieval pipeline but also the optimal generation configuration for our specific use case. Please note that LLM judgement can't be trusted blindly. It is important to establish the correlation with human judgement at-least for a small subset of evaluation dataset.

Evaluating generation pipeline (Image by author)

Conclusion

By systematically evaluating both the retrieval and generation pipelines, you can create a robust retrieval-augmented generation system that is accurate and responsive to nuanced information needs. Evaluations helps refine your understanding of what works best, providing insights such as which model configurations are most effective for specific tasks. These iterative improvements help you understand where to focus and eventually will get you closer to achieving your business objectives.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓