Choosing the Best Embedding Model For Your RAG Pipeline
Author(s): Nilesh Raghuvanshi
Originally published on Towards AI.
Improving Retrieval Augmented Generation (RAG) Systematically
Introduction
Through my experience building an extractive question-answering system using Googleβs QANet and BERT back in 2018, I quickly realized the significant impact that high-quality retrieval has on the overall performance of the system. With the advent of generative models (LLMs), the importance of effective retrieval has only grown. Generative models are prone to βhallucinationβ, meaning they can produce incorrect or misleading information if they lack the correct context or are fed noisy data.
Simply put, the retrieval component (the βRβ in RAG) is the backbone of Retrieval Augmented Generation. However, it is also one of the most challenging aspects to get right. Achieving high-quality retrieval requires constant iteration and refinement.
To improve your retrieval, itβs essential to focus on the individual components within your retrieval pipeline. Moreover, having a clear methodology for evaluating their performance β both individually and as part of the larger system β is key to driving improvements.
This series is not intended to be an exhaustive guide on improving RAG-based applications, but rather a reflection on key insights Iβve gained, such as the importance of iterative evaluation and the role of high-quality retrieval, while working on real-world projects. I hope these insights resonate with you and provide valuable perspectives for your own RAG endeavors.
Case Study: Code Generation for SimTalk
The project aimed to generate code for a proprietary programming language called SimTalk. SimTalk is the scripting language used in Siemensβ Tecnomatix Plant Simulation software, a tool designed for modeling, simulating, and optimizing manufacturing systems and processes. By utilizing SimTalk, users can customize and extend the behavior of standard simulation objects, enabling the creation of more realistic and complex system models.
Since SimTalk is unfamiliar to LLMs due to its proprietary nature and limited training data, the out-of-the-box code generation quality is quite poor compared to more popular programming languages like Python, which have extensive publicly available datasets and broader community support. However, when provided with the right context through a well-augmented prompt β such as including relevant code examples, detailed descriptions of SimTalk functions, and explanations of expected behavior β the generated code quality becomes acceptable and useful, even if not perfect. This significantly enhances user productivity, which aligns well with our business objectives.
Our only knowledge source is high-quality documentation of SimTalk, consisting of approximately 10,000 pages, covering detailed explanations of language syntax, functions, use cases, and best practices, along with some code snippets. This comprehensive documentation serves as the foundational knowledge base for code generation by providing the LLM with the necessary context to understand and generate SimTalk code.
There are several critical components in our pipeline, each designed to provide the LLM with precise context. For instance, we use query rewriting techniques such as expansion, relaxation, and segmentation, and extract metadata from queries to dynamically build filters for more targeted searches. Instead of diving into all these specific components β such as query rewriting, metadata extraction, and dynamic filtering β I will focus on the general aspects that are applicable to any RAG-based project. In this series, we will cover
- How to evaluate the performance of multiple embedding models on your custom domain data?
- How to fine-tune an embedding model on your custom domain data?
- How to evaluate the retrieval pipeline?
- How to evaluate the generation pipeline?
In general, the goal is to make data-driven decisions based on evaluation results, such as precision, recall, and relevance metrics, to optimize your RAG applications, rather than relying on intuition or assumptions.
Evaluating Embedding Models for Domain-Specific Retrieval
Embedding models are a critical component of any RAG application today, as they enable semantic search, which involves understanding the meaning behind user queries to find the most relevant information. This is valuable in the context of RAG because it ensures that the generative model has access to high-quality, contextually appropriate information. However, not all applications require semantic search β full-text search can often be sufficient or at least a good starting point. Establishing a solid baseline with full-text search is often a practical first step in improving retrieval.
The embedding model landscape is as dynamic and competitive as the LLM space, with numerous options from a wide range of vendors. Key differentiators among these models include embedding dimensions, maximum token limit, model size, memory requirements, model architecture, fine-tuning capabilities, multilingual support, and task-specific optimization. Here, we will focus on enterprise-friendly choices like Azure OpenAI, AWS Bedrock, and open-source models from Hugging Face 🤗. It is essential to evaluate and identify the most suitable embedding model for your application in order to optimize accuracy, latency, storage, memory, and cost.
To effectively evaluate and compare the performance of multiple embedding models, it is necessary to establish a benchmarking dataset. If such a dataset is not readily available, a scalable solution is to use LLMs to create one based on your domain-specific data. For example, LLMs can generate a variety of realistic queries and corresponding relevant content by using existing domain-specific documents as input, which can then be used as a benchmarking dataset.
Generating a Synthetic Dataset Based on Domain-Specific Data
Generating a synthetic dataset presented a unique challenge, especially with the goal of keeping costs low. We aimed to create a diverse and effective dataset using a practical, resource-efficient approach. To achieve this, we used quantized small language models (SLMs) running locally on desktop with a consumer-grade GPU. We wanted a certain level of variety and randomness in the dataset and did not want to spend excessive time selecting the βrightβ LLM. Therefore, we decided to use a diverse set of SLMs, including Phi, Gemma, Mistral, Llama, Qwen, and DeepSeek. Additionally, we used a mix of code and language-specific models.
Since we wanted the solution to be general-purpose, we developed a custom implementation that allows potential users to specify a list of LLMs they wish to use (including those provided by Azure OpenAI and AWS Bedrock). Users can also provide a custom system prompt tailored to their specific needs. This flexibility makes the solution adaptable to a wide range of use cases. We also extracted our domain-specific data (SimTalk documentation) into JSON format, enriched with useful metadata. The availability of rich metadata provided the flexibility to filter specific sections of the SimTalk documentation for quick tests.
For each context chunk from the domain-specific dataset, the LLM was tasked with generating a question or query that could be answered based on that context. The system prompt was relatively simple but required iterative adjustments, such as adding domain-specific terminology and refining the structure of the prompt, to better capture the nuances of our application needs and improve the quality of the generated questions. The implementation ensured that each LLM in the list had an equal chance of being selected for generation, with tasks processed in parallel to improve efficiency.
In the end, we managed to generate about 13,000 samples in less than 30 minutes. While reviewing a few samples, we noticed that they were not perfect β some lacked specificity, while others contained minor inaccuracies β but they provided a solid starting point. These issues could be addressed in future iterations by refining the prompt further, adding more domain-specific details, and using feedback loops to enhance the quality of generated queries.
This dataset provides approximately thirteen thousand examples of potential queries. Each query is paired with its corresponding top-ranked context chunk, which we expect our retrieval system to fetch. However, it is important to note that this dataset is limited, as it only includes single chunks as context. This limitation affects our ability to comprehensively evaluate the retrieval system, particularly in real-world scenarios where some queries require answers spanning multiple chunks. For example, in a code generation scenario, generating a complete piece of code might require information from multiple sections of the documentation, such as syntax definitions, function descriptions, and best practices, which are spread across multiple chunks. As a result, the evaluation may not fully capture the systemβs performance on more complex queries needing information from several sources. Nevertheless, despite this limitation, the dataset is a solid starting point for understanding the systemβs capabilities, and its simplicity makes it easier to iterate and improve upon in future evaluations.
Evaluating Embedding Models on Your Dataset
Next, we aimed to evaluate the performance of multiple embedding models on this dataset to determine which one performs best for the domain-specific data. To achieve this, we developed a multi-embedding model loader capable of interacting with any embedding model. The loader caches the embeddings to avoid redundant computations and speed up evaluation. It also supports batching and normalizing for specific dimensions in models that support Matryoshka Representation Learning (MRL).
The main script allows users to specify a list of embedding models to evaluate, a test dataset (20% of the dataset we created), and a list of cutoff (k) values for metrics such as Precision, Recall, NDCG, MRR, and MAP. These metrics and cutoff values help comprehensively assess different aspects of model performance. We used pytrec_evalβs RelevanceEvaluator to calculate these metrics at multiple cutoff values (1, 3, 5, 10).
For each embedding model, we first generate the document and query embeddings (if they are not already available in the cache), and then calculate similarities. We then calculate each metric at the specified cutoff values and log the results for comparison. Finally, we visualize the results for each metric at each cutoff and generate an easy-to-read report using an LLM to provide insights, such as identifying the top-performing models and optimal cutoff.
Below is a brief explanation of the metrics used for evaluation.
NDCG (Normalized Discounted Cumulative Gain) evaluates the quality of a ranked list of items by considering both their relevance and their positions in the ranking. A higher NDCG score indicates better ranking performance.
MRR (Mean Reciprocal Rank) measures the position of the first relevant item in a ranked list of results, with higher ranks leading to a higher MRR score.
MAP (Mean Average Precision) is calculated as the mean of the Average Precision (AP) scores for each query. It is particularly useful for comparing performance across multiple queries, especially when there are varied relevance judgments. It provides a single-figure measure of quality across multiple queries, considering both precision and recall.
Recall is the proportion of relevant items that were retrieved out of the total number of relevant items available. For example, if there are 10 relevant documents and 8 are retrieved, the recall is 0.8.
Precision is the proportion of retrieved documents that are relevant out of the total number retrieved. There is often a trade-off between precision and recall, where optimizing one may lead to a decrease in the other. For example, increasing recall may result in retrieving more irrelevant items, thereby reducing precision. Understanding this trade-off is crucial for determining which metric to prioritize based on the applicationβs needs. For example, if 10 documents are retrieved and 7 are relevant, the precision is 0.7.
For our domain-specific dataset, azure/text-embedding-3-large (3072 dimensions) emerged as the best performer across all metrics, with azure/text-embedding-3-small (1536 dimensions) and huggingface/BAAI/bge-large-en-v1.5 (1024 dimensions) showing similar performance. Interestingly, cohere.embed-english-v3 (1024 dimensions) performed the worst on this dataset.
So, where do we go from here? You can either choose the top-performing model and focus on optimizing other components of your retrieval pipeline, or you could continue exploring to identify the most suitable model for your needs.
We chose to explore further, so in the next article we will see how we fine-tuned an embedding model on our domain-specific data.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI