Fine-Tuning Embedding Models: Achieving More with Less

Author(s): Nilesh Raghuvanshi

Originally published on Towards AI.

Improving Retrieval Augmented Generation (RAG) Systematically

Fine-tuning for alignment— AI generated image

Introduction

In my last article, we saw that, while evaluating multiple embedding models on our domain-specific data, the huggingface/BAAI/bge-large-en-v1.5 model (1024 dimensions) showed competitive performance. It was comparable to azure/text-embedding-3-large (3072 dimensions) and azure/text-embedding-3-small (1536 dimensions). What made it more interesting was its flexibility for fine-tuning on domain-specific data using the sentence-transformers library.

Choosing the Right Model for Fine-Tuning

As you may know, the BGE family of models comes in multiple sizes (large, base, and small), each with different parameter counts and memory usage. The large model is suitable for high-resource environments due to its high parameter count and memory requirements, while the base and small models are more practical for resource-constrained scenarios. After initial exploration, I decided to approach my fine-tuning experiments with the base model BAAI/bge-base-en-v1.5. I chose the base model as it provided a good balance between resource efficiency and performance, making it suitable for the available computational resources. The base model has 109 million parameters, using 0.41 GB of memory, and outputs 768 dimensions. In comparison, the large model BAAI/bge-large-en-v1.5 has 335 million parameters, uses 1.25 GB of memory, and outputs 1024 dimensions. Its smaller size made it more practical for my GPU (NVIDIA A40 with 48 GB VRAM), allowing for faster iterations given the memory limitations.

Matryoshka Representation Learning (MRL)

One notable feature of the sentence-transformers library is support for Matryoshka Representation Learning (MRL). MRL boosts efficiency by enabling the embedding models to generate embeddings at different dimensions without a significant loss in performance. Reducing embedding size improves computational efficiency and lowers memory requirements, which is particularly useful when deploying models in resource-constrained environments. For this evaluation, I experimented with embedding dimensions of [768, 512, 256, 128, 64]. The latest version of OpenAI embedding models azure/text-embedding-3-large and azure/text-embedding-3-small also supports MRL, making it an exciting area for comparison.

Training

If you remember, we used only 20% of the generated synthetic dataset for evaluation to ensure a representative sample for testing while keeping computational requirements manageable. The remaining 80% of the dataset was reserved for fine-tuning the embedding model to provide ample training data and enhance model generalization. For training, we used a combination of MatryoshkaLoss and MultipleNegativesRankingLoss as the loss function. MatryoshkaLoss helps in learning embeddings at multiple granularities. MultipleNegativesRankingLoss is a loss function that optimizes models to produce similar embeddings for positive sentence pairs and dissimilar embeddings for negative pairs. By integrating MatryoshkaLoss with MultipleNegativesRankingLoss, one can train a model to generate embeddings that are both dimensionally flexible and semantically robust. This combination facilitates the use of multiple embedding sizes while maintaining high performance in tasks requiring precise semantic understanding. Finally, we used adamw optimizer with a learning rate of 2e-5 and trained for 10 epochs. This number of epochs was chosen to strike a balance between training time and model performance, providing sufficient learning without overfitting.

Evaluation and Results

The model was fine-tuned on the training set. We evaluated its performance using only test queries against the entire corpus (both training and test data) using InformationRetrievalEvaluator from the sentence-transformers library. The evaluation on multiple embedding models in my last article was performed only using the test set i.e. queries and corpus both from test dataset only. Next, we compare the performance of the base model BAAI/bge-base-en-v1.5 with the newly fine-tuned version.

Comparison of Base and Fine-Tuned Models

To visualize the results, we compared the base and fine-tuned models. Note that the base model does not support MRL. In the visualization, the first gray bar represents the base model at 768 dimensions, while the green bars represent the fine-tuned model at 768, 512, 256, 128, and 64 dimensions. Interestingly, the fine-tuned model at 64 dimensions (last green bar) outperformed the base model at 768 dimensions across all metrics. The fine-tuned model performed best at 512 and 256 dimensions, showing the strength of MRL. Fine-tuning for just 10 epochs on a domain-specific dataset led to an 8% improvement in NDCG@10.

Comparison with Top-Performing Models

Next, we compared the fine-tuned model against the top-performing models from our last evaluation. Not only did the fine-tuned model huggingface/BAAI/ft-bge-base-en-v1.5 512 outperform the rest of the competition, but it also challenged the top model azure/text-embedding-3-large 3072. In fact, at higher cutoffs (3, 5, 10), the fine-tuned model edges past azure/text-embedding-3-large 3072 if you consider the metrics up-to 3 decimal places (not shown here).

Fair Comparison Across Dimensions

To make the comparisons fair, we also evaluated the fine-tuned model at 512, 256, and 64 dimensions against azure/text-embedding-3-large at corresponding dimensions. Here, the fine-tuned model emerged as the clear winner, though azure/text-embedding-3-large remained competitive at 512 dimensions. However, at 256 and especially at 64 dimensions, the performance of azure/text-embedding-3-large dropped significantly.

Conclusion

Overall, our fine-tuning efforts have paid off well. We now have a model that offers 6x to 48x storage reduction compared to the top-performing model from earlier evaluations, with better performance across all metrics. For instance, this dimensionality reduction and improved performance translates into lower storage costs, faster search times, reduced memory usage, and ultimately lower overall costs, all while delivering superior performance. In the final article of this short series, we will see how to evaluate retrieval and generation pipeline to determine the most optimal RAG pipeline for your application.

References

[1] Fine-tune Embedding models for Retrieval Augmented Generation (RAG)

[2] Introduction to Matryoshka Embedding Models

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Fine-Tuning Embedding Models: Achieving More with Less

Author(s): Nilesh Raghuvanshi

Improving Retrieval Augmented Generation (RAG) Systematically

Introduction

Choosing the Right Model for Fine-Tuning

Matryoshka Representation Learning (MRL)

Training

Evaluation and Results

Comparison of Base and Fine-Tuned Models

Comparison with Top-Performing Models

Fair Comparison Across Dimensions

Conclusion

References

JOIN NOW!

🔥 Recommended Articles 🔥

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Fine-Tuning Embedding Models: Achieving More with Less

Author(s): Nilesh Raghuvanshi

Improving Retrieval Augmented Generation (RAG) Systematically

Introduction

Choosing the Right Model for Fine-Tuning

Matryoshka Representation Learning (MRL)

Training

Evaluation and Results

Comparison of Base and Fine-Tuned Models

Comparison with Top-Performing Models

Fair Comparison Across Dimensions

Conclusion

References

JOIN NOW!

🔥 Recommended Articles 🔥

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement