Anthropic’s New RAG Approach
Last Updated on October 5, 2024 by Editorial Team
Author(s): Barhoumi Mosbeh
Originally published on Towards AI.
The Rise of LLMs
LLMs are super-powerful tools. I remembered the first time ChatGPT was released, I was saying hell no, I will lose my job (before even getting one), but… then I saw it struggling to pick the right CUDA version for Ubuntu just like the rest of us 🙂
Anyway, if you ask someone who doesn’t know much about AI about ChatGPT, they might say it’s just retrieving answers from the internet like a semantic search. Well, almost, but it’s a bit more complex. These models aren’t retrieving the next word from a database you can instantly point to; they’ve compressed vast amounts of information into their weights, and they literally predict every single word (sub-word).
Even though these LLMs are very powerful in chatting and answering questions, such as writing APIs using Flask or preparing unit tests for us, they struggle in specific domains.
What About Fine-Tuning Your Own LLM?
While Large Language Models (LLMs) excel at general knowledge tasks, they often underperform in specialized domains. Their broad capabilities come at the cost of domain-specific expertise, akin to being generalists rather than specialists.
This limitation is where fine-tuning becomes relevant. Fine-tuning adapts an LLM to a specific domain or task, enhancing its performance in targeted areas. However, the process is considerably more complex than simply exposing the model to domain-specific data.
The Challenges of Fine-Tuning
Implementing LLM fine-tuning presents significant challenges. Organizations typically have two options:
- Fine-tuning an open-source model
- Utilizing APIs from providers offering fine-tuning services for proprietary models
Regardless of the chosen path, financial considerations are paramount. Costs may include API usage fees or expenses related to cloud GPU resources for training. Even with adequate financial resources, organizations may face a shortage of high-quality, relevant data necessary for effective fine-tuning.
Data sensitivity introduces additional complexities. Organizations must ensure proper attribution and traceability of data sources, citing an LLM in a research paper might raise a few eyebrows. Moreover, protecting proprietary information from potential exposure to LLM providers is a critical concern.
While fine-tuning LLMs is a trending topic in the AI community, it’s not a decision to be made lightly. The process requires careful consideration of resources, data quality, and potential risks much like adopting a high-maintenance technology stack.
Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) offers a potential solution to these challenges. RAG connects an LLM directly to an organization’s knowledge base, enabling relevant data retrieval without the need for model retraining. This approach enhances performance in specific domains while maintaining data security and reducing the complexities associated with fine-tuning.
How RAG Systems Work
Retrieval-Augmented Generation (RAG) systems consist of two main components:
1. Knowledge Base Creation
This is the preparation phase where your documents are processed and stored:
- Document Chunking: You take your documents and break them down into smaller, manageable sub-documents or chunks.
- Embedding Computation: For each of these chunks, you compute embeddings. These are numerical representations that capture the semantic meaning of the text.
- Vector Store: The computed embeddings are then stored in a vector store, which allows for efficient similarity searches later on.
2. Generation Part
This is the runtime or inference phase, where the system responds to user queries:
- Query Processing: When a user asks a question, the system computes an embedding for that query.
- Retrieval: The system then retrieves the most relevant chunks from the vector store based on the similarity between the query embedding and the stored chunk embeddings.
- LLM Integration: The retrieved chunks, along with the original query, are fed into a Large Language Model (LLM).
- Response Generation: The LLM generates a response based on the query and the context provided by the retrieved chunks.
Enhancing Retrieval Accuracy
While this technique is great for semantic similarity, it can still have failure points. To improve accuracy, many systems combine this semantic search with traditional keyword-based search mechanisms:
- BM25 Integration: One such technique is BM25 (Best Matching 25), which plays a critical role when the user query and the database involve specific keywords.
- Hybrid Approach: By combining semantic search with keyword-based methods, RAG systems can provide more robust and accurate results, especially in scenarios where exact term matching is important.
This hybrid approach allows RAG systems to leverage the strengths of both semantic understanding and traditional information retrieval techniques, resulting in more reliable and context-aware responses. As reported by Anthropic, this method improves results by ~ 1%.
BM25:
Sometimes, the smart part misses important exact matches. That’s where BM25 comes in. It’s like a super-powered “Ctrl+F” (find function) that looks for exact words or phrases.
Examples Where BM25 Shines
- Error Codes: Finding “Error XYZ-123” exactly.
- Product Numbers: Locating “Model AB-9000” precisely.
- Specific Terms: Finding “mitochondria” in biology texts.
RECAP BEFORE MOVING ON
A standard RAG system works by first breaking a large collection of documents into smaller pieces. These pieces are transformed into numbers (called embeddings) to capture their meaning, and they are stored in a special database for fast searching. When you ask a question, the system looks through these stored pieces and picks the ones that seem most relevant using two methods: one based on word importance (TF-IDF) and another based on the embeddings. It then combines the best pieces and uses a smart model to generate a clear answer based on them.
What’s wrong with that approach
Here is a quick example of where a standard RAG system might fail:
What were the long-term effects of Drug X in the 2023 clinical trial?
A relevant chunk retrieved by the system might contain the text:
Participants showed significant improvements after treatment.
However, this chunk doesn’t specify which drug was used, whether it refers to the 2023 trial, or if the improvements were long-term. Without this additional context, the system cannot accurately answer the question, leading to a potentially misleading or incomplete response.
Introducing Contextual Retrieval
In practice, this is how it’s going to look like. You will take a single PDF file or a single document from your corpus, then convert that into chunks. After that, you take one chunk at a time, and run that chunk through a prompt like the one below, along with the original document.
<document>
{{WHOLE_DOCUMENT}}
</document>
Here is the chunk we want to situate within the whole document
<chunk>
{{CHUNK_CONTENT}}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else
This will generate contextual information for each chunk. You combine the given chunk with the relevant contextual information, and then you pass it through an embedding model. Those embeddings are going to be stored in a standard vector database. On the other hand, we also update the BM25 indexing. In this case, we’re using TF-IDF, which stands for Term Frequency Inverse Document Frequency. This is basically the keyword-based search mechanism. As a result, in each of the chunks, you are adding 50 to 100 tokens. I think you can already see some potential issues with this approach. One of them is that it’s going to add a lot of overhead, not only from the number of tokens you’re adding to each chunk, but also because each chunk goes through an LLM, which adds up to a lot of different tokens.
Performance improvements
Contextual Embeddings reduced the top-20-chunk retrieval failure rate by 35% (5.7% → 3.7%).
Combining Contextual Embeddings and Contextual BM25 reduced the top-20-chunk retrieval failure rate by 49% (5.7% → 2.9%).
Reranking
Retriever models, which are designed to extract chunks of information, are compact and efficient. However, they rely on basic extraction techniques, such as Euclidean distance or cosine similarity between vectors. This approach can lead to suboptimal results, as it prioritizes speed and volume of extractions over accuracy.
In contrast, reranker models are more substantial and slower, making them unsuitable for processing large extractions directly. Instead, they can be employed to rerank a smaller selection of chunks identified by the retriever.
The key distinction lies in the rerankers’ ability to perform cross-attention between the user query and each chunk. This enables them to uncover important relationships between the two texts that simpler similarity measures may overlook.
Learn more about rerankers here:
Rerank
Rerank is a powerful semantic boost for search quality. Improve search precision with Rerank's semantic boost via…
cohere.com
Rerankers
A reranker, given a query and many documents, returns the (ranks of) relevancy between the query and documents. The…
docs.voyageai.com
Ressources
Introducing Contextual Retrieval
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI…
www.anthropic.com
Rerankers
A reranker, given a query and many documents, returns the (ranks of) relevancy between the query and documents. The…
docs.voyageai.com
Rerank
Rerank is a powerful semantic boost for search quality. Improve search precision with Rerank's semantic boost via…
cohere.com
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI