Late Chunking In Long Context Embedding Models
Last Updated on November 8, 2024 by Editorial Team
Author(s): Barhoumi Mosbeh
Originally published on Towards AI.
In a previous article, we looked at contextual retrieval from Anthropic, which is their context enhancement technique for improving RAG systems. But thereβs another technique called late chunking in long-context embedding models, which I think is a lot more interesting and can be significant.
Embeddings are one of the most critical components of any retrieval system but are often ignored or misused. When you are selecting an embedding model, you need to consider two very important parameters. One is the max tokens, which is basically the context window, and the second one is the embedding dimension, which is the output size of the embedding vector.
You have probably seen this huge max token size for some of the newest embedding models, but there is one major issue with these. In a standard RAG pipeline, irrespective of the size of the chunk, the output size of your embedding vector is going to remain the same. So, whether you are embedding five tokens or 5,000 tokens, the output is going to be exactly the same. That means these embedding models are going to be compressing a lot of information for long input chunks.
In most cases, you want to use smaller chunks, but that has its own problems. To understand that, letβs look at this example: here is a small paragraph regarding Tunisia:
If you use a sentence-level chunking strategy, essentially each sentence is going to become its own different chunk. You can see that if you embed these different chunks separately, you will lose the contextual information. For example, βthe countryβ is referring to Tunisia, but if this chunk is in isolation, itβs going to lose that context.
The contextual retrieval approach tries to summarize the documents and add contextual information to each chunk, but there is a better approach, and that is late chunking.
Letβs first look at how βnormalβ chunking works.
Normal Chunking and Embedding Process
In the traditional chunking approach, the text is first divided into smaller chunks, and then each chunk is passed through a neural network, which is a Transformer model. The output of this process is embeddings for each of the individual tokens within the chunks. The contextual information in these tokens is limited to the chunk that they belong to. After generating the token embeddings, mean pooling is performed to compute the final output. This preserves the contextual information, but it is only within the individual chunks and is independent of the other chunks in the same document.
Late Chunking Approach
The late chunking approach reverses this process. Instead of dividing the document into chunks and then computing the embeddings, the whole text of the document is first passed through a Transformer model. This generates an embedding representation for each token, and these tokens now contain contextual information not only limited to a single chunk, but encompassing the entire document. After this step, the chunking process is performed, where the original text is divided into chunks, and the corresponding tokens are used to compute the mean pooling, resulting in the final representation.
Since the chunking process is done at a later stage, this approach is called βlate chunkingβ.
The best approach for retrieval
The late chunking approach is directly related to another approach called late interaction, which is a COBERT-based approach, This is probably the best approach for retrieval, but it comes with a cost of storage needs. In this case, in the final step, you donβt do the pooling step, but you take individual token embeddings and store those token embeddings.
A blog post from the weaviate team shows that if you embed about 100,000 documents with the same number of embeddings, you will need about 1.6 million vectors for a naive chunking approach, which is about 5 GB. However, if you were to do the late interaction or COBERT-based multi-vector representation, you will need about 2.5 terabytes, which is pretty huge. The reason is that youβre storing these embeddings for each token individually.
The late chunking approach, on the other hand, gives you the best of both worlds. Not only does it preserve the context in your final chunking process, but it also gives you about the same storage needs as the naive chunking approach. This makes the late chunking approach a more efficient and practical solution compared to the late interaction or COBERT-based approaches, which require significantly more storage.
Results
Late chunking is introduced by Jina, and they have their own embedding models. According to the results presented in the table, the late chunking approach, when combined with other methods, demonstrates promising performance across various benchmarks.
The choice of embedding model plays a critical role, and the results presented here may require further validation from independent sources.
Practical Implementation Guide
The folks behind this new idea have put together a really simple notebook that walks through how you can implement this late chunking approach in your own applications and pipelines. Even though weβre not going to dive into the notebook itself, let me provide you an idea about the results.
In the notebook they have computed the similarity of βBerlinβ as a single individual token to some embeddings. If you look at the first sentence, which talks about Berlin directly and mentions βBerlin,β both the traditional and late chunking outputs are going to give you very similar similarities.
similarity_new β Late chunking gives β 0.849546
similarity_trad β Normal Chunking β 0.84862185
similarity_new("Berlin", "Berlin is the capital and largest city of Germany, both by area and by population."): 0.849546
similarity_trad("Berlin", "Berlin is the capital and largest city of Germany, both by area and by population."): 0.84862185
However, in the second sentence, βits more than 3.85 million inhabitants β¦ β refers to Berlin indirectly without directly mentioning it. In this case, with the late chunking approach, the similarity is ~82, whereas for the traditional chunking approach, the similarity drops to ~ 70.
similarity_new("Berlin", " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."): 0.82489026
similarity_trad("Berlin", " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."): 0.7084338
Similarly, in the third sentence, which refers to Berlin as βthis city,β the similarity is high for the late chunking approach (0.84) but substantially lower for the traditional chunking approach (0.75).
similarity_new("Berlin", " The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."): 0.84980094
similarity_trad("Berlin", " The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."): 0.7534553
Now, this is a simple example, but if the chunks are much larger, there will be a significant loss of information in a naive chunking approach compared to the late chunking approach they propose. Another point to mention is that late chunking is bidirectional. If certain information precedes a specific chunk, it can still preserve that information. Since itβs bidirectional, looking at the entire document during embedding, a chunk will still retain information related to it, regardless of whether itβs before or after that chunk.
This bidirectionality makes it even more powerful, showing that long-context models are essential for both LLMs and these embeddings.
Conclusion
You know, after walking through all the details and examples in this article, I have to say that the late chunking approach for long-context embedding models is really fascinating and seems to hold a lot of promise.
The key insight here is that the traditional way of chunking up documents and then separately embedding each chunk can lead to a significant loss of context and important information. By instead taking the whole document, embedding it first to capture that broad context, and then chunking it up, youβre able to preserve so much more of the relevant details.
As the examples showed, when youβre dealing with references to entities like βBerlinβ that might be made both directly and indirectly, the late chunking method is able to maintain a much stronger understanding of the meaning, compared to the traditional chunking and embedding pipeline. And the fact that itβs bidirectional, able to pull in context from before and after each chunk, just makes it even more powerful.
Links
Google Colab
Edit description
colab.research.google.com
Late Chunking in Long-Context Embedding Models
Chunking long documents while preserving contextual information is challenging. We introduce the "Late Chunking" thatβ¦
jina.ai
What Late Chunking Really Is & What It's Not: Part II
Part 2 of our exploration of Late Chunking, a deep dive into why it is the best method for chunk embeddings andβ¦
jina.ai
Late Chunking: Balancing Precision and Cost in Long Context Retrieval | Weaviate
Learn about Late Chunking and how it may be the right fit for balancing cost and performance in your long contextβ¦
weaviate.io
Introducing Contextual Retrieval
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AIβ¦
www.anthropic.com
GitHub – jina-ai/late-chunking: Code for explaining and evaluating late chunking (chunked pooling)
Code for explaining and evaluating late chunking (chunked pooling) – jina-ai/late-chunking
github.com
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI