Late Chunking In Long Context Embedding Models

Last Updated on November 8, 2024 by Editorial Team

Author(s): Barhoumi Mosbeh

Originally published on Towards AI.

In a previous article, we looked at contextual retrieval from Anthropic, which is their context enhancement technique for improving RAG systems. But there’s another technique called late chunking in long-context embedding models, which I think is a lot more interesting and can be significant.
Embeddings are one of the most critical components of any retrieval system but are often ignored or misused. When you are selecting an embedding model, you need to consider two very important parameters. One is the max tokens, which is basically the context window, and the second one is the embedding dimension, which is the output size of the embedding vector.
You have probably seen this huge max token size for some of the newest embedding models, but there is one major issue with these. In a standard RAG pipeline, irrespective of the size of the chunk, the output size of your embedding vector is going to remain the same. So, whether you are embedding five tokens or 5,000 tokens, the output is going to be exactly the same. That means these embedding models are going to be compressing a lot of information for long input chunks.
In most cases, you want to use smaller chunks, but that has its own problems. To understand that, let’s look at this example: here is a small paragraph regarding Tunisia:

If you use a sentence-level chunking strategy, essentially each sentence is going to become its own different chunk. You can see that if you embed these different chunks separately, you will lose the contextual information. For example, “the country” is referring to Tunisia, but if this chunk is in isolation, it’s going to lose that context.
The contextual retrieval approach tries to summarize the documents and add contextual information to each chunk, but there is a better approach, and that is late chunking.

Let’s first look at how “normal” chunking works.

Normal Chunking and Embedding Process

In the traditional chunking approach, the text is first divided into smaller chunks, and then each chunk is passed through a neural network, which is a Transformer model. The output of this process is embeddings for each of the individual tokens within the chunks. The contextual information in these tokens is limited to the chunk that they belong to. After generating the token embeddings, mean pooling is performed to compute the final output. This preserves the contextual information, but it is only within the individual chunks and is independent of the other chunks in the same document.

Late Chunking Approach

The late chunking approach reverses this process. Instead of dividing the document into chunks and then computing the embeddings, the whole text of the document is first passed through a Transformer model. This generates an embedding representation for each token, and these tokens now contain contextual information not only limited to a single chunk, but encompassing the entire document. After this step, the chunking process is performed, where the original text is divided into chunks, and the corresponding tokens are used to compute the mean pooling, resulting in the final representation.

Since the chunking process is done at a later stage, this approach is called “late chunking”.

The best approach for retrieval

The late chunking approach is directly related to another approach called late interaction, which is a COBERT-based approach, This is probably the best approach for retrieval, but it comes with a cost of storage needs. In this case, in the final step, you don’t do the pooling step, but you take individual token embeddings and store those token embeddings.

A blog post from the weaviate team shows that if you embed about 100,000 documents with the same number of embeddings, you will need about 1.6 million vectors for a naive chunking approach, which is about 5 GB. However, if you were to do the late interaction or COBERT-based multi-vector representation, you will need about 2.5 terabytes, which is pretty huge. The reason is that you’re storing these embeddings for each token individually.

The late chunking approach, on the other hand, gives you the best of both worlds. Not only does it preserve the context in your final chunking process, but it also gives you about the same storage needs as the naive chunking approach. This makes the late chunking approach a more efficient and practical solution compared to the late interaction or COBERT-based approaches, which require significantly more storage.

Results

Late chunking is introduced by Jina, and they have their own embedding models. According to the results presented in the table, the late chunking approach, when combined with other methods, demonstrates promising performance across various benchmarks.

The choice of embedding model plays a critical role, and the results presented here may require further validation from independent sources.

Practical Implementation Guide

The folks behind this new idea have put together a really simple notebook that walks through how you can implement this late chunking approach in your own applications and pipelines. Even though we’re not going to dive into the notebook itself, let me provide you an idea about the results.

In the notebook they have computed the similarity of “Berlin” as a single individual token to some embeddings. If you look at the first sentence, which talks about Berlin directly and mentions “Berlin,” both the traditional and late chunking outputs are going to give you very similar similarities.

similarity_new → Late chunking gives → 0.849546

similarity_trad → Normal Chunking → 0.84862185

similarity_new("Berlin", "Berlin is the capital and largest city of Germany, both by area and by population."): 0.849546
similarity_trad("Berlin", "Berlin is the capital and largest city of Germany, both by area and by population."): 0.84862185

However, in the second sentence, “its more than 3.85 million inhabitants … ” refers to Berlin indirectly without directly mentioning it. In this case, with the late chunking approach, the similarity is ~82, whereas for the traditional chunking approach, the similarity drops to ~ 70.

similarity_new("Berlin", " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."): 0.82489026
similarity_trad("Berlin", " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."): 0.7084338

Similarly, in the third sentence, which refers to Berlin as “this city,” the similarity is high for the late chunking approach (0.84) but substantially lower for the traditional chunking approach (0.75).

similarity_new("Berlin", " The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."): 0.84980094
similarity_trad("Berlin", " The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."): 0.7534553

Now, this is a simple example, but if the chunks are much larger, there will be a significant loss of information in a naive chunking approach compared to the late chunking approach they propose. Another point to mention is that late chunking is bidirectional. If certain information precedes a specific chunk, it can still preserve that information. Since it’s bidirectional, looking at the entire document during embedding, a chunk will still retain information related to it, regardless of whether it’s before or after that chunk.

This bidirectionality makes it even more powerful, showing that long-context models are essential for both LLMs and these embeddings.

Conclusion

You know, after walking through all the details and examples in this article, I have to say that the late chunking approach for long-context embedding models is really fascinating and seems to hold a lot of promise.

The key insight here is that the traditional way of chunking up documents and then separately embedding each chunk can lead to a significant loss of context and important information. By instead taking the whole document, embedding it first to capture that broad context, and then chunking it up, you’re able to preserve so much more of the relevant details.

As the examples showed, when you’re dealing with references to entities like “Berlin” that might be made both directly and indirectly, the late chunking method is able to maintain a much stronger understanding of the meaning, compared to the traditional chunking and embedding pipeline. And the fact that it’s bidirectional, able to pull in context from before and after each chunk, just makes it even more powerful.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Late Chunking In Long Context Embedding Models

Author(s): Barhoumi Mosbeh

Normal Chunking and Embedding Process

Late Chunking Approach

The best approach for retrieval

Results

Practical Implementation Guide

Conclusion

Links

Google Colab

Edit description

Late Chunking in Long-Context Embedding Models

Chunking long documents while preserving contextual information is challenging. We introduce the "Late Chunking" that…

What Late Chunking Really Is & What It's Not: Part II

Part 2 of our exploration of Late Chunking, a deep dive into why it is the best method for chunk embeddings and…

Late Chunking: Balancing Precision and Cost in Long Context Retrieval | Weaviate

Learn about Late Chunking and how it may be the right fit for balancing cost and performance in your long context…

Introducing Contextual Retrieval

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI…

GitHub – jina-ai/late-chunking: Code for explaining and evaluating late chunking (chunked pooling)

Code for explaining and evaluating late chunking (chunked pooling) – jina-ai/late-chunking

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥