Easy Late-Chunking With Chonkie

Last Updated on February 5, 2025 by Editorial Team

Author(s): Michael Ryaboy

Originally published on Towards AI.

Image Source https://github.com/chonkie-ai/

Late Chunking has just been released in Chonkie, a lean chunking library that already boasts over 2,000 stars on GitHub. This is a welcome update for anyone looking to integrate late chunking into their retrieval pipelines, since implementing it from the ground up can be conceptually tricky and prone to mistakes. This article breaks down what Late Chunking is, why it’s essential for embedding larger or more intricate documents, and how to build it into your search pipeline using Chonkie and KDB.AI as the vector store.

What is Late Chunking?

When you have a document that spans thousands of words, encoding it into a single embedding often isn’t optimal. In many scenarios, you need to retrieve smaller segments of text, and dense-vector retrieval tends to perform better when those text segments (chunks) are smaller. This is partly because embedding a whole, massive document may “over-compress” its semantics into a single vector.

Retrieval-Augmented Generation (RAG) is a prime example that benefits from splitting documents into smaller text chunks — often around 512 tokens each. In RAG, you store these chunks in a vector database and encode them with a text embedding model.

The Lost Context Problem

The typical RAG pipeline of chunk → embed → retrieve → generate is far from perfect. Splitting text naively can inadvertently break longer contextual relationships. If crucial information is spread across multiple chunks, or a chunk requires context from the wider document, simply retrieving one chunk alone might not provide enough context to answer a query accurately. Our chunk embeddings also do not represent the chunk’s full meaning, which means the correct chunks might not be retrieved.

Take, for instance, a query like:

“What is the population of Berlin?”

If an article is split sentence by sentence, one chunk might mention “Berlin,” while another mentions the population figure without restating the city name. Without the context from the entire document, these fragments can’t answer the query effectively, especially when resolving references like “it” or “the city.” This example by Jina AI demonstrates this further:

Image Source: https://jina.ai/news/late-chunking-in-long-context-embedding-models/

Late Chunking Solution

Instead of passing each chunk individually to an embedding model, in Late Chunking:

The entire text (or as much as possible) is processed by the transformer layers of your embedding model, generating token embeddings that reflect global context.
Text is split into chunks, and mean pooling is applied to token embeddings within each chunk to create embeddings informed by the whole document.

This preserves document context in every chunk, ensuring the embedding captures more than just the local semantics of the individual chunk. Of course, this doesn’t solve the issue of the chunk itself not having enough context. To solve this, check out my article comparing Late Chunking to Contextual Retrieval, a method popularized by Anthropic to add context to chunks with LLMs:
https://medium.com/kx-systems/late-chunking-vs-contextual-retrieval-the-math-behind-rags-context-problem-d5a26b9bbd38.

In practice, what this does instead is reduce the number of failed retrievals, and clusters chunk embeddings around the document.

Source: https://jina.ai/news/late-chunking-in-long-context-embedding-models/

Naive vs Late Chunking Comparison

Late Embedding Process. Image By Author.

In a naive approach, each chunk is encoded independently, producing embeddings that lack context from other chunks. Late Chunking, on the other hand, creates chunk embeddings conditioned on the global context, significantly improving retrieval performance. This helps reduce hallucinations and failed responses in RAG systems.

Late chunking has been shown to improve retrieval performance, which in turn means it can reduce RAG hallucinations and failed responses.

Implementation with Chonkie and KDB.AI

Here’s how you can implement Late Chunking using KDB.AI as the vector store.

(Disclaimer, I’m a Developer Advocate for KDB.AI and a contributor to Chonkie.)

1. Install Dependencies and Set Up LateChunker

!pip install "chonkie[st]" kdbai-client sentence-transformers

from chonkie import LateChunker
import kdbai_client as kdbai
import pandas as pd

# Initialize Late Chunker
chunker = LateChunker(
 embedding_model="all-MiniLM-L6-v2",
 mode="sentence",
 chunk_size=512,
 min_sentences_per_chunk=1,
 min_characters_per_sentence=12,
)

2. Set Up the Vector Database

You can sign up for a free-tier KDB.AI instance at kdb.ai, which offers up to 4 MB memory and 32 GB storage. This is more than enough for most use cases if embeddings are stored efficiently.

# Initialize KDB.AI session
session = kdbai.Session(
 api_key="your_api_key",
 endpoint="your_endpoint"
)
# Create database and define schema
db = session.create_database("documents")
schema = [
 {"name": "sentences", "type": "str"},
 {"name": "vectors", "type": "float64s"},
]
# Configure HNSW index for fast similarity search
indexes = [{
 'type': 'hnsw',
 'name': 'hnsw_index',
 'column': 'vectors',
 'params': {'dims': 384, 'metric': "L2"},
}]
# Create table
table = db.create_table(
 table="chunks",
 schema=schema,
 indexes=indexes
)

3. Chunk and Embed

Here’s an example using Paul Graham’s essays in Markdown format. We’ll generate late chunks and store them in the vector database.

import requests

urls = ["ww.paulgraham.com/wealth.html", "www.paulgraham.com/start.html"]
texts = [requests.get('http://r.jina.ai/' + url).text for url in urls]

batch_chunks = chunker(texts)
chunks = [chunk for batch in batch_chunks for chunk in batch]

# Store in KDB.AI
embeddings_df = pd.DataFrame({
 "vectors": [chunk.embedding.tolist() for chunk in chunks],
 "sentences": [chunk.text for chunk in chunks]
})

embeddings_df.head()

4. Query the Vector Store

Let’s test the retrieval pipeline by embedding a search query and finding the most relevant chunks.

import sentence_transformers
search_query = "to get rich do this"
search_embedding = sentence_transformers.SentenceTransformer("all-MiniLM-L6-v2").encode(search_query)

# search for similar documents
table.search(vectors={'hnsw_index': [search_embedding]}, n=3)[0]['sentences']

And we are able to get some results! The results aren’t ideal, as the dataset size is tiny, we are using a weak embedding model, and we aren’t utilizing reranking. But as the size of the dataset scales, late chunking can give a very significant boost in accuracy.

5. Clean Up

Remember to drop the database to save resources:

db.drop()

Conclusion

Late Chunking solves the critical issue of preserving long-distance context in retrieval pipelines. When paired with KDB.AI, you get:

Context-aware embeddings: Every chunk’s embedding reflects the entire document.
Sub-100ms latency: Leveraging KDB.AI’s HNSW index ensures fast retrieval.
Scalability: Capable of handling large-scale datasets in production.

Chonkie makes adding Late Chunking to your pipeline extremely simple. If you’ve struggled with building this from scratch before (like I have), this library will definitely save you a lot of time and headaches.

For more insights into advanced AI techniques, vector search, and Retrieval-Augmented Generation, follow me on Linkedin!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Easy Late-Chunking With Chonkie

Author(s): Michael Ryaboy

What is Late Chunking?

The Lost Context Problem

Late Chunking Solution

Naive vs Late Chunking Comparison

Implementation with Chonkie and KDB.AI

1. Install Dependencies and Set Up LateChunker

2. Set Up the Vector Database

3. Chunk and Embed

4. Query the Vector Store

5. Clean Up

Conclusion

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Your Data is the New Currency: Are You Protecting It?

AI Agent Software: The Future of Coding Tools

Architecting Intelligent Multi-Agent AI Systems: A2A vs MCP

I Built an AI That Turns Side Projects Into Stories That Get You Hired

🧠 Building an AI Study Buddy: A Practical Guide to Developing a Simple Learning Companion

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Easy Late-Chunking With Chonkie

Author(s): Michael Ryaboy

What is Late Chunking?

The Lost Context Problem

Late Chunking Solution

Naive vs Late Chunking Comparison

Implementation with Chonkie and KDB.AI

1. Install Dependencies and Set Up LateChunker

2. Set Up the Vector Database

3. Chunk and Embed

4. Query the Vector Store

5. Clean Up

Conclusion

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥