Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Easy Late-Chunking With Chonkie
Data Analysis   Data Science   Latest   Machine Learning

Easy Late-Chunking With Chonkie

Last Updated on February 5, 2025 by Editorial Team

Author(s): Michael Ryaboy

Originally published on Towards AI.

Image Source https://github.com/chonkie-ai/

Late Chunking has just been released in Chonkie, a lean chunking library that already boasts over 2,000 stars on GitHub. This is a welcome update for anyone looking to integrate late chunking into their retrieval pipelines, since implementing it from the ground up can be conceptually tricky and prone to mistakes. This article breaks down what Late Chunking is, why it’s essential for embedding larger or more intricate documents, and how to build it into your search pipeline using Chonkie and KDB.AI as the vector store.

What is Late Chunking?

When you have a document that spans thousands of words, encoding it into a single embedding often isn’t optimal. In many scenarios, you need to retrieve smaller segments of text, and dense-vector retrieval tends to perform better when those text segments (chunks) are smaller. This is partly because embedding a whole, massive document may β€œover-compress” its semantics into a single vector.

Retrieval-Augmented Generation (RAG) is a prime example that benefits from splitting documents into smaller text chunks β€” often around 512 tokens each. In RAG, you store these chunks in a vector database and encode them with a text embedding model.

The Lost Context Problem

The typical RAG pipeline of chunk β†’ embed β†’ retrieve β†’ generate is far from perfect. Splitting text naively can inadvertently break longer contextual relationships. If crucial information is spread across multiple chunks, or a chunk requires context from the wider document, simply retrieving one chunk alone might not provide enough context to answer a query accurately. Our chunk embeddings also do not represent the chunk’s full meaning, which means the correct chunks might not be retrieved.

Take, for instance, a query like:

β€œWhat is the population of Berlin?”

If an article is split sentence by sentence, one chunk might mention β€œBerlin,” while another mentions the population figure without restating the city name. Without the context from the entire document, these fragments can’t answer the query effectively, especially when resolving references like β€œit” or β€œthe city.” This example by Jina AI demonstrates this further:

Image Source: https://jina.ai/news/late-chunking-in-long-context-embedding-models/

Late Chunking Solution

Instead of passing each chunk individually to an embedding model, in Late Chunking:

  1. The entire text (or as much as possible) is processed by the transformer layers of your embedding model, generating token embeddings that reflect global context.
  2. Text is split into chunks, and mean pooling is applied to token embeddings within each chunk to create embeddings informed by the whole document.

This preserves document context in every chunk, ensuring the embedding captures more than just the local semantics of the individual chunk. Of course, this doesn’t solve the issue of the chunk itself not having enough context. To solve this, check out my article comparing Late Chunking to Contextual Retrieval, a method popularized by Anthropic to add context to chunks with LLMs:
https://medium.com/kx-systems/late-chunking-vs-contextual-retrieval-the-math-behind-rags-context-problem-d5a26b9bbd38.

In practice, what this does instead is reduce the number of failed retrievals, and clusters chunk embeddings around the document.

Source: https://jina.ai/news/late-chunking-in-long-context-embedding-models/

Naive vs Late Chunking Comparison

Late Embedding Process. Image By Author.

In a naive approach, each chunk is encoded independently, producing embeddings that lack context from other chunks. Late Chunking, on the other hand, creates chunk embeddings conditioned on the global context, significantly improving retrieval performance. This helps reduce hallucinations and failed responses in RAG systems.

Late chunking has been shown to improve retrieval performance, which in turn means it can reduce RAG hallucinations and failed responses.

Image Source: https://jina.ai/news/late-chunking-in-long-context-embedding-models/

Implementation with Chonkie and KDB.AI

Image Source: KDB.AI

Here’s how you can implement Late Chunking using KDB.AI as the vector store.

(Disclaimer, I’m a Developer Advocate for KDB.AI and a contributor to Chonkie.)

1. Install Dependencies and Set Up LateChunker

!pip install "chonkie[st]" kdbai-client sentence-transformers

from chonkie import LateChunker
import kdbai_client as kdbai
import pandas as pd

# Initialize Late Chunker
chunker = LateChunker(
embedding_model="all-MiniLM-L6-v2",
mode="sentence",
chunk_size=512,
min_sentences_per_chunk=1,
min_characters_per_sentence=12,
)

2. Set Up the Vector Database

You can sign up for a free-tier KDB.AI instance at kdb.ai, which offers up to 4 MB memory and 32 GB storage. This is more than enough for most use cases if embeddings are stored efficiently.

# Initialize KDB.AI session
session = kdbai.Session(
api_key="your_api_key",
endpoint="your_endpoint"
)
# Create database and define schema
db = session.create_database("documents")
schema = [
{"name": "sentences", "type": "str"},
{"name": "vectors", "type": "float64s"},
]
# Configure HNSW index for fast similarity search
indexes = [{
'type': 'hnsw',
'name': 'hnsw_index',
'column': 'vectors',
'params': {'dims': 384, 'metric': "L2"},
}]
# Create table
table = db.create_table(
table="chunks",
schema=schema,
indexes=indexes
)

3. Chunk and Embed

Here’s an example using Paul Graham’s essays in Markdown format. We’ll generate late chunks and store them in the vector database.

import requests

urls = ["ww.paulgraham.com/wealth.html", "www.paulgraham.com/start.html"]
texts = [requests.get('http://r.jina.ai/' + url).text for url in urls]

batch_chunks = chunker(texts)
chunks = [chunk for batch in batch_chunks for chunk in batch]

# Store in KDB.AI
embeddings_df = pd.DataFrame({
"vectors": [chunk.embedding.tolist() for chunk in chunks],
"sentences": [chunk.text for chunk in chunks]
})

embeddings_df.head()

4. Query the Vector Store

Let’s test the retrieval pipeline by embedding a search query and finding the most relevant chunks.

import sentence_transformers
search_query = "to get rich do this"
search_embedding = sentence_transformers.SentenceTransformer("all-MiniLM-L6-v2").encode(search_query)

# search for similar documents
table.search(vectors={'hnsw_index': [search_embedding]}, n=3)[0]['sentences']

And we are able to get some results! The results aren’t ideal, as the dataset size is tiny, we are using a weak embedding model, and we aren’t utilizing reranking. But as the size of the dataset scales, late chunking can give a very significant boost in accuracy.

5. Clean Up

Remember to drop the database to save resources:

db.drop()

Conclusion

Late Chunking solves the critical issue of preserving long-distance context in retrieval pipelines. When paired with KDB.AI, you get:

  • Context-aware embeddings: Every chunk’s embedding reflects the entire document.
  • Sub-100ms latency: Leveraging KDB.AI’s HNSW index ensures fast retrieval.
  • Scalability: Capable of handling large-scale datasets in production.

Chonkie makes adding Late Chunking to your pipeline extremely simple. If you’ve struggled with building this from scratch before (like I have), this library will definitely save you a lot of time and headaches.

For more insights into advanced AI techniques, vector search, and Retrieval-Augmented Generation, follow me on Linkedin!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓