Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Unlocking the Advantages of Semantic Chunking to Supercharge Your RAG Models
Latest   Machine Learning

Unlocking the Advantages of Semantic Chunking to Supercharge Your RAG Models

Author(s): Aditya Baser

Originally published on Towards AI.

1. Introduction

1.1. What is chunking, and why do we need it?

The intuition behind chunking and how it helps in the retrieval of information

Imagine you are searching for a specific piece of information in a vast library. If the books are arranged haphazardly β€” some with irrelevant sections bound together and others with critical pages scattered across volumes β€” you’d spend a frustrating amount of time flipping through unrelated content. Now, consider a library where each book is carefully organized by topic, with coherent sections that neatly encapsulate a single idea or concept. This is the intuition behind chunking in the context of retrieval-augmented generation (RAG): it’s about organizing information so it can be easily retrieved and understood.

RAG Workflow β€” Our emphasis would be on understanding chunking

Chunking refers to the process of dividing large bodies of text into smaller, self-contained segments called chunks. Each chunk is designed to encapsulate a coherent unit of information that can be efficiently stored, retrieved, and used for downstream tasks like search, indexing, or contextual input for an LLM.

1.2. What are the different types of chunking methods?

Extending the library analogy, imagine you walk into the library to find information about β€œThe Effects of Climate Change on Marine Life.” The way the books are organized will determine how easily you can find the specific information you’re looking for:

1.2.1. Fixed-Length Chunking

Every book in the library is arbitrarily divided into fixed-sized sections, say, 100 pages each. No matter what the content is, each section stops at the 100-page mark. As a result, a chapter about coral bleaching might be split across two sections, leaving you scrambling to piece together the full information.

Fixed-Length Chunking splits the text into chunks based on a fixed token, word, or character count. While this method is simple to implement, it often causes relevant information to be split among different chunks or for the same chunk to have information on different topics, making retrieval less accurate.

1.2.2. Recursive Chunking (Hierarchical)

The books are structured into sections, chapters, and paragraphs following their natural hierarchy. For instance, a book on climate change might have sections on global warming, rising sea levels, and marine ecosystems. However, if a section about marine life is too large, it may remain unwieldy and difficult to search through quickly.

Recursive chunking breaks text hierarchically, following natural structures such as chapters, sections, or paragraphs. While it preserves the natural structure of the document it would lead to chunks that are too large for cases when sections are lengthy and poorly organized.

1.2.3. Semantic Chunking

In this case, the books are reorganized based on meaning and topic coherence. Instead of rigidly splitting sections by length or following a strict hierarchy, every section focuses on a specific topic or concept. For example, a section might cover β€œThe Impact of Rising Temperatures on Coral Reefs” in its entirety, regardless of length, ensuring all related content stays together. As a result, you can retrieve exactly what you need without having to sift through unrelated material.

Semantic chunking uses meaning or context to define chunk boundaries, often leveraging embeddings or similarity measures to detect where one topic ends and another begins.

2. Semantic Chunking: 101

Semantic chunking involves breaking text into smaller, meaningful units (chunks) that retain context and meaning.

2.1. Why Semantic Chunking is Superior

Semantic chunking stands out among chunking methods because it optimizes the retrieval process for contextual relevance, precision, and user satisfaction. In retrieval-augmented generation (RAG), where the goal is to feed highly relevant and coherent information into a large language model (LLM), semantic chunking eliminates many pitfalls associated with fixed-length and hierarchical approaches.

Let’s explore the unique advantages of semantic chunking and why it is crucial for building high-performance RAG systems.

2.1.1. Context Preservation

Semantic chunking ensures that each chunk contains complete, self-contained information related to a single topic. This contrasts with fixed-length chunking, were arbitrary boundaries often split context, leading to incomplete or fragmented information retrieval. When feeding an LLM, context completeness is critical. Missing context forces the LLM to β€œhallucinate” or generate suboptimal answers, while semantic chunking minimizes this risk by delivering coherent inputs.

2.1.2. Improved Retrieval Precision

Semantic chunking generates chunks that are tightly focused on specific topics. This makes it easier for retrieval systems to match queries to the most relevant chunks, improving the precision of retrieval. Precise retrieval reduces the number of irrelevant chunks passed to the LLM. This saves tokens, minimizes noise, and ensures the LLM focuses only on information that directly answers the query.

2.1.3. Minimized Redundancy

Semantic chunking reduces overlap and redundancy across chunks. While some overlap is necessary for preserving context, semantic chunking ensures this is deliberate and optimized, unlike fixed-length chunking, where overlaps are arbitrary and often wasteful. RAG pipelines often must deal with token constraints. Redundancy wastes valuable token space, while semantic chunking maximizes the information density of each chunk.

3. Implementing Semantic Chunking

3.1. Loading the dataset and setting up the API key

We will use the dataset β€œjamescalam/ai-arxiv2”, which contains research papers on artificial intelligence. These papers are often long and contain distinct sections like abstracts, methodologies, experiments, and conclusions. Chunking this dataset using semantic methods ensures we preserve context within sections and facilitate efficient retrieval for downstream tasks like summarization or question answering.

Snippet of the dataset "jamescalam/ai-arxiv2”

Semantic chunking stands out by splitting text based on meaning and context rather than arbitrary rules, ensuring each chunk is coherent and self-contained.

One of the key tools for implementing semantic chunking is the semantic_router package.

Among its core features, the semantic_router.splitters module is specifically designed for splitting text into meaningful chunks using cutting-edge semantic methods.

The semantic_router.splitters module is central to the chunking functionality. It offers three key chunking methodsβ€”consecutive_sim, cumulative_sim, and rolling_windowβ€”each catering to different document structures and use cases.

To use OpenAI’s tools, you need an API key for authentication, which we securely load from a .env file using the dotenv library. This keeps your key safe and out of your code. The OpenAIEncoder is then initialized to convert text into embeddingsβ€”numerical representations of meaning and context. These embeddings are crucial for semantic chunking, enabling us to measure similarity between text segments and create coherent chunks. Make sure your API key is set up in the .env file, and the encoder is configured with a model like text-embedding-ada-002 for efficient and accurate embedding generation. Below is the code for the same β€”

from datasets import load_dataset
import os
from getpass import getpass
from semantic_router.encoders import OpenAIEncoder
from dotenv import load_dotenv
import sys
import openai

#import the data
dataset = load_dataset("jamescalam/ai-arxiv2", split= "train")

#Securly loading the openai api key from a .env file
openai.api_key = os.environ["OPENAI_API_KEY"]

#The OpenAIEncoder is initialized for accurate embedding generation
encoder = OpenAIEncoder(name= "text-embedding-3-small")

The below code uses the RollingWindowSplitter from the semantic_router package to semantically chunk the dataset. The rolling window technique creates overlapping chunks to maintain context across boundaries, making it particularly effective for NLP tasks like retrieval-augmented generation (RAG).

The rolling window splits text into chunks of a specified size (defined by window_size) with overlaps between adjacent chunks. This overlap helps preserve context from one chunk to the next, ensuring downstream models, such as large language models (LLMs), receive coherent input for processing.

3.2. RollingWindowSplitter Parameter Breakdown

encoder

  • The encoder generates embeddings for the text, representing its semantic meaning. These embeddings help measure similarity and guide chunking.

dynamic_threshold = False

  • What it Does: Disables automatic adjustment of the similarity threshold based on content. This means chunks will be determined solely by the fixed parameters (window_size, min_split_tokens, etc.).
  • Best Practice: Use False when you have a clear idea of your thresholds or if the dataset is consistent in structure. Use True for varied or unstructured datasets.

min_split_tokens = 100

  • What it Does: Ensures each chunk contains at least 100 tokens. This prevents overly small, uninformative chunks.
  • Best Practice: Set this based on the minimum amount of information required for your task.

max_split_tokens = 500

  • What it Does: Caps each chunk at 500 tokens to fit within token limits for downstream models (e.g., OpenAI models with token constraints).
  • Best Practice: Match this value to your LLM’s token limit, subtracting space for query tokens and prompts.

window_size = 2

  • What it Does: Specifies how many segments (e.g., sentences, paragraphs) to include in each chunk. Smaller windows produce tighter chunks; larger windows preserve more context but may include unrelated content.
  • Best Practice: Adjust based on the granularity of your text (e.g., use 1–2 for short sentences, 3–5 for paragraphs).

plot_splits = True

  • What it Does: Visualizes the chunking process, showing how the text was divided into chunks. This is helpful for debugging and parameter tuning.

enable_statistics = True

  • What it Does: Outputs statistics about the chunking process, such as the number of chunks and their average size. This helps evaluate how well your chunking configuration performs.
from semantic_router.splitters import RollingWindowSplitter
from semantic_router.utils.logger import logger

logger.setLevel("WARNING") # reduce logs from splitter

encoder.score_threshold = config.score_threshold

#Read the above description to understand best practices for parameters
splitter = RollingWindowSplitter(
encoder=encoder,
dynamic_threshold = False,
min_split_tokens = 100,
max_split_tokens = 500,
window_size = 2,
plot_splits = False, # set this to true to visualize chunking
enable_statistics = False# to print chunking stats
)

splits = splitter([dataset["content"][0]])

The build_chunk function combines a title and a content chunk into a formatted string, where the title is prefixed with a # (indicating a heading in markdown) followed by the content. This is useful for creating easily readable and structured outputs, particularly when chunking large datasets like research papers. In the example, the title is taken from the first document in the dataset, and the function is applied to the first three chunks from the splits. By looping through these chunks, it prints them as well-organized sections, helping users see how each chunk relates to the overall document title. This approach ensures clarity and structure, making the output more comprehensible for tasks like summarization or retrieval.

def build_chunk(title: str, content: str):
return f"# {title}\n{content}"

# we use it like:
title = dataset[0]["title"]
for s in splits[:3]:
print("---")
print(build_chunk(title=title, content=s.content))

The build_metadata function creates a structured metadata list for a document and its corresponding chunks. It starts by extracting document-level metadata like the ArXiv ID, title, and references, then iterates over the provided chunks (doc_splits) to assign each chunk its own metadata. For each chunk, it adds identifiers for the current chunk, the previous chunk (prechunk_id), and the next chunk (postchunk_id) to maintain contextual links without storing the full neighboring chunks, which helps save storage in systems like Pinecone. This metadata structure is particularly useful for indexing and retrieval tasks, as it combines chunk-level context with document-wide details for efficient querying and navigation.

from semantic_router.schema import DocumentSplit

def build_metadata(doc: dict, doc_splits: list[DocumentSplit]):
# get document level metadata first
arxiv_id = doc["id"]
title = doc["title"]
refs = list(doc["references"].values())
# init split level metadata list
metadata = []
for i, split in enumerate(doc_splits):
# get neighboring chunks
prechunk_id = "" if i == 0 else f"{arxiv_id}#{i-1}"
postchunk_id = "" if i+1 == len(doc_splits) else f"{arxiv_id}#{i+1}"
# create dict and append to metadata list
metadata.append({
"id": f"{arxiv_id}#{i}",
"title": title,
"content": split.content,
"prechunk_id": prechunk_id,
"postchunk_id": postchunk_id,
"arxiv_id": arxiv_id,
"references": refs
})
return metadata

metadata = build_metadata(
doc=dataset[0],
doc_splits=splits[:3]
)
Metadata structure

This code connects to a Pinecone instance, a vector database optimized for storing and retrieving embeddings, using an API key for authentication. It checks if the specified index (configured with config.chunk_index_name) already exists. If not, it creates a new index with a specified dimensionality (dims), which matches the size of the embeddings generated by the encoder. The index uses the dotproduct similarity metric for vector comparisons, and ServerlessSpec specifies the cloud and region (e.g., us-east-1). The code waits for the index to initialize before connecting and displaying its stats. This setup ensures that your embeddings can be stored, queried, and managed efficiently for downstream tasks like semantic search or retrieval.

To use Pinecone, you first need an API key to authenticate your connection. Head over to Pinecone’s website and sign up for a free account. Once logged in, navigate to the API Keys section in the Pinecone dashboard. Here, you’ll find an automatically generated key or the option to create a new one. Copy the key and store it securely in your environment variables file (e.g., .env) as PINECONE_API_KEY. This ensures your key remains private and can be accessed by your code without hardcoding it directly, enhancing security while enabling seamless integration.

from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.environ["PINECONE_API_KEY"]

# configure client
pc = Pinecone(api_key=api_key)

from pinecone import ServerlessSpec

spec = ServerlessSpec(
cloud= config.chunk_cloud, region= config.chunk_region # us-east-1 is the free one.
)

dims = len(encoder(["some random text"])[0])

import time

index_name = config.chunk_index_name

# check if index already exists (it shouldn't if this is first time)
if index_name not in pc.list_indexes().names():
# if does not exist, create index
pc.create_index(
index_name,
dimension=dims, # dimensionality of embed 3 from openai
metric='dotproduct',
spec=spec
)
# wait for index to be initialized
while not pc.describe_index(index_name).status['ready']:
time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

This code processes the dataset in batches to create semantic chunks, embed them, and store the results in a Pinecone index. It first converts the dataset into a Pandas DataFrame (limited to 10,000 documents for efficiency) and prepares a list, full_dataset, to store all chunk metadata. The splitter is configured to suppress statistics and visual outputs for faster processing. For each document, the splitter generates chunks, and build_metadata adds identifiers and metadata. Batches of chunks (batch_size = 128) are then processed, with unique IDs assigned to each chunk, embeddings generated using the encoder, and all data uploaded to the Pinecone index using the upsert method. This approach ensures scalable and efficient processing, embedding, and storage for large datasets in Pinecone, suitable for retrieval-augmented generation and semantic search.

from tqdm.auto import tqdm

# easier to work with dataset as pandas dataframe
data = dataset.to_pandas().iloc[:10000] # limit to 10k docs
# store dataset *without* embeddings here
full_dataset = []

batch_size = 128

# adjust splitter to not display stats and visuals
splitter.enable_statistics = False
splitter.plot_splits = False

for doc in tqdm(dataset):
# create splits
splits = splitter([doc["content"]])
# create IDs and metadata for all splits in doc
metadata = build_metadata(doc=doc, doc_splits=splits)
for i in range(0, len(splits), batch_size):
i_end = min(len(splits), i+batch_size)
# get batch of data
metadata_batch = metadata[i:i_end]
full_dataset.extend(metadata_batch)
# generate unique ids for each chunk
ids = [m["id"] for m in metadata_batch]
# get text content to embed
content = [
build_chunk(
title=x["title"], content=x["content"]
) for x in metadata_batch
]
# embed text
embeds = encoder(content)
# add to Pinecone
index.upsert(vectors=zip(ids, embeds, metadata))

The query function retrieves relevant chunks from the Pinecone index based on a user's input query (text), embedding it using the same encoder used during index creation to ensure consistency in the semantic space. The function searches the index for the top k matches, where top_k=5 retrieves the 5 most similar chunks to the query based on the similarity metric (e.g., dot product, which measures alignment in the embedding space). It includes metadata for each match (include_metadata=True), such as the title, content, and IDs for the preceding (prechunk_id) and following (postchunk_id) chunks. Neighboring chunks are fetched to provide additional context, appending up to 400 characters from their edges to the current chunk. Each result is then formatted with the document's title as a heading and enriched with context for coherence, ensuring that the query response is accurate, relevant, and easy to understand.

def query(text: str):
xq = encoder([text])[0] #We are using the same encoder that we used while creating the index here so that the query is plotted in the same space
matches = index.query(
vector=xq,
top_k=5, #How many chunks to retrive
include_metadata=True #Allows us to get the metadata
)
chunks = []
for m in matches["matches"]:
content = m["metadata"]["content"]
title = m["metadata"]["title"]
pre = m["metadata"]["prechunk_id"]
post = m["metadata"]["postchunk_id"]
other_chunks = index.fetch(ids=[pre, post])["vectors"]
prechunk = other_chunks[pre]["metadata"]["content"]
postchunk = other_chunks[post]["metadata"]["content"]
chunk = f"""# {title}

{prechunk[-400:]}
{content}
{postchunk[:400]}"""

chunks.append(chunk)
return chunks
query("what are large language models?")
Retrieving 5 relevant chunks from the database to be fed into the LLM model

References

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓