Chunking Tabular Data for RAG and Search Systems

Last Updated on September 4, 2025 by Editorial Team

Author(s): Kunal

Originally published on Towards AI.

When working with Retrieval-Augmented Generation (RAG) or search systems, we often focus on how to chunk long documents — but tables present a different kind of challenge. Unlike plain text, tabular data carries structured relationships across rows and columns, making naive chunking approaches (like splitting by rows or converting everything into text blocks) prone to losing context. The key is to design a chunking strategy that preserves both the semantics of individual rows and the structure of the table, while still being efficient for retrieval. In this post, we’ll explore practical strategies for chunking tabular data to maximize retrieval accuracy and relevance.

I have used Toxicological reference values pdf from Government of Canada Publications link. The document presents several challenges for data extraction, including multi-header tables, empty cells, varying table orientations, and duplicate column names.

What Happened When I Tried Direct PDF Upload to ChatGPT

I also experimented with directly uploading the PDF to ChatGPT-4o and asking questions about it. While the model could surface some information, it frequently hallucinated answers, especially when handling complex tables with merged headers or empty cells. This reinforced an important lesson: without structured preprocessing and thoughtful chunking strategies, even the most advanced LLMs struggle to reliably interpret tabular data from PDFs. That realization became the foundation for why I built a dedicated chunking pipeline for tables.

From Raw Tables to Usable Data

To extract tables, I used PyMuPDF, which handled the raw parsing well, but extracting text alone wasn’t enough. The real challenge was restructuring these complex tables into a consistent, searchable format and then deciding how to chunk them effectively for RAG and search tasks.

Some of the code snippets from table preprocessing:

Chunking Tabular Data for RAG and Search Systems — Image from Toxicological reference values pdf

df.iat[0, 2] = df.iat[0, 1] — copies the value from the second cell to the third cell in the first row.
df.iat[0, 4] = df.iat[0, 3] — copies the value from the fourth cell to the fifth cell in the first row.
merged_header = df.iloc[0].astype(str) + "_" + df.iloc[1].astype(str) — combines the first and second rows into a single row by joining their values with an underscore, creating a merged header.

df.columns = df.iloc[0] — sets the first row as the DataFrame’s column names.

Image from Toxicological reference values pdf

left = pd.concat([df.iloc[:, 0], df.iloc[:, 2]], axis=0, ignore_index=True) — stacks the first and third columns vertically.
right = pd.concat([df.iloc[:, 1], df.iloc[:, 3]], axis=0, ignore_index=True) — stacks the second and fourth columns vertically.
df = pd.DataFrame({"Substance": left, "RAF Derm": right}) — creates a new DataFrame with just two columns, combining the stacked columns for easier analysis and retrieval.

def detect_block_orientation(block, min_text_len=5, vertical_ratio=3): ... — a helper function to detect if a text block on a PDF page is vertical or horizontal, based on its bounding box aspect ratio, ignoring very short text blocks.

Chunking and vectorization

The process begins by reading all CSV files that match the pattern output_table*.csv, cleaning the table headers for consistency, and converting each row into a structured JSON object. Each JSON entry contains a text representation of the row along with metadata such as the table name, page number, and column names. These JSON objects are then saved for later use. Next, the workflow demonstrates how to build a vector store by loading the generated JSON files and converting each entry into a Document suitable for vector databases. Using a sentence transformer model, embeddings are created for each row and stored in a FAISS vector store, enabling efficient similarity-based search. I wrote a small function to illustrate how to query the vector store to retrieve the most relevant table rows for a given input query, providing a practical example of a retrieval-augmented workflow for tabular PDF data.

Snippet from from Toxicological refernece pdf

You can find the complete code for table extraction, header merging, column stacking, and block orientation detection on Github.

Note: I do not own the rights to the Toxicological Reference Values PDF; it is published by the Government of Canada. This post uses the document solely for educational and demonstration purposes.

“I hope you found this guide helpful and easy to follow. If it added value to your learning journey, a clap would mean a lot and help me create more content like this. Thanks for reading!”

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Chunking Tabular Data for RAG and Search Systems

Author(s): Kunal

What Happened When I Tried Direct PDF Upload to ChatGPT

From Raw Tables to Usable Data

Chunking and vectorization

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Chunking Tabular Data for RAG and Search Systems

Author(s): Kunal

What Happened When I Tried Direct PDF Upload to ChatGPT

From Raw Tables to Usable Data

Chunking and vectorization

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement