Mastering RAG: Precision from Table-Heavy PDFs

Last Updated on September 23, 2025 by Editorial Team

Author(s): Vicky’s Notes

Originally published on Towards AI.

I just wrapped a customer pilot where “documents” really meant PDFs stuffed with tables, footnotes, and odd layouts. The goal sounded simple: answer two kinds of questions reliably. For semantic questions like “What changed in Q4?”, the system has to find the right passage and explain it in plain language. For numeric, constrained questions like “What was the average export price (USD/tonne) in Apr-2025 for Canada?”, there’s only one acceptable outcome: the exact number, with the right unit, date, and source.

It didn’t take long to see why naive RAG falls over. The layout zigzags from multi-column prose to full-width tables to sidebars, so fixed-window chunking splits ideas in the middle. Table headers hide meaning across multiple rows, so if you don’t merge them you end up comparing “Market” to “Open” or mixing “USD/tonne” with “% change.” If you embed an 800-row table as text, you flood the index with near-duplicate numbers and bury the signal. Footnotes quietly change totals, so answers drift unless you keep provenance. And plain OCR cracks under merged cells and heavy ruling.

So we rebuilt the pipeline around one principle: find truth in the right place. We use a DOC index for discovery (where to look) and a FACTS store for precision (what the number is).

The DOC index holds prose chunks and compact table previews.
The FACTS store holds typed numeric rows written as columnar files (Parquet/Delta/Iceberg) in object storage. These files are then registered in a lakehouse catalog (Hive Metastore, Glue, Iceberg catalog), making them queryable with SQL.

This FACTS layer acts as a tool the agent queries when precision is required, where measures, units, geos, and dates are enforced. Everything else we do (extraction, cleaning, enrichment, governance, chunking, embedding) serves that separation so answers stay both accurate and explainable.

Mastering RAG: Precision from Table-Heavy PDFs — High-level pipeline for building a RAG system over digital PDFs with many tables.

With this separation, semantic questions (“what changed in Q4?”) route to the DOC index, while numeric/constrained questions (“what was the avg export price in Apr-2025 for Canada?”) route to the FACTS store. Over time, the same FACTS schema can power charts, trendlines, and BI dashboards, giving you not only text answers but structured analytics from the exact same source of truth.

In this article, I’ll focus on the ingestion pipeline steps for digital PDFs (not scanned) with lots of tables (I’ll leave handwritten content, graphs and images for another time 🙂).

Step 0: Evaluate first, then design

Before writing a single line of code, I forced myself to slow down and run a discovery sprint. The idea was to make sure the pipeline design matched the documents, the questions, and the constraints we’d really face. That small upfront effort saved me from endless rounds of re-chunking and patch fixes later. Here’s how I approached it:

A. Inventory & sample

Collect a handful of representative files across sources.
Include variety: tables vs prose, multiple languages, and a mix of short vs long page counts.

B. Define questions (what users will ask)

Semantic question example: “Summarize key drivers of Q4 revenue.”
Numeric/constrained question example: “What was avg export price (USD/tonne) in Apr 2025 for Canada?”

C. Define success criteria & gold set

Metrics: Answer accuracy, citation correctness, unit correctness, retrieval hit-rate@K, latency.
Acceptance thresholds.
A small gold set (question → expected answer + citation).

D. Identify risks & constraints

PII or regulated fields to redact.
Multilingual needs.
Freshness and update cadence.
SLA expectations.

📌 Note: Run the whole pipeline from Step 1 onward under an orchestrator (Airflow, Prefect, Dagster, LangChain pipelines) so every step is reproducible, retries don’t corrupt outputs, and lineage is captured from extraction through indexing.

Step 1: Recover the structure first

Once you know what users will ask, the next step is to turn messy PDF pages into machine-usable structure. Without that, both semantic search and numeric queries will drift or fail. At this stage, I used Docling as the primary extractor for both prose and tables.

Docling, is an open source tool made by IBM research. It reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to Markdown and JSON.

1.1 Docling for text & tables extraction

Use Docling to split each page into structured elements. At this stage the goal wasn’t accuracy of every token, but to classify what kind of content was on the page. Docling does a good job of distinguishing:

Prose blocks like paragraphs, headings, and lists.
Tables with cell coordinates and candidate headers.

def extract_pdf(pdf_path):
 # 1) Docling as the default extractor for prose + tables
 dl = docling_convert(pdf_path, table_mode="ACCURATE") # TableFormer accurate mode
 md_text = dl.document.export_to_markdown() # prose/structure for DOC index
 tables = [t.export_to_dataframe() for t in dl.document.tables] # for FACTS + previews

 # 2) Build DOC chunks
 doc_chunks = build_prose_chunks_from_markdown(md_text) # headings→paras→sentences
 table_previews = [make_table_preview(t, meta=...) for t in tables]
 doc_chunks.extend(table_previews)

 # 3) Build FACTS rows (tidy melt, units/currency/time explicit)
 facts_rows = []
 for t, meta in zip(tables, dl.table_metadata):
 normalized = normalize_headers_and_units(t)
 facts_rows.extend(melt_to_facts(normalized, meta)) # one fact per row

1.2 Promote & merge header rows (give columns meaning)

Many PDFs put units or category labels in a top header row and the actual series names in a second row. Produce one final header list where each column name already carries the right meaning (e.g., units, percent change).

Input (two header rows) example:

We merged the two header rows by attaching the parent to each child and remove totals (It will be explained in step 1.5):
Males → Males (Number of Students), Females → Females (Number of Students)as below.

Marks Bin,Gender,Count
30-40,Males,8
30-40,Females,6
40-50,Males,16
40-50,Females,10

Every numeric column now carries its meaning in the name, so downstream retrieval and filtering don’t lose context.

1.3 Create a Table Preview (for the DOC index)

A compact preview helps the retriever find the right table without embedding the whole grid. Numeric truth will come from FACTS index later.

Include:

Title/caption + merged headers (units baked into names).
3–5 representative rows (sampled, not just the first few).
Metadata: table_id, page, columns, row_count, has_facts=true

Exclude:

Full table dumps (too noisy).
Base64 image bytes (unnecessary bloat).

Pseudo:

def table_preview_markdown(headers, rows, k=5):
 sample = diverse_sample(rows, k) # e.g., first, middle, rare entity
 hdr = "| " + " | ".join(headers) + " |"
 sep = "| " + " | ".join(["---"]*len(headers)) + " |"
 body = ["| " + " | ".join("" if v is None else str(v)) + " |" for v in sample]
 if len(rows) > k:
 body.append(f"... ({len(rows)-k} more rows)")
 return "\n".join([hdr, sep, *body])

doc_chunk = {
 "content": f"{table_title}\n" + table_preview_markdown(merged_headers, table_rows),
 "meta": {
 "type": "table_preview",
 "table_id": table_id,
 "page": page_no,
 "columns": merged_headers,
 "row_count": len(table_rows),
 "has_facts": True
 }
}

This preview goes to DOC index. Its job is discovery. It’s intentionally small so retrieval quality stays high and costs stay low.

1.4 Make tables “analysis-ready” (for the FACTS store)

To deliver precision, convert tables into tidy, machine-usable rows stored in columnar files (e.g., Parquet, Delta, Iceberg) on object storage. Then, register those files in a lakehouse catalog (Hive Metastore, Glue, Iceberg catalog) so they’re queryable with SQL engines (Trino, Spark, Presto). This design keeps schema flexibility while still enabling fast, standards-based queries over time. It’s the single biggest lever for accuracy in numeric questions.

Transformations to apply:

Tidy melt (wide → long): turn month/year columns into rows, emitting one observation per row with fields like series_name, value, measure, unit, currency, period_ym, geo, product, plus provenance (doc_id, table_id, page`).
Skip totals: don’t hard-store grand totals; compute them dynamically when users ask via SQL or aggregation functions.
Schema-on-read: if you don’t know all dimensions up front, keep a flexible schema.

📌 Think of the FACTS store as the agent’s numeric tool, not its semantic memory. The AI routes here for queries like “What was the avg export price (USD/tonne) in Sep-2025 for Canada?”

Example query (Trino/Athena over Parquet):

SELECT AVG(value) AS avg_price
FROM facts
WHERE geo = 'CA'
 AND period_ym = 202509
 AND unit_std = '/TONNE'
 AND series_name = 'Export Price (USD/tonne)';

Step 2: Enrichment (business terms take over)

At this stage, enrichment attaches your company’s official meanings (taxonomies, codes, calendars, FX rates) to the data we extracted from PDFs, so search, analytics, and answers all line up. Think of it as moving fields from a flexible attributes bag into standardized columns.

For DOC (discovery layer): Normalize metadata fields that improve retrieval and filtering (e.g., doc_type, topic, jurisdiction, language, audience); Keep the original text untouched for embeddings and citations; don’t rewrite or replace entity mentions.
For FACTS (numeric layer): Normalize the fields that matter for filters and aggregations (e.g., country codes, product names, fiscal periods, currencies, units). Over time, you can expand coverage as needed.

Common enrichment examples:

Alias deduping: “U.S.” / “USA” / “United States” → country_code=US
Product synonyms: “fries” / “French fries” → product_std=FROZEN_FRIES
Entity resolution: “Woohoo Inc.” / “WOOHOO LTD” → supplier_master=WOOHOO_CORP
Fiscal time: period_ym=202509 → fiscal_year=2025, fiscal_qtr=Q3
Currency/unit unification: map /t, /ton, /tonne → unit_std=/TONNE; convert to base currency with stored FX rate
Series naming: “% change YoY”, “pct chg” → series_name=Pct Change YoY, measure=pct_change
Doc type / topic: “Master Agreement”, “SOW” → doc_type=contract, topic=legal
Jurisdiction / language: “EU”, “FR” → jurisdiction=EU, language=fr

If you don’t have business terms/MDM yet, use a simple rule of thumb:

Will users filter by it? → Normalize it.
Will users aggregate it? → Normalize it and make units explicit.
Footnotes → store as metadata (for citation), not filters.
Decorative columns (row numbers, internal IDs)/irrelevant fields → skip.

Step 3: Govern & freeze the document set

Apply governance and lock a reproducible set of “approved” sources. This ensures that whatever ends up in your DOC/FACTS stores can be audited and reproduced.

PII/HAP policy: run detection on normalized text/tables. Actions: drop | redact | mask | quarantine.
Access labels: add audience (e.g., internal, external) and any row/column-level restrictions you need later at query time.
Provenance: ensure every artifact keeps doc_id, table_id, page, source_path, checksum.
Freeze the set: persist a manifest (IDs + checksums + policy version). This becomes your reproducible document set for indexing and audits.

Step 4: De-dup & idempotency

Hash normalized content (and table previews) so duplicates don’t produce redundant chunks. If ingestion must stop mid-run, re-runs won’t re-process what’s already done. You’ll never re-extract, re-transform, or re-index identical content.

4.1 File level (skip unchanged files):

Instead of re-parsing every PDF, compute a file signature from size + mtime (or better, checksum) and skip unchanged files.

Pseudo:

def file_signature(path):
 st = os.stat(path)
 return f"{st.st_size}-{st.st_mtime_ns}-{st.st_dev}-{st.st_ino}"

4.2 Chunk level — DOC previews (dedupe + resume):

Use a deterministic _id for each DOC record (tenant + doc_id + table_id + part_no); Store a content_hash so re-runs can detect no-ops.

Pseudo:

id_doc = hash(tenant, doc_id, table_id, part_no) # stable foreign key shape
hash_doc = hash(normalized_preview) # content hash

if exists(id_doc) and get(id_doc).content_hash == hash_doc:
 skip_write()
else:
 upsert(
 id=id_doc,
 doc={
 "doc_id": doc_id,
 "table_id": table_id, # <— keep FK right here
 ...
 "has_facts": True, # handy hint for routing
 "content_hash": hash_doc
 }
 )

Note: the same stable keys you use for idempotent writes also serve as foreign keys across indexes, making retrieval deterministic and preventing brittle, NLP-only joins.

4.3 Fact level — row observations (dedupe + resume):

Hash a composite that defines one fact (the natural key), and use it as the deterministic _id for the FACTS record. On re-run, already-written facts won’t be re-inserted.

Pseudo:

id_fact = hash(tenant, doc_id, table_id, series_name, geo, period_ym, unit, currency)
hash_fact = hash(series_name, geo, period_ym, unit, currency, canonicalize(value))

if exists(id_fact) and get(id_fact).content_hash == hash_fact:
 skip_write()
else:
 upsert(
 id=id_fact,
 doc={
 "doc_id": doc_id,
 "table_id": table_id, # FK back to DOC preview
 ...
 "content_hash": hash_fact
 }
 )

4.4 Cross-index linkage & retrieval routing (DOC preview → FACTS rows)

Every DOC preview chunk carries doc_id and table_id. On retrieval, if the preview is a top hit, jump to the corresponding FACTS.

Routing pseudo:

def retrieve(q):
 doc_hits = search_doc_previews(q, top_k=10) # BM25+vector+rerank
 facts_hits = search_facts_by_filters(q) # parse geo/period/series

 if wants_numbers(q) and facts_hits:
 return synthesize_from_facts(facts_hits)

 if doc_hits:
 tid = best_table_id(doc_hits) # from preview.meta.table_id
 rows = facts_by_table_id(tid)
 if rows:
 return synthesize_from_facts(rows)

 return synthesize_from_doc(doc_hits) # prose-only answers

Step 5: DOC index chunking

Once the document set is governed and ready, the next challenge is chunking. If you just slice text into fixed windows, you cut ideas in half and confuse retrieval. The trick is to respect natural boundaries so chunks carry meaning, while still fitting within your embedding model’s limits.

Principles I followed:

Avoid fixed token windows when natural structure is available.
Sentence-aware chunks (with small overlaps) work best for prose.
Title/section-aware chunks are essential for reports.
Recursive splitting works well: start coarse (headings → sections), then split finer (paragraphs → sentences) only if needed.
Re-tune by model: different embedding models tolerate different chunk sizes, so don’t assume one size fits all.

This recursive strategy preserved semantics, which made retrieval cheaper and more accurate: the retriever could land directly in the right section rather than piecing together partial sentences.

Pseudo (Example with LangChain):

from langchain.text_splitter import (
 MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
)

# 1) Split by headings first (coarse)
headers_to_split_on = [
 ("#", "H1"), ("##", "H2"), ("###", "H3")
]
hdr_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
sections = hdr_splitter.split_text(markdown_text) # returns list[{"content","metadata"}]

# 2) Within each section, split recursively by semantic separators (fine)
# Order matters: try larger boundaries first, then smaller.
recursive = RecursiveCharacterTextSplitter(
 separators=[ "\n\n", "\n", ". ", " ", "" ], # paragraph → line → sentence → word → chars
 chunk_size=embed_cfg.chunk_size, # e.g., 400–700 tokens (tune per model)
 chunk_overlap=embed_cfg.chunk_overlap # e.g., ~10–15% of size
)

doc_chunks = []
for sec in sections:
 # NOTE: Keep section title/path in metadata; prepend title to content for retrieval signal
 titled_text = f"{sec['metadata'].get('H1','')}\n{sec['content']}".strip()
 parts = recursive.split_text(titled_text)
 for p in parts:
 doc_chunks.append({
 "content": p,
 "meta": {
 "type": "text",
 "section_path": {k:v for k,v in sec["metadata"].items() if k in ("H1","H2","H3")},
 "page": guess_page_from_md(sec), # if you track pages
 }
 })

Important: Chunk size is model-dependent. Re-tune if you swap the embedding model.

Step 6: Embedding Generation

Once chunks are ready, you generate embeddings. This step seems trivial, but consistency matters:

Use a single embedding model across the corpus. Mixing dimensions or model types means your search space is no longer comparable, and retrieval becomes unstable.
Re-embed on change: If you swap models or preprocessing, re-embed the whole corpus. Never mix dimensions in the same index.
Log everything: record model_id, dimension, and pre-processing version in metadata. That way if you re-embed later, you know exactly why. Example:

{
 "content": "Q3 revenue grew 20% YoY driven by EMEA strength...",
 "embedding": [0.021, -0.115, ...], 
 "meta": {
 "doc_id": "Q3_Report.pdf",
 "chunk_id": "sec-Q3-Rev:p1",
 "model_id": "openai/text-embedding-3-large",
 "dim": 3072,
 "preproc_version": "v2.1"
 }
}

Re-embed on change: if you swap models or pre-processing steps, re-embed the whole corpus. Never mix dimensions.

Step 7: Retrieval Optimization

Even with solid chunks and consistent embeddings, vector similarity alone isn’t enough. Retrieval is where you balance precision vs. recall, and it’s always iterative.

Filters (metadata): Let the retriever avoid drowning in irrelevant chunks by narrowing results. Useful field examples:

page or section to keep context tight
table_id when the question clearly refers to a specific grid
date or geo to restrict facts to the right slice
content_type (prose, table_preview, figure)

Example filter query:

vector_db.search(
 query_vector=q_vec,
 top_k=50,
 filter={"geo": "CA", "period_ym": 202509, "content_type": "facts"}
)

2. Re-ranking: Once you have candidate chunks, pass them through a re-ranker (e.g., Cohere Rerank, ColBERT). This step re-orders results using the actual query text + chunk content, often doubling accuracy compared to vector scores alone.

Example pipeline:

# Step 1: vector DB search (recall-oriented)
hits = vector_db.search(q_vec, top_k=50, filter={...})

# Step 2: re-rank (precision-oriented)
reranked = reranker.rank(query_text, [h["content"] for h in hits])
top_chunks = reranked[:5]

3. Iteration: Continuously improve by refining chunking, normalizing metadata, and retraining embeddings/rerankers when systematic misses appear.

Step 8: Scaling responsibly

Once you’re satisfied with pilot results, scaling shifts focus to efficiency and governance:

Batch processing for large backlogs of files.
Parallel pipelines for faster throughput.
Automated monitoring & alerting to catch failures before users notice.
Performance tuning based on observed query patterns.
Governance baked in: preserve lineage, PII/HAP filtering, and reproducibility at scale.

Advanced features:

Add a text-to-SQL layer on top of the FACTS store → enables not just natural-language answers but also structured results (charts, aggregates).

SELECT period_ym, AVG(value) AS avg_price
FROM facts
WHERE geo = 'CA' AND product = 'Laptops'
GROUP BY period_ym;

The Future of Your RAG Applications

If you take one thing from this guide, let it be this: split discovery from precision.

DOC index → holds prose and compact table previews. It tells you where to look and supports semantic search, context retrieval, and citations.
FACTS store → holds tidy, typed numeric rows (in Parquet/Delta/Iceberg). It gives you the exact number with units, dates, and provenance.

Why does this separation matter? Because once you draw that line, every other stage of the pipeline has a natural home:

Extraction: prose flows to DOC; structured rows flow to FACTS.
Enrichment: normalize metadata for DOC filtering; standardize dimensions (geo, units, product codes) for FACTS precision.
Governance: apply PII/HAP policies consistently at the document set level; enforce access labels differently for narrative text vs numeric tables.
Chunking: recursive splitting keeps DOC chunks semantically coherent; FACTS tables are already row-grain and don’t need chunking.
Embedding: only DOC chunks (prose + table previews) get embeddings; FACTS stay numeric and are queried with SQL.
Retrieval: semantic questions → DOC index; numeric/constrained questions → FACTS store.

With this design, your agentic AI knows exactly which tool to reach for:

Vector store = knowledge base (semantic search, narrative context).
Data lake = ground truth (numeric answers, aggregates, BI queries).

That’s how you build RAG that doesn’t just read well, but can also count, filter, and aggregate with confidence, and that’s the foundation of explainable, enterprise-ready AI.

Additional resources:

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Mastering RAG: Precision from Table-Heavy PDFs

Author(s): Vicky’s Notes

Step 0: Evaluate first, then design

Step 1: Recover the structure first

1.1 Docling for text & tables extraction

1.2 Promote & merge header rows (give columns meaning)

1.3 Create a Table Preview (for the DOC index)

1.4 Make tables “analysis-ready” (for the FACTS store)

Step 2: Enrichment (business terms take over)

Step 3: Govern & freeze the document set

Step 4: De-dup & idempotency

4.1 File level (skip unchanged files):

4.2 Chunk level — DOC previews (dedupe + resume):

4.3 Fact level — row observations (dedupe + resume):

4.4 Cross-index linkage & retrieval routing (DOC preview → FACTS rows)

Step 5: DOC index chunking

Step 6: Embedding Generation

Step 7: Retrieval Optimization

Step 8: Scaling responsibly

The Future of Your RAG Applications

Additional resources:

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Mastering RAG: Precision from Table-Heavy PDFs

Author(s): Vicky’s Notes

Step 0: Evaluate first, then design

Step 1: Recover the structure first

1.1 Docling for text & tables extraction

1.2 Promote & merge header rows (give columns meaning)

1.3 Create a Table Preview (for the DOC index)

1.4 Make tables “analysis-ready” (for the FACTS store)

Step 2: Enrichment (business terms take over)

Step 3: Govern & freeze the document set

Step 4: De-dup & idempotency

4.1 File level (skip unchanged files):

4.2 Chunk level — DOC previews (dedupe + resume):

4.3 Fact level — row observations (dedupe + resume):

4.4 Cross-index linkage & retrieval routing (DOC preview → FACTS rows)

Step 5: DOC index chunking

Step 6: Embedding Generation

Step 7: Retrieval Optimization

Step 8: Scaling responsibly

The Future of Your RAG Applications

Additional resources:

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement