Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

Mastering RAG: Precision from Table-Heavy PDFs
Data Engineering   Latest   Machine Learning

Mastering RAG: Precision from Table-Heavy PDFs

Last Updated on September 23, 2025 by Editorial Team

Author(s): Vicky’s Notes

Originally published on Towards AI.

I just wrapped a customer pilot where “documents” really meant PDFs stuffed with tables, footnotes, and odd layouts. The goal sounded simple: answer two kinds of questions reliably. For semantic questions like “What changed in Q4?”, the system has to find the right passage and explain it in plain language. For numeric, constrained questions like “What was the average export price (USD/tonne) in Apr-2025 for Canada?”, there’s only one acceptable outcome: the exact number, with the right unit, date, and source.

It didn’t take long to see why naive RAG falls over. The layout zigzags from multi-column prose to full-width tables to sidebars, so fixed-window chunking splits ideas in the middle. Table headers hide meaning across multiple rows, so if you don’t merge them you end up comparing “Market” to “Open” or mixing “USD/tonne” with “% change.” If you embed an 800-row table as text, you flood the index with near-duplicate numbers and bury the signal. Footnotes quietly change totals, so answers drift unless you keep provenance. And plain OCR cracks under merged cells and heavy ruling.

So we rebuilt the pipeline around one principle: find truth in the right place. We use a DOC index for discovery (where to look) and a FACTS store for precision (what the number is).

  • The DOC index holds prose chunks and compact table previews.
  • The FACTS store holds typed numeric rows written as columnar files (Parquet/Delta/Iceberg) in object storage. These files are then registered in a lakehouse catalog (Hive Metastore, Glue, Iceberg catalog), making them queryable with SQL.

This FACTS layer acts as a tool the agent queries when precision is required, where measures, units, geos, and dates are enforced. Everything else we do (extraction, cleaning, enrichment, governance, chunking, embedding) serves that separation so answers stay both accurate and explainable.

Mastering RAG: Precision from Table-Heavy PDFs
High-level pipeline for building a RAG system over digital PDFs with many tables.

With this separation, semantic questions (“what changed in Q4?”) route to the DOC index, while numeric/constrained questions (“what was the avg export price in Apr-2025 for Canada?”) route to the FACTS store. Over time, the same FACTS schema can power charts, trendlines, and BI dashboards, giving you not only text answers but structured analytics from the exact same source of truth.

In this article, I’ll focus on the ingestion pipeline steps for digital PDFs (not scanned) with lots of tables (I’ll leave handwritten content, graphs and images for another time 🙂).

Step 0: Evaluate first, then design

Before writing a single line of code, I forced myself to slow down and run a discovery sprint. The idea was to make sure the pipeline design matched the documents, the questions, and the constraints we’d really face. That small upfront effort saved me from endless rounds of re-chunking and patch fixes later. Here’s how I approached it:

A. Inventory & sample

  • Collect a handful of representative files across sources.
  • Include variety: tables vs prose, multiple languages, and a mix of short vs long page counts.

B. Define questions (what users will ask)

  • Semantic question example: “Summarize key drivers of Q4 revenue.”
  • Numeric/constrained question example: “What was avg export price (USD/tonne) in Apr 2025 for Canada?”

C. Define success criteria & gold set

  • Metrics: Answer accuracy, citation correctness, unit correctness, retrieval hit-rate@K, latency.
  • Acceptance thresholds.
  • A small gold set (question → expected answer + citation).

D. Identify risks & constraints

  • PII or regulated fields to redact.
  • Multilingual needs.
  • Freshness and update cadence.
  • SLA expectations.

📌 Note: Run the whole pipeline from Step 1 onward under an orchestrator (Airflow, Prefect, Dagster, LangChain pipelines) so every step is reproducible, retries don’t corrupt outputs, and lineage is captured from extraction through indexing.

Step 1: Recover the structure first

Once you know what users will ask, the next step is to turn messy PDF pages into machine-usable structure. Without that, both semantic search and numeric queries will drift or fail. At this stage, I used Docling as the primary extractor for both prose and tables.

Docling, is an open source tool made by IBM research. It reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to Markdown and JSON.

1.1 Docling for text & tables extraction

Use Docling to split each page into structured elements. At this stage the goal wasn’t accuracy of every token, but to classify what kind of content was on the page. Docling does a good job of distinguishing:

  • Prose blocks like paragraphs, headings, and lists.
  • Tables with cell coordinates and candidate headers.
def extract_pdf(pdf_path):
# 1) Docling as the default extractor for prose + tables
dl = docling_convert(pdf_path, table_mode="ACCURATE") # TableFormer accurate mode
md_text = dl.document.export_to_markdown() # prose/structure for DOC index
tables = [t.export_to_dataframe() for t in dl.document.tables] # for FACTS + previews

# 2) Build DOC chunks
doc_chunks = build_prose_chunks_from_markdown(md_text) # headings→paras→sentences
table_previews = [make_table_preview(t, meta=...) for t in tables]
doc_chunks.extend(table_previews)

# 3) Build FACTS rows (tidy melt, units/currency/time explicit)
facts_rows = []
for t, meta in zip(tables, dl.table_metadata):
normalized = normalize_headers_and_units(t)
facts_rows.extend(melt_to_facts(normalized, meta)) # one fact per row

1.2 Promote & merge header rows (give columns meaning)

Many PDFs put units or category labels in a top header row and the actual series names in a second row. Produce one final header list where each column name already carries the right meaning (e.g., units, percent change).

Input (two header rows) example:

Credit: BrainKart.com

We merged the two header rows by attaching the parent to each child and remove totals (It will be explained in step 1.5):
Males → Males (Number of Students), Females → Females (Number of Students)as below.

Marks Bin,Gender,Count
30-40,Males,8
30-40,Females,6
40-50,Males,16
40-50,Females,10

Every numeric column now carries its meaning in the name, so downstream retrieval and filtering don’t lose context.

1.3 Create a Table Preview (for the DOC index)

A compact preview helps the retriever find the right table without embedding the whole grid. Numeric truth will come from FACTS index later.

Include:

  • Title/caption + merged headers (units baked into names).
  • 3–5 representative rows (sampled, not just the first few).
  • Metadata: table_id, page, columns, row_count, has_facts=true

Exclude:

  • Full table dumps (too noisy).
  • Base64 image bytes (unnecessary bloat).

Pseudo:

def table_preview_markdown(headers, rows, k=5):
sample = diverse_sample(rows, k) # e.g., first, middle, rare entity
hdr = "| " + " | ".join(headers) + " |"
sep = "| " + " | ".join(["---"]*len(headers)) + " |"
body = ["| " + " | ".join("" if v is None else str(v)) + " |" for v in sample]
if len(rows) > k:
body.append(f"... ({len(rows)-k} more rows)")
return "\n".join([hdr, sep, *body])

doc_chunk = {
"content": f"{table_title}\n" + table_preview_markdown(merged_headers, table_rows),
"meta": {
"type": "table_preview",
"table_id": table_id,
"page": page_no,
"columns": merged_headers,
"row_count": len(table_rows),
"has_facts": True
}
}

This preview goes to DOC index. Its job is discovery. It’s intentionally small so retrieval quality stays high and costs stay low.

1.4 Make tables “analysis-ready” (for the FACTS store)

To deliver precision, convert tables into tidy, machine-usable rows stored in columnar files (e.g., Parquet, Delta, Iceberg) on object storage. Then, register those files in a lakehouse catalog (Hive Metastore, Glue, Iceberg catalog) so they’re queryable with SQL engines (Trino, Spark, Presto). This design keeps schema flexibility while still enabling fast, standards-based queries over time. It’s the single biggest lever for accuracy in numeric questions.

Transformations to apply:

  • Tidy melt (wide → long): turn month/year columns into rows, emitting one observation per row with fields like series_name, value, measure, unit, currency, period_ym, geo, product, plus provenance (doc_id, table_id, page`).
  • Skip totals: don’t hard-store grand totals; compute them dynamically when users ask via SQL or aggregation functions.
  • Schema-on-read: if you don’t know all dimensions up front, keep a flexible schema.

📌 Think of the FACTS store as the agent’s numeric tool, not its semantic memory. The AI routes here for queries like “What was the avg export price (USD/tonne) in Sep-2025 for Canada?”

Example query (Trino/Athena over Parquet):

SELECT AVG(value) AS avg_price
FROM facts
WHERE geo = 'CA'
AND period_ym = 202509
AND unit_std = '/TONNE'
AND series_name = 'Export Price (USD/tonne)';

Step 2: Enrichment (business terms take over)

At this stage, enrichment attaches your company’s official meanings (taxonomies, codes, calendars, FX rates) to the data we extracted from PDFs, so search, analytics, and answers all line up. Think of it as moving fields from a flexible attributes bag into standardized columns.

  • For DOC (discovery layer): Normalize metadata fields that improve retrieval and filtering (e.g., doc_type, topic, jurisdiction, language, audience); Keep the original text untouched for embeddings and citations; don’t rewrite or replace entity mentions.
  • For FACTS (numeric layer): Normalize the fields that matter for filters and aggregations (e.g., country codes, product names, fiscal periods, currencies, units). Over time, you can expand coverage as needed.

Common enrichment examples:

  • Alias deduping: “U.S.” / “USA” / “United States” → country_code=US
  • Product synonyms: “fries” / “French fries” → product_std=FROZEN_FRIES
  • Entity resolution: “Woohoo Inc.” / “WOOHOO LTD” → supplier_master=WOOHOO_CORP
  • Fiscal time: period_ym=202509fiscal_year=2025, fiscal_qtr=Q3
  • Currency/unit unification: map /t, /ton, /tonneunit_std=/TONNE; convert to base currency with stored FX rate
  • Series naming: “% change YoY”, “pct chg” → series_name=Pct Change YoY, measure=pct_change
  • Doc type / topic: “Master Agreement”, “SOW” → doc_type=contract, topic=legal
  • Jurisdiction / language: “EU”, “FR” → jurisdiction=EU, language=fr

If you don’t have business terms/MDM yet, use a simple rule of thumb:

  • Will users filter by it? → Normalize it.
  • Will users aggregate it? → Normalize it and make units explicit.
  • Footnotes → store as metadata (for citation), not filters.
  • Decorative columns (row numbers, internal IDs)/irrelevant fields → skip.

Step 3: Govern & freeze the document set

Apply governance and lock a reproducible set of “approved” sources. This ensures that whatever ends up in your DOC/FACTS stores can be audited and reproduced.

  • PII/HAP policy: run detection on normalized text/tables. Actions: drop | redact | mask | quarantine.
  • Access labels: add audience (e.g., internal, external) and any row/column-level restrictions you need later at query time.
  • Provenance: ensure every artifact keeps doc_id, table_id, page, source_path, checksum.
  • Freeze the set: persist a manifest (IDs + checksums + policy version). This becomes your reproducible document set for indexing and audits.

Step 4: De-dup & idempotency

Hash normalized content (and table previews) so duplicates don’t produce redundant chunks. If ingestion must stop mid-run, re-runs won’t re-process what’s already done. You’ll never re-extract, re-transform, or re-index identical content.

4.1 File level (skip unchanged files):

Instead of re-parsing every PDF, compute a file signature from size + mtime (or better, checksum) and skip unchanged files.

Pseudo:

def file_signature(path):
st = os.stat(path)
return f"{st.st_size}-{st.st_mtime_ns}-{st.st_dev}-{st.st_ino}"

4.2 Chunk level — DOC previews (dedupe + resume):

Use a deterministic _id for each DOC record (tenant + doc_id + table_id + part_no); Store a content_hash so re-runs can detect no-ops.

Pseudo:

id_doc = hash(tenant, doc_id, table_id, part_no) # stable foreign key shape
hash_doc = hash(normalized_preview) # content hash

if exists(id_doc) and get(id_doc).content_hash == hash_doc:
skip_write()
else:
upsert(
id=id_doc,
doc={
"doc_id": doc_id,
"table_id": table_id, # <— keep FK right here
...
"has_facts": True, # handy hint for routing
"content_hash": hash_doc
}
)

Note: the same stable keys you use for idempotent writes also serve as foreign keys across indexes, making retrieval deterministic and preventing brittle, NLP-only joins.

4.3 Fact level — row observations (dedupe + resume):

Hash a composite that defines one fact (the natural key), and use it as the deterministic _id for the FACTS record. On re-run, already-written facts won’t be re-inserted.

Pseudo:

id_fact = hash(tenant, doc_id, table_id, series_name, geo, period_ym, unit, currency)
hash_fact = hash(series_name, geo, period_ym, unit, currency, canonicalize(value))

if exists(id_fact) and get(id_fact).content_hash == hash_fact:
skip_write()
else:
upsert(
id=id_fact,
doc={
"doc_id": doc_id,
"table_id": table_id, # FK back to DOC preview
...
"content_hash": hash_fact
}
)

4.4 Cross-index linkage & retrieval routing (DOC preview → FACTS rows)

Every DOC preview chunk carries doc_id and table_id. On retrieval, if the preview is a top hit, jump to the corresponding FACTS.

Routing pseudo:

def retrieve(q):
doc_hits = search_doc_previews(q, top_k=10) # BM25+vector+rerank
facts_hits = search_facts_by_filters(q) # parse geo/period/series

if wants_numbers(q) and facts_hits:
return synthesize_from_facts(facts_hits)

if doc_hits:
tid = best_table_id(doc_hits) # from preview.meta.table_id
rows = facts_by_table_id(tid)
if rows:
return synthesize_from_facts(rows)

return synthesize_from_doc(doc_hits) # prose-only answers

Step 5: DOC index chunking

Once the document set is governed and ready, the next challenge is chunking. If you just slice text into fixed windows, you cut ideas in half and confuse retrieval. The trick is to respect natural boundaries so chunks carry meaning, while still fitting within your embedding model’s limits.

Principles I followed:

  • Avoid fixed token windows when natural structure is available.
  • Sentence-aware chunks (with small overlaps) work best for prose.
  • Title/section-aware chunks are essential for reports.
  • Recursive splitting works well: start coarse (headings → sections), then split finer (paragraphs → sentences) only if needed.
  • Re-tune by model: different embedding models tolerate different chunk sizes, so don’t assume one size fits all.

This recursive strategy preserved semantics, which made retrieval cheaper and more accurate: the retriever could land directly in the right section rather than piecing together partial sentences.

Pseudo (Example with LangChain):

from langchain.text_splitter import (
MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
)

# 1) Split by headings first (coarse)
headers_to_split_on = [
("#", "H1"), ("##", "H2"), ("###", "H3")
]
hdr_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
sections = hdr_splitter.split_text(markdown_text) # returns list[{"content","metadata"}]

# 2) Within each section, split recursively by semantic separators (fine)
# Order matters: try larger boundaries first, then smaller.
recursive = RecursiveCharacterTextSplitter(
separators=[ "\n\n", "\n", ". ", " ", "" ], # paragraph → line → sentence → word → chars
chunk_size=embed_cfg.chunk_size, # e.g., 400–700 tokens (tune per model)
chunk_overlap=embed_cfg.chunk_overlap # e.g., ~10–15% of size
)

doc_chunks = []
for sec in sections:
# NOTE: Keep section title/path in metadata; prepend title to content for retrieval signal
titled_text = f"{sec['metadata'].get('H1','')}\n{sec['content']}".strip()
parts = recursive.split_text(titled_text)
for p in parts:
doc_chunks.append({
"content": p,
"meta": {
"type": "text",
"section_path": {k:v for k,v in sec["metadata"].items() if k in ("H1","H2","H3")},
"page": guess_page_from_md(sec), # if you track pages
}
})

Important: Chunk size is model-dependent. Re-tune if you swap the embedding model.

Step 6: Embedding Generation

Once chunks are ready, you generate embeddings. This step seems trivial, but consistency matters:

  • Use a single embedding model across the corpus. Mixing dimensions or model types means your search space is no longer comparable, and retrieval becomes unstable.
  • Re-embed on change: If you swap models or preprocessing, re-embed the whole corpus. Never mix dimensions in the same index.
  • Log everything: record model_id, dimension, and pre-processing version in metadata. That way if you re-embed later, you know exactly why. Example:
{
"content": "Q3 revenue grew 20% YoY driven by EMEA strength...",
"embedding": [0.021, -0.115, ...],
"meta": {
"doc_id": "Q3_Report.pdf",
"chunk_id": "sec-Q3-Rev:p1",
"model_id": "openai/text-embedding-3-large",
"dim": 3072,
"preproc_version": "v2.1"
}
}
  • Re-embed on change: if you swap models or pre-processing steps, re-embed the whole corpus. Never mix dimensions.

Step 7: Retrieval Optimization

Even with solid chunks and consistent embeddings, vector similarity alone isn’t enough. Retrieval is where you balance precision vs. recall, and it’s always iterative.

  1. Filters (metadata): Let the retriever avoid drowning in irrelevant chunks by narrowing results. Useful field examples:
  • page or section to keep context tight
  • table_id when the question clearly refers to a specific grid
  • date or geo to restrict facts to the right slice
  • content_type (prose, table_preview, figure)

Example filter query:

vector_db.search(
query_vector=q_vec,
top_k=50,
filter={"geo": "CA", "period_ym": 202509, "content_type": "facts"}
)

2. Re-ranking: Once you have candidate chunks, pass them through a re-ranker (e.g., Cohere Rerank, ColBERT). This step re-orders results using the actual query text + chunk content, often doubling accuracy compared to vector scores alone.

Example pipeline:

# Step 1: vector DB search (recall-oriented)
hits = vector_db.search(q_vec, top_k=50, filter={...})

# Step 2: re-rank (precision-oriented)
reranked = reranker.rank(query_text, [h["content"] for h in hits])
top_chunks = reranked[:5]

3. Iteration: Continuously improve by refining chunking, normalizing metadata, and retraining embeddings/rerankers when systematic misses appear.

Step 8: Scaling responsibly

Once you’re satisfied with pilot results, scaling shifts focus to efficiency and governance:

  • Batch processing for large backlogs of files.
  • Parallel pipelines for faster throughput.
  • Automated monitoring & alerting to catch failures before users notice.
  • Performance tuning based on observed query patterns.
  • Governance baked in: preserve lineage, PII/HAP filtering, and reproducibility at scale.

Advanced features:

  • Add a text-to-SQL layer on top of the FACTS store → enables not just natural-language answers but also structured results (charts, aggregates).
SELECT period_ym, AVG(value) AS avg_price
FROM facts
WHERE geo = 'CA' AND product = 'Laptops'
GROUP BY period_ym;

The Future of Your RAG Applications

If you take one thing from this guide, let it be this: split discovery from precision.

  • DOC index → holds prose and compact table previews. It tells you where to look and supports semantic search, context retrieval, and citations.
  • FACTS store → holds tidy, typed numeric rows (in Parquet/Delta/Iceberg). It gives you the exact number with units, dates, and provenance.

Why does this separation matter? Because once you draw that line, every other stage of the pipeline has a natural home:

  • Extraction: prose flows to DOC; structured rows flow to FACTS.
  • Enrichment: normalize metadata for DOC filtering; standardize dimensions (geo, units, product codes) for FACTS precision.
  • Governance: apply PII/HAP policies consistently at the document set level; enforce access labels differently for narrative text vs numeric tables.
  • Chunking: recursive splitting keeps DOC chunks semantically coherent; FACTS tables are already row-grain and don’t need chunking.
  • Embedding: only DOC chunks (prose + table previews) get embeddings; FACTS stay numeric and are queried with SQL.
  • Retrieval: semantic questions → DOC index; numeric/constrained questions → FACTS store.

With this design, your agentic AI knows exactly which tool to reach for:

  • Vector store = knowledge base (semantic search, narrative context).
  • Data lake = ground truth (numeric answers, aggregates, BI queries).

That’s how you build RAG that doesn’t just read well, but can also count, filter, and aggregate with confidence, and that’s the foundation of explainable, enterprise-ready AI.

Additional resources:

  1. Chunk Twice, Retrieve Once: RAG Chunking Strategies Optimized for Different Content Types
  2. Design an index for RAG in Azure AI Search

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.