Extracting Data from Unstructured Documents

Author(s): Felix Pappe

Originally published on Towards AI.

Extracting Data from Unstructured Documents — Image created by the author using gpt-image-1

Introduction

In the past, extracting specific information from documents or images using traditional methods could have become quickly cumbersome and frustrating, especially when the final results stray far from what you intended. The reasons for this can be diverse, ranging from overly complex document layouts to improperly formatted files, or an avalanche of visual elements, which machines struggle to interpret.

However, vision-enabled language (vLMs)models have come to the rescue. Over the past months and years, these models have gained ever-greater capabilities, from rough image descriptions to detailed text extraction. Notably, the extraction of complex textual information from images has seen astonishing progress. This allows for rapid knowledge extraction from diverse document types without brittle, rule-based systems that break as soon as the document structure changes — and without the time-, data-, and cost-intensive specialized training of custom models.

However, there is one flaw: vLMs, like their text-only counterparts, tend to produce verbose output around the information you actually want. Phrases such as “Of course, here is the information you requested” or “This is the extracted information about XYZ” commonly surround the essential content.

You could use regular expressions together with advanced prompt engineering to constrain the vLM to output only the requested information. However, crafting the perfect prompt and matching regex for a given task is difficult and requires much trial and error. In this blog post, I’d like to introduce a simpler approach: combining the rich capabilities of vLLMs with the strict validation offered by Pydantic classes to extract exactly the desired information for your document processing pipeline.

Description of post tackled issue

The example in this blog post describes a situation that every job applicant has likely experienced many times. I am sure of it.
After you have carefully and thoroughly created your CV, thinking about every word and maybe even every letter, you upload the file to a job portal. But after successfully uploading the file, including all the requested information, you are asked once again to fill out the same details in standard HTML forms by copying and pasting the information from your CV into the correct fields.
Some companies attempt to autofill these fields based on the information extracted from your CV, but the results are often far from accurate or complete.

In the following code, I combine Pixtral, LangChain, and Pydantic to provide a simple solution.
The code extracts the first name, last name, phone number, email, and birthday from the CV if they exist. This helps keep the example simple and focuses on the technical aspects.
The code can be easily adapted for other use cases or extended to extract all required information from a CV.

So let us dive into the code.

Code walkthrough

Importing required libraries

In the first step, the required libraries are imported, including:

os, pathlib, and typing for standard Python modules providing filesystem access and type annotations
base64 for encoding binary image data as text
dontenv to load environment variables from a .env file into os.environ
pydantic for defining a schema for the structured LLM output
ChatMistralAI from LangChain’s Mistral integration as the vision-enabled LLM interface
PIL for opening and resizing images

import os
import base64
from pathlib import Path
from typing import Optional
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from langchain_mistralai.chat_models import ChatMistralAI
from langchain_core.messages import HumanMessage
from PIL import Image

Loading environment variables

Subsequently, the environment variables are loaded using load_dotenv(), and the MISTRAL_API_KEY is retrieved.

load_dotenv()
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")
if not MISTRAL_API_KEY:
 raise ValueError("MISTRAL_API_KEY not set in environment")

Defining the output schema with pydantic

Following that, the output schema is defined using Pydantic. Pydantic is a Python library for data parsing and validation based on Python type hints. At its core, Pydantic’s BaseModel offers various useful features, such as the declaration of data types (e.g. str, int, List[str], nested models, etc.) and automatic coercion of incoming data into the required types when possible (e.g., converting "102" into 102).
Moreover, it validates whether the incoming data matches the predefined schema and raises an error if it does not. Thanks to these clearly defined schemas, the data can be quickly serialized into other formats such as JSON. Likewise, Pydantic also allows the creation of document fields with metadata that tools such as LLMs can inspect and utilize.
The next code block defines the structure of the expected output using Pydantic. These are the data points that the model should extract from the CV image.

class BasicCV(BaseModel):
 first_name: Optional[str] = Field(None, description="first name")
 last_name: Optional[str] = Field(None, description="last name")
 phone: Optional[str] = Field(None, description="Telephone number")
 email: Optional[str] = Field(None, description="Email address")
 birthday: Optional[str] = Field(None, description="Date of birth (e.g., YYYY-MM-DD)")

Converting images to base64

Subsequently, the first function is defined for the script. The function encode_image_to_base64() does exactly what its name suggests. It loads an image and converts it into a base64 string, which is passed into the vLM later.

Moreover, an upscaling factor has been integrated. Although no additional information is gained by simply increasing the height and width, in my experience, the results tend to improve, especially in situations where the original resolution of the image is low.

def encode_image_to_base64(image_path: Path, upscale_factor: float = 1.0) -> str:
 with Image.open(image_path) as img:
 if upscale_factor != 1.0:
 new_size = (int(img.width * upscale_factor), int(img.height * upscale_factor))
 img = img.resize(new_size, Image.LANCZOS)
 from io import BytesIO
 buffer = BytesIO()
 img.save(buffer, format="PNG")
 image_bytes = buffer.getvalue()
 return base64.b64encode(image_bytes).decode()

Processing the CV with a vision language model

Now, let’s move on to the main function of this script. The process_cv() function begins by initializing the Mistral interface using a previously generated API key. This model is then wrapped using the .with_structured_output(BasicCV) function, in which the Pydantic model defined above is passed as input. If you are using a different vLM, make sure that it supports structured output, as not all vLMs do.

Afterwards, the input image is converted into a base64 (b64) string, which is then transformed into a Uniform Resource Identifier (URI) by attaching a metadata string in front of the b64 string.

Next, a simple system prompt is defined, which leaves room for improvement in more complex extraction tasks but works perfectly for this scenario.

Finally, the URI and system prompt are combined into a LangChain HumanMessage, which is passed to the structured vLM. The model then returns the requested information in the previously defined Pydantic format.

def process_cv(
 image_path: Path,
 api_key: Optional[str] = None
) -> BasicCV:
 
 llm = ChatMistralAI(
 model="pixtral-12b-latest",
 mistral_api_key=api_key or MISTRAL_API_KEY,
 )

 structured_llm = llm.with_structured_output(BasicCV)

 image_b64 = encode_image_to_base64(image_path)
 data_uri = f"data:image/png;base64,{image_b64}"

 system_text = (
 "Extract only the following fields from this CV: first name, last name, "
 "telephone number, email address, and birthday. Return JSON matching the schema."
 )

 message = HumanMessage(
 content=[
 {"type": "text", "text": system_text},
 {"type": "image_url", "image_url": data_uri},
 ]
 )

 result: BasicCV = structured_llm.invoke([message])

 return result

Running the script

This function is executed by the main, where the path is defined and the final information is printed out.

if __name__ == "__main__":
 image_file = Path("cv-test.png")
 cv_data = process_cv(image_file)
 
 print(f"First Name: {cv_data.first_name}")
 print(f"Last Name: {cv_data.last_name}")
 print(f"Phone: {cv_data.phone}")
 print(f"Email: {cv_data.email}")
 print(f"Birthday: {cv_data.birthday}")

Conclusion

This simple Python script provides only a first impression of how powerful and flexible vLMs have become. In combination with Pydantic and with the support of the powerful LangChain framework, vLMs can be turned into a meaningful solution for many document processing workflows, such as application processing or invoice handling.

What experience have you had with vision Large Language Models?
Do you have other fields in mind where such a workflow might be beneficial?

Source

felix-pappe.medium.com/subscribe 🔔
www.linkedin.com/in/felix-pappe 🔗
https://felixpappe.de🌐

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Extracting Data from Unstructured Documents

Author(s): Felix Pappe

Introduction

Description of post tackled issue

Code walkthrough

Importing required libraries

Loading environment variables

Defining the output schema with pydantic

Converting images to base64

Processing the CV with a vision language model

Running the script

Conclusion

Source

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

No Code, No Limits: The Best Open-Source AI UIs in 2025

LLMs Don’t Need Search Engines: They Can Search Their Own Brains

This Plug-and-Play AI Memory Works With Any Model

From Prompts to RAG to RAGAs: Evaluating Retrieval-Augmented Generation Systems the Right Way

“BIOREASON” Makes DNA Analysis Simple Using AI

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Extracting Data from Unstructured Documents

Author(s): Felix Pappe

Introduction

Description of post tackled issue

Code walkthrough

Importing required libraries

Loading environment variables

Defining the output schema with pydantic

Converting images to base64

Processing the CV with a vision language model

Running the script

Conclusion

Source

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement