Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-FranΓ§ois Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Extracting Data from Unstructured Documents
Latest   Machine Learning

Extracting Data from Unstructured Documents

Author(s): Felix Pappe

Originally published on Towards AI.

Extracting Data from Unstructured Documents
Image created by the author using gpt-image-1

Introduction

In the past, extracting specific information from documents or images using traditional methods could have become quickly cumbersome and frustrating, especially when the final results stray far from what you intended. The reasons for this can be diverse, ranging from overly complex document layouts to improperly formatted files, or an avalanche of visual elements, which machines struggle to interpret.

However, vision-enabled language (vLMs)models have come to the rescue. Over the past months and years, these models have gained ever-greater capabilities, from rough image descriptions to detailed text extraction. Notably, the extraction of complex textual information from images has seen astonishing progress. This allows for rapid knowledge extraction from diverse document types without brittle, rule-based systems that break as soon as the document structure changes β€” and without the time-, data-, and cost-intensive specialized training of custom models.

However, there is one flaw: vLMs, like their text-only counterparts, tend to produce verbose output around the information you actually want. Phrases such as β€œOf course, here is the information you requested” or β€œThis is the extracted information about XYZ” commonly surround the essential content.

You could use regular expressions together with advanced prompt engineering to constrain the vLM to output only the requested information. However, crafting the perfect prompt and matching regex for a given task is difficult and requires much trial and error. In this blog post, I’d like to introduce a simpler approach: combining the rich capabilities of vLLMs with the strict validation offered by Pydantic classes to extract exactly the desired information for your document processing pipeline.

Description of post tackled issue

The example in this blog post describes a situation that every job applicant has likely experienced many times. I am sure of it.
After you have carefully and thoroughly created your CV, thinking about every word and maybe even every letter, you upload the file to a job portal. But after successfully uploading the file, including all the requested information, you are asked once again to fill out the same details in standard HTML forms by copying and pasting the information from your CV into the correct fields.
Some companies attempt to autofill these fields based on the information extracted from your CV, but the results are often far from accurate or complete.

In the following code, I combine Pixtral, LangChain, and Pydantic to provide a simple solution.
The code extracts the first name, last name, phone number, email, and birthday from the CV if they exist. This helps keep the example simple and focuses on the technical aspects.
The code can be easily adapted for other use cases or extended to extract all required information from a CV.

So let us dive into the code.

Code walkthrough

Importing required libraries

In the first step, the required libraries are imported, including:

  • os, pathlib, and typing for standard Python modules providing filesystem access and type annotations
  • base64 for encoding binary image data as text
  • dontenv to load environment variables from a .env file into os.environ
  • pydantic for defining a schema for the structured LLM output
  • ChatMistralAI from LangChain’s Mistral integration as the vision-enabled LLM interface
  • PIL for opening and resizing images
import os
import base64
from pathlib import Path
from typing import Optional
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from langchain_mistralai.chat_models import ChatMistralAI
from langchain_core.messages import HumanMessage
from PIL import Image

Loading environment variables

Subsequently, the environment variables are loaded using load_dotenv(), and the MISTRAL_API_KEY is retrieved.

load_dotenv()
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")
if not MISTRAL_API_KEY:
raise ValueError("MISTRAL_API_KEY not set in environment")

Defining the output schema with pydantic

Following that, the output schema is defined using Pydantic. Pydantic is a Python library for data parsing and validation based on Python type hints. At its core, Pydantic’s BaseModel offers various useful features, such as the declaration of data types (e.g. str, int, List[str], nested models, etc.) and automatic coercion of incoming data into the required types when possible (e.g., converting "102" into 102).
Moreover, it validates whether the incoming data matches the predefined schema and raises an error if it does not. Thanks to these clearly defined schemas, the data can be quickly serialized into other formats such as JSON. Likewise, Pydantic also allows the creation of document fields with metadata that tools such as LLMs can inspect and utilize.
The next code block defines the structure of the expected output using Pydantic. These are the data points that the model should extract from the CV image.

class BasicCV(BaseModel):
first_name: Optional[str] = Field(None, description="first name")
last_name: Optional[str] = Field(None, description="last name")
phone: Optional[str] = Field(None, description="Telephone number")
email: Optional[str] = Field(None, description="Email address")
birthday: Optional[str] = Field(None, description="Date of birth (e.g., YYYY-MM-DD)")

Converting images to base64

Subsequently, the first function is defined for the script. The function encode_image_to_base64() does exactly what its name suggests. It loads an image and converts it into a base64 string, which is passed into the vLM later.

Moreover, an upscaling factor has been integrated. Although no additional information is gained by simply increasing the height and width, in my experience, the results tend to improve, especially in situations where the original resolution of the image is low.

def encode_image_to_base64(image_path: Path, upscale_factor: float = 1.0) -> str:
with Image.open(image_path) as img:
if upscale_factor != 1.0:
new_size = (int(img.width * upscale_factor), int(img.height * upscale_factor))
img = img.resize(new_size, Image.LANCZOS)
from io import BytesIO
buffer = BytesIO()
img.save(buffer, format="PNG")
image_bytes = buffer.getvalue()
return base64.b64encode(image_bytes).decode()

Processing the CV with a vision language model

Now, let’s move on to the main function of this script. The process_cv() function begins by initializing the Mistral interface using a previously generated API key. This model is then wrapped using the .with_structured_output(BasicCV) function, in which the Pydantic model defined above is passed as input. If you are using a different vLM, make sure that it supports structured output, as not all vLMs do.

Afterwards, the input image is converted into a base64 (b64) string, which is then transformed into a Uniform Resource Identifier (URI) by attaching a metadata string in front of the b64 string.

Next, a simple system prompt is defined, which leaves room for improvement in more complex extraction tasks but works perfectly for this scenario.

Finally, the URI and system prompt are combined into a LangChain HumanMessage, which is passed to the structured vLM. The model then returns the requested information in the previously defined Pydantic format.

def process_cv(
image_path: Path,
api_key: Optional[str] = None
) -> BasicCV:

llm = ChatMistralAI(
model="pixtral-12b-latest",
mistral_api_key=api_key or MISTRAL_API_KEY,
)

structured_llm = llm.with_structured_output(BasicCV)

image_b64 = encode_image_to_base64(image_path)
data_uri = f"data:image/png;base64,{image_b64}"

system_text = (
"Extract only the following fields from this CV: first name, last name, "
"telephone number, email address, and birthday. Return JSON matching the schema."
)

message = HumanMessage(
content=[
{"type": "text", "text": system_text},
{"type": "image_url", "image_url": data_uri},
]
)

result: BasicCV = structured_llm.invoke([message])

return result

Running the script

This function is executed by the main, where the path is defined and the final information is printed out.

if __name__ == "__main__":
image_file = Path("cv-test.png")
cv_data = process_cv(image_file)

print(f"First Name: {cv_data.first_name}")
print(f"Last Name: {cv_data.last_name}")
print(f"Phone: {cv_data.phone}")
print(f"Email: {cv_data.email}")
print(f"Birthday: {cv_data.birthday}")

Conclusion

This simple Python script provides only a first impression of how powerful and flexible vLMs have become. In combination with Pydantic and with the support of the powerful LangChain framework, vLMs can be turned into a meaningful solution for many document processing workflows, such as application processing or invoice handling.

What experience have you had with vision Large Language Models?
Do you have other fields in mind where such a workflow might be beneficial?

Source

felix-pappe.medium.com/subscribe 🔔
www.linkedin.com/in/felix-pappe 🔗
https://felixpappe.de🌐

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓