ExtractThinker: AI Document Intelligence with LLMs

Last Updated on June 11, 2024 by Editorial Team

Author(s): Júlio Almeida

Originally published on Towards AI.

Introduction

It’s been a long road to get here. I’ve been working with LLMs since October 2021 (OpenAI Codex) and immediately started working on extraction. I tried to build a solution in .NET similar to what is presented here, but creating an agnostic solution was not easy, especially without access to the useful tools available in Python.

I spent over a month migrating all the code that you’ll see in this GitHub repo. This is the first version, more of a proof of concept than anything else, to be expanded over time and hopefully gain as much traction as LiteLLM or instructor.

Motivation

When using document extraction with LLMs you sometimes get asked questions of this nature:

“Can I take this group of files and separate them according to this classification?”

“Can we just extract these fields here? I want another format”

“Can classify them and extract a piece of information?”

You already have several tools to archive this, like AWS Textract or Azure AI Document Intelligence, which allows you to do this with some code work on your side. They offer a range of templates and a training ecosystem so you can add your document type, and it's important to mention that they are also transformer models underneath. The big issue is usually vendor lock and developer costs to train the model.

Let's compare the costs of a solution done with Azure Document Intelligence and an alternative with LLMs and also Azure Document Intelligence. I will be using this pricing table from Azure.

Comparison of prices between both techniques

Azure offers you a basic “read” document type for extraction, that extracts without fields or business logic. But is good enough to get a structure like paragraphs, checkboxes, and tables, and that's plenty to use with GPT 3.5. The rest of the cloud providers have similar services with similar prices, so the math should work the same.

Also, these tools are not oriented to a normal domain model, so you will need to do a lot of mapping, similar to when you convert SQL results into a class in an OOP language. So the best way to think about this project is an “ORM but for document extraction”. The image below expresses this idea well:

In a traditional ORM, a database driver takes care of the mapping from database tables to classes. Similarly, ExtractThinker uses OCR and LLM as a “driver” to map document fields to class attributes. This ORM-style interaction simplifies the process, turning unstructured document data into structured.

Functionalities

In terms of project purpose and size, you should compare it to LiteLLM and instructor. They solve a specific use case, like creating a load balancer of several LLMs or making sure the output is parsed in a pydantic way. This project draws inspiration and relies, for now, heavily on them. The image below shows a good way to think about this project.

The project will reach into the “document intelligence” side, offering a mapper for Textract or Azure DI, to be paired with a low-cost LLM. Also, more advanced tools such as anti-hallucination and ToT can be added to the pipeline to increase the quality of the results.

Here is a list of features that you should expect:

It supports multiple document loaders, including Tesseract OCR, Azure Form Recognizer, AWS Textract, and Google Document AI.
Customizable extraction using contract definitions.
Asynchronous processing for efficient document handling.
Built-in support for various document formats.
ORM-style interaction between files and LLMs.

Code example

from extract_thinker import DocumentLoaderTesseract, Extractor, Contract


# contract definition. Based on instructor/pydantic
class InvoiceContract(Contract):
 invoice_number: str
 invoice_date: str

path = "some\path"

# creation of the extractor
extractor = Extractor()

# load documentloader. This case is using tessaract
extractor.load_document_loader(
 DocumentLoaderTesseract(tesseract_path)
)

# load the LLM model. Uses LiteLLM for the heavy lifting
extractor.load_llm("claude-3-haiku-20240307")

# extract the data with the contract above
result = extractor.extract(path, InvoiceContract)


print("Invoice Number: ", result.invoice_number)
print("Invoice Date: ", result.invoice_date)

Why just not langchain?

While LangChain is a generalized framework designed for a wide array of use cases, extract_thinker is specifically focused on Intelligent Document Processing (IDP). That’s the difference, and Langchain is limited on the extraction part, even though they are now making it a core part of their product. You can see more in this GitHub repo.

What should be expected from this project is a group of completed and tested templates based on your use case, or quite close. This reason by itself justifies the existence of this project and this is the way you should see it: an aggregation of tools for Document Intelligence using LLMs.

If compared to langchain, the structure is quite similar, divided by:

Core: Third-party projects like LiteLLM and Instructor, and others internal such as DocumentLoader, cache, classification, and abstract split code.

Components: Implementation of the DocumentLoeader for the supported OCRs (e.g DocumentLoaderTessaract), ImageSplitting, support for LLMs, and so on.

Templates: A group of templates that will work out of the box, a combination of several components to fulfill a use case.

NOTE: This is likely to change soon since it's just the first version. Make sure you check the official documentation.

Use cases

Different groups of documents in a PDF file
Agnostic AI Document Intelligence
Extraction of multiple sources of documents
classification (mixture of models)
Agnostic classification of form by image comparison
Extract data “parsed” to another language

More examples will be added to the documentation as in this medium account.

Conclusion

ExtractThinker is a library designed to bring Document Intelligence to LLMs. It was based on a previous NET project, migrated and implemented to Python. Uses an ORM-style approach to document extraction, combining OCR with LLMs for performance and agnostic usability.

As the project evolves, it is expected to offer a diverse set of completed and tested templates based on different use cases, further solidifying its position as a go-to resource for document intelligence using LLMs.

Feel free to contribute to the project, which will always remain open-source. And please give it🌟if you can!

[1] Ziwei X., Sanjay J., Mohan K.“Hallucination is Inevitable: An Innate Limitation of Large Language Models” (2024), arXiv

Instructor GitHub

LiteLLM Github

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

ExtractThinker: AI Document Intelligence with LLMs

Author(s): Júlio Almeida

Introduction

Motivation

Functionalities

Code example

Why just not langchain?

Use cases

Conclusion

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

ExtractThinker: AI Document Intelligence with LLMs

Author(s): Júlio Almeida

Introduction

Motivation

Functionalities

Code example

Why just not langchain?

Use cases

Conclusion

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement