ExtractThinker: AI Document Intelligence with LLMs
Last Updated on June 11, 2024 by Editorial Team
Author(s): JΓΊlio Almeida
Originally published on Towards AI.
Introduction
Itβs been a long road to get here. Iβve been working with LLMs since October 2021 (OpenAI Codex) and immediately started working on extraction. I tried to build a solution in .NET similar to what is presented here, but creating an agnostic solution was not easy, especially without access to the useful tools available in Python.
I spent over a month migrating all the code that youβll see in this GitHub repo. This is the first version, more of a proof of concept than anything else, to be expanded over time and hopefully gain as much traction as LiteLLM or instructor.
Motivation
When using document extraction with LLMs you sometimes get asked questions of this nature:
βCan I take this group of files and separate them according to this classification?β
βCan we just extract these fields here? I want another formatβ
βCan classify them and extract a piece of information?β
You already have several tools to archive this, like AWS Textract or Azure AI Document Intelligence, which allows you to do this with some code work on your side. They offer a range of templates and a training ecosystem so you can add your document type, and it's important to mention that they are also transformer models underneath. The big issue is usually vendor lock and developer costs to train the model.
Let's compare the costs of a solution done with Azure Document Intelligence and an alternative with LLMs and also Azure Document Intelligence. I will be using this pricing table from Azure.
Azure offers you a basic βreadβ document type for extraction, that extracts without fields or business logic. But is good enough to get a structure like paragraphs, checkboxes, and tables, and that's plenty to use with GPT 3.5. The rest of the cloud providers have similar services with similar prices, so the math should work the same.
Also, these tools are not oriented to a normal domain model, so you will need to do a lot of mapping, similar to when you convert SQL results into a class in an OOP language. So the best way to think about this project is an βORM but for document extractionβ. The image below expresses this idea well:
In a traditional ORM, a database driver takes care of the mapping from database tables to classes. Similarly, ExtractThinker uses OCR and LLM as a βdriverβ to map document fields to class attributes. This ORM-style interaction simplifies the process, turning unstructured document data into structured.
Functionalities
In terms of project purpose and size, you should compare it to LiteLLM and instructor. They solve a specific use case, like creating a load balancer of several LLMs or making sure the output is parsed in a pydantic way. This project draws inspiration and relies, for now, heavily on them. The image below shows a good way to think about this project.
The project will reach into the βdocument intelligenceβ side, offering a mapper for Textract or Azure DI, to be paired with a low-cost LLM. Also, more advanced tools such as anti-hallucination and ToT can be added to the pipeline to increase the quality of the results.
Here is a list of features that you should expect:
- It supports multiple document loaders, including Tesseract OCR, Azure Form Recognizer, AWS Textract, and Google Document AI.
- Customizable extraction using contract definitions.
- Asynchronous processing for efficient document handling.
- Built-in support for various document formats.
- ORM-style interaction between files and LLMs.
Code example
from extract_thinker import DocumentLoaderTesseract, Extractor, Contract
# contract definition. Based on instructor/pydantic
class InvoiceContract(Contract):
invoice_number: str
invoice_date: str
path = "some\path"
# creation of the extractor
extractor = Extractor()
# load documentloader. This case is using tessaract
extractor.load_document_loader(
DocumentLoaderTesseract(tesseract_path)
)
# load the LLM model. Uses LiteLLM for the heavy lifting
extractor.load_llm("claude-3-haiku-20240307")
# extract the data with the contract above
result = extractor.extract(path, InvoiceContract)
print("Invoice Number: ", result.invoice_number)
print("Invoice Date: ", result.invoice_date)
Why just not langchain?
While LangChain is a generalized framework designed for a wide array of use cases, extract_thinker is specifically focused on Intelligent Document Processing (IDP). Thatβs the difference, and Langchain is limited on the extraction part, even though they are now making it a core part of their product. You can see more in this GitHub repo.
What should be expected from this project is a group of completed and tested templates based on your use case, or quite close. This reason by itself justifies the existence of this project and this is the way you should see it: an aggregation of tools for Document Intelligence using LLMs.
If compared to langchain, the structure is quite similar, divided by:
Core: Third-party projects like LiteLLM and Instructor, and others internal such as DocumentLoader, cache, classification, and abstract split code.
Components: Implementation of the DocumentLoeader for the supported OCRs (e.g DocumentLoaderTessaract), ImageSplitting, support for LLMs, and so on.
Templates: A group of templates that will work out of the box, a combination of several components to fulfill a use case.
NOTE: This is likely to change soon since it's just the first version. Make sure you check the official documentation.
Use cases
- Different groups of documents in a PDF file
- Agnostic AI Document Intelligence
- Extraction of multiple sources of documents
- classification (mixture of models)
- Agnostic classification of form by image comparison
- Extract data βparsedβ to another language
More examples will be added to the documentation as in this medium account.
Conclusion
ExtractThinker is a library designed to bring Document Intelligence to LLMs. It was based on a previous NET project, migrated and implemented to Python. Uses an ORM-style approach to document extraction, combining OCR with LLMs for performance and agnostic usability.
As the project evolves, it is expected to offer a diverse set of completed and tested templates based on different use cases, further solidifying its position as a go-to resource for document intelligence using LLMs.
Feel free to contribute to the project, which will always remain open-source. And please give it🌟if you can!
[1] Ziwei X., Sanjay J., Mohan K.βHallucination is Inevitable: An Innate Limitation of Large Language Modelsβ (2024), arXiv
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI