Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take the GenAI Test: 25 Questions, 6 Topics. Free from Activeloop & Towards AI

Publication

ExtractThinker: AI Document Intelligence with LLMs
Latest   Machine Learning

ExtractThinker: AI Document Intelligence with LLMs

Last Updated on June 11, 2024 by Editorial Team

Author(s): JΓΊlio Almeida

Originally published on Towards AI.

Introduction

It’s been a long road to get here. I’ve been working with LLMs since October 2021 (OpenAI Codex) and immediately started working on extraction. I tried to build a solution in .NET similar to what is presented here, but creating an agnostic solution was not easy, especially without access to the useful tools available in Python.

I spent over a month migrating all the code that you’ll see in this GitHub repo. This is the first version, more of a proof of concept than anything else, to be expanded over time and hopefully gain as much traction as LiteLLM or instructor.

Motivation

When using document extraction with LLMs you sometimes get asked questions of this nature:

β€œCan I take this group of files and separate them according to this classification?”

β€œCan we just extract these fields here? I want another format”

β€œCan classify them and extract a piece of information?”

You already have several tools to archive this, like AWS Textract or Azure AI Document Intelligence, which allows you to do this with some code work on your side. They offer a range of templates and a training ecosystem so you can add your document type, and it's important to mention that they are also transformer models underneath. The big issue is usually vendor lock and developer costs to train the model.

Let's compare the costs of a solution done with Azure Document Intelligence and an alternative with LLMs and also Azure Document Intelligence. I will be using this pricing table from Azure.

Comparison of prices between both techniques

Azure offers you a basic β€œread” document type for extraction, that extracts without fields or business logic. But is good enough to get a structure like paragraphs, checkboxes, and tables, and that's plenty to use with GPT 3.5. The rest of the cloud providers have similar services with similar prices, so the math should work the same.

Also, these tools are not oriented to a normal domain model, so you will need to do a lot of mapping, similar to when you convert SQL results into a class in an OOP language. So the best way to think about this project is an β€œORM but for document extraction”. The image below expresses this idea well:

Comparing ExtractThinker with an ORM

In a traditional ORM, a database driver takes care of the mapping from database tables to classes. Similarly, ExtractThinker uses OCR and LLM as a β€œdriver” to map document fields to class attributes. This ORM-style interaction simplifies the process, turning unstructured document data into structured.

Functionalities

In terms of project purpose and size, you should compare it to LiteLLM and instructor. They solve a specific use case, like creating a load balancer of several LLMs or making sure the output is parsed in a pydantic way. This project draws inspiration and relies, for now, heavily on them. The image below shows a good way to think about this project.

The project will reach into the β€œdocument intelligence” side, offering a mapper for Textract or Azure DI, to be paired with a low-cost LLM. Also, more advanced tools such as anti-hallucination and ToT can be added to the pipeline to increase the quality of the results.

Here is a list of features that you should expect:

  • It supports multiple document loaders, including Tesseract OCR, Azure Form Recognizer, AWS Textract, and Google Document AI.
  • Customizable extraction using contract definitions.
  • Asynchronous processing for efficient document handling.
  • Built-in support for various document formats.
  • ORM-style interaction between files and LLMs.

Code example

from extract_thinker import DocumentLoaderTesseract, Extractor, Contract


# contract definition. Based on instructor/pydantic
class InvoiceContract(Contract):
invoice_number: str
invoice_date: str

path = "some\path"

# creation of the extractor
extractor = Extractor()

# load documentloader. This case is using tessaract
extractor.load_document_loader(
DocumentLoaderTesseract(tesseract_path)
)

# load the LLM model. Uses LiteLLM for the heavy lifting
extractor.load_llm("claude-3-haiku-20240307")

# extract the data with the contract above
result = extractor.extract(path, InvoiceContract)


print("Invoice Number: ", result.invoice_number)
print("Invoice Date: ", result.invoice_date)

Why just not langchain?

While LangChain is a generalized framework designed for a wide array of use cases, extract_thinker is specifically focused on Intelligent Document Processing (IDP). That’s the difference, and Langchain is limited on the extraction part, even though they are now making it a core part of their product. You can see more in this GitHub repo.

What should be expected from this project is a group of completed and tested templates based on your use case, or quite close. This reason by itself justifies the existence of this project and this is the way you should see it: an aggregation of tools for Document Intelligence using LLMs.

If compared to langchain, the structure is quite similar, divided by:

Core: Third-party projects like LiteLLM and Instructor, and others internal such as DocumentLoader, cache, classification, and abstract split code.

Components: Implementation of the DocumentLoeader for the supported OCRs (e.g DocumentLoaderTessaract), ImageSplitting, support for LLMs, and so on.

Templates: A group of templates that will work out of the box, a combination of several components to fulfill a use case.

NOTE: This is likely to change soon since it's just the first version. Make sure you check the official documentation.

Use cases

  • Different groups of documents in a PDF file
  • Agnostic AI Document Intelligence
  • Extraction of multiple sources of documents
  • classification (mixture of models)
  • Agnostic classification of form by image comparison
  • Extract data β€œparsed” to another language

More examples will be added to the documentation as in this medium account.

Conclusion

ExtractThinker is a library designed to bring Document Intelligence to LLMs. It was based on a previous NET project, migrated and implemented to Python. Uses an ORM-style approach to document extraction, combining OCR with LLMs for performance and agnostic usability.

As the project evolves, it is expected to offer a diverse set of completed and tested templates based on different use cases, further solidifying its position as a go-to resource for document intelligence using LLMs.

Feel free to contribute to the project, which will always remain open-source. And please give it🌟if you can!

[1] Ziwei X., Sanjay J., Mohan K.β€œHallucination is Inevitable: An Innate Limitation of Large Language Models” (2024), arXiv

Instructor GitHub

LiteLLM Github

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓