Mastering Document Intelligence: Navigate LLM Hallucinations with the LLM Challenge Framework!
Last Updated on January 6, 2025 by Editorial Team
Author(s): Dr. Sreeram Mullankandy
Originally published on Towards AI.
Picture this: Your company just processed 10,000 insurance claims using LLMs. Everything seems efficient until you discover that 5% of the processed data contains subtle but critical errors. Thatβs 500 potentially problematic cases that could lead to incorrect payouts, compliance issues, or worse β loss of customer trust.
Welcome to the real world of AI-powered document processing, where hallucinations β LLMβs tendency to generate plausible but incorrect information β pose a significant business risk.
But is that the LLMβs fault?!
The LLM has no βhallucination problemβ. Hallucination is not a bug, it is LLMβs greatest feature. The LLM Assistant (like Chat GPT) has a hallucination problem, and we should fix it.
β Andrej Karpathi
The $3 Trillion a year Problem
According to recent studies, businesses lose an estimated $3 Trillion annually due to their inability to extract data accurately. Traditional methods offer a difficult choice:
- Manual processing: Accurate but painfully slow and expensive
- OCR-based systems: Fast, but inaccurate due to lack of context
- Rule-based systems: Rigid, error prone, and requires constant updates
- LLM based systems: Fast, context-aware, but prone to errors due to hallucinations
Introducing the LLM Challenge Framework
What if you could combine the speed of AI with human-level accuracy? Thatβs exactly what the LLM Challenge Framework achieves. That is exactly what we tried out. By pitting multiple LLMs against each other and incorporating strategic human oversight, we could generate:
- 99% accuracy in field-level data extraction
- 70% reduction in human review time
- 85% cost savings compared to manual processing
How It Works
Letβs break down the magic:
- Document pre-processing:
β Ingest the document to generate images or text of consistent quality
β Convert to images if you are using visual LLMs (our choice after trial and error)
β Convert the document to text (using OCR) if you are using text based LLMs. - Dual AI Processing:
β Two independent visual LLMs process the same document
β Each LLM generates structured output
β Results are automatically compared with each other
2. Smart Verification:
β Matching results = High confidence β Automatic approval
β Discrepancies = Targeted human review via Human-in-the-loop approach
3. Reinforcement learning:
β Collect human input (feedback) to identify the correct response.
β Use the collected data to further fine tune LLMs.
Why Itβs Better Than Existing Solutions
Here is a high-level comparison chart on current main-stream approaches to this problem.
Implementation
We used AWS based Cloud-native architecture:
- AWS Sagemaker to build, train, and deploy LLMs.
- NVIDIA L40 GPUs to compute. We went with L40 instead of A100 or H100 to be cost-efficient. But it came with its own limitations.
- Qwen (Qwen-2VL-2B) and InternVL (Intern-vit-6B) as the two competing visual LLMs. We tried out other LLMs too (includes Gemma and LLaMA) but settled on Qwen and InternVL considering the RAM capacity of L40 GPUs and the accuracy of the results.
Potential Improvements
Here are some of the additional approaches that could yield better results.
- Multiple LLM Integration: Expanding beyond two LLMs to increase confidence levels.
- Model Diversity: Optimizing the mix of large and small models. Keep in mind that the larger models may need GPUs with larger RAM, which could bump up the cost.
- Multi-modal Capabilities: Leveraging models capable of processing both text and images.
ROI Metrics
Here are some of the metrics that we considered while operationalizing this workflow
– Accuracy (field-level): 99% accuracy
– Throughput time: 85% faster document processing
– Cost reduction: 70% reduction in processing costs
– Payback period: 6-month payback period
Relevance
Industries that deal with high volumes of critical documents where accuracy is paramount and errors can have significant financial or legal consequences, making them perfect candidates for the LLM Challenge Framework. Here are some of them:
1. Healthcare & Insurance: Processing medical claims, clinical notes, and insurance policies to extract diagnosis codes, treatment plans, and coverage details while maintaining HIPAA compliance.
2. Banking & Financial Services: Automating loan application processing, KYC documents, and financial statements to expedite customer onboarding and risk assessment.
3. Legal Services: Analyzing contracts, court documents, and legal correspondence to extract key clauses, deadlines, and legal obligations with high accuracy.
4. Supply Chain & Logistics: Processing bills of lading, customs declarations, and shipping manifests to extract shipment details, compliance information, and tracking data.
5. Manufacturing: Processing quality control reports, safety certificates, and compliance documentation to extract specifications and maintain regulatory standards.
The Bottom Line
In a world where data accuracy can make or break a business, the LLM Challenge Framework offers a revolutionary solution. Itβs not just about processing documents faster β itβs about processing them right.
By combining the power of multiple LLMs with strategic human oversight, we can deliver unprecedented accuracy while maintaining the speed and cost benefits of automation.
The question isnβt whether to modernize your document processing β itβs whether you can afford not to.
Reference:
[1] T. Redman, Bad Data Costs the U.S. $3 Trillion Per Year (2016), Harvard Business Review
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI