Building a Local Committee-of-Expert (CoE) RAG Application for Document Discovery

Last Updated on January 7, 2025 by Editorial Team

Author(s): Kamban Parasuraman

Originally published on Towards AI.

I n today’s fast-paced world, where access to timely and accurate information can be a critical differentiator, organizations across various sectors constantly seek innovative solutions to stay ahead of the competition. This is particularly true in the insurance and reinsurance industry where the underwriting expenses have grown significantly in the last decade. Rising costs, salaries, and regulatory compliance requirements drive this consistent increase in underwriting expenses.

U.S. Property and Casualty Insurance Industry — Underwriting Expenses (Source: WWW.NAIC.ORG)

Insurance companies spend considerable time and resources in parsing out the information in the policy and claims forms. The extracted information from these documents underpins the rest of the underwriting & claims decision-making process. Large Language Models (LLMs) present a transformative opportunity for the insurance industry to bring down the underwriting expenses by automating and streamlining the information extraction process from the vast amount of text data found in policy and claims forms.

However, despite their immense potential, there are concerns about using models like ChatGPT or any third-party platforms due to data privacy issues. Insurance companies handle vast amounts of sensitive information, including personal details and financial records. The prospect of transmitting sensitive information through external servers raises legitimate worries about data breaches and regulatory compliance, such as GDPR and CCPA. Organizations are also concerned about third-party platforms using companies' proprietary data for training their models.

Organizations can leverage AI’s advanced capabilities by deploying LLMs locally without compromising data privacy and security. In this blog, we will explore a simple RAG (Retrieval-Augmented Generation) application for document discovery using Streamlit, Ollama, and ChromaDB, all hosted locally to safeguard sensitive data while demonstrating the effectiveness of advanced AI technology.

What is RAG?

Retrieval-Augmented Generation (RAG) is an advanced AI technique that enhances the capabilities of Large Language Models (LLMs) by combining the strengths of information retrieval and text generation to create more accurate and contextually aware responses. RAG involves two steps:

Retrieval: The model retrieves relevant information from an external source and/or an internal knowledge base.

Generation: The retrieved information is then used to generate responses, making them more accurate and contextually relevant.

The chart below highlights the key benefits of building a local RAG application.

Transformers

LLM’s are based on Transformer architecture. Transformers are neural network architectures designed to handle sequential data, such as text. The architecture excels at “transforming” one data sequence into another sequence. They are highly effective for tasks in natural language processing (NLP) due to their ability to capture long-range dependencies and relationships within texts. Transformers consist of an encoder-decoder structure. The encoder processes the input sequence, while the decoder generates the output sequence. The key component of the transformers is the self-attention mechanism, which allows the model to weigh the importance of different words in a sequence relative to each other. Please refer to the Attention Is All You Need paper for more information on transformers and attention mechanisms.

Ollama

Ollama is a free and open-source tool that allows users to run large language models (LLMs) locally on their machines, ensuring data privacy and controls. Ollama does this by using a technique called model quantization. In simple terms, quantization reduces the model size by converting the floating-point numbers used in the model’s parameters to lower-bit representations, such as 4-bit integers. This helps in reducing the memory footprint and allows for quick deployment on devices with limited resources. You can download Ollama using this link. Ollama provides a Command-Line Interface (CLI) for easy installation and model management. Below is a partial list of available commands in Ollama.

# Check installed version of ollama
C:\>ollama --version

# Download a particular LLM or embedding model
C:\>ollama pull llama3

# List of installed models
C:\>ollama list

# Run your ollama models
C:\>ollama serve

# Show model information
C:\>ollama show llama3

Streamlit

Streamlit is an open-source Python library designed to facilitate rapid development and sharing of custom applications. It empowers developers and data scientists to create interactive, data-driven apps with minimal effort and maximum efficiency. To assist new users in familiarizing themselves with its capabilities, Streamlit offers an extensive App gallery. Streamlit can be easily installed using the following command-

(envn) C:\>pip install streamlit

ChromaDB

Chroma is an open-source AI application vector database designed for storing and retrieving vector embeddings. Unlike traditional databases that store data in structured tables with rows and columns using a relational model, a vector database stores data as a vector representation, making them well-suited for handling complex, unstructured data. Vector databases excel at similarity searches and complex matching. A quick overview of Chroma can be found here. Chroma can be easily installed using the following command.

(envn) C:\>pip install chromadb

Transforming Texts to Latent-Space Representations

Vector embeddings provide a way to represent words, sentences, and even documents as dense numerical vectors. This numerical representation is essential because it allows machine learning algorithms to process and understand text data. Embeddings capture the semantic meaning of words and phrases. Words with similar meanings will have embeddings that are close together in vector latent space. For example, the words “king” and “queen” will be closer to each other than “king” and “transformer”. For LLaMA 3, the embedding vector has a dimension of 4096. This means each token is represented by a 4096-dimension vector. To experiment and visualize tokenization and embeddings from different models, please see this app.

Committee of Experts

In building a robust RAG application using quantized versions of the LLMs, one significant challenge users often face is ensuring the accuracy and completeness of the generated responses while also avoiding the issue of hallucinations in language models. To address this, let’s employ a Committee of Experts (CoE). By leveraging multiple LLMs — specifically LLaMA 3–8B, Mistral 7B, and Phi3-mini — for each query, the application can provide users with multiple perspectives on the same question. This method involves passing each query through all three models, allowing users to receive three distinct responses. While all three models are designed for general-purpose NLP tasks, their performance may vary slightly based on the specific task and dataset used for training these models. LLaMA 3–8B is the largest model, followed by Mistral 7B, and then Phi3-mini. Generally, larger models can capture more complex patterns and nuances in the data, leading to higher accuracy.

The multi-model strategy enhances users confidence in the system’s outputs, as the convergence of the contextually similar answers across different models suggests reliability. By cross-verifying the answers, users can identify the most accurate and comprehensive response, and build confidence scores for different models in the CoE. This collaborative methodology also helps identify and mitigate individual model biases and errors.

Using real insurance documents for testing is not feasible due to the presence of personally identifiable information and strict data protection regulations. Instead, we will use publicly available research papers to demonstrate the application’s capabilities. The structure and complexity of research papers provide a robust testbed, highlighting the application’s features without compromising data privacy.

The source code for the application can be found on my GitHub page.

We will use this paper published in the American Meteorological Society to evaluate the application. When prompted to “Summarize the important findings from the paper”, responses were generated by each member of the CoE (see Table below). Each LLM in the CoE is trained on distinct datasets and exhibits unique characteristics and strengths, akin to the diverse expertise found in a panel of human experts.

For instance, the Phi-3 model provided a detailed response, itemizing the important findings. In contrast, the Mistral model delivered a more concise summary. This diversity in viewpoints ensures that the final output is comprehensive, nuanced, and well-rounded, effectively capturing the multifaceted nature of the query posed to the application.

Evaluate Model Outputs

One crucial consideration for building a robust RAG framework is assessing the accuracy of responses generated by LLMs. Typically, the most reliable method for evaluating these responses involves human feedback, where individuals review and rate the AI-generated outputs. However, obtaining high-quality human feedback is often both expensive and time-consuming.

To streamline this process, we will automate the evaluation by comparing responses from different LLMs by encoding them into embeddings and calculating the cosine similarity between the outputs. If the cosine similarity among all three models exceeds a certain threshold, we can assign a high confidence score to the responses. A cosine similarity score of 1 indicates an exact match, while a score of 0 indicates no similarity.

Sentence Transformers are a type of machine learning model designed to generate dense vector representation (embeddings) of sentences and paragraphs. all-MiniLM-L6-v2 is a compact but powerful Sentence Transformer model trained to generate high-quality sentence embeddings. This model maps sentences and paragraphs to a 384-dimensional dense vector space, making it suitable for tasks such as clustering or semantic search. By default, input text longer than 256 word pieces is truncated. Overall, the responses have a meaningful similarity, indicating higher level of confidence in the responses from the CoE.

Cosine Similarity between responses from different LLM’s for the Prompt “Summarize the important findings from the paper”

Closing Thoughts

In this blog, we developed a minimalistic CoE RAG application for document discovery. This serves as a proof-of-concept, demonstrating the foundational capabilities of the application for document discovery. We can further customize the CoE RAG framework based on specific needs and preferences.

General-purpose LLMs like LLaMA, Mistral, and Phi may not natively comprehend the insurance-specific terms and acronyms used in the Property and Casualty industry. Due to this limitation, these models might not perform optimally out-of-the-box. To address this, we need to fine-tune the models to familiarize them with insurance-specific terminology. In the next blog, we will explore methods for generating synthetic data to custom-train general-purpose LLM’s for insurance-specific use cases.

Thanks for reading this article! All feedbacks are appreciated. For any questions, feel free to contact me.

If you liked this article, here are some other articles you may enjoy:

The views expressed in this article are my own and do not necessarily reflect the views of my employer.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Building a Local Committee-of-Expert (CoE) RAG Application for Document Discovery

Author(s): Kamban Parasuraman

What is RAG?

Transformers

Ollama

Streamlit

ChromaDB

Committee of Experts

Evaluate Model Outputs

Closing Thoughts

Hurricane Path Prediction using Deep Learning

Every year, the time window between June 1 and November 30 signifies the North Atlantic Hurricane season. During this…

Stochastic Weather Generator using Generative Adversarial Networks

Modeling Multivariate Distributions using GANs

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Building a Local Committee-of-Expert (CoE) RAG Application for Document Discovery

Author(s): Kamban Parasuraman

What is RAG?

Transformers

Ollama

Streamlit

ChromaDB

Committee of Experts

Evaluate Model Outputs

Closing Thoughts

Hurricane Path Prediction using Deep Learning

Every year, the time window between June 1 and November 30 signifies the North Atlantic Hurricane season. During this…

Stochastic Weather Generator using Generative Adversarial Networks

Modeling Multivariate Distributions using GANs

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥