Building a Local Committee-of-Expert (CoE) RAG Application for Document Discovery
Last Updated on January 15, 2025 by Editorial Team
Author(s): Kamban Parasuraman
Originally published on Towards AI.
I n todayβs fast-paced world, where access to timely and accurate information can be a critical differentiator, organizations across various sectors constantly seek innovative solutions to stay ahead of the competition. This is particularly true in the insurance and reinsurance industry where the underwriting expenses have grown significantly in the last decade. Rising costs, salaries, and regulatory compliance requirements drive this consistent increase in underwriting expenses.
Insurance companies spend considerable time and resources in parsing out the information in the policy and claims forms. The extracted information from these documents underpins the rest of the underwriting & claims decision-making process. Large Language Models (LLMs) present a transformative opportunity for the insurance industry to bring down the underwriting expenses by automating and streamlining the information extraction process from the vast amount of text data found in policy and claims forms.
However, despite their immense potential, there are concerns about using models like ChatGPT or any third-party platforms due to data privacy issues. Insurance companies handle vast amounts of sensitive information, including personal details and financial records. The prospect of transmitting sensitive information through external servers raises legitimate worries about data breaches and regulatory compliance, such as GDPR and CCPA. Organizations are also concerned about third-party platforms using companies' proprietary data for training their models.
Organizations can leverage AIβs advanced capabilities by deploying LLMs locally without compromising data privacy and security. In this blog, we will explore a simple RAG (Retrieval-Augmented Generation) application for document discovery using Streamlit, Ollama, and ChromaDB, all hosted locally to safeguard sensitive data while demonstrating the effectiveness of advanced AI technology.
What is RAG?
Retrieval-Augmented Generation (RAG) is an advanced AI technique that enhances the capabilities of Large Language Models (LLMs) by combining the strengths of information retrieval and text generation to create more accurate and contextually aware responses. RAG involves two steps:
Retrieval: The model retrieves relevant information from an external source and/or an internal knowledge base.
Generation: The retrieved information is then used to generate responses, making them more accurate and contextually relevant.
The chart below highlights the key benefits of building a local RAG application.
Transformers
LLMβs are based on Transformer architecture. Transformers are neural network architectures designed to handle sequential data, such as text. The architecture excels at βtransformingβ one data sequence into another sequence. They are highly effective for tasks in natural language processing (NLP) due to their ability to capture long-range dependencies and relationships within texts. Transformers consist of an encoder-decoder structure. The encoder processes the input sequence, while the decoder generates the output sequence. The key component of the transformers is the self-attention mechanism, which allows the model to weigh the importance of different words in a sequence relative to each other. Please refer to the Attention Is All You Need paper for more information on transformers and attention mechanisms.
Ollama
Ollama is a free and open-source tool that allows users to run large language models (LLMs) locally on their machines, ensuring data privacy and controls. Ollama does this by using a technique called model quantization. In simple terms, quantization reduces the model size by converting the floating-point numbers used in the modelβs parameters to lower-bit representations, such as 4-bit integers. This helps in reducing the memory footprint and allows for quick deployment on devices with limited resources. You can download Ollama using this link. Ollama provides a Command-Line Interface (CLI) for easy installation and model management. Below is a partial list of available commands in Ollama.
# Check installed version of ollama
C:\>ollama --version
# Download a particular LLM or embedding model
C:\>ollama pull llama3
# List of installed models
C:\>ollama list
# Run your ollama models
C:\>ollama serve
# Show model information
C:\>ollama show llama3
Streamlit
Streamlit is an open-source Python library designed to facilitate rapid development and sharing of custom applications. It empowers developers and data scientists to create interactive, data-driven apps with minimal effort and maximum efficiency. To assist new users in familiarizing themselves with its capabilities, Streamlit offers an extensive App gallery. Streamlit can be easily installed using the following command-
(envn) C:\>pip install streamlit
ChromaDB
Chroma is an open-source AI application vector database designed for storing and retrieving vector embeddings. Unlike traditional databases that store data in structured tables with rows and columns using a relational model, a vector database stores data as a vector representation, making them well-suited for handling complex, unstructured data. Vector databases excel at similarity searches and complex matching. A quick overview of Chroma can be found here. Chroma can be easily installed using the following command.
(envn) C:\>pip install chromadb
Vector embeddings provide a way to represent words, sentences, and even documents as dense numerical vectors. This numerical representation is essential because it allows machine learning algorithms to process and understand text data. Embeddings capture the semantic meaning of words and phrases. Words with similar meanings will have embeddings that are close together in vector latent space. For example, the words βkingβ and βqueenβ will be closer to each other than βkingβ and βtransformerβ. For LLaMA 3, the embedding vector has a dimension of 4096. This means each token is represented by a 4096-dimension vector. To experiment and visualize tokenization and embeddings from different models, please see this app.
Committee of Experts
In building a robust RAG application using quantized versions of the LLMs, one significant challenge users often face is ensuring the accuracy and completeness of the generated responses while also avoiding the issue of hallucinations in language models. To address this, letβs employ a Committee of Experts (CoE). By leveraging multiple LLMs β specifically LLaMA 3β8B, Mistral 7B, and Phi3-mini β for each query, the application can provide users with multiple perspectives on the same question. This method involves passing each query through all three models, allowing users to receive three distinct responses. While all three models are designed for general-purpose NLP tasks, their performance may vary slightly based on the specific task and dataset used for training these models. LLaMA 3β8B is the largest model, followed by Mistral 7B, and then Phi3-mini. Generally, larger models can capture more complex patterns and nuances in the data, leading to higher accuracy.
The multi-model strategy enhances users confidence in the systemβs outputs, as the convergence of the contextually similar answers across different models suggests reliability. By cross-verifying the answers, users can identify the most accurate and comprehensive response, and build confidence scores for different models in the CoE. This collaborative methodology also helps identify and mitigate individual model biases and errors.
Using real insurance documents for testing is not feasible due to the presence of personally identifiable information and strict data protection regulations. Instead, we will use publicly available research papers to demonstrate the applicationβs capabilities. The structure and complexity of research papers provide a robust testbed, highlighting the applicationβs features without compromising data privacy.
The source code for the application can be found on my GitHub page.
We will use this paper published in the American Meteorological Society to evaluate the application. When prompted to βSummarize the important findings from the paperβ, responses were generated by each member of the CoE (see Table below). Each LLM in the CoE is trained on distinct datasets and exhibits unique characteristics and strengths, akin to the diverse expertise found in a panel of human experts.
For instance, the Phi-3 model provided a detailed response, itemizing the important findings. In contrast, the Mistral model delivered a more concise summary. This diversity in viewpoints ensures that the final output is comprehensive, nuanced, and well-rounded, effectively capturing the multifaceted nature of the query posed to the application.
Evaluate Model Outputs
One crucial consideration for building a robust RAG framework is assessing the accuracy of responses generated by LLMs. Typically, the most reliable method for evaluating these responses involves human feedback, where individuals review and rate the AI-generated outputs. However, obtaining high-quality human feedback is often both expensive and time-consuming.
To streamline this process, we will automate the evaluation by comparing responses from different LLMs by encoding them into embeddings and calculating the cosine similarity between the outputs. If the cosine similarity among all three models exceeds a certain threshold, we can assign a high confidence score to the responses. A cosine similarity score of 1 indicates an exact match, while a score of 0 indicates no similarity.
Sentence Transformers are a type of machine learning model designed to generate dense vector representation (embeddings) of sentences and paragraphs. all-MiniLM-L6-v2 is a compact but powerful Sentence Transformer model trained to generate high-quality sentence embeddings. This model maps sentences and paragraphs to a 384-dimensional dense vector space, making it suitable for tasks such as clustering or semantic search. By default, input text longer than 256 word pieces is truncated. Overall, the responses have a meaningful similarity, indicating higher level of confidence in the responses from the CoE.
Closing Thoughts
In this blog, we developed a minimalistic CoE RAG application for document discovery. This serves as a proof-of-concept, demonstrating the foundational capabilities of the application for document discovery. We can further customize the CoE RAG framework based on specific needs and preferences.
General-purpose LLMs like LLaMA, Mistral, and Phi may not natively comprehend the insurance-specific terms and acronyms used in the Property and Casualty industry. Due to this limitation, these models might not perform optimally out-of-the-box. To address this, we need to fine-tune the models to familiarize them with insurance-specific terminology. In the next blog, we will explore methods for generating synthetic data to custom-train general-purpose LLMβs for insurance-specific use cases.
Thanks for reading this article! All feedbacks are appreciated. For any questions, feel free to contact me.
If you liked this article, here are some other articles you may enjoy:
Hurricane Path Prediction using Deep Learning
Every year, the time window between June 1 and November 30 signifies the North Atlantic Hurricane season. During thisβ¦
medium.com
Stochastic Weather Generator using Generative Adversarial Networks
Modeling Multivariate Distributions using GANs
towardsdatascience.com
The views expressed in this article are my own and do not necessarily reflect the views of my employer.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI