Harnessing the power of LLMs and LangChain for structured data extraction from unstructured data

Last Updated on September 25, 2025 by Editorial Team

Author(s): Leapfrog Technology

Originally published on Towards AI.

Harnessing the power of LLMs and LangChain for structured data extraction from unstructured data

In today’s ever-evolving tech landscape, the rise of Large Language Models (LLMs) has brought about a transformative shift in how we engage with digital applications and content. These advanced language models, exemplified by renowned models like ChatGPT and Google BARD, have not only opened doors but entire gateways to innovative application development. As a testament to their potential, even tech giants like Microsoft and Meta have ventured into this arena with TypeChat and Llama 2, respectively, offering open-source libraries for developers to harness the power of LLMs.

Moreover, the growing influence of LLMs is not just evident in their adoption by tech giants but is also mirrored in the substantial investments pouring into this field. As of fall, 2023, some of the top-funded companies working with LLMs have raised staggering amounts of funding. OpenAI, the pioneering organization behind ChatGPT, has secured a whopping $14 billion in funding, highlighting the immense interest and belief in the potential of these models. Joining the league are companies like Anthropic with $1.55 billion, Cohere with $435 million, Adept with $415 million, Hugging Face with $160.60 million, and Mistral AI with $112.93 million in funding. These eye-popping figures underscore the burgeoning importance of LLMs in the tech world. (Source: cbinsights.com)

In this blog article, we will not only introduce you to LLMs but also dive deeper into Langchain — a cutting-edge LLM framework that is making waves in the industry. Furthermore, we will demonstrate how to harness the power of OpenAI GPT-3.5 LLM in conjunction with the Langchain framework to obtain structured outputs, paving the way for exciting new applications and developments. Let’s embark on this journey into the world of Large Language Models and explore the limitless possibilities they offer.

Introduction to Large Language Models (LLMs)

Large Language Models are artificial intelligence systems created to handle extensive quantities of natural language data. They leverage this data to generate responses to user queries (prompts). These systems are trained on massive data sets using advanced machine learning algorithms This training enables them to grasp the intricacies and structures of human language, empowering them to produce coherent and contextually relevant text in response to a diverse array of written inputs. The significance of Large Language Models is steadily growing across various domains, including but not limited to natural language processing, machine translation, code and text generation, among others.

Application Areas of LLMs

Chatbots and Virtual Assistants
Code Generation and Debugging
Sentiment Analysis
Text Classification and Clustering
Language Translation
Summarization and Paraphrasing
Content Generation

Note: Most Large Language Models are not specifically trained to serve as repositories of factual information. While they possess language generation capabilities, they may not have knowledge about specific details like the winner of a major sporting event from the previous year. It is crucial to exercise caution by fact-checking and thoroughly comprehending their responses before relying on them as reliable references.

Applying Large Language Models

When considering the utilization of Large Language Models for a specific purpose, there exist several approaches one can explore. Broadly speaking, these approaches can be categorized into two distinct groups, although there may be some overlap between them. In the following discussion, we will provide a brief overview of the pros and cons associated with each approach and identify the scenarios that are most suitable for each.

Proprietary Services

OpenAI’s ChatGPT marked the introduction of Large Language Models (LLMs) to the mainstream, setting the stage for their widespread use. ChatGPT provides a user-friendly interface or API through which users can input prompts for various models, including GPT-3.5 and GPT-4, and receive prompt responses in a timely manner. These models are highly proficient and capable of handling intricate technical tasks like code generation and creative endeavors such as composing poetry in specific styles.

Nevertheless, there are notable downsides to these services. First and foremost, they demand an immense amount of computational resources, not only for their development (with GPT-4 costing over $100 million to create) but also for serving responses. Consequently, these extremely large models are typically controlled by organizations, necessitating users to transmit their data to third-party servers for interactions. This arrangement raises concerns about privacy, security, and the use of “black box” models, where users lack influence over their training and operational constraints. Furthermore, due to the substantial computational requirements, these services often come with associated costs, making budget considerations a significant factor in their widespread adoption.

In summary, proprietary LLM services are an excellent choice when tackling complex tasks. However, users should be willing to share data with third parties, anticipate costs when scaling up, and recognize the limited control they have over these models’ inner workings.

Open Source Models

An alternative route in the realm of language models is engaging with the thriving open source community, exemplified by platforms like Hugging Face. Here, a multitude of models contributed by various sources are available to address specific language-related tasks such as text generation, summarization, and classification. Although open source models have made significant progress, they have not yet matched the peak performance of proprietary models like GPT-4. However, ongoing developments are simplifying the process of using open source models, making them more user-friendly.

These models are often significantly smaller than proprietary alternatives like ChatGPT, facilitating local hosting and ensuring data control for privacy and governance. One notable advantage of open source models is their adaptability, allowing fine-tuning to specific datasets, thereby enhancing performance in domain-specific applications.

Furthermore, the introduction of Llama2, an innovative open source language model, has added another dimension to this landscape. Llama2 offers competitive performance, enhanced accessibility, data control, cost management, and fine-tuning capabilities, making it an appealing choice for various language-related tasks.

In summary, the open source community provides a viable alternative to proprietary models, with Llama2 strengthening this option by offering a powerful toolset for language tasks while enabling data control and cost efficiency.

Example of Applying LLM using OpenAI API in Python

This Python code snippet demonstrates how to utilize OpenAI’s API to interact with its Large Language Models (LLMs), such as GPT-3.5 Turbo and GPT-4. It begins by importing the necessary libraries and setting the API key for authentication. Then, it specifies the chosen model (in this case, “gpt-3.5-turbo”) and defines a user message (“Mary had a little”) as input. The code uses this message to create a chat-like interaction with the chosen LLM through the openai.ChatCompletion.create() function. Finally, it prints the generated response from the LLM. This code offers a straightforward way to integrate OpenAI’s language models into various applications, making it accessible and user-friendly for developers.

Code:

import os
import openai

openai.api_key = "<openai_api_key>"model = "gpt-3.5-turbo"
message = "Mary had a little"completion = openai.ChatCompletion.create(
 model=model,
 messages=[{"role": "user", "content": message}]
)print(completion["choices"][0]["message"]["content"])

LangChain — A Framework for LLM Applications

The popularity of LLMs has skyrocketed, drawing the attention of users and developers alike. However, beneath the surface of this excitement lies a challenge — how to effectively and seamlessly integrate these language models into applications. While LLMs excel at understanding and generating text, their true potential shines when they are harmoniously blended with other sources of computation and knowledge, resulting in dynamic and truly powerful applications.

Enter LangChain, a cutting-edge framework engineered to unlock the full potential of LLMs by streamlining their integration with a diverse range of resources. LangChain empowers data professionals and developers to create applications that not only exhibit linguistic intelligence but also tap into a rich ecosystem of information and computation.

What sets LangChain apart is its versatility. Unlike other solutions that might be limited to a specific LLM’s API, LangChain is designed to work seamlessly with various LLMs, including not only OpenAI’s but also those from Cohere, Hugging Face, Llama, and more. This flexibility ensures that developers can choose the language model that best suits their project’s requirements.

But LangChain’s capabilities don’t stop at LLM integration. It goes further by incorporating what it terms “Tools” into the development process. These tools can encompass a wide array of resources, from Wikipedia for knowledge enrichment to Zapier for automation and the file system for data management. By leveraging these tools, LangChain offers a comprehensive toolkit for developers to create applications that are not just linguistically proficient but also well-equipped to access, process, and utilize various data sources and computational services.

In summary, while the emergence of LLMs has ushered in a new era of application development, integrating them effectively into projects can be challenging. LangChain steps in as a groundbreaking framework that simplifies this process, allowing developers to harness the true potential of LLMs while seamlessly incorporating diverse resources and tools into their applications. With LangChain, the possibilities are boundless, making it the go-to choice for those looking to build intelligent, data-rich, and dynamic applications in the age of LLMs.

Key Components of LangChain

LangChain’s core framework is a powerful tool for language model applications, built around several key components that serve as building blocks. These components include Models (comprising LLMs, Chat Models, and Text Embedding Models), Prompts, Memory (both Short-Term and Long-Term), Chains (including LLMChains and Index-related Chains), Agents (comprising Action and Plan-and-Execute Agents), Callback, and Indexes. Let’s delve into these components in detail.

Model

At the heart of LangChain are Models, which come in three primary types:

Large Language Models (LLMs): These models are trained on extensive text data and excel at generating meaningful output.
Chat Models: They offer a structured approach, enabling interactive conversations with users through messages.
Text Embedding Models: These models convert text into numerical representations, facilitating semantic-style searches across a vector space.

Prompt

The Prompt component serves as the entry point for interacting with LLMs and directing the flow of information. It includes three essential elements:

Prompt Templates: These templates guide the format of the model’s responses, including questions and few-shot examples.
Example Selectors: They dynamically choose examples based on user input to enhance interaction.
Output Parsers: Output Parsers structure and format the model’s responses to meet specific requirements.

Memory

Memory plays a pivotal role in creating a seamless and interactive user experience within LangChain. It is divided into two parts:

Short-Term Memory: This component keeps track of the current conversation, providing context for responses in real-time.
Long-Term Memory: Long-Term Memory stores past interactions, enabling personalized and relevant responses based on historical data.

Chain

Chains bring together various components to generate meaningful responses from language models. There are two common types:

LLMChain: This combines Prompt Template, Model, and optional Guardrails for standard interactions with language models.
Index-related Chains: These interact with Indexes and combine data with LLMs using various methods to generate responses.

Agents

Agents are autonomous decision-makers within LangChain that interact with other components. They come in two main types:

Action Agents: These handle small tasks and contribute to the smooth operation of the system.
Plan-and-Execute Agents: These agents are responsible for managing complex or long-running tasks, vital for coordinating and managing information flow within the system.

Indexes

Indexes efficiently organize and retrieve data within LangChain and consist of several elements:

Document Loaders: These bring data into LangChain from various sources.
Text Splitters: Text Splitters break down large text chunks into manageable pieces for processing.
VectorStores: VectorStores store numerical representations of text, facilitating semantic-style searches.
Retrievers: Retrievers fetch relevant documents for interaction with language models, ensuring the system operates efficiently.

In summary, LangChain’s comprehensive framework combines these key components to create a versatile and efficient tool for developing language model applications. It empowers developers to harness the power of language models while managing data, interactions, and tasks effectively.

Note: The details of some of these components will be discussed in other blogs in the series.

Implementing a Feedback Scoring System with LangChain to get Structured Output

Scenario

In the ever-evolving landscape of team collaboration and performance management, efficient feedback sharing among team members is pivotal. Feedback, often composed in plain English, carries the potential to drive improvements, foster growth, and enhance productivity. However, when it comes to summarizing and analyzing this feedback at the end of each semester, organizations are confronted with the Herculean task of sifting through an avalanche of information. It is this challenge that necessitates the implementation of a robust feedback scoring system, one that not only scores feedback for each individual but also ensures a consistent format, ready for integration into the Performance Management System and seamless presentation on dashboards and other applications.

Problem

The core issue at hand might initially appear straightforward. Leveraging advanced AI tools such as ChatGPT and related APIs, scoring feedback should, in theory, be a simple endeavor. However, the complexity arises from the inherent variability in the structure of the feedback. Each piece of feedback is unique, and while AI models excel at interpreting plain English text, they do not consistently conform to requests for standardized output formats. For instance, as illustrated in the accompanying image, the output format can vary significantly. Moreover, these outputs sometimes contain explanations that are not required for database insertion, resulting in additional effort to format the data.

Below, we provide two examples of outputs generated by ChatGPT (without the use of any frameworks such as LangChain) in response to a request for structured JSON output (scores) for a given feedback. The same prompt was used for both runs, but the output differs on different runs. While the first output aligns with our desired format, the second output deviates from the specified requirements by including additional information beyond the expected JSON structure.

Prompt:

Output 1:

Output 2:

To address these challenges related to consistent output format from similar prompts, we introduce the LangChain framework, designed to streamline the process of feedback scoring and ensure consistent, structured output. LangChain is engineered to bridge the gap between the inherent flexibility of AI language models (LLMs) and the need for standardized data formatting. By leveraging LangChain, organizations can enjoy the benefits of AI-powered feedback scoring while maintaining control over the output format.

In the following sections, we will explore how LangChain can be effectively utilized to achieve structured feedback scoring, simplifying the integration of scores into your database and improving the overall efficiency of your feedback management process.

Solution

Within LangChain, several key components play pivotal roles in achieving structured and standardized outputs. These components include ChatOpenAI, ChatPromptTemplate, ResponseSchema, and StructuredOutputParser.

ChatOpenAI serves as the entry point for interacting with language models. It facilitates the communication between the user and the model, making it a crucial component for requests like feedback scoring. ChatOpenAI streamlines the process of sending prompts to the AI model and receiving responses, ensuring that the model understands and responds appropriately to specific requests.

ChatPromptTemplate is a critical part of LangChain’s toolkit for structuring interactions with language models. It allows users to define templates that guide the conversation with the model, making it easier to request structured data, like JSON output for feedback scoring. By providing predefined prompts and contexts, ChatPromptTemplate helps ensure consistency in interactions with the model and in the data generated.

ResponseSchema is the linchpin of LangChain when it comes to structuring AI-generated outputs. This component enables users to define the expected format for model responses, such as JSON structures. By specifying the schema, users can explicitly request structured data, aligning the AI model’s responses with their intended data format. This is particularly valuable in scenarios where data uniformity is critical, like the case of feedback scoring.

StructuredOutputParser complements the LangChain framework by providing tools to parse and extract structured data from the model’s responses. It plays a vital role in the feedback scoring process, where the model’s outputs may not always conform to the requested structure. StructuredOutputParser allows users to extract the essential information from model responses, discarding extraneous data and ensuring that the desired output is in a consistent format for easy integration into databases and other applications.

Together, these components within the LangChain framework empower organizations to harness the capabilities of AI language models while maintaining control over the structure and format of the data they generate. Whether it’s scoring feedback, generating reports, or any other structured data task, LangChain provides the tools needed to bridge the gap between the flexibility of AI models and the need for structured, standardized data.

Below, you will find code snippets that illustrate how to leverage ChatOpenAI, ChatPromptTemplate, ResponseSchema, and StructuredOutputParser to tackle the real-world challenges presented by our fictional Employee Feedback scoring system effectively.

Importing crucial modules from the LangChain framework for communication with language models, defining interaction templates, and structuring/parsing AI-generated responses.

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.output_parsers import ResponseSchema,
StructuredOutputParser

Setting up OpenAI API credentials and define parameters for generating feedback scores based on a provided text.

openai_api_key = '<OPENAI_API_KEY>'
model = 'gpt-3.5-turbo'
temperature = 0

min_score = 0
max_score = 10feedback = "- Took ownership of revising the Basic Database Course\n\
- Took ownership of his own learning and worked on pet projects\n\
- Has Python programming skills\n\
\n\
Area of Improvement: \n\
- As an SE2, Rob needs to expand his technical depth and breadth. \ 
He needs to grasp the technical concepts required for a Data Engineer such as data modeling, dimensional modeling, and analytics among 
others.\n\
- Rob needs to learn best practices and implement them in projects, but without overengineering. \
He needs to learn how to balance technicalities and simplification.\n\
- Rob needs to be a better Active listener. \
He needs to make sure he listens to people and understands their viewpoints before answering or interrupting.\n\
- Rob needs to communicate more proactively (or ask questions when required) leading to being more reliable.\n\
- Rob needs to try to be more concise and clear when communicating."

Defining response schemas for feedback scores, including descriptions and score ranges, within the LangChain framework to structure and interpret AI-generated feedback data.

# Define response schemas for the feedback scores
overall_score_schema = ResponseSchema(name="Overall_Score",
description="Overall rating of the appraisee considering all aspects/areas. \
The scores should range from {min_score} to {max_score}, {min_score} being the worst performer, and {max_score} being an exceptional performer."
, type="float"
)

technical_score_schema = ResponseSchema(name="Technical_Score",
description="Rating of the appraisee considering technical skill. \
Technical Expertise (TE) is a multidisciplinary quality that determines an individual’s ability to use the right solution, tool, and/or processes in the best possible manner to get the job done....(further details) \
The scores should range from {min_score} to {max_score}, {min_score} being the worst performer, and {max_score} being an exceptional performer."
, type="float"
)communication_score_schema = ResponseSchema(name="Communication_Score",
description="Rating of the appraisee considering communication skill. \
For someone to be effective with communication, one has to be an active listener who understands their audience or the speaker....(further details) \
The scores should range from {min_score} to {max_score}, {min_score} being the worst performer, and {max_score} being an exceptional performer."
, type="float"
)ownership_score_schema = ResponseSchema(name="Ownership_Score",
description="Rating of the appraisee considering ownership traits. \
Ownership means being responsible for the successful implementation and execution of a project from beginning to end....(further details) \
The scores should range from {min_score} to {max_score}, {min_score} being the worst performer, and {max_score} being an exceptional performer."
, type="float"
)teamplayer_score_schema = ResponseSchema(name="TeamPlayer_Score",
description="Rating of the appraisee considering team player traits....(further details) \
A team player is a person who works well as a member of a team. \
The scores should range from {min_score} to {max_score}, {min_score} being the worst performer, and {max_score} being an exceptional performer."
, type="float"
)response_schemas = [overall_score_schema,
 technical_score_schema,
 communication_score_schema,
 ownership_score_schema,
 teamplayer_score_schema]

Initializing an output parser and obtaining format instructions based on predefined response schemas for structured data processing within the LangChain framework.

# Initialize output parser and format instructions
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
format_instructions = output_parser.get_format_instructions()

Creating a template for extracting feedback score information, including descriptions, from the appraiser to the appraisee using predefined response schemas and format instructions in the LangChain framework.

# Define feedback score template
feedback_score_template = "\
For the following feedback from appraiser to appraisee, extract the following information:\
{overall_score_schema.name}: {overall_score_schema.description}\
{technical_score_schema.name}: {technical_score_schema.description}\
{communication_score_schema.name}: {communication_score_schema.description}\
{ownership_score_schema.name}: {ownership_score_schema.description}\
{teamplayer_score_schema.name}: {teamplayer_score_schema.description}\
feedback: {{feedback}}\
{{format_instructions}}"

Creating a feedback scoring prompt template, initializing a ChatOpenAI instance for feedback scoring, and defining a function to score feedback based on the LangChain framework, OpenAI model parameters, and format instructions.

# Create feedback score prompt template
feedback_score_prompt_template = ChatPromptTemplate.from_template(feedback_score_template)

# Initialize ChatOpenAI instance for feedback scoring
feedback_scorer = ChatOpenAI(model_name=model, temperature=temperature, openai_api_key=openai_api_key)def score_feedback(feedback, max_score, min_score, format_instructions):
 feedback_score_prompt = feedback_score_prompt_template.format_messages(
 feedback=feedback, max_score=max_score, min_score=min_score, format_instructions=format_instructions)
 feedback_score_response = feedback_scorer(feedback_score_prompt) feedback_score_dict = output_parser.parse(feedback_score_response.content) return feedback_score_dict

Scoring the provided feedback using the LangChain framework and printing the resulting feedback score dictionary.

feedback_score_dict = score_feedback(
 feedback=feedback, max_score=max_score, min_score=min_score, format_instructions=format_instructions)

print(feedback_score_dict)

Output:

{'Overall_Score': 6.5, 'Technical_Score': 5, 'Communication_Score': 6, 'Ownership_Score': 7, 'TeamPlayer_Score': 6}

In this exploration, we’ve demonstrated how to harness the power of Language Models (LLMs) in conjunction with the LangChain framework to obtain structured outputs for quantifying feedback effectively. This methodology not only ensures uniformity in data formatting but also simplifies the integration of feedback scores into databases and other applications. Importantly, the techniques showcased here can be readily extended to address a wide array of similar use cases, offering a versatile solution to organizations seeking to extract structured insights from unstructured data in the era of AI-powered analytics.

In the next blog in this series, we will delve into the world of vector databases and their crucial role when working with LLMs. We will highlight the advantages of using vector databases and compare it to fine-tuning LLM models. Additionally, we will demonstrate their application by developing a simple yet powerful QA chatbot tailored for organizational data. Stay tuned for an exciting exploration into the synergy between LLM, LangChain, and vector databases!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Harnessing the power of LLMs and LangChain for structured data extraction from unstructured data

Author(s): Leapfrog Technology

Introduction to Large Language Models (LLMs)

Application Areas of LLMs

Applying Large Language Models

Proprietary Services

Open Source Models

Example of Applying LLM using OpenAI API in Python

Key Components of LangChain

Model

Prompt

Memory

Chain

Agents

Indexes

Implementing a Feedback Scoring System with LangChain to get Structured Output

Scenario

Problem

Solution

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Understanding Neural Networks — and Building One!

LLMs Don’t Just Need to Be Smart — They Need to Be Specific. Here’s How.

Beyond pre-trained LLMs: Augmenting LLMs through vector databases to create a chatbot on organizational data

Harnessing the power of LLMs and LangChain for structured data extraction from unstructured data

TAI #171: How is AI Actually Being Used? Frontier Ambitions Meet Real-World Adoption Data

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Harnessing the power of LLMs and LangChain for structured data extraction from unstructured data

Author(s): Leapfrog Technology

Introduction to Large Language Models (LLMs)

Application Areas of LLMs

Applying Large Language Models

Proprietary Services

Open Source Models

Example of Applying LLM using OpenAI API in Python

Key Components of LangChain

Model

Prompt

Memory

Chain

Agents

Indexes

Implementing a Feedback Scoring System with LangChain to get Structured Output

Scenario

Problem

Solution

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement