Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Structured Data Extraction from LLMs using DSPy Assertions and Qdrant
Latest   Machine Learning

Structured Data Extraction from LLMs using DSPy Assertions and Qdrant

Structured Data Extraction from LLMs using DSPy Assertions and Qdrant

Last Updated on July 3, 2024 by Editorial Team

Author(s): Ashish Abraham

Originally published on Towards AI.

Structured Data Extraction from LLMs using DSPy Assertions and Qdrant

Photo by Kelly Sikkema on Unsplash

Prompt templates and techniques have been around since the advent of Large Language Models (LLMs). LLMs are sensitive to the prompts used for tasks, especially in pipelines where multiple LLMs need to work together. Currently, most LLMs and developer frameworks use fixed ‘prompt templates’, which are long, manually created instruction strings. However, this method can be fragile and hard to scale, similar to manually adjusting a classifier’s weights. Also, a specific prompt may not work well across different pipelines, LLMs, data types, or inputs.

While methods such as fine-tuning and prompt engineering have seen increased usage, these strategies can be labor-intensive and heavily dependent on manual intervention to ensure LLMs comply with certain constraints. When LLMs started to be used for function calling and specific tasks, they were required to deliver outputs in certain formats. Yet, the consistency of these outputs is not always assured with traditional prompt engineering.

Table of Contents

· DSPy
1. Signatures
2. Modules
3. Metrics
4. Optimizers (earlier, Teleprompters):
· Assertions in DSPy
· RAG for Structured Data Extraction
Prerequisites
Setting Up the Large Language Model
Setting Up Qdrant
Database Preparation
Implementing the Information Extraction Pipeline
Implementing the Pipeline with DSPy TypedPredictors
Implementing the Pipeline with DSPy Assertions
· Wrapping Up
· References

DSPy

Programming — not prompting

The Stanford NLP team released DSPy (Declarative Sequencing Python framework) in 2023, a radically different approach to building and managing language model pipelines. As the tagline suggests, it is more ‘programming’ instead of prompting. DSPy presented a new strategy of algorithmically optimizing prompts instead of manual adjustment. It combines methods for prompting, refining LLMs, and even compiling whole pipelines, enhancing them with logical reasoning and retrieval augmentation. All of these are represented through a streamlined set of Python operations that are capable of composition and learning. This enabled models to refine their responses and even backtrack to present the user with the preferred type of response. You will see how this works later in this blog.

DSPy revolves around the following concepts:

1. Signatures

Signatures serve as explicit blueprints of the anticipated input/output actions for LLMs. They instruct the model on what exactly to do rather than providing information on how to do it.

For simple tasks, inline signatures can be used. They are simple strings that define the task precisely.

Question Answering: "question -> answer"

Sentiment Classification: "sentence -> sentiment"

Summarization: "document -> summary"

For more complex tasks, custom signatures can be defined as classes. The classes will be subclass of the dspy.Signatureclass.

2. Modules

A DSPy module encapsulates a prompting technique and is designed to work with any DSPy Signature. These modules contain learnable parameters and can process inputs to produce outputs. They can be combined to form larger modules, drawing inspiration from PyTorch’s neural network modules but tailored for LLM programs.

There are specific modules for abstracting prompting techniques:

dspy.Predict: For simple prompting.

dspy.ChainOfThought: For step-by-step reasoning.

dspy.ProgramOfThought: Instructs the model to generate code, with the execution outcomes determining the response.

dspy.ReAct: An agent that can use tools to implement the given signature.

dspy.MultiChainComparison: Compares multiple outputs from ChainOfThought and produces a final result.

3. Metrics

As in all machine learning frameworks, a metric is used to compare model output with actual output and assign a score. Based on this, the optimizer works.
Some of the available built-in metrics are:
dspy.evaluate.metrics.answer_exact_matchand dspy.evaluate.metrics.answer_passage_match.

4. Optimizers (earlier, Teleprompters):

The prompt, LLM weights, and signatures are the parameters of the DSPy program, and Optimizers are used to tune them according to the performance indicated by Metrics.

You will need a few training examples to get started. BootstrapFewShot,BootstrapFewShotWithRandomSearch, BootstrapFineTune and Mipro are some built-in optimizers available in DSPy.

DSPy has great integrations with the Qdrant vector database which enable you to seamlessly use them together to develop robust Retrieval-Augmented Generation (RAG) systems.

Assertions in DSPy

For guiding the LLMs toward the desired outputs without manual work, DSPy offers a feature called Assertions. When the output format or constraint is not met, assertions offer functionalities like:

  1. Backtracking: The model is provided with a feedback message and a validation function that helps it refine the prompt and produce the output again until the constraint is met.
  2. Dynamic Signature Modification: The past output and feedback message are used to make changes in the signature to guide the model toward the desired output.

Assertions can be implemented with two constructs:

  • dspy.Suggest: It enables gradual refining of the prompt and model output by backtracking, without explicitly halting the pipeline, in case constraints are not met.
  • dspy.Assert: It retries upon failure but if failures continue, it stops the operation and triggers a dspy.AssertionError.

Further details will be elaborated in the tutorial section of this blog.

RAG for Structured Data Extraction

That was a quick overview of DSPy and its relevance. In this tutorial, we will implement a Retrieval-Augmented Generation (RAG) system with Llama 3, which will retrieve the best resume from the database based on the user’s query, parse the resume text, and return the extracted information in JSON format. We will explore how the output changes as we implement a simple DSPy extractor, extractor with DSPy TypedPredictors, and finally, with DSPy Assertions.

Prerequisites

For the tutorial, we will use Qdrant as the vector database and Groq as the LLM engine.
Install the required libraries before proceeding.

pip install -U qdrant-client
pip install --upgrade protobuf google-cloud-aiplatform
pip install dspy-ai[qdrant]
pip install groq

Setting Up the Large Language Model

As of now, Groq is providing LLMs as easy-to-use and free APIs on their cloud platform. Go to this page to get started. Create an API key.

Image By Author

Setting Up Qdrant

Make sure you have Docker installed and keep the Docker engine running if you are using a local environment. Qdrant can be installed by downloading its Docker image.

docker pull qdrant/qdrant

Run the Qdrant Docker container using the command.

docker run -p 6333:6333 \
-v $(pwd)/qdrant_storage:/qdrant/storage \
qdrant/qdrant

Alternatively, you can start the container from the Docker desktop console.

Image By Author

Then only you will be able to start the Qdrant Client.

Database Preparation

Start a Qdrant client in localhost, memory, or remote server as required.

from qdrant_client import QdrantClient, models

# Option 1: Connecting to a Remote Qdrant Server


# Replace "URL" with the actual URL of your remote Qdrant server
# Replace "API_KEY" with your Qdrant API key
# qdrant_client = QdrantClient(
# url="URL",
# api_key="API_KEY",
# )


# Option 2: Connecting to an In-Memory Qdrant Instance (for development)


# This option launches a temporary Qdrant instance in memory, useful for development purposes.
# qdrant_client = QdrantClient(":memory:")


# Option 3: Connecting to a Local Qdrant Server


# This option connects to a Qdrant server running on the same machine (localhost) on port 6333 (default port).
qdrant_client = QdrantClient(host='localhost', port=6333)

Let’s populate the database with resume data. Here I have used a sample dataset from Hugging Face. Feel free to use any data you like or as required.

from datasets import load_dataset
import random

# Load the dataset from HuggingFace
dataset = load_dataset("DevashishBhake/resume_section_classification")

Extract the resume text field and create a collection in the database.

rows = dataset["train"]
rows_list = list(rows)

random_rows = random.sample(rows_list, 10)
resume_content = []

# Iterate over the random rows and append the resume content to the list
for row in random_rows:
resume_content.append(row["Resume"])

# create collection
ids = list(range(0, 10))
documents = resume_content
qdrant_client.add(
collection_name="resume-collection",
documents=documents,
ids=ids,
)
print(qdrant_client.get_collections())

If you are running Qdrant Client on localhost, you can confirm the collection at http://localhost:6333/dashboard.

Implementing the Information Extraction Pipeline

Now we will cover each component of the pipeline step-by-step. Define the retriever component using the Qdrant integrations in DSPy.

from dspy.retrieve.qdrant_rm import QdrantRM
qdrant_retriever = QdrantRM(
qdrant_collection_name="resume-collection",
qdrant_client=qdrant_client,
)

You can test the retriever using the following code.

results = qdrant_retriever("Candidates with HR(Human Resource) profiles", k=3)

for result in results:
print("Document:", result.long_text, "\n")

Next, define and test the LLM component using the Groq LLM integration in DSPy. I have used the Llama-3–8B endpoint here. Llama-3 is the latest state-of-the-art language model developed by Meta, known for its exceptional performance in various applications. It stands out due to its large-scale training on a diverse dataset and its efficient language encoding capabilities.

import dspy
from groq import Groq

LLM = dspy.GROQ(model='llama3-8b-8192', api_key ="API_KEY")
LLM("Say Hi!!")

Configure both retriever and LLM components.

dspy.configure(LLM=LLM, rm=qdrant_retriever)

Signature

Define the signatures for both input and output. For the resume parser functionality, I have given single input and output fields. For more complex tasks, you can use multiple input fields.

Create the signature class and define the fields:

  1. description: Textual content from the resume is the input field.
  2. parsed_content: The extracted JSON string with three fields name, email_id, andyears_of_experience. This is the output field.

Notice the descriptions included with each field to help the LLM better understand how to structure the input and output.

class Candidate(dspy.Signature):
"""
You are a resume parser. Parse this textual resume and return the extracted information as a single JSON string with the following format:


{
"name": "<candidate name>",
"email": "<candidate email>",
"years_of_experience": <number of years>
}
"""

description = dspy.InputField(desc="Textual content of the candidate's resume. This should be plain text containing the candidate's work experience, skills, and other relevant information.")
parsed_content= dspy.OutputField(desc="Extracted JSON string. This must follow a json format with three fields name, email and years_of_experience only!")

Module

This is the main RAG module that performs the task. At first glance, DSPy modules might appear more intricate than the RAG functions provided by systems such as LangChain or LlamaIndex. But the class itself is self-explanatory.

class CandidateExtractor(dspy.Module):

def __init__(self):
super().__init__()
# Retriever module to get relevant document
self.retriever = qdrant_retriever
# Predicter module for the created signature
self.predict = dspy.Predict(Candidate)

def forward(self, query: str):
# Retrieve the most relevant documents
results = self.retriever(query)
candidate = self.predict(description=results[0]["long_text"])
return candidate.parsed_content

The class has two components that perform the task as shown in the workflow below.

Workflow with dspy.Predict (Image By Author)

Run the extractor to see how it works.

extractor = CandidateExtractor()
response = extractor("Get the details of candidate with business majors.")
response

Here is the expected output.

'Here is the parsed JSON string:\n\n{\n "name": "Phyllis Physical",\n "email": "[email protected]",\n "years_of_experience": 0\n}\n\nNote: The years of experience is set to 0 as the resume does not explicitly mention the number of years of experience.'

As you can see, the output is not just a single JSON string as we wanted. It includes acknowledgments and additional information from the LLM.

Implementing the Pipeline with DSPy TypedPredictors

DSPy Signatures utilize InputField and OutputField to characterize the inputs and outputs of a field. However, these fields always accept and return string data, necessitating string processing for both inputs and outputs.

While Pydantic BaseModel is an excellent tool for applying type restrictions to fields, it doesn’t seamlessly integrate with dspy.Signature. This is where TypedPredictors come in, offering a solution to enforce type constraints on the inputs and outputs of fields within a dspy.Signature.

Here we’ll try a different approach by providing a class to define the structure of the output.


from dspy.functional import TypedPredictor
import pydantic
class Selection(pydantic.BaseModel):
name: str
email_id: str
years_of_experience: str

Signature

Define the signature similar to the previous one but the output field will be an object of the class we defined just now. Notice how the descriptions are being tweaked.

class Candidate(dspy.Signature):
"""
You are a resume parser. Parse this textual resume and return the extracted information with the following fields only:
"name": "<candidate name>",
"email": "<candidate email>",
"years_of_experience": <number of years>
"""

description: str = dspy.InputField(desc="Textual content of the candidate's resume. This should be plain text containing the candidate's work experience, skills, and other relevant information.")
parsed_content: Selection= dspy.OutputField(desc="Extracted information with name, email and years_of_experience only! Don’t include any other information.")

Module

Define the custom module as required.

import functools
class CandidateExtractor(dspy.Module):

def __init__(self):
super().__init__()
# Retrieve module to get relevant documents
self.retriever = qdrant_retriever
# Predict module for the created signature
self.predict = dspy.functional.TypedPredictor(Candidate)

def forward(self, query: str):
# Retrieve the most relevant documents
results = self.retriever(query)
print(results)
candidate = self.predict(description=results[0]["long_text"])
return candidate.parsed_content
extractor = CandidateExtractor()
response = extractor("Get the details of candidate with business majors.")
response
Workflow with dspy.TypedPredictor (Image By Author)

Here is the expected output.

Selection(name='Sarah K. Davis', email_id='[email protected]', years_of_experience=8)

That didn’t exactly meet our requirements, did it? What if we need responses in a format that can be directly used? A format such as JSON is often more desirable than a custom-defined class object.

Implementing the Pipeline with DSPy Assertions

Now we will try extracting a JSON format response using DSPy assertions. Define a helper function to validate the required JSON format and also define a feedback message to prompt the LLM when the assertion condition is not met.

import functools
import json

def assert_valid_json(candidate_json):
"""
This function checks if the provided JSON string matches the expected format exactly.

Expected format: {
"name": "<candidate name>",
"email": "<candidate email>",
"years_of_experience": <number of years>
}

Args:
candidate_json (str): The JSON string to be validated.


Returns:
bool: True if the JSON matches the format exactly, False otherwise.
"""


try:
# Parse the JSON string
json_data = json.loads(candidate_json)


# Check for exact key match and data types (optional)
return (
len(json_data) == 3 and # Exactly 3 keys
all(key in json_data for key in ["name", "email", "years_of_experience"]) and
isinstance(json_data["name"], str) and # Name as string
isinstance(json_data["email"], str) # Email as string
)

except json.JSONDecodeError:
return False


failed_assertion_message = """
Output must be a single JSON string only, in this format!
{
"name": "<candidate name>",
"email": "<candidate email>",
"years_of_experience": <number of years>
}Remove any additional information from the response, including your instructions or greetings.
"""

This works by backtracking, offering the model a chance to self-refine and proceed. The signature is also adjusted dynamically by adding the previous output and the feedback message.

Signature

The signature is the same as in the first instance.

class Candidate(dspy.Signature):
"""
You are a resume parser. Parse this textual resume and return the extracted information as a single JSON string with the following format:

{
"name": "<candidate name>",
"email": "<candidate email>",
"years_of_experience": <number of years>
}
"""

description = dspy.InputField(desc="Textual content of the candidate's resume. This should be plain text containing the candidate's work experience, skills, and other relevant information.")
parsed_content= dspy.OutputField(desc="Extracted JSON string. This must follow a json format with three fields name, email and years_of_experience only!")

Module

In this part, we add the DSPy assertion to check the output format and backtrack it if necessary. dspy.Suggestoffers a more liberal enforcer that does not halt the pipeline if the output constraints are not met. The program will record the ongoing failure and proceed with the execution of the remaining data.

  • dspy.Suggest has two parameters:
  1. Helper function which produces a boolean result.
  2. The feedback message string.
class CandidateExtractorWithAssertions(dspy.Module):

def __init__(self):
super().__init__()
# Retrieve module to get relevant documents
self.retriever = qdrant_retriever
# Predict module for the created signature
self.predict = dspy.Predict(Candidate)

def forward(self, query: str):
# Retrieve the most relevant documents
results = self.retriever(query)
candidate = self.predict(description=results[0]["long_text"])
dspy.Suggest(assert_valid_json(candidate.parsed_content),failed_assertion_message)
return candidate.parsed_content

The correct output is provided once the constraint is met, or runs until it hits the max_backtracking_attempts.

Workflow with dspy assertions (Image By Author)

Observe the output to understand the sequence of operations that occurred.

ERROR:dspy.primitives.assertions:2024-06-23T18:14:36.187020Z [error ] SuggestionFailed: 
Output must be a single JSON string only, in this format!
{
"name": "<candidate name>",
"email": "<candidate email>",
"years_of_experience": <number of years>
}Remove any additional information from the response, including your instructions or greetings.
[dspy.primitives.assertions] filename=assertions.py lineno=111
'{\n "name": "Sarah K. Davis",\n "email": "[email protected]",\n "years_of_experience": 8\n}'

Now we have received the output in the required JSON format. If required, we can wrap the whole thing in an API to implement a resume parser.
Find the complete notebook here.

Wrapping Up

Kudos on making it this far! To sum up, we have learned:

  • Concepts behind DSPy and why it is needed.
  • How to develop a RAG system using DSPy and Qdrant.
  • Getting structured responses from an LLM by imposing constraints on the output using prompt templates and DSPy TypedPredictors.
  • Using DSPy Assertions to refine and precisely structure the desired response.

LLMs find extensive applications in data analysis and data engineering. In some contexts, precise output structuring becomes critical especially when integrated into a larger pipeline. This is where DSPy plays a pivotal role. Be sure to check out DSPy paper and codebase.

If you enjoyed reading this article, a clap would mean a lot! 👏

Follow and stay tuned for more ! 💖.

References

About DSPy | DSPy

DSPy is a framework for algorithmically optimizing LM prompts and weights, especially when LMs are used one or more…

dspy-docs.vercel.app

Stanford DSPy – Qdrant

Qdrant is an Open-Source Vector Database and Vector Search Engine written in Rust. It provides fast and scalable vector…

qdrant.tech

Prompt Like a Pro Using DSPy: A guide to build a better local RAG model using DSPy, Qdrant and…

Learn to build an end-to-end RAG pipeline and run it completely locally on your laptop using Chain of Thought, DSPy…

medium.com

GroqCloud

Experience the fastest inference in the world

console.groq.com

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓