Structured Data Extraction from LLMs using DSPy Assertions and Qdrant
Last Updated on July 3, 2024 by Editorial Team
Author(s): Ashish Abraham
Originally published on Towards AI.
Structured Data Extraction from LLMs using DSPy Assertions and Qdrant
Prompt templates and techniques have been around since the advent of Large Language Models (LLMs). LLMs are sensitive to the prompts used for tasks, especially in pipelines where multiple LLMs need to work together. Currently, most LLMs and developer frameworks use fixed βprompt templatesβ, which are long, manually created instruction strings. However, this method can be fragile and hard to scale, similar to manually adjusting a classifierβs weights. Also, a specific prompt may not work well across different pipelines, LLMs, data types, or inputs.
While methods such as fine-tuning and prompt engineering have seen increased usage, these strategies can be labor-intensive and heavily dependent on manual intervention to ensure LLMs comply with certain constraints. When LLMs started to be used for function calling and specific tasks, they were required to deliver outputs in certain formats. Yet, the consistency of these outputs is not always assured with traditional prompt engineering.
Table of Contents
Β· DSPy
β 1. Signatures
β 2. Modules
β 3. Metrics
β 4. Optimizers (earlier, Teleprompters):
Β· Assertions in DSPy
Β· RAG for Structured Data Extraction
β Prerequisites
β Setting Up the Large Language Model
β Setting Up Qdrant
β Database Preparation
β Implementing the Information Extraction Pipeline
β Implementing the Pipeline with DSPy TypedPredictors
β Implementing the Pipeline with DSPy Assertions
Β· Wrapping Up
Β· References
DSPy
β Programming β not prompting β
The Stanford NLP team released DSPy (Declarative Sequencing Python framework) in 2023, a radically different approach to building and managing language model pipelines. As the tagline suggests, it is more βprogrammingβ instead of prompting. DSPy presented a new strategy of algorithmically optimizing prompts instead of manual adjustment. It combines methods for prompting, refining LLMs, and even compiling whole pipelines, enhancing them with logical reasoning and retrieval augmentation. All of these are represented through a streamlined set of Python operations that are capable of composition and learning. This enabled models to refine their responses and even backtrack to present the user with the preferred type of response. You will see how this works later in this blog.
DSPy revolves around the following concepts:
1. Signatures
Signatures serve as explicit blueprints of the anticipated input/output actions for LLMs. They instruct the model on what exactly to do rather than providing information on how to do it.
For simple tasks, inline signatures can be used. They are simple strings that define the task precisely.
Question Answering: "question -> answer"
Sentiment Classification: "sentence -> sentiment"
Summarization: "document -> summary"
For more complex tasks, custom signatures can be defined as classes. The classes will be subclass of the dspy.Signature
class.
2. Modules
A DSPy module encapsulates a prompting technique and is designed to work with any DSPy Signature. These modules contain learnable parameters and can process inputs to produce outputs. They can be combined to form larger modules, drawing inspiration from PyTorchβs neural network modules but tailored for LLM programs.
There are specific modules for abstracting prompting techniques:
dspy.Predict
: For simple prompting.
dspy.ChainOfThought
: For step-by-step reasoning.
dspy.ProgramOfThought
: Instructs the model to generate code, with the execution outcomes determining the response.
dspy.ReAct
: An agent that can use tools to implement the given signature.
dspy.MultiChainComparison
: Compares multiple outputs from ChainOfThought and produces a final result.
3. Metrics
As in all machine learning frameworks, a metric is used to compare model output with actual output and assign a score. Based on this, the optimizer works.
Some of the available built-in metrics are:dspy.evaluate.metrics.answer_exact_match
and dspy.evaluate.metrics.answer_passage_match
.
4. Optimizers (earlier, Teleprompters):
The prompt, LLM weights, and signatures are the parameters of the DSPy program, and Optimizers are used to tune them according to the performance indicated by Metrics.
You will need a few training examples to get started. BootstrapFewShot
,BootstrapFewShotWithRandomSearch, BootstrapFineTune
and Mipro
are some built-in optimizers available in DSPy.
DSPy has great integrations with the Qdrant vector database which enable you to seamlessly use them together to develop robust Retrieval-Augmented Generation (RAG) systems.
Assertions in DSPy
For guiding the LLMs toward the desired outputs without manual work, DSPy offers a feature called Assertions. When the output format or constraint is not met, assertions offer functionalities like:
- Backtracking: The model is provided with a feedback message and a validation function that helps it refine the prompt and produce the output again until the constraint is met.
- Dynamic Signature Modification: The past output and feedback message are used to make changes in the signature to guide the model toward the desired output.
Assertions can be implemented with two constructs:
dspy.Suggest
: It enables gradual refining of the prompt and model output by backtracking, without explicitly halting the pipeline, in case constraints are not met.dspy.Assert
: It retries upon failure but if failures continue, it stops the operation and triggers a dspy.AssertionError.
Further details will be elaborated in the tutorial section of this blog.
RAG for Structured Data Extraction
That was a quick overview of DSPy and its relevance. In this tutorial, we will implement a Retrieval-Augmented Generation (RAG) system with Llama 3, which will retrieve the best resume from the database based on the userβs query, parse the resume text, and return the extracted information in JSON format. We will explore how the output changes as we implement a simple DSPy extractor, extractor with DSPy TypedPredictors, and finally, with DSPy Assertions.
Prerequisites
For the tutorial, we will use Qdrant as the vector database and Groq as the LLM engine.
Install the required libraries before proceeding.
pip install -U qdrant-client
pip install --upgrade protobuf google-cloud-aiplatform
pip install dspy-ai[qdrant]
pip install groq
Setting Up the Large Language Model
As of now, Groq is providing LLMs as easy-to-use and free APIs on their cloud platform. Go to this page to get started. Create an API key.
Setting Up Qdrant
Make sure you have Docker installed and keep the Docker engine running if you are using a local environment. Qdrant can be installed by downloading its Docker image.
docker pull qdrant/qdrant
Run the Qdrant Docker container using the command.
docker run -p 6333:6333 \
-v $(pwd)/qdrant_storage:/qdrant/storage \
qdrant/qdrant
Alternatively, you can start the container from the Docker desktop console.
Then only you will be able to start the Qdrant Client.
Database Preparation
Start a Qdrant client in localhost, memory, or remote server as required.
from qdrant_client import QdrantClient, models
# Option 1: Connecting to a Remote Qdrant Server
# Replace "URL" with the actual URL of your remote Qdrant server
# Replace "API_KEY" with your Qdrant API key
# qdrant_client = QdrantClient(
# url="URL",
# api_key="API_KEY",
# )
# Option 2: Connecting to an In-Memory Qdrant Instance (for development)
# This option launches a temporary Qdrant instance in memory, useful for development purposes.
# qdrant_client = QdrantClient(":memory:")
# Option 3: Connecting to a Local Qdrant Server
# This option connects to a Qdrant server running on the same machine (localhost) on port 6333 (default port).
qdrant_client = QdrantClient(host='localhost', port=6333)
Letβs populate the database with resume data. Here I have used a sample dataset from Hugging Face. Feel free to use any data you like or as required.
from datasets import load_dataset
import random
# Load the dataset from HuggingFace
dataset = load_dataset("DevashishBhake/resume_section_classification")
Extract the resume text field and create a collection in the database.
rows = dataset["train"]
rows_list = list(rows)
random_rows = random.sample(rows_list, 10)
resume_content = []
# Iterate over the random rows and append the resume content to the list
for row in random_rows:
resume_content.append(row["Resume"])
# create collection
ids = list(range(0, 10))
documents = resume_content
qdrant_client.add(
collection_name="resume-collection",
documents=documents,
ids=ids,
)
print(qdrant_client.get_collections())
If you are running Qdrant Client on localhost, you can confirm the collection at http://localhost:6333/dashboard.
Implementing the Information Extraction Pipeline
Now we will cover each component of the pipeline step-by-step. Define the retriever component using the Qdrant integrations in DSPy.
from dspy.retrieve.qdrant_rm import QdrantRM
qdrant_retriever = QdrantRM(
qdrant_collection_name="resume-collection",
qdrant_client=qdrant_client,
)
You can test the retriever using the following code.
results = qdrant_retriever("Candidates with HR(Human Resource) profiles", k=3)
for result in results:
print("Document:", result.long_text, "\n")
Next, define and test the LLM component using the Groq LLM integration in DSPy. I have used the Llama-3β8B endpoint here. Llama-3 is the latest state-of-the-art language model developed by Meta, known for its exceptional performance in various applications. It stands out due to its large-scale training on a diverse dataset and its efficient language encoding capabilities.
import dspy
from groq import Groq
LLM = dspy.GROQ(model='llama3-8b-8192', api_key ="API_KEY")
LLM("Say Hi!!")
Configure both retriever and LLM components.
dspy.configure(LLM=LLM, rm=qdrant_retriever)
Signature
Define the signatures for both input and output. For the resume parser functionality, I have given single input and output fields. For more complex tasks, you can use multiple input fields.
Create the signature class and define the fields:
- description: Textual content from the resume is the input field.
- parsed_content: The extracted JSON string with three fields
name
,email_id
, andyears_of_experience
. This is the output field.
Notice the descriptions included with each field to help the LLM better understand how to structure the input and output.
class Candidate(dspy.Signature):
"""
You are a resume parser. Parse this textual resume and return the extracted information as a single JSON string with the following format:
{
"name": "<candidate name>",
"email": "<candidate email>",
"years_of_experience": <number of years>
}
"""
description = dspy.InputField(desc="Textual content of the candidate's resume. This should be plain text containing the candidate's work experience, skills, and other relevant information.")
parsed_content= dspy.OutputField(desc="Extracted JSON string. This must follow a json format with three fields name, email and years_of_experience only!")
Module
This is the main RAG module that performs the task. At first glance, DSPy modules might appear more intricate than the RAG functions provided by systems such as LangChain or LlamaIndex. But the class itself is self-explanatory.
class CandidateExtractor(dspy.Module):
def __init__(self):
super().__init__()
# Retriever module to get relevant document
self.retriever = qdrant_retriever
# Predicter module for the created signature
self.predict = dspy.Predict(Candidate)
def forward(self, query: str):
# Retrieve the most relevant documents
results = self.retriever(query)
candidate = self.predict(description=results[0]["long_text"])
return candidate.parsed_content
The class has two components that perform the task as shown in the workflow below.
Run the extractor to see how it works.
extractor = CandidateExtractor()
response = extractor("Get the details of candidate with business majors.")
response
Here is the expected output.
'Here is the parsed JSON string:\n\n{\n "name": "Phyllis Physical",\n "email": "[email protected]",\n "years_of_experience": 0\n}\n\nNote: The years of experience is set to 0 as the resume does not explicitly mention the number of years of experience.'
As you can see, the output is not just a single JSON string as we wanted. It includes acknowledgments and additional information from the LLM.
Implementing the Pipeline with DSPy TypedPredictors
DSPy Signatures utilize InputField and OutputField to characterize the inputs and outputs of a field. However, these fields always accept and return string data, necessitating string processing for both inputs and outputs.
While Pydantic BaseModel is an excellent tool for applying type restrictions to fields, it doesnβt seamlessly integrate with dspy.Signature. This is where TypedPredictors come in, offering a solution to enforce type constraints on the inputs and outputs of fields within a dspy.Signature.
Here weβll try a different approach by providing a class to define the structure of the output.
from dspy.functional import TypedPredictor
import pydantic
class Selection(pydantic.BaseModel):
name: str
email_id: str
years_of_experience: str
Signature
Define the signature similar to the previous one but the output field will be an object of the class we defined just now. Notice how the descriptions are being tweaked.
class Candidate(dspy.Signature):
"""
You are a resume parser. Parse this textual resume and return the extracted information with the following fields only:
"name": "<candidate name>",
"email": "<candidate email>",
"years_of_experience": <number of years>
"""
description: str = dspy.InputField(desc="Textual content of the candidate's resume. This should be plain text containing the candidate's work experience, skills, and other relevant information.")
parsed_content: Selection= dspy.OutputField(desc="Extracted information with name, email and years_of_experience only! Donβt include any other information.")
Module
Define the custom module as required.
import functools
class CandidateExtractor(dspy.Module):
def __init__(self):
super().__init__()
# Retrieve module to get relevant documents
self.retriever = qdrant_retriever
# Predict module for the created signature
self.predict = dspy.functional.TypedPredictor(Candidate)
def forward(self, query: str):
# Retrieve the most relevant documents
results = self.retriever(query)
print(results)
candidate = self.predict(description=results[0]["long_text"])
return candidate.parsed_content
extractor = CandidateExtractor()
response = extractor("Get the details of candidate with business majors.")
response
Here is the expected output.
Selection(name='Sarah K. Davis', email_id='[email protected]', years_of_experience=8)
That didnβt exactly meet our requirements, did it? What if we need responses in a format that can be directly used? A format such as JSON is often more desirable than a custom-defined class object.
Implementing the Pipeline with DSPy Assertions
Now we will try extracting a JSON format response using DSPy assertions. Define a helper function to validate the required JSON format and also define a feedback message to prompt the LLM when the assertion condition is not met.
import functools
import json
def assert_valid_json(candidate_json):
"""
This function checks if the provided JSON string matches the expected format exactly.
Expected format: {
"name": "<candidate name>",
"email": "<candidate email>",
"years_of_experience": <number of years>
}
Args:
candidate_json (str): The JSON string to be validated.
Returns:
bool: True if the JSON matches the format exactly, False otherwise.
"""
try:
# Parse the JSON string
json_data = json.loads(candidate_json)
# Check for exact key match and data types (optional)
return (
len(json_data) == 3 and # Exactly 3 keys
all(key in json_data for key in ["name", "email", "years_of_experience"]) and
isinstance(json_data["name"], str) and # Name as string
isinstance(json_data["email"], str) # Email as string
)
except json.JSONDecodeError:
return False
failed_assertion_message = """
Output must be a single JSON string only, in this format!
{
"name": "<candidate name>",
"email": "<candidate email>",
"years_of_experience": <number of years>
}Remove any additional information from the response, including your instructions or greetings.
"""
This works by backtracking, offering the model a chance to self-refine and proceed. The signature is also adjusted dynamically by adding the previous output and the feedback message.
Signature
The signature is the same as in the first instance.
class Candidate(dspy.Signature):
"""
You are a resume parser. Parse this textual resume and return the extracted information as a single JSON string with the following format:
{
"name": "<candidate name>",
"email": "<candidate email>",
"years_of_experience": <number of years>
}
"""
description = dspy.InputField(desc="Textual content of the candidate's resume. This should be plain text containing the candidate's work experience, skills, and other relevant information.")
parsed_content= dspy.OutputField(desc="Extracted JSON string. This must follow a json format with three fields name, email and years_of_experience only!")
Module
In this part, we add the DSPy assertion to check the output format and backtrack it if necessary. dspy.Suggest
offers a more liberal enforcer that does not halt the pipeline if the output constraints are not met. The program will record the ongoing failure and proceed with the execution of the remaining data.
dspy.Suggest
has two parameters:
- Helper function which produces a boolean result.
- The feedback message string.
class CandidateExtractorWithAssertions(dspy.Module):
def __init__(self):
super().__init__()
# Retrieve module to get relevant documents
self.retriever = qdrant_retriever
# Predict module for the created signature
self.predict = dspy.Predict(Candidate)
def forward(self, query: str):
# Retrieve the most relevant documents
results = self.retriever(query)
candidate = self.predict(description=results[0]["long_text"])
dspy.Suggest(assert_valid_json(candidate.parsed_content),failed_assertion_message)
return candidate.parsed_content
The correct output is provided once the constraint is met, or runs until it hits the max_backtracking_attempts
.
Observe the output to understand the sequence of operations that occurred.
ERROR:dspy.primitives.assertions:2024-06-23T18:14:36.187020Z [error ] SuggestionFailed:
Output must be a single JSON string only, in this format!
{
"name": "<candidate name>",
"email": "<candidate email>",
"years_of_experience": <number of years>
}Remove any additional information from the response, including your instructions or greetings.
[dspy.primitives.assertions] filename=assertions.py lineno=111
'{\n "name": "Sarah K. Davis",\n "email": "[email protected]",\n "years_of_experience": 8\n}'
Now we have received the output in the required JSON format. If required, we can wrap the whole thing in an API to implement a resume parser.
Find the complete notebook here.
Wrapping Up
Kudos on making it this far! To sum up, we have learned:
- Concepts behind DSPy and why it is needed.
- How to develop a RAG system using DSPy and Qdrant.
- Getting structured responses from an LLM by imposing constraints on the output using prompt templates and DSPy TypedPredictors.
- Using DSPy Assertions to refine and precisely structure the desired response.
LLMs find extensive applications in data analysis and data engineering. In some contexts, precise output structuring becomes critical especially when integrated into a larger pipeline. This is where DSPy plays a pivotal role. Be sure to check out DSPy paper and codebase.
If you enjoyed reading this article, a clap would mean a lot! 👏
Follow and stay tuned for more ! 💖.
References
About DSPy | DSPy
DSPy is a framework for algorithmically optimizing LM prompts and weights, especially when LMs are used one or moreβ¦
dspy-docs.vercel.app
Stanford DSPy – Qdrant
Qdrant is an Open-Source Vector Database and Vector Search Engine written in Rust. It provides fast and scalable vectorβ¦
qdrant.tech
Prompt Like a Pro Using DSPy: A guide to build a better local RAG model using DSPy, Qdrant andβ¦
Learn to build an end-to-end RAG pipeline and run it completely locally on your laptop using Chain of Thought, DSPyβ¦
medium.com
GroqCloud
Experience the fastest inference in the world
console.groq.com
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI