Synthetic Data Generation with Language Models: A Practical Guide

Last Updated on October 5, 2024 by Editorial Team

Author(s): Ehssan

Originally published on Towards AI.

Created with Nightcafe — Image property of Author

In the evolving landscape of artificial intelligence, data remains the fuel that powers innovation. But what happens when acquiring real-world data becomes challenging, expensive, or even impossible?

Enter synthetic data generation — a groundbreaking technique that leverages language models to create high-quality, realistic datasets. Consider training a language model on medical records without breaching privacy laws, or developing a customer interaction model without access to private conversation logs, or designing autonomous driving systems where collecting data on rare edge cases is nearly impossible. Synthetic data bridges gaps in data availability while maintaining the realism needed for effective AI training.

Beyond addressing data shortages, synthetic data enhances AI development by balancing imbalanced datasets (e.g., in fraud detection or rare medical conditions), simulating rare events, and augmenting limited data with realistic variations. Companies can accelerate development, improve model robustness, and experiment with datasets otherwise unavailable.

While the benefits of synthetic data — such as scalability, privacy preservation, and the ability to simulate hard-to-capture scenarios — are clear, it also has limitations, including limited real-world credibility, overfitting, and bias, which require careful consideration.

In this article, we’ll explore synthetic data generation, discuss its limitations and ways to overcome them, and show you how to implement your own synthetic data generator in Python.

How to Overcome the Limitations of Synthetic Data

1. Lack of Real-World Authenticity

Synthetic data may not fully capture the nuances and variability of real-world data, leading to models that perform well in controlled environments but fail in real-world applications.

How to Overcome:

Hybrid Approach: Use synthetic data to augment real data, not replace it. A combination ensures that the model can generalize to unseen, real-world scenarios.
Validation on Real Data: Always validate models on real-world datasets, even if training is done with synthetic data, to assess performance in practical applications and to ensure robustness.

2. Overfitting and Bias

Models trained on synthetic data might overfit to the patterns in that data, which may not exist in real-world data. This can lead to poor generalization when deployed. Also, Synthetic data can inherit or amplify biases present in the models used to generate it. This can result in biased predictions.

How to Overcome:

Data Regularization: Apply data augmentation techniques and introduce noise in synthetic data to mimic the randomness and variability of real-world data.
Diverse Data Generation: Ensure diversity in the synthetic data by using multiple models and methods to generate data from different perspectives.

In addition, keep in mind that ensuring the quality and representativeness of synthetic data can be difficult and often a little experimentation with few-shot learning (FSL) and chain-of-thought (CoT) prompting in prompt engineering can go a long way. We shall illustrate these in more detail below.

Synthetic Data Generator Implementation

You can run this tutorial on the Intel® Tiber™ Developer Cloud free environment, which is equipped with a 4th Generation Intel® Xeon® CPU. This platform provides ample computing resources ensuring smooth execution of our code.

Environment Setup

Let’s begin with importing the necessary libraries. In our demo we shall use Llama 3.1 and you will need a Hugging Face token to access this model’s gated repository. You may create and access your tokens directly from your Hugging Face account. To do so, select “Access Tokens” from your settings menu and create a token with the “write” permission.

Snapshot of the Hugging Face token creation page— Image by Author

Now, you can insert your token in your Python script. (Do not share your Access Tokens with anyone; Hugging Face removes any leaked Access Tokens.)

import torch
import numpy as np
from transformers import pipeline
import pandas as pd
from huggingface_hub import login

login("your_token")

Next, go to meta-llama/Meta-Llama-3.1–8B-Instruct and read the license before providing your information and submitting the Llama 3.1 access request.

Implementation

Let’s say we want to generate synthetic customer service texts classified by the following labels

labels = ["polite", "somewhat polite", "neutral", "impolite"]

in these contexts

category_type = {
 "travel": ["air", "train"],
 "stores": ["appliances", "toys and games"]
}

We shall randomly select labels and categories and instruct the language model to generate synthetic data based on the specified categories and labels.

Randomness will ensure data regularization; see the second challenge (Overfitting and Bias) above. Once we have selected a context category, we randomly choose a corresponding type from our dictionary as follows.


def diversify(category):
 """
 Randomly selects a value from the list associated with a given key in the category_type dictionary.

 Args:
 category (str): A key in the category_type dictionary.

 Returns:
 str: A randomly chosen value from the list associated with the provided key.
 """
 return np.random.choice(category_type[category])

Here’s how we go about the full implementation: we generate data in batches and our function randomly assigns labels and categories to the batch’s samples. For each sample in the batch, the sdg function:

Creates a prompt that instructs the language model to generate a synthetic customer service response based on the assigned label and category.
Uses the language model to generate a response to the prompt.
Extracts the relevant text from the generated response. You can leave the text_extraction function as an identity function for now, since its exact definition depends on factors like the prompt. It can be easily handled with regular expressions, for example.

Finally, each batch of the generated responses, along with their labels and the model used is appended to a CSV file.

def sdg(
 sample_size,
 labels,
 categories,
 batch_size=20,
 output_path="./output.csv",
 model="meta-llama/Meta-Llama-3.1-8B-Instruct",
):
 """
 Generates synthetic data based on specified categories and labels.

 Args:
 sample_size (int): The number of synthetic data samples to generate.
 labels (list of str): The labels used to classify the synthetic data.
 categories (list of str): The categories for data generation and diversification.
 batch_size (int): The number of samples per batch to append to the output file.
 output_dir (str): The directory path where the output file will be saved.
 model (str): The large language model used for generating the synthetic data.
 """

 # If sample_size is not divisible by batch_size, an extra batch is added
 num_batches = (sample_size + batch_size - 1) // batch_size

 print(f"Synthetic data will be appended to {output_path} in {num_batches} batches.")

 for batch in range(num_batches):
 # Calculate the start and end indices for the current batch
 start = batch * batch_size
 end = min(start + batch_size, sample_size)

 # Store results of the current batch
 batch_data = []

 # Assign random labels to the current batch
 batch_random_labels = np.random.choice(labels, batch_size, replace=True)

 # Assign random categories to the current batch
 batch_random_categories = np.random.choice(categories, batch_size, replace=True)

 for i in range(start, end):
 prompt = f"""I am creating synthetic OUTPUT to fine-tune
 my BERT model. The usecase is customer service chatbots.
 You should generate only one OUTPUT for the classification
 LABEL: {batch_random_labels[i - start]} in CATEGORY:
 {batch_random_categories[i - start]} and TYPE
 {diversify(batch_random_categories[i - start])}. 

 Examples. 
 OUTPUT: The fee you’re seeing is likely related
 to our standard account maintenance charges. I can provide
 more details if needed.

 OUTPUT: You can return it, but only if you have the
 receipt and it’s within the return window.

 OUTPUT: It's not our fault your baggage didn't make it.
 What do you expect us to do about it now?

 OUTPUT: I apologize for the trouble you’ve had with the
 heater. We can certainly look into a return or exchange.
 Please bring in your receipt, and we’ll take care of it
 for you.

 Only return one OUTPUT and not the LABEL or the CATEGORY.
 """
 messages = [
 {
 "role": "system",
 "content": f"You are a helpful assistant designed to generate synthetic customer service data with labels {labels} in categories {list(category_type.keys())}.",
 },
 {"role": "user", "content": prompt},
 ]
 generator = pipeline("text-generation", model=model)
 result = generator(messages, max_new_tokens=128)[0]["generated_text"][-1][
 "content"
 ]

 result = text_extraction(result)
 batch_data.append(
 {
 "text": result,
 "label": batch_random_labels[i - start],
 "model": model,
 }
 )

 # Convert the batch results to a DataFrame
 batch_df = pd.DataFrame(batch_data)

 # Append the DataFrame to the CSV file
 if batch == 0:
 # If it's the first batch, write headers
 batch_df.to_csv(output_path, mode="w", index=False)
 else:
 # For subsequent batches, append without headers
 batch_df.to_csv(output_path, mode="a", header=False, index=False)
 print(f"Saved batch number {batch + 1}/{num_batches}")

Here’s a sample output.

| text | label | model |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+---------------------------------------|
| You're still whining about your membership renewal fee? It's not like we're the ones who raised the prices, it's the board's decision. You should just deal with it and stop complaining. | impolite | meta-llama/Meta-Llama-3.1-8B-Instruct |
| I'm not sure why our membership fees are higher this quarter, but I can check on the pricing for our tennis courts and see if there's a way to adjust your plan to fit your budget better. | somewhat polite | meta-llama/Meta-Llama-3.1-8B-Instruct |

Further Improvements

To improve the quality of the outputs of our data generator, we could modify the prompt and diversify the model. We discuss each of these briefly.

Prompt

It’s good practice to pass explicit label descriptions to the model through the prompt. For instance, we could add the lines

polite: Text is considerate and shows respect and good manners, often including courteous phrases and a friendly tone.
somewhat polite: Text is generally respectful but lacks warmth or formality, communicating with a decent level of courtesy.
neutral: Text is straightforward and factual, without emotional undertones or specific attempts at politeness.
impolite: Text is disrespectful or rude, often blunt or dismissive, showing a lack of consideration for the recipient's feelings.

to our prompt. Additionally, we could require the language model to provide its reasoning to support the text generation for the specified label. Here is such an improved prompt.

prompt = f"""You should create synthetic data for specified labels and categories.
This is especially useful for developing customer service chatbots.

Label descriptions:
- polite: Text is considerate and shows respect and good manners, often including courteous phrases and a friendly tone.
- somewhat polite: Text is generally respectful but lacks warmth or formality, communicating with a decent level of courtesy.
- neutral: Text is straightforward and factual, without emotional undertones or specific attempts at politeness.
- impolite: Text is disrespectful or rude, often blunt or dismissive, showing a lack of consideration for the recipient's feelings.

Examples.

LABEL: somewhat polite
CATEGORY: travel
TYPE: train
OUTPUT: I understand your concern about your booking, and I'll check what options we have for you.
REASONING: This text would be classified as "somewhat polite."
The acknowledgment of the customer's concern shows a basic level of respect.
The sentence is direct and lacks additional warmth or formality, but it communicates a willingness to help.
The use of "I'll check" is a straightforward commitment to action without additional courteous phrases that would make it fully polite.

LABEL: neutral
CATEGORY: stores
TYPE: appliances
OUTPUT: Your TV will be delivered within three to five business days.
REASONING: This text would be classified as "neutral."
The sentence is purely informational, providing the facts about delivery time without any emotional undertones.
There are no phrases that express politeness or rudeness; it's a straightforward statement.
The tone is impersonal and focused solely on conveying the necessary information.
####################
You should generate one OUTPUT for the classification below.
Only return the OUTPUT and REASONING.
Do not return the LABEL, CATEGORY, or TYPE.

LABEL: {batch_random_labels[i - start]}
CATEGORY: {batch_random_categories[i - start]}
TYPE: {diversify(batch_random_categories[i - start])}
OUTPUT:
REASONING:
"""

Diversity

To further diversify the output data, one can pass multiple different language models to the synthetic data generator. When we used identical generators and prompts on Llama-3.1–8B-Instruct, gemma-2–9b-it, and Mixtral-8x7B-Instruct-v0.1, we observed the following percentages of duplicated data.

Llama: 0.04%
Gemma: 94.6%(Note: This model wasn’t trained with any system instructions, so you need to modify messages accordingly.)
Mixtral: 7%

Gotcha Alert In some edge cases the language model might generate the same text for different labels! For instance, when we ran the generator with Llama 3.1, the following output was generated for both neutral and somewhat polite labels.

I'm afraid the toy you're looking for is currently out of stock, but we do have a similar product that might interest you. Would you like me to check availability?

Conclusion

Synthetic data generation with language models is a powerful tool that has the potential to reshape the future of AI. Whether you’re a researcher, developer, or business leader, understanding this technology could provide a competitive edge in the evolving AI landscape.

If you’re interested in exploring how synthetic data can revolutionize your AI projects, consider diving deeper into language models, writing your custom data generators, and experimenting with existing data generation tools to unlock new possibilities.

For more AI development how-to content, visit Intel® AI Development Resources.

Frequently Used, Contextual References

Resources

Publication

Synthetic Data Generation with Language Models: A Practical Guide

Author(s): Ehssan

How to Overcome the Limitations of Synthetic Data

1. Lack of Real-World Authenticity

2. Overfitting and Bias

Synthetic Data Generator Implementation

Environment Setup

Implementation

Further Improvements

Conclusion

Suggested Reading

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Synthetic Data Generation with Language Models: A Practical Guide

Author(s): Ehssan

How to Overcome the Limitations of Synthetic Data

1. Lack of Real-World Authenticity

2. Overfitting and Bias

Synthetic Data Generator Implementation

Environment Setup

Implementation

Further Improvements

Conclusion

Suggested Reading

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥