Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Synthetic Data Generation with Language Models: A Practical Guide
Latest   Machine Learning

Synthetic Data Generation with Language Models: A Practical Guide

Last Updated on October 5, 2024 by Editorial Team

Author(s): Ehssan

Originally published on Towards AI.

Created with Nightcafe — Image property of Author

In the evolving landscape of artificial intelligence, data remains the fuel that powers innovation. But what happens when acquiring real-world data becomes challenging, expensive, or even impossible?

Enter synthetic data generation — a groundbreaking technique that leverages language models to create high-quality, realistic datasets. Consider training a language model on medical records without breaching privacy laws, or developing a customer interaction model without access to private conversation logs, or designing autonomous driving systems where collecting data on rare edge cases is nearly impossible. Synthetic data bridges gaps in data availability while maintaining the realism needed for effective AI training.

Beyond addressing data shortages, synthetic data enhances AI development by balancing imbalanced datasets (e.g., in fraud detection or rare medical conditions), simulating rare events, and augmenting limited data with realistic variations. Companies can accelerate development, improve model robustness, and experiment with datasets otherwise unavailable.

While the benefits of synthetic data — such as scalability, privacy preservation, and the ability to simulate hard-to-capture scenarios — are clear, it also has limitations, including limited real-world credibility, overfitting, and bias, which require careful consideration.

In this article, we’ll explore synthetic data generation, discuss its limitations and ways to overcome them, and show you how to implement your own synthetic data generator in Python.

How to Overcome the Limitations of Synthetic Data

1. Lack of Real-World Authenticity

Synthetic data may not fully capture the nuances and variability of real-world data, leading to models that perform well in controlled environments but fail in real-world applications.

How to Overcome:

  • Hybrid Approach: Use synthetic data to augment real data, not replace it. A combination ensures that the model can generalize to unseen, real-world scenarios.
  • Validation on Real Data: Always validate models on real-world datasets, even if training is done with synthetic data, to assess performance in practical applications and to ensure robustness.

2. Overfitting and Bias

Models trained on synthetic data might overfit to the patterns in that data, which may not exist in real-world data. This can lead to poor generalization when deployed. Also, Synthetic data can inherit or amplify biases present in the models used to generate it. This can result in biased predictions.

How to Overcome:

  • Data Regularization: Apply data augmentation techniques and introduce noise in synthetic data to mimic the randomness and variability of real-world data.
  • Diverse Data Generation: Ensure diversity in the synthetic data by using multiple models and methods to generate data from different perspectives.

In addition, keep in mind that ensuring the quality and representativeness of synthetic data can be difficult and often a little experimentation with few-shot learning (FSL) and chain-of-thought (CoT) prompting in prompt engineering can go a long way. We shall illustrate these in more detail below.

Synthetic Data Generator Implementation

You can run this tutorial on the Intel® Tiber™ Developer Cloud free environment, which is equipped with a 4th Generation Intel® Xeon® CPU. This platform provides ample computing resources ensuring smooth execution of our code.

Environment Setup

Let’s begin with importing the necessary libraries. In our demo we shall use Llama 3.1 and you will need a Hugging Face token to access this model’s gated repository. You may create and access your tokens directly from your Hugging Face account. To do so, select “Access Tokens” from your settings menu and create a token with the “write” permission.

Snapshot of the Hugging Face token creation page— Image by Author

Now, you can insert your token in your Python script. (Do not share your Access Tokens with anyone; Hugging Face removes any leaked Access Tokens.)

import torch
import numpy as np
from transformers import pipeline
import pandas as pd
from huggingface_hub import login

login("your_token")

Next, go to meta-llama/Meta-Llama-3.1–8B-Instruct and read the license before providing your information and submitting the Llama 3.1 access request.

Implementation

Let’s say we want to generate synthetic customer service texts classified by the following labels

labels = ["polite", "somewhat polite", "neutral", "impolite"]

in these contexts

category_type = {
"travel": ["air", "train"],
"stores": ["appliances", "toys and games"]
}

We shall randomly select labels and categories and instruct the language model to generate synthetic data based on the specified categories and labels.

Randomness will ensure data regularization; see the second challenge (Overfitting and Bias) above. Once we have selected a context category, we randomly choose a corresponding type from our dictionary as follows.


def diversify(category):
"""
Randomly selects a value from the list associated with a given key in the category_type dictionary.

Args:
category (str): A key in the category_type dictionary.

Returns:
str: A randomly chosen value from the list associated with the provided key.
"""

return np.random.choice(category_type[category])

Here’s how we go about the full implementation: we generate data in batches and our function randomly assigns labels and categories to the batch’s samples. For each sample in the batch, the sdg function:

  • Creates a prompt that instructs the language model to generate a synthetic customer service response based on the assigned label and category.
  • Uses the language model to generate a response to the prompt.
  • Extracts the relevant text from the generated response. You can leave the text_extraction function as an identity function for now, since its exact definition depends on factors like the prompt. It can be easily handled with regular expressions, for example.

Finally, each batch of the generated responses, along with their labels and the model used is appended to a CSV file.

def sdg(
sample_size,
labels,
categories,
batch_size=20,
output_path="./output.csv",
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
):
"""
Generates synthetic data based on specified categories and labels.

Args:
sample_size (int): The number of synthetic data samples to generate.
labels (list of str): The labels used to classify the synthetic data.
categories (list of str): The categories for data generation and diversification.
batch_size (int): The number of samples per batch to append to the output file.
output_dir (str): The directory path where the output file will be saved.
model (str): The large language model used for generating the synthetic data.
"""


# If sample_size is not divisible by batch_size, an extra batch is added
num_batches = (sample_size + batch_size - 1) // batch_size

print(f"Synthetic data will be appended to {output_path} in {num_batches} batches.")

for batch in range(num_batches):
# Calculate the start and end indices for the current batch
start = batch * batch_size
end = min(start + batch_size, sample_size)

# Store results of the current batch
batch_data = []

# Assign random labels to the current batch
batch_random_labels = np.random.choice(labels, batch_size, replace=True)

# Assign random categories to the current batch
batch_random_categories = np.random.choice(categories, batch_size, replace=True)

for i in range(start, end):
prompt = f"""I am creating synthetic OUTPUT to fine-tune
my BERT model. The usecase is customer service chatbots.
You should generate only one OUTPUT for the classification
LABEL: {batch_random_labels[i - start]} in CATEGORY:
{batch_random_categories[i - start]} and TYPE
{diversify(batch_random_categories[i - start])}.

Examples.
OUTPUT: The fee you’re seeing is likely related
to our standard account maintenance charges. I can provide
more details if needed.

OUTPUT: You can return it, but only if you have the
receipt and it’s within the return window.

OUTPUT: It's not our fault your baggage didn't make it.
What do you expect us to do about it now?

OUTPUT: I apologize for the trouble you’ve had with the
heater. We can certainly look into a return or exchange.
Please bring in your receipt, and we’ll take care of it
for you.

Only return one OUTPUT and not the LABEL or the CATEGORY.
"""

messages = [
{
"role": "system",
"content": f"You are a helpful assistant designed to generate synthetic customer service data with labels {labels} in categories {list(category_type.keys())}.",
},
{"role": "user", "content": prompt},
]
generator = pipeline("text-generation", model=model)
result = generator(messages, max_new_tokens=128)[0]["generated_text"][-1][
"content"
]

result = text_extraction(result)
batch_data.append(
{
"text": result,
"label": batch_random_labels[i - start],
"model": model,
}
)

# Convert the batch results to a DataFrame
batch_df = pd.DataFrame(batch_data)

# Append the DataFrame to the CSV file
if batch == 0:
# If it's the first batch, write headers
batch_df.to_csv(output_path, mode="w", index=False)
else:
# For subsequent batches, append without headers
batch_df.to_csv(output_path, mode="a", header=False, index=False)
print(f"Saved batch number {batch + 1}/{num_batches}")

Here’s a sample output.

| text | label | model |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+---------------------------------------|
| You're still whining about your membership renewal fee? It's not like we're the ones who raised the prices, it's the board's decision. You should just deal with it and stop complaining. | impolite | meta-llama/Meta-Llama-3.1-8B-Instruct |
| I'm not sure why our membership fees are higher this quarter, but I can check on the pricing for our tennis courts and see if there's a way to adjust your plan to fit your budget better. | somewhat polite | meta-llama/Meta-Llama-3.1-8B-Instruct |

Further Improvements

To improve the quality of the outputs of our data generator, we could modify the prompt and diversify the model. We discuss each of these briefly.

Prompt

It’s good practice to pass explicit label descriptions to the model through the prompt. For instance, we could add the lines

polite: Text is considerate and shows respect and good manners, often including courteous phrases and a friendly tone.
somewhat polite: Text is generally respectful but lacks warmth or formality, communicating with a decent level of courtesy.
neutral: Text is straightforward and factual, without emotional undertones or specific attempts at politeness.
impolite: Text is disrespectful or rude, often blunt or dismissive, showing a lack of consideration for the recipient's feelings.

to our prompt. Additionally, we could require the language model to provide its reasoning to support the text generation for the specified label. Here is such an improved prompt.

prompt = f"""You should create synthetic data for specified labels and categories. 
This is especially useful for developing customer service chatbots.

Label descriptions:
- polite: Text is considerate and shows respect and good manners, often including courteous phrases and a friendly tone.
- somewhat polite: Text is generally respectful but lacks warmth or formality, communicating with a decent level of courtesy.
- neutral: Text is straightforward and factual, without emotional undertones or specific attempts at politeness.
- impolite: Text is disrespectful or rude, often blunt or dismissive, showing a lack of consideration for the recipient's feelings.

Examples.

LABEL: somewhat polite
CATEGORY: travel
TYPE: train
OUTPUT: I understand your concern about your booking, and I'll check what options we have for you.
REASONING: This text would be classified as "somewhat polite."
The acknowledgment of the customer's concern shows a basic level of respect.
The sentence is direct and lacks additional warmth or formality, but it communicates a willingness to help.
The use of "I'll check" is a straightforward commitment to action without additional courteous phrases that would make it fully polite.

LABEL: neutral
CATEGORY: stores
TYPE: appliances
OUTPUT: Your TV will be delivered within three to five business days.
REASONING: This text would be classified as "neutral."
The sentence is purely informational, providing the facts about delivery time without any emotional undertones.
There are no phrases that express politeness or rudeness; it's a straightforward statement.
The tone is impersonal and focused solely on conveying the necessary information.
####################
You should generate one OUTPUT for the classification below.
Only return the OUTPUT and REASONING.
Do not return the LABEL, CATEGORY, or TYPE.

LABEL: {batch_random_labels[i - start]}
CATEGORY: {batch_random_categories[i - start]}
TYPE: {diversify(batch_random_categories[i - start])}
OUTPUT:
REASONING:
"""

Diversity

To further diversify the output data, one can pass multiple different language models to the synthetic data generator. When we used identical generators and prompts on Llama-3.1–8B-Instruct, gemma-2–9b-it, and Mixtral-8x7B-Instruct-v0.1, we observed the following percentages of duplicated data.

  • Llama: 0.04%
  • Gemma: 94.6%(Note: This model wasn’t trained with any system instructions, so you need to modify messages accordingly.)
  • Mixtral: 7%

Gotcha Alert In some edge cases the language model might generate the same text for different labels! For instance, when we ran the generator with Llama 3.1, the following output was generated for both neutral and somewhat polite labels.

I'm afraid the toy you're looking for is currently out of stock, but we do have a similar product that might interest you. Would you like me to check availability?

Conclusion

Synthetic data generation with language models is a powerful tool that has the potential to reshape the future of AI. Whether you’re a researcher, developer, or business leader, understanding this technology could provide a competitive edge in the evolving AI landscape.

If you’re interested in exploring how synthetic data can revolutionize your AI projects, consider diving deeper into language models, writing your custom data generators, and experimenting with existing data generation tools to unlock new possibilities.

For more AI development how-to content, visit Intel® AI Development Resources.

Suggested Reading

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓