Simple Method to Generate Useful Synthetic Data

Last Updated on March 13, 2024 by Editorial Team

Author(s): Peter Chung

Originally published on Towards AI.

Read time ~5 minutes

It’s become a very common phrase, but ‘data is the new oil’ rings louder and truer with each passing year.

Working with large language models, developers typically face 2 big bottlenecks: compute and data.

While compute remains an increasingly growing and in-demand resource (see NVIDIA’s latest results for context), datasets have faced a tougher impasse.

Recently, a trend has emerged with content platforms and media distributors making deals with AI teams to license out their data (see Reddit’s pre-IPO activity for example).

This makes access and availability increasing tougher for developers without bigger budgets for similar licensing deals. In most cases, it’s just a complete non-starter.

One possible solution, however, has become increasingly more popular: synthetic data.

Understanding Synthetic Data

What is synthetic data? Formally, it can be defined as:

Synthetic data is information that’s artificially manufactured rather than generated by real-world events. It’s created algorithmically and is used as a stand-in for test data sets of production or operational data, to validate mathematical models and to train machine learning models.
— techtarget.com

Synthetic data is not new. In fact, by some accounts the idea originated some time in the 1990s.

LLMs, however, have made their adoption much more accessible.

In a paper published in 2022 titled “Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor” (https://arxiv.org/abs/2212.09689), researchers developed an instruction tuning dataset “of creative and diverse instructions, collected with virtually no human labor”. Through their methods, they were able to generate 64,000 unique examples with a very small seed set of examples. They then used models to rephrase and expand on this set and created a near completely synthetic dataset of 240,000 examples! All at a fraction of the cost and time if generated and annotated by a human team.

To simplify, their method of generation used this basic template:

They provided 3 examples (black text) to a language model, and asked it to provide a 4th to complete their inference (pink text above). The multi-shot prompting provided enough context to the model so that the generated examples followed similar patterns as the provided example data (i.e. providing instructions) and contextualized the generation with additional information such as ‘input’ text.

With only a few starting examples, we can leverage this approach to create batches of our own data, specific to any task that we can find examples to replicate on.

Implementation with Code

NOTE: The below is for demonstration purposes only. Please check licensing requirements before using sample data and models for commercial use.

We’ll start by loading in some sample data.

Recently, @teknium1, one of the founders of Nous Research a notable open-source AI research company, released the underlying datasets he used to train the Open Hermes and Nous Hermes model series. Both series are very well regarded and used on the Hugging Face Hub, and perform very strongly across many benchmark evaluations.

Using Hugging Face’s dataset library, we’ll read in this dataset.

from datasets import load_dataset
from pprint import pprint

dataset = load_dataset("teknium/OpenHermes-2.5")
pprint(dataset)

Next we’ll take a look at what a example of this data, reading the first example from the ‘train’ split.

dataset['train'][0]

Examining the ‘sources’ of this dataset, we can see that a large number of notable research projects were aggregated together. This includes the unnatural instructions dataset in the paper we cited earlier.

Using a filtering function, we can pare down our dataset to a more specific subset to work with. We’ll narrow down to just the unnatural instructions dataset for our example.

To filter down to different segments of this larger dataset, just update the key, value pairs in the key_value_pairs dictionary below.

from pprint import pprint

def filter_dataset(dataset, key_value_pairs):
 for key, value in key_value_pairs.items():
 if key not in dataset or dataset[key] != value:
 return False
 return True
key_value_pairs = {
 'source': 'UnnaturalInstructions',
 # 'category': 'general'
}
filtered_dataset = dataset.filter(lambda sample: filter_dataset(sample, key_value_pairs))
pprint(filtered_dataset)

Now with our dataset filtered down, we’ll use 3 examples from our narrowed dataset to generate our inference prompt.

For our example, we’ll randomly select 3.

from pprint import pprint

examples_index_ref = random.sample(range(len(filtered_dataset['train'])),3)
pprint(examples_index_ref)

Using the randomly selected examples, we then create a dictionary isolating just the conversation text between human and assistant.

from pprint import pprint

examples = {}
for i, j in enumerate(examples_index_ref):
 conversation = filtered_dataset['train'][j]['conversations']
 formatted_conversation = (f"human: {conversation[0]['value']}\\n"
 f"assistant: {conversation[1]['value']}")
 examples[f'example{i+1}'] = formatted_conversation
pprint(examples)

We take the example conversation text and format them into a generation_prompt in the pattern taken from the paper.

from pprint import pprint

generation_prompt = (f"Example 1:\\n{examples['example1']}\\n\\n"
 f"Example 2:\\n{examples['example2']}\\n\\n"
 f"Example 3:\\n{examples['example3']}\\n\\n"
 f"Example 4:")
pprint(generation_prompt)

Now we can take this prompt and invoke a completion from a model. For our purposes, we’ll use the public inference client from Hugging Face to generate a sample.

Notably, we’ll use the Mistral-7B model for inference and pass it a temperature of 0.7 to provide some variability to the response. For your own purposes, you may consider a different model and parameters to generate your desired responses. These models allow for a lot of flexibly, so as long as licensing and resources allow you should always experiment!

from huggingface_hub import InferenceClient
from pprint import pprint

client = InferenceClient()
output = client.text_generation(
 prompt = generation_prompt,
 model = 'mistralai/Mistral-7B-Instruct-v0.2',
 max_new_tokens = 512, 
 temperature = 0.7,
 stop_sequences = ['Example 5:']
).strip('\\n\\nExample 5:')
pprint(output)

And that’s it!

A Colab notebook with all the code in this article can be found here.

With allocated compute, you can can create powerful data examples to leverage for multi-shot prompting or model fine-tuning.

I hope this was helpful!

Please don’t hesitate to comment or reach out with any questions.

Thanks for reading!

Peter Chung is the founder and principal engineer of Innova Forge, a Machine Learning development studio working with enterprise and startup customers to deploy ML and LLM applications. www.innova-forge.com

References & Resources

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor https://arxiv.org/abs/2212.09689v1

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Simple Method to Generate Useful Synthetic Data

Author(s): Peter Chung

Understanding Synthetic Data

Implementation with Code

References & Resources

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

I Used ChatGPT to Count My Calories

Resource-Efficient Fine-Tuning of DeepSeek-R1

TAI #138: OpenAI’s o3-Mini and Deep Research: A New Era of Reasoning Powered Agents?

Text Preprocessing for NLP: A Step-by-Step Guide to Clean Raw Text Data

DeepSeek AI — The Future is Here

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Simple Method to Generate Useful Synthetic Data

Author(s): Peter Chung

Understanding Synthetic Data

Implementation with Code

References & Resources

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement