Simple Method to Generate Useful Synthetic Data
Last Updated on March 13, 2024 by Editorial Team
Author(s): Peter Chung
Originally published on Towards AI.
Read time ~5 minutes
Itβs become a very common phrase, but βdata is the new oilβ rings louder and truer with each passing year.
Working with large language models, developers typically face 2 big bottlenecks: compute and data.
While compute remains an increasingly growing and in-demand resource (see NVIDIAβs latest results for context), datasets have faced a tougher impasse.
Recently, a trend has emerged with content platforms and media distributors making deals with AI teams to license out their data (see Redditβs pre-IPO activity for example).
This makes access and availability increasing tougher for developers without bigger budgets for similar licensing deals. In most cases, itβs just a complete non-starter.
One possible solution, however, has become increasingly more popular: synthetic data.
Understanding Synthetic Data
What is synthetic data? Formally, it can be defined as:
Synthetic data is information thatβs artificially manufactured rather than generated by real-world events. Itβs created algorithmically and is used as a stand-in for test data sets of production or operational data, to validate mathematical models and to train machine learning models.
β techtarget.com
Synthetic data is not new. In fact, by some accounts the idea originated some time in the 1990s.
LLMs, however, have made their adoption much more accessible.
In a paper published in 2022 titled βUnnatural Instructions: Tuning Language Models with (Almost) No Human Laborβ (https://arxiv.org/abs/2212.09689), researchers developed an instruction tuning dataset βof creative and diverse instructions, collected with virtually no human laborβ. Through their methods, they were able to generate 64,000 unique examples with a very small seed set of examples. They then used models to rephrase and expand on this set and created a near completely synthetic dataset of 240,000 examples! All at a fraction of the cost and time if generated and annotated by a human team.
To simplify, their method of generation used this basic template:
They provided 3 examples (black text) to a language model, and asked it to provide a 4th to complete their inference (pink text above). The multi-shot prompting provided enough context to the model so that the generated examples followed similar patterns as the provided example data (i.e. providing instructions) and contextualized the generation with additional information such as βinputβ text.
With only a few starting examples, we can leverage this approach to create batches of our own data, specific to any task that we can find examples to replicate on.
Implementation with Code
NOTE: The below is for demonstration purposes only. Please check licensing requirements before using sample data and models for commercial use.
Weβll start by loading in some sample data.
Recently, @teknium1, one of the founders of Nous Research a notable open-source AI research company, released the underlying datasets he used to train the Open Hermes and Nous Hermes model series. Both series are very well regarded and used on the Hugging Face Hub, and perform very strongly across many benchmark evaluations.
Using Hugging Faceβs dataset library, weβll read in this dataset.
from datasets import load_dataset
from pprint import pprint
dataset = load_dataset("teknium/OpenHermes-2.5")
pprint(dataset)
Next weβll take a look at what a example of this data, reading the first example from the βtrainβ split.
dataset['train'][0]
Examining the βsourcesβ of this dataset, we can see that a large number of notable research projects were aggregated together. This includes the unnatural instructions dataset in the paper we cited earlier.
Using a filtering function, we can pare down our dataset to a more specific subset to work with. Weβll narrow down to just the unnatural instructions dataset for our example.
To filter down to different segments of this larger dataset, just update the key, value pairs in the key_value_pairs
dictionary below.
from pprint import pprint
def filter_dataset(dataset, key_value_pairs):
for key, value in key_value_pairs.items():
if key not in dataset or dataset[key] != value:
return False
return True
key_value_pairs = {
'source': 'UnnaturalInstructions',
# 'category': 'general'
}
filtered_dataset = dataset.filter(lambda sample: filter_dataset(sample, key_value_pairs))
pprint(filtered_dataset)
Now with our dataset filtered down, weβll use 3 examples from our narrowed dataset to generate our inference prompt.
For our example, weβll randomly select 3.
from pprint import pprint
examples_index_ref = random.sample(range(len(filtered_dataset['train'])),3)
pprint(examples_index_ref)
Using the randomly selected examples, we then create a dictionary isolating just the conversation text between human and assistant.
from pprint import pprint
examples = {}
for i, j in enumerate(examples_index_ref):
conversation = filtered_dataset['train'][j]['conversations']
formatted_conversation = (f"human: {conversation[0]['value']}\\n"
f"assistant: {conversation[1]['value']}")
examples[f'example{i+1}'] = formatted_conversation
pprint(examples)
We take the example conversation text and format them into a generation_prompt
in the pattern taken from the paper.
from pprint import pprint
generation_prompt = (f"Example 1:\\n{examples['example1']}\\n\\n"
f"Example 2:\\n{examples['example2']}\\n\\n"
f"Example 3:\\n{examples['example3']}\\n\\n"
f"Example 4:")
pprint(generation_prompt)
Now we can take this prompt and invoke a completion from a model. For our purposes, weβll use the public inference client from Hugging Face to generate a sample.
Notably, weβll use the Mistral-7B model for inference and pass it a temperature of 0.7 to provide some variability to the response. For your own purposes, you may consider a different model and parameters to generate your desired responses. These models allow for a lot of flexibly, so as long as licensing and resources allow you should always experiment!
from huggingface_hub import InferenceClient
from pprint import pprint
client = InferenceClient()
output = client.text_generation(
prompt = generation_prompt,
model = 'mistralai/Mistral-7B-Instruct-v0.2',
max_new_tokens = 512,
temperature = 0.7,
stop_sequences = ['Example 5:']
).strip('\\n\\nExample 5:')
pprint(output)
And thatβs it!
A Colab notebook with all the code in this article can be found here.
With allocated compute, you can can create powerful data examples to leverage for multi-shot prompting or model fine-tuning.
I hope this was helpful!
Please donβt hesitate to comment or reach out with any questions.
Thanks for reading!
Peter Chung is the founder and principal engineer of Innova Forge, a Machine Learning development studio working with enterprise and startup customers to deploy ML and LLM applications. www.innova-forge.com
References & Resources
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor https://arxiv.org/abs/2212.09689v1
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI