Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Simple Method to Generate Useful Synthetic Data
Artificial Intelligence   Data Science   Latest   Machine Learning

Simple Method to Generate Useful Synthetic Data

Last Updated on March 13, 2024 by Editorial Team

Author(s): Peter Chung

Originally published on Towards AI.

Read time ~5 minutes

It’s become a very common phrase, but ‘data is the new oil’ rings louder and truer with each passing year.

Working with large language models, developers typically face 2 big bottlenecks: compute and data.

While compute remains an increasingly growing and in-demand resource (see NVIDIA’s latest results for context), datasets have faced a tougher impasse.

Recently, a trend has emerged with content platforms and media distributors making deals with AI teams to license out their data (see Reddit’s pre-IPO activity for example).

This makes access and availability increasing tougher for developers without bigger budgets for similar licensing deals. In most cases, it’s just a complete non-starter.

One possible solution, however, has become increasingly more popular: synthetic data.

Data is the new oil

Understanding Synthetic Data

What is synthetic data? Formally, it can be defined as:

Synthetic data is information that’s artificially manufactured rather than generated by real-world events. It’s created algorithmically and is used as a stand-in for test data sets of production or operational data, to validate mathematical models and to train machine learning models.
techtarget.com

Synthetic data is not new. In fact, by some accounts the idea originated some time in the 1990s.

LLMs, however, have made their adoption much more accessible.

In a paper published in 2022 titled “Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor” (https://arxiv.org/abs/2212.09689), researchers developed an instruction tuning dataset “of creative and diverse instructions, collected with virtually no human labor”. Through their methods, they were able to generate 64,000 unique examples with a very small seed set of examples. They then used models to rephrase and expand on this set and created a near completely synthetic dataset of 240,000 examples! All at a fraction of the cost and time if generated and annotated by a human team.

To simplify, their method of generation used this basic template:

https://arxiv.org/abs/2212.09689

They provided 3 examples (black text) to a language model, and asked it to provide a 4th to complete their inference (pink text above). The multi-shot prompting provided enough context to the model so that the generated examples followed similar patterns as the provided example data (i.e. providing instructions) and contextualized the generation with additional information such as ‘input’ text.

With only a few starting examples, we can leverage this approach to create batches of our own data, specific to any task that we can find examples to replicate on.

Implementation with Code

NOTE: The below is for demonstration purposes only. Please check licensing requirements before using sample data and models for commercial use.

We’ll start by loading in some sample data.

Recently, @teknium1, one of the founders of Nous Research a notable open-source AI research company, released the underlying datasets he used to train the Open Hermes and Nous Hermes model series. Both series are very well regarded and used on the Hugging Face Hub, and perform very strongly across many benchmark evaluations.

Using Hugging Face’s dataset library, we’ll read in this dataset.

from datasets import load_dataset
from pprint import pprint

dataset = load_dataset("teknium/OpenHermes-2.5")
pprint(dataset)

Next we’ll take a look at what a example of this data, reading the first example from the ‘train’ split.

dataset['train'][0]

Examining the ‘sources’ of this dataset, we can see that a large number of notable research projects were aggregated together. This includes the unnatural instructions dataset in the paper we cited earlier.

Using a filtering function, we can pare down our dataset to a more specific subset to work with. We’ll narrow down to just the unnatural instructions dataset for our example.

To filter down to different segments of this larger dataset, just update the key, value pairs in the key_value_pairs dictionary below.

from pprint import pprint

def filter_dataset(dataset, key_value_pairs):
for key, value in key_value_pairs.items():
if key not in dataset or dataset[key] != value:
return False
return True
key_value_pairs = {
'source': 'UnnaturalInstructions',
# 'category': 'general'
}
filtered_dataset = dataset.filter(lambda sample: filter_dataset(sample, key_value_pairs))
pprint(filtered_dataset)

Now with our dataset filtered down, we’ll use 3 examples from our narrowed dataset to generate our inference prompt.

For our example, we’ll randomly select 3.

from pprint import pprint

examples_index_ref = random.sample(range(len(filtered_dataset['train'])),3)
pprint(examples_index_ref)

Using the randomly selected examples, we then create a dictionary isolating just the conversation text between human and assistant.

from pprint import pprint

examples = {}
for i, j in enumerate(examples_index_ref):
conversation = filtered_dataset['train'][j]['conversations']
formatted_conversation = (f"human: {conversation[0]['value']}\\n"
f"assistant: {conversation[1]['value']}")
examples[f'example{i+1}'] = formatted_conversation
pprint(examples)

We take the example conversation text and format them into a generation_prompt in the pattern taken from the paper.

from pprint import pprint

generation_prompt = (f"Example 1:\\n{examples['example1']}\\n\\n"
f"Example 2:\\n{examples['example2']}\\n\\n"
f"Example 3:\\n{examples['example3']}\\n\\n"
f"Example 4:")
pprint(generation_prompt)

Now we can take this prompt and invoke a completion from a model. For our purposes, we’ll use the public inference client from Hugging Face to generate a sample.

Notably, we’ll use the Mistral-7B model for inference and pass it a temperature of 0.7 to provide some variability to the response. For your own purposes, you may consider a different model and parameters to generate your desired responses. These models allow for a lot of flexibly, so as long as licensing and resources allow you should always experiment!

from huggingface_hub import InferenceClient
from pprint import pprint

client = InferenceClient()
output = client.text_generation(
prompt = generation_prompt,
model = 'mistralai/Mistral-7B-Instruct-v0.2',
max_new_tokens = 512,
temperature = 0.7,
stop_sequences = ['Example 5:']
).strip('\\n\\nExample 5:')
pprint(output)

And that’s it!

A Colab notebook with all the code in this article can be found here.

With allocated compute, you can can create powerful data examples to leverage for multi-shot prompting or model fine-tuning.

I hope this was helpful!

Please don’t hesitate to comment or reach out with any questions.

Thanks for reading!

Peter Chung is the founder and principal engineer of Innova Forge, a Machine Learning development studio working with enterprise and startup customers to deploy ML and LLM applications. www.innova-forge.com

References & Resources

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor https://arxiv.org/abs/2212.09689v1

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓