Create an Instruction Dataset From Scratch

Last Updated on September 18, 2024 by Editorial Team

Author(s): Arthur Lagacherie

Originally published on Towards AI.

My goal today is to create an instruction dataset from Wikipedia texts.

But first, what is an Instruct dataset.? An Instruct dataset is a dataset for LLMs fine-tuning, after its pre-training, LLMs can’t answer real questions, they can just recite knowledge. That’s why the 2nd step in their training is the instruction part. Train them to answer real questions.

For that, we need an instruction dataset, a dataset composed of one column for the question and one column for the answer.

How will I create the dataset?

For each question, we will take a Wikipedia text and extract from it the question and the answer with a LLM.

To get the Wikipedia texts I’m not going to scrape all of Wikipedia because dozens and dozens of Huggingface datasets already give all the texts we need. I found this dataset which is of good quality and in English only.

Now that we have the dataset, we need a LLM to generate the questions and the answers. For the LLM I choose the Gemma2 model 2b or 9b. Because they are small and smart, to compute more than one thousand rows, we need a model as small as possible.

Let’s begin.

LLMs test

First, for the LLMs, I quantized them to make they faster:

gemma 2 2b => 4bits version
gemma 2 9b => 4bits version

I want to test if the 2b version can be usable for our task. So I download it.

model_id = "Arthur-LAGACHERIE/Gemma-2-2b-4bit"

tokenizer = AutoTokenizer.from_pretrained(model_id)
streamer = TextStreamer(tokenizer, skip_prompt=True)

model = pipeline('text-generation', 
 model=model_id,
 tokenizer=tokenizer,
 streamer=streamer)

2.22 GB of memory

Now let’s ask a question.

prompt = """
### Context
Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy, typically including, though not necessarily limited to, governments, nation states, and capitalism. Anarchism advocates for the replacement of the state with stateless societies or other forms of free associations. As a historically left-wing movement, usually placed on the farthest left of the political spectrum, it is usually described alongside communalism and libertarian Marxism as the libertarian wing (libertarian socialism) of the socialist movement. Humans lived in societies without formal hierarchies long before the establishment of formal states, realms, or empires. With the rise of organised hierarchical bodies, scepticism toward authority also rose. Although traces of anarchist thought are found throughout history, modern anarchism emerged from the Enlightenment. During the latter half of the 19th and the first decades of the 20th century, the anarchist movement flourished in most parts of the world and had a significant role in workers' struggles for emancipation. Various anarchist schools of thought formed during this period. Anarchists have taken part in several revolutions, most notably in the Paris Commune, the Russian Civil War and the Spanish Civil War, whose end marked the end of the classical era of anarchism. In the last decades of the 20th and into the 21st century, the anarchist movement has been resurgent once more, growing in popularity and influence within anti-capitalist, anti-war and anti-globalisation movements.

### Instruct
From the context information generate a question and an answer.
Generate it in this specific format:
question<endofthequestion>answer
"""

chat = [
 {"role": "user", "content": prompt},
]
out = model(chat, max_length=4024)[0]["generated_text"][1]["content"]

question: What is the historical context of modern anarchism? <endofthequestion>answer: Modern anarchism emerged from the Enlightenment and flourished in the latter half of the 19th and the first decades of the 20th century, with a significant role in workers’ struggles for emancipation.

🤯 Gemma 2 2B works so well!!

Now I need to execute this code to separate the question and the answer.

out = out.split("<endofthequestion>")
question = out[0]
answer = out[1]
print(question, answer)

‘question: What is the historical context of modern anarchism? ‘

“answer: Modern anarchism emerged from the Enlightenment and flourished in the latter half of the 19th and the first decades of the 20th century, with a significant role in workers’ struggles for emancipation. \n”

It’s decided, I’ll use Gemma 2 2b.

Dataset

First, let’s download the dataset. The dataset is composed of 6M rows so I download it with streaming for not using too much memory.

from datasets import load_dataset
dataset = load_dataset('vietgpt/wikipedia_en', split='train', streaming=True)
dataset = iter(dataset)

With the streaming, we can do a loop over the texts.

for i in range(2):
 text = next(dataset)["text"]
 print(text[:500])
 print("\n")

Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary…

Albedo (; ) is the measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0, corresponding to a black body that absorbs all incident radiation, …

Create the loop

Now we know how to download the dataset and the model we can combine them in a while to create the instruction dataset. So let’s begin by downloading the model and the dataset.

!pip install bitsandbytes

from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, TextStreamer

model_id = "Arthur-LAGACHERIE/Gemma-2-2b-4bit"
model = pipeline('text-generation', 
 model=model_id)

from datasets import load_dataset
dataset = load_dataset('vietgpt/wikipedia_en', split='train', streaming=True)
dataset = iter(dataset)

After downloading we can write the start of the main loop:

load the text
define the prompt

for i in range(1):
 text = next(dataset)["text"][:1000]
 prompt = f"""
### Context
{text}

### Instruct
From the context information generate a question and an answer.
Generate it in this specific format:
question<endofthequestion>answer
 """

Now we can generate the output, and separate the question of the answer.

# in the loop
 chat = [
 {"role": "user", "content": prompt},
 ]
 out = model(chat, max_length=4024)[0]["generated_text"][1]["content"]
 out = out.split("<endofthequestion>")
 question = out[0]
 answer = out[1]

But after some tests, I noticed that Gemma 2 wrote before the question and the answer “question” and “answer”. This is a problem because if we train an LLM with the dataset when we ask it a question, it will answer: “Answer: blah blah blah…”.

So I created a function to clear the word like “question:” or “answer:”.

def clear(text, words):
 text = text.split(words)
 if len(text) > 1:
 text = ''.join(text[1:])
 done = True
 else:
 text = ''.join(text)
 done = False
 return text, done

Then, I integrate it into the loop and add a list system to save the questions and the answers.

word_question = ["Question:", "question:", "Question :", "question :", "question", "Question"]
word_answer = ["answer:", "Answer:", "answer :", "Answer :", "answer", "Answer"]
questions = []
answers = []

for i in range(1):
 # rest of the code

 for word in word_question:
 text, done = clear(question, word)
 if done:
 break
 question = text
 
 for word in word_answer:
 text, done = clear(answer, word)
 if done:
 break
 answer = text
 
 questions.append(question)
 answers.append(answer)

To push the dataset to the hub we need to execute the following lines of code:

data = {"questions":questions, "answers":answers}
data = pd.DataFrame.from_dict(data)
data = Dataset.from_pandas(data)
data.push_to_hub("Arthur-LAGACHERIE/wikipedia-instruct", "01", token="hf_token")

I run it… and an error appears. The model doesn’t write the separation tag correctly. So an error occurs when we try to take the second part of the output.

out = out.split("<endofthequestion>") # there no <endofthequestion>
question = out[0]
answer = out[1] # <== here

To solve the problem I add an “if” to verify if <endofthequestion> is in the output.

word_question = ["Question:", "question:", "Question :", "question :", "question", "Question"]
word_answer = ["answer:", "Answer:", "answer :", "Answer :", "answer", "Answer"]
questions = []
answers = []

for i in tqdm(range(1000)):
 text = next(dataset)["text"][:1000]
 prompt = f"""
### Context
{text}

### Instruct
From the context information generate a question and an answer.
Generate it in this specific format:
question<endofthequestion>answer
 """
 chat = [
 {"role": "user", "content": prompt},
 ]
 out = model(chat, max_length=4024)[0]["generated_text"][1]["content"]
 
 if "<endofthequestion>" in out:
 out = out.split("<endofthequestion>")
 question = out[0]
 answer = out[1]

 for word in word_question:
 text, done = clear(question, word)
 if done:
 break
 question = text

 for word in word_answer:
 text, done = clear(answer, word)
 if done:
 break
 answer = text

 questions.append(question)
 answers.append(answer)
 
data = {"questions":questions, "answers":answers}
data = pd.DataFrame.from_dict(data)
data = Dataset.from_pandas(data)
data.push_to_hub("Arthur-LAGACHERIE/wikipedia-instruct", token="hf_token")

And voila, the code is totally functional. So let’s run it.

Finally, the dataset has been created and pushed to the hub 1 hour, 28 minutes, and 6 seconds later.👍

You can see it here.

A little problem

It seems to work well, except for one thing: the length.

1000–828=172 rows have been skipped because there is no separation tag. It is not too grave, but it has importance.

I could solve the issue by having Gemma verify the sentence, but that would take too much time. So I’ll leave it like that, it’s not so bad.

Conclusion

I will continue to create this dataset until I reach a respectable size (a few thousand). You can like it if you want.

Arthur-LAGACHERIE/wikipedia-instruct · Datasets at Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

I hope you enjoyed this article. If this is the case you can clap it (you can also follow me if you want).

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Create an Instruction Dataset From Scratch

Author(s): Arthur Lagacherie

How will I create the dataset?

LLMs test

Dataset

Create the loop

A little problem

Conclusion

Arthur-LAGACHERIE/wikipedia-instruct · Datasets at Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Create an Instruction Dataset From Scratch

Author(s): Arthur Lagacherie

How will I create the dataset?

LLMs test

Dataset

Create the loop

A little problem

Conclusion

Arthur-LAGACHERIE/wikipedia-instruct · Datasets at Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥