Create an Instruction Dataset From Scratch
Last Updated on September 18, 2024 by Editorial Team
Author(s): Arthur Lagacherie
Originally published on Towards AI.
My goal today is to create an instruction dataset from Wikipedia texts.
But first, what is an Instruct dataset.? An Instruct dataset is a dataset for LLMs fine-tuning, after its pre-training, LLMs canβt answer real questions, they can just recite knowledge. Thatβs why the 2nd step in their training is the instruction part. Train them to answer real questions.
For that, we need an instruction dataset, a dataset composed of one column for the question and one column for the answer.
How will I create the dataset?
For each question, we will take a Wikipedia text and extract from it the question and the answer with a LLM.
To get the Wikipedia texts Iβm not going to scrape all of Wikipedia because dozens and dozens of Huggingface datasets already give all the texts we need. I found this dataset which is of good quality and in English only.
Now that we have the dataset, we need a LLM to generate the questions and the answers. For the LLM I choose the Gemma2 model 2b or 9b. Because they are small and smart, to compute more than one thousand rows, we need a model as small as possible.
Letβs begin.
LLMs test
First, for the LLMs, I quantized them to make they faster:
I want to test if the 2b version can be usable for our task. So I download it.
model_id = "Arthur-LAGACHERIE/Gemma-2-2b-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_id)
streamer = TextStreamer(tokenizer, skip_prompt=True)
model = pipeline('text-generation',
model=model_id,
tokenizer=tokenizer,
streamer=streamer)
2.22 GB of memory
Now letβs ask a question.
prompt = """
### Context
Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy, typically including, though not necessarily limited to, governments, nation states, and capitalism. Anarchism advocates for the replacement of the state with stateless societies or other forms of free associations. As a historically left-wing movement, usually placed on the farthest left of the political spectrum, it is usually described alongside communalism and libertarian Marxism as the libertarian wing (libertarian socialism) of the socialist movement. Humans lived in societies without formal hierarchies long before the establishment of formal states, realms, or empires. With the rise of organised hierarchical bodies, scepticism toward authority also rose. Although traces of anarchist thought are found throughout history, modern anarchism emerged from the Enlightenment. During the latter half of the 19th and the first decades of the 20th century, the anarchist movement flourished in most parts of the world and had a significant role in workers' struggles for emancipation. Various anarchist schools of thought formed during this period. Anarchists have taken part in several revolutions, most notably in the Paris Commune, the Russian Civil War and the Spanish Civil War, whose end marked the end of the classical era of anarchism. In the last decades of the 20th and into the 21st century, the anarchist movement has been resurgent once more, growing in popularity and influence within anti-capitalist, anti-war and anti-globalisation movements.
### Instruct
From the context information generate a question and an answer.
Generate it in this specific format:
question<endofthequestion>answer
"""
chat = [
{"role": "user", "content": prompt},
]
out = model(chat, max_length=4024)[0]["generated_text"][1]["content"]
question: What is the historical context of modern anarchism? <endofthequestion>answer: Modern anarchism emerged from the Enlightenment and flourished in the latter half of the 19th and the first decades of the 20th century, with a significant role in workersβ struggles for emancipation.
🤯 Gemma 2 2B works so well!!
Now I need to execute this code to separate the question and the answer.
out = out.split("<endofthequestion>")
question = out[0]
answer = out[1]
print(question, answer)
βquestion: What is the historical context of modern anarchism? β
βanswer: Modern anarchism emerged from the Enlightenment and flourished in the latter half of the 19th and the first decades of the 20th century, with a significant role in workersβ struggles for emancipation. \nβ
Itβs decided, Iβll use Gemma 2 2b.
Dataset
First, letβs download the dataset. The dataset is composed of 6M rows so I download it with streaming for not using too much memory.
from datasets import load_dataset
dataset = load_dataset('vietgpt/wikipedia_en', split='train', streaming=True)
dataset = iter(dataset)
With the streaming, we can do a loop over the texts.
for i in range(2):
text = next(dataset)["text"]
print(text[:500])
print("\n")
Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessaryβ¦
Albedo (; ) is the measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0, corresponding to a black body that absorbs all incident radiation, β¦
Create the loop
Now we know how to download the dataset and the model we can combine them in a while to create the instruction dataset. So letβs begin by downloading the model and the dataset.
!pip install bitsandbytes
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, TextStreamer
model_id = "Arthur-LAGACHERIE/Gemma-2-2b-4bit"
model = pipeline('text-generation',
model=model_id)
from datasets import load_dataset
dataset = load_dataset('vietgpt/wikipedia_en', split='train', streaming=True)
dataset = iter(dataset)
After downloading we can write the start of the main loop:
- load the text
- define the prompt
for i in range(1):
text = next(dataset)["text"][:1000]
prompt = f"""
### Context
{text}
### Instruct
From the context information generate a question and an answer.
Generate it in this specific format:
question<endofthequestion>answer
"""
Now we can generate the output, and separate the question of the answer.
# in the loop
chat = [
{"role": "user", "content": prompt},
]
out = model(chat, max_length=4024)[0]["generated_text"][1]["content"]
out = out.split("<endofthequestion>")
question = out[0]
answer = out[1]
But after some tests, I noticed that Gemma 2 wrote before the question and the answer βquestionβ and βanswerβ. This is a problem because if we train an LLM with the dataset when we ask it a question, it will answer: βAnswer: blah blah blahβ¦β.
So I created a function to clear the word like βquestion:β or βanswer:β.
def clear(text, words):
text = text.split(words)
if len(text) > 1:
text = ''.join(text[1:])
done = True
else:
text = ''.join(text)
done = False
return text, done
Then, I integrate it into the loop and add a list system to save the questions and the answers.
word_question = ["Question:", "question:", "Question :", "question :", "question", "Question"]
word_answer = ["answer:", "Answer:", "answer :", "Answer :", "answer", "Answer"]
questions = []
answers = []
for i in range(1):
# rest of the code
for word in word_question:
text, done = clear(question, word)
if done:
break
question = text
for word in word_answer:
text, done = clear(answer, word)
if done:
break
answer = text
questions.append(question)
answers.append(answer)
To push the dataset to the hub we need to execute the following lines of code:
data = {"questions":questions, "answers":answers}
data = pd.DataFrame.from_dict(data)
data = Dataset.from_pandas(data)
data.push_to_hub("Arthur-LAGACHERIE/wikipedia-instruct", "01", token="hf_token")
I run itβ¦ and an error appears. The model doesnβt write the separation tag correctly. So an error occurs when we try to take the second part of the output.
out = out.split("<endofthequestion>") # there no <endofthequestion>
question = out[0]
answer = out[1] # <== here
To solve the problem I add an βifβ to verify if <endofthequestion> is in the output.
word_question = ["Question:", "question:", "Question :", "question :", "question", "Question"]
word_answer = ["answer:", "Answer:", "answer :", "Answer :", "answer", "Answer"]
questions = []
answers = []
for i in tqdm(range(1000)):
text = next(dataset)["text"][:1000]
prompt = f"""
### Context
{text}
### Instruct
From the context information generate a question and an answer.
Generate it in this specific format:
question<endofthequestion>answer
"""
chat = [
{"role": "user", "content": prompt},
]
out = model(chat, max_length=4024)[0]["generated_text"][1]["content"]
if "<endofthequestion>" in out:
out = out.split("<endofthequestion>")
question = out[0]
answer = out[1]
for word in word_question:
text, done = clear(question, word)
if done:
break
question = text
for word in word_answer:
text, done = clear(answer, word)
if done:
break
answer = text
questions.append(question)
answers.append(answer)
data = {"questions":questions, "answers":answers}
data = pd.DataFrame.from_dict(data)
data = Dataset.from_pandas(data)
data.push_to_hub("Arthur-LAGACHERIE/wikipedia-instruct", token="hf_token")
And voila, the code is totally functional. So letβs run it.
Finally, the dataset has been created and pushed to the hub 1 hour, 28 minutes, and 6 seconds later.👍
You can see it here.
A little problem
It seems to work well, except for one thing: the length.
1000β828=172 rows have been skipped because there is no separation tag. It is not too grave, but it has importance.
I could solve the issue by having Gemma verify the sentence, but that would take too much time. So Iβll leave it like that, itβs not so bad.
Conclusion
I will continue to create this dataset until I reach a respectable size (a few thousand). You can like it if you want.
Arthur-LAGACHERIE/wikipedia-instruct Β· Datasets at Hugging Face
We're on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
I hope you enjoyed this article. If this is the case you can clap it (you can also follow me if you want).
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI