Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Create an Instruction Dataset From Scratch
Artificial Intelligence   Latest   Machine Learning

Create an Instruction Dataset From Scratch

Last Updated on September 18, 2024 by Editorial Team

Author(s): Arthur Lagacherie

Originally published on Towards AI.

image by me

My goal today is to create an instruction dataset from Wikipedia texts.

But first, what is an Instruct dataset.? An Instruct dataset is a dataset for LLMs fine-tuning, after its pre-training, LLMs can’t answer real questions, they can just recite knowledge. That’s why the 2nd step in their training is the instruction part. Train them to answer real questions.

For that, we need an instruction dataset, a dataset composed of one column for the question and one column for the answer.

alpaca-cleaned dataset

How will I create the dataset?

For each question, we will take a Wikipedia text and extract from it the question and the answer with a LLM.

image by me

To get the Wikipedia texts I’m not going to scrape all of Wikipedia because dozens and dozens of Huggingface datasets already give all the texts we need. I found this dataset which is of good quality and in English only.

Now that we have the dataset, we need a LLM to generate the questions and the answers. For the LLM I choose the Gemma2 model 2b or 9b. Because they are small and smart, to compute more than one thousand rows, we need a model as small as possible.

Let’s begin.

LLMs test

First, for the LLMs, I quantized them to make they faster:

I want to test if the 2b version can be usable for our task. So I download it.

model_id = "Arthur-LAGACHERIE/Gemma-2-2b-4bit"

tokenizer = AutoTokenizer.from_pretrained(model_id)
streamer = TextStreamer(tokenizer, skip_prompt=True)

model = pipeline('text-generation',
model=model_id,
tokenizer=tokenizer,
streamer=streamer)

2.22 GB of memory

Now let’s ask a question.

prompt = """
### Context
Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy, typically including, though not necessarily limited to, governments, nation states, and capitalism. Anarchism advocates for the replacement of the state with stateless societies or other forms of free associations. As a historically left-wing movement, usually placed on the farthest left of the political spectrum, it is usually described alongside communalism and libertarian Marxism as the libertarian wing (libertarian socialism) of the socialist movement. Humans lived in societies without formal hierarchies long before the establishment of formal states, realms, or empires. With the rise of organised hierarchical bodies, scepticism toward authority also rose. Although traces of anarchist thought are found throughout history, modern anarchism emerged from the Enlightenment. During the latter half of the 19th and the first decades of the 20th century, the anarchist movement flourished in most parts of the world and had a significant role in workers' struggles for emancipation. Various anarchist schools of thought formed during this period. Anarchists have taken part in several revolutions, most notably in the Paris Commune, the Russian Civil War and the Spanish Civil War, whose end marked the end of the classical era of anarchism. In the last decades of the 20th and into the 21st century, the anarchist movement has been resurgent once more, growing in popularity and influence within anti-capitalist, anti-war and anti-globalisation movements.

### Instruct
From the context information generate a question and an answer.
Generate it in this specific format:
question<endofthequestion>answer
"""


chat = [
{"role": "user", "content": prompt},
]
out = model(chat, max_length=4024)[0]["generated_text"][1]["content"]

question: What is the historical context of modern anarchism? <endofthequestion>answer: Modern anarchism emerged from the Enlightenment and flourished in the latter half of the 19th and the first decades of the 20th century, with a significant role in workers’ struggles for emancipation.

🤯 Gemma 2 2B works so well!!

Now I need to execute this code to separate the question and the answer.

out = out.split("<endofthequestion>")
question = out[0]
answer = out[1]
print(question, answer)

β€˜question: What is the historical context of modern anarchism? β€˜

β€œanswer: Modern anarchism emerged from the Enlightenment and flourished in the latter half of the 19th and the first decades of the 20th century, with a significant role in workers’ struggles for emancipation. \n”

It’s decided, I’ll use Gemma 2 2b.

Dataset

First, let’s download the dataset. The dataset is composed of 6M rows so I download it with streaming for not using too much memory.

from datasets import load_dataset
dataset = load_dataset('vietgpt/wikipedia_en', split='train', streaming=True)
dataset = iter(dataset)

With the streaming, we can do a loop over the texts.

for i in range(2):
text = next(dataset)["text"]
print(text[:500])
print("\n")

Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary…

Albedo (; ) is the measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0, corresponding to a black body that absorbs all incident radiation, …

Create the loop

Now we know how to download the dataset and the model we can combine them in a while to create the instruction dataset. So let’s begin by downloading the model and the dataset.

!pip install bitsandbytes

from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, TextStreamer

model_id = "Arthur-LAGACHERIE/Gemma-2-2b-4bit"
model = pipeline('text-generation',
model=model_id)

from datasets import load_dataset
dataset = load_dataset('vietgpt/wikipedia_en', split='train', streaming=True)
dataset = iter(dataset)

After downloading we can write the start of the main loop:

  • load the text
  • define the prompt
for i in range(1):
text = next(dataset)["text"][:1000]
prompt = f"""
### Context
{text}

### Instruct
From the context information generate a question and an answer.
Generate it in this specific format:
question<endofthequestion>answer
"
""

Now we can generate the output, and separate the question of the answer.

# in the loop
chat = [
{"role": "user", "content": prompt},
]
out = model(chat, max_length=4024)[0]["generated_text"][1]["content"]
out = out.split("<endofthequestion>")
question = out[0]
answer = out[1]

But after some tests, I noticed that Gemma 2 wrote before the question and the answer β€œquestion” and β€œanswer”. This is a problem because if we train an LLM with the dataset when we ask it a question, it will answer: β€œAnswer: blah blah blah…”.

So I created a function to clear the word like β€œquestion:” or β€œanswer:”.

def clear(text, words):
text = text.split(words)
if len(text) > 1:
text = ''.join(text[1:])
done = True
else:
text = ''.join(text)
done = False
return text, done

Then, I integrate it into the loop and add a list system to save the questions and the answers.

word_question = ["Question:", "question:", "Question :", "question :", "question", "Question"]
word_answer = ["answer:", "Answer:", "answer :", "Answer :", "answer", "Answer"]
questions = []
answers = []

for i in range(1):
# rest of the code

for word in word_question:
text, done = clear(question, word)
if done:
break
question = text

for word in word_answer:
text, done = clear(answer, word)
if done:
break
answer = text

questions.append(question)
answers.append(answer)

To push the dataset to the hub we need to execute the following lines of code:

data = {"questions":questions, "answers":answers}
data = pd.DataFrame.from_dict(data)
data = Dataset.from_pandas(data)
data.push_to_hub("Arthur-LAGACHERIE/wikipedia-instruct", "01", token="hf_token")

I run it… and an error appears. The model doesn’t write the separation tag correctly. So an error occurs when we try to take the second part of the output.

out = out.split("<endofthequestion>") # there no <endofthequestion>
question = out[0]
answer = out[1] # <== here

To solve the problem I add an β€œif” to verify if <endofthequestion> is in the output.

word_question = ["Question:", "question:", "Question :", "question :", "question", "Question"]
word_answer = ["answer:", "Answer:", "answer :", "Answer :", "answer", "Answer"]
questions = []
answers = []

for i in tqdm(range(1000)):
text = next(dataset)["text"][:1000]
prompt = f"""
### Context
{text}

### Instruct
From the context information generate a question and an answer.
Generate it in this specific format:
question<endofthequestion>answer
"""

chat = [
{"role": "user", "content": prompt},
]
out = model(chat, max_length=4024)[0]["generated_text"][1]["content"]

if "<endofthequestion>" in out:
out = out.split("<endofthequestion>")
question = out[0]
answer = out[1]

for word in word_question:
text, done = clear(question, word)
if done:
break
question = text

for word in word_answer:
text, done = clear(answer, word)
if done:
break
answer = text

questions.append(question)
answers.append(answer)

data = {"questions":questions, "answers":answers}
data = pd.DataFrame.from_dict(data)
data = Dataset.from_pandas(data)
data.push_to_hub("Arthur-LAGACHERIE/wikipedia-instruct", token="hf_token")

And voila, the code is totally functional. So let’s run it.

Finally, the dataset has been created and pushed to the hub 1 hour, 28 minutes, and 6 seconds later.👍

You can see it here.

A little problem

It seems to work well, except for one thing: the length.

1000–828=172 rows have been skipped because there is no separation tag. It is not too grave, but it has importance.

I could solve the issue by having Gemma verify the sentence, but that would take too much time. So I’ll leave it like that, it’s not so bad.

Conclusion

I will continue to create this dataset until I reach a respectable size (a few thousand). You can like it if you want.

Arthur-LAGACHERIE/wikipedia-instruct Β· Datasets at Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

I hope you enjoyed this article. If this is the case you can clap it (you can also follow me if you want).

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓