You Can No Longer Fail To Understand How To Use Large Language Models
Last Updated on March 16, 2023 by Editorial Team
Author(s): MichaΓ«l Karpe
Originally published on Towards AI.
A hands-on approach to learning how Large Language Models work in practice.
Why a new article on Large Language Models?
The launch and incredible speed of adoption of ChatGPT in the last few months have turned this artificial intelligence chatbot into a real general knowledge topic. Anyone who hasnβt heard of ChatGPT by now is immediately suspected of being disconnected from the real world for several months β even though the intelligence of this chatbot is more artificial than real.
Thus, it is becoming urgent for self-proclaimed experts in artificial intelligence to understand in more detail how to concretely use these Large Language Models (LLMs), if they want to be able not only to understand what emerges from them but also and above all, to avoid their regression and defend the legitimacy of their statusβ¦
Where to start?
Computer science courses from top universities are traditionally available online for free and are probably the best starting point for someone who has already studied in this field. Since it is the authorβs case, after having a look at the latest resources available on LLMs by universities such as UC Berkeley, Stanford, or MIT, we found Stanfordβs CS 324 course on Advances in Foundation Models [5] to be the most suitable for what we want to do in this article: get a quick but detailed understanding on how to use LLMs. The coursework description says it all:
βBoth the early assignment and the quarter-long project are designed to get you hands-on experience with foundation models.β
The early assignment comes in the form of a Google Colab Notebook, which, once completed, will allow us to know how to use LLMs such as BLOOM (BigScience Large Open-science Open-access Multilingual Language Model) [4]. In this article, we will analyze and go through three main steps to run our first LLM and understand the main challenge when using such models: task selection, prompt development, and model comparison.
The first step is about choosing the task we want our LLM to perform, as well as the dataset and evaluation metric used for this task, using the HuggingFace datasets library. We will choose a well-known benchmark dataset: GLUE (General Language Understanding Evaluation).
By browsing the tasks associated with this dataset (from its original site or its HuggingFace dataset card), we discover 11 different tasks and their associated metrics.
In this article, we will choose the sentiment classification task with the Stanford Sentiment Treebank (sst2), whose associated evaluation metric is accuracy, i.e. the ratio of correct predictions to total predictions made.
from datasets import load_dataset
from evaluate import load as load_metric
dataset = load_dataset("glue", "sst2", split="validation").to_pandas()
metric = load_metric("glue", "sst2")
The second part of the notebook is about developing effective prompts for LLMs. In other words, how to talk to an LLM?
βLLMs are whatever you prompt them to be.β Andrej Karpathy
Indeed, while most machine learning practitioners have learned and are used to hyperparameter tuning to improve the performance of a machine learning model, LLMs are all about prompt engineering, or in simpler terms, input tuning. We will see that to improve the performance of an LLM model, the simplest approach is first to improve the format of the request we submit to it.
Zero-shot and few-shot prompting
We distinguish two main techniques for prompting language models: zero-shot prompting [2] and few-shot prompting [1]. To cite Stanford CS 324 authors:
βIn zero-shot prompting, an instruction for the task is usually specified in natural language. The model is expected to following the specification and output a correct response, without any examples (hence βzero shotsβ).
In few-shot prompting, we provide a few examples in the prompt, optionally including task instructions as well (all as natural language). Even without said instructions, our hope is that the LLM can use the examples to autoregressively complete what comes next to solve the desired task.β
In this article, we will focus on few-shot prompting, although similar work can be performed for zero-shot prompting. As a first step, we will observe the performance of the sentiment classification prompt proposed in the notebook on the selected sst2 task, using the BLOOM model composed of 1.7 billion parameters. Although this model is already large by definition, since it is an LLM, it remains relatively small when we know that the largest BLOOM model is composed of 176 billion parameters.
Experimenting few-shot prompting with BLOOM
A proposed prompt for the sentiment classification task is written as follows:
# Prompt 1
f'''Review: The movie was horrible
Sentiment: Negative
Review: The movie was the best movie I have watched all year!!!
Sentiment: Positive
Review: The film was a disaster
Sentiment: Negative
Review: {review}
Sentiment:'''
We import torch and transformers libraries to build our LLM based on the BLOOM model composed of 1.7 billion parameters: bigscience/bloom-1b7.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer,
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_name = "bigscience/bloom-1b7"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)
We created a generator object associated with a βtext-generationβ task. We can now define the following functions to prompt our LLM to generate (with the function generate_sentiment) one new token (note the argument max_new_tokens = 1) from the presented prompt (with the function prompt_1). We expect here the newly generated token to be either βPositiveβ or βNegativeβ.
def prompt_1(review: str) -> str:
return f'''Review: The movie was horrible
Sentiment: Negative
Review: The movie was the best movie I have watched all year!!!
Sentiment: Positive
Review: The film was a disaster
Sentiment: Negative
Review: {review}
Sentiment:'''
def generate_sentiment(review: str) -> str:
generated_text = generator(prompt_1(review), max_new_tokens=1)[0]['generated_text']
return generated_text.split()[-1]
With the previous functions being defined, we are now able to generate one token from the proposed prompt and convert the generated token to an integer value to evaluate the performance of our LLM.
dataset["prediction"] = dataset["sentence"].apply(generate_sentiment)
dataset["prediction_int"] = dataset["prediction"].str.lower().map({"negative": 0, "positive": 1}).fillna(-1)
accuracy = metric.compute(predictions=dataset["prediction_int"], references=dataset["label"])["accuracy"]
print(accuracy)
Testing this prompt with the bigscience/bloom-1b7 model on the sst2 validation dataset results in an accuracy of 66.28%, which is not amazing but already better than tossing a coin.
Experimenting with different prompts
Letβs now propose to the LLM another few-shot prompt, with new sentences that are not about movie reviews but still express a positive or negative sentiment.
# Prompt 2
f'''Review: This has been the worst trade deal in the history of trade deals, maybe ever
Sentiment: Negative
Review: Amazing introduction assignment on how to use large language models
Sentiment: Positive
Review: This code is full of bugs, it's impossible to run it
Sentiment: Negative
Review: {review}
Sentiment:'''
Testing this prompt with the bigscience/bloom-1b7 model on the sst2 validation dataset results in an accuracy of 59.98%, which is a lower accuracy than the one obtained with the first prompt. This second trial shows already some importance of the examples given to the LLM to understand the task it needs to perform.
On the third attempt, letβs change not only the sentences used as few-shot examples for the LLM, but also the format of the examples provided. Here, we are paying less attention to the format of the prompt, expecting the LLM to complete the proposed sentences with either βpositiveβ or βnegativeβ.
# Prompt 3
f'''The sentiment of the sentence "I hate this world." is negative
The sentiment of the sentence "I love you all!" is positive
The sentiment of the sentence "It will never work." is negative
The sentiment of the sentence {review} is'''
Testing this prompt with the bigscience/bloom-1b7 model on the sst2 validation dataset results in an accuracy of 31.77%, which is a much lower accuracy than the ones obtained with the two first tested prompts. When looking deeper at the results, we even observe that the proposed word is not always βpositiveβ or βnegativeβ.
From this section, we can conclude the importance of the chosen prompt for our LLM text generation task.
In this final step, we propose to compare the performance of BLOOM models on two dimensions: model size (i.e., model number of parameters) and model training objectives (depending on the objective function and the training dataset).
Model size
Regarding model size, we will first compare bigscience/bloom-1b7 with bigscience/bloom-3b, the latter being composed of 3 billion parameters.
Testing the first prompt with the bigscience/bloom-3b model on the sst2 validation dataset results in an accuracy of 68.69%, which is a slightly higher accuracy than the one obtained with bigscience/bloom-1b7 (which was 66.28%).
From this experiment, we note that the performance of our LLM increases with the model size.
Model training objectives
Regarding model training objectives, we will now compare the performance of bigscience/bloom-3b with bigscience/bloomz-3b, the latter having been instruction fine-tuned, which means that the model has been explicitly trained to follow instructions.
Testing the first prompt with the bigscience/bloomz-3b model on the sst2 validation dataset results in an accuracy of 73.51%, which is again slightly higher than the accuracy obtained with bigscience/bloomz-3b (which was 68.69%).
From this experiment, we note that the performance of our LLM is better with fine-tuned models than with pre-trained models.
Instruction fine-tuning
Instruction fine-tuning is described in detail in the Crosslingual Generalization through Multitask Finetuning paper [3], where are also available all the results obtained on benchmark datasets, as well as the prompts used to obtain these results.
In Table 7 of this paper, showing all experimental results, we can almost confirm the observations we made in the two previous subsections.
Regarding the increased performance with the model size, we observe that for a given number of tasks and datasets, the performances of the different pre-trained (but not multitask fine-tuned) models are similar regardless of the number of model parameters. We do not always observe an increasing performance with the number of parameters for pre-trained models, as we did in our experiment.
However, for pre-trained and multitask fine-tuned models, not only do we observe, in most cases, better performance than for pre-trained but not multitask fine-tuned models, but we also observe an improvement of the performance with the number of parameters. With multitasking fine-tuned models, it appears that the largest models still have learning capacity when provided with additional data.
After reading the previously mentioned paper and analyzing its results, we would want to experiment with larger multitask fine-tuned models, aiming to get even better results with minimal changes to the code. However, GPU memory available with a free Google Colab Notebook is not sufficient for running BLOOM multitask fine-tuned models with 7.1 billion parameters and beyond.
Thus, the second easiest way to improve the classification results appears to be prompt engineering, as we experimented with before.
Contextualizing and refining the task
A first approach for improving developed prompts is to contextualize and be more precise in the description of the task we want the Autoregressive Language Model to perform. This first finding comes from the study of the prompts used for training and evaluating BLOOM models, as described in the Appendix of the previously mentioned BLOOM Multitask Finetuning paper.
Here are two examples of prompt templates used for training and evaluating topic classification tasks for BLOOM multitask fine-tuned models:
# clue csl
'''After John wrote the abstract "{{abst}}", he wrote these keywords "{{ keyword |
join(', ') }}". Do you think his choice of keywords was correct? Answer {{
answer_choices[1]}} or {{ answer_choices[0]}}.'''
# csl tnews
'''Given the topics of {{answer_choices[:-1] | join(', ') }}, and {{
answer_choices[-1] }}, specify which of them best represents the following
sentence:
{{ sentence }}
Best:'''
Leveraging these templates, I decided to evaluate again bigscience/bloomz-3b with a prompt that contextualizes more the task the model has to perform.
# Prompt 4
f'''Given single sentences extracted from movie reviews, specify which of sentiments "Positive" or "Negative" best classifies the following sentences.
Movie review: The movie was horrible
Best sentiment classification: Negative
Movie review: The movie was the best movie I have watched all year!!!
Best sentiment classification: Positive
Movie review: The film was a disaster
Best sentiment classification: Negative
Movie review: {review}
Best sentiment classification:'''
Testing this more contextualized prompt with the bigscience/bloomz-3b model on the sst2 validation dataset results in an accuracy of 77.29%, which is, once again, slightly higher than the previously obtained accuracy (which was 73.51% with the third tested prompt).
Increasing shot number
A second approach for improving developed prompts is to increase the shot number, i.e. the number of task examples provided as input. We will now evaluate bigscience/bloomz-3b again with the previous prompt. In addition, we will increase the number of provided examples from 3 to 5 (as shown below) and then 10 (the 5 following plus 5 others we do not provide for the sake of the articleβs length).
# Prompt 5
f'''Given single sentences extracted from movie reviews, specify which of sentiments "Positive" or "Negative" best classifies the following sentences.
Movie review: The movie was horrible
Best sentiment classification: Negative
Movie review: The movie was the best movie I have watched all year!!!
Best sentiment classification: Positive
Movie review: The film was a disaster
Best sentiment classification: Negative
Movie review: I will never watch this film again...
Best sentiment classification: Negative
Movie review: This film deserves its 5-star reputation.
Best sentiment classification: Positive
Movie review: {review}
Best sentiment classification:'''
Testing this more contextualized and 5-shot prompt (respectively more contextualized and 10-shot prompt) with the bigscience/bloomz-3b model on the sst2 validation dataset results in an accuracy of 90.48% (respectively 92.66%), which is a great improvement from the one obtained with only 3 examples!
Results summary
The following table summarizes the accuracy obtained on the sst2 validation dataset depending on the model and prompt used.
Without going into more detail in this article (stay tuned!), we note that the classification accuracy depends on the model and the prompt used. As we noted already in the Model Comparison section, for a given number of parameters:
A fine-tuned Autoregressive Language Model is more likely to give better results, and so is a more contextualized prompt with an explicit structure highlighting the task to be performed.
Concluding remarks
In this article, we leveraged CS 324 Advanced in Foundation Models Stanfordβs course to get a hands-on experience with Autoregressive Language Models.
After having selected a sentiment classification task, the Stanford Sentiment Treebank dataset, and the accuracy as a metric for evaluating many BLOOM models, we reminded the definition of zero-shot and few-shot prompting and experimented with prompt development.
Few-shot experiments evaluation and model comparison not only highlighted the higher performance of fine-tuned models but, most importantly, the importance of prompt engineering for getting the best results from Autoregressive Language Models.
References
[1] T. Brown et al., Language Models are Few-Shot Learners (2020), Advances in Neural Information Processing Systems 33 (NeurIPS 2020)
[2] J. Wei et al., Finetuned Language Models Are Zero-Shot Learners (2021), arXiv preprint arXiv:2109.01652
[3] N. Muennighoff et al., Crosslingual Generalization through Multitask Finetuning (2022), arXiv preprint arXiv:2211.01786
[4] Teven Le S. et al., BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (2022), arXiv preprint arXiv:2211.05100
[5] C. RΓ© et al., CS 324 β Advances in Foundation Models (2023), Stanford University Computer Science Department
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI