Fine-Tuning and Evaluating Large Language Models: Key Benchmarks and Metrics
Last Updated on July 20, 2024 by Editorial Team
Author(s): Saif Ali Kheraj
Originally published on Towards AI.
In generative AI, we must first define the problem statement. Then, select a model accordingly. We must then select the model that best fits the specific task at hand. For example, we can use the FLAN-T5 model to summarize dialogues. We can also choose any other model. We then proceed with one, two, and more shots to see how it performs. If it does not produce the desired results, we may need to fine-tune the model. Then, weβll look at the evaluation part. In this post, we will go into greater detail about fine-tuning and evaluating the model.
In-context learning (where you give one, two, or more shots) has limitations for certain cases and does not work well for smaller models. In-context learning is a process in which you try zero shots, one shots, or multiple shots and provide examples to LLM in prompts so that the model can generate for an unknown prompt.
LLM Finetuning
Fine-tuning is a supervised learning process that uses a dataset of labeled examples to update the LLMβs weights. The labeled examples are prompt completion pairs, as illustrated in the diagram above. The fine-tuning process extends the modelβs training to improve its ability to generate high-quality completions for a specific task.
For example, If I want to finetune the model to improve sentiment analysis capability, we would build up a dataset of examples that begin with the instruction Classify. We will build a dataset with many such example prompts as mentioned above.
Classify the following sentence into positive or negative: Text: {input_text} Summary: {expected sentiment}
We can use many example prompts as our training dataset. This includes instruction to classify the text along with the associated labels.
For translation:
Translate this sentence to Spanish: English: {input_sentence} Spanish: {expected_translation}
To summarize what we have said:
- Use Pretrained Model: A model already trained on a large, general dataset.
- Task-Specific Examples: Prompt completion pairs specific to the desired task.
- Prepared Instruction Dataset Split: We divide the dataset into training, validation and test set.
- Finetuning Process: We fine-tune the model using training and validation datasets and then evaluate the performance on testset using cross-entropy loss.
Surprisingly, good results can be obtained with relatively few examples. In comparison to the billions of pieces of text that the model saw during pre-training, only 500β1,000 examples can consistently produce good results.
Drawbacks of finetuning on a single task:
- Catastrophic forgetting happens because the full fine-tuning process modifies the weights of the original LLM. While this leads to great performance on a single fine-tuning task, it can degrade performance on other tasks.
How to avoid catastrophic Forgetting?
Multi Task Finetuning
Catastrophic Forgetting can be avoided by providing a variety of examples to the model. For example, we can provide examples of summarization prompts, translation prompts, and rating prompts. This requires numerous examples of each instruction when completed. The βinstructβ version of the model is fine-tuned so that it can follow prompted instructions.
One example is the FLAN family of models. FLAN (fine-tuned language net) refers to a specific set of instructions used to fine-tune various models. Many models are based on FLAN models. For example, the FLAN T5 model is based on the FLAN model. SAMSUM is one of the datasets that FLAN T5 uses. There are several pre-trained FLAN T5 models that have been fine-tuned on SAMSUM, including Phil Schmid/flan-t5-base-samsum and jasonmcaffee/flan-t5-large-samsum on Hugging Face. If we want to fine-tune the FLAN T5 model specifically for formal dialogue conversations, we can do so using the DIALOGUESUM dataset.
Models fine-tuned on DialogSum can be applied to areas like customer support, meeting minutes generation, chatbot summarization, and more.
2. PEFT (Parameter efficient fine tuning)
Training LLMs is computationally intensive. Full finetuning is computationally expensive as it might change each weight in the model. First, we start with a pretrained LLM like GPT-3. This model already has a vast amount of knowledge and understanding of language. Then we provide task-specific datasets, which could be data for question answering or sentiment analysis or any other customer dataset. During training, full finetuning process makes slight adjustments to every weight in the pretrained model. While the model weights are substantial, we have other important aspects during training like Optimizer, which adds up to the cost. For example, Optimizer States, gradients, forward activation, and temporary memory. These additional components add up to the training cost.
Three main approaches are used in PEFT: Selective / reparameterization/additive.
1. Selective
Here, we select a subset of initial LLM parameters to fine-tune.
2. Reparameterization
We reparameterize model weights using a low-rank representation. We will discuss LoRA in detail below.
LORA: Low Rank Representation:
Each layer in a transformer architecture has multiple weight matrices for different operations, like self-attention or feed-forward networks. These matrices can have different sizes depending on the specific layer and configuration. Let us take an example by picking a matrix of size 512 x 64 = 32,768 parameters. Let us now see LoRA with rank = 8.
- Original Weight Matrix: Dimensions: 512 x 64, Parameters: 32,768 (512 x 64)
- Matrix A (Rank Decomposition): Dimensions: 8 x 64 (rank x original dimension), Parameters: 512 (8 x 64)
- Matrix B (Rank Decomposition): Dimensions: 8 x 512 (rank x original dimension), Parameters: 4,096 (8 x 512)
- Total LORA Parameters: 512 (A) + 4,096 (B) = 4,608
Approximation:
The original weight matrix (W) is approximated by the product of A and B:
Z β W β A * B
Reasoning Behind the Dimensions:
- The dimensions of A and B are chosen to capture the essence of the original weight matrix (W) with fewer parameters.
- The rank (here, 8) controls the trade-off between efficiency and accuracy. A lower rank leads to fewer parameters but might result in a slightly less accurate approximation.
- We can also create task-specific decomposition matrices.
In the example we discussed, LORA achieves a reduction of approximately 86% in the number of trainable parameters needed for fine-tuning. Hereβs the summary.
- Original Weight Matrix: 32,768 parameters (512 x 64)
- Total LORA Parameters: 4,608 parameters (512 + 4,096)
3. Additive
We add trainable layers or parameters to the model in the form of adapter modules.
The two main additive approaches are:
- Adapter Modules: These are small, trainable neural network modules strategically inserted into specific layers of the pre-trained LLM. They help the LLM learn task-specific information without drastically changing its underlying knowledge.
- Prompt Tuning: This approach doesnβt involve adding any new modules to the model itself. Instead, it focuses on crafting specific prompts (essentially instructions or questions) that guide the pre-trained LLM toward the desired task.
All these approaches are similar to transfer learning, but they are more efficient in that they only fine-tune a subset of parameters rather than fine-tuning the complete layer. Even adapter modules are lightweight.
PEFT is particularly beneficial when dealing with large LLMs that have billions or even trillions of parameters, as fine-tuning all of them can be computationally expensive and resource-intensive.
PEFT is less prone to the catastrophic forgetting problems of full fine-tuning. Full fine-tuning results in a new version of the model for every task you train on.
Metrics to assess the performance
In the language model, evaluation is more challenging since the output is non deterministic.
Let us explore some of the metrics that we can use to evaluate.
ROUGE-1: (Recall-Oriented Understudy for Gisting Evaluation)
ROUGE-1 is recall oriented metric, which means it prioritizes identifying how many of the important words from the reference summaries are included in the generated summary. ROUGE 1 focuses on individual words(unigrams). Similarly, ROUGE-2 focuses on bigrams and so goes on.
Let us take an example of ROUGE-1:
Letβs walk through an example step-by-step:
- Reference Text: βMike really loves drinking tea.β
- Generated Text: βMike adores sipping tea.β
Step 1: Identify Unigrams
- Reference Text Unigrams: {Mike, really, loves, drinking, tea}
- Generated Text Unigrams: {Mike, adores, sipping, tea}
Step 2: Count Overlapping Unigrams
- Overlapping Unigrams: {Mike, tea}
- Number of Overlapping Unigrams: 2
ROUGE-1 Recall
ROUGE-1 Precision
ROUGE-1 F1 Score
ROUGE-L:
ROUGE-L is a metric used to evaluate the quality of text by measuring the longest common subsequence (LCS) between a generated text and a reference text. The LCS takes into account order of words making it more sensitive to the overall structure of the text compared to simple n gram overlap.
Letβs walk through an example step-by-step:
- Reference Text: βIt is cold outsideβ (We can see two subsequence βIt isβ in italics and cold outside in bold.)
- Generated Text: βIt is very cold outsideβ (We can see two subsequence βIt isβ in italic and cold outside in bold.)
ROUGE-L Recall = LCS(Gen, Ref) / unigrams in reference = 2/4 = 0.5`
ROUGE-L Precision = 2 / 5 = 0.4
ROUGE-L F1 = 2 . (0.2/0.9) = 0.44
ROUGE Cliping
ROUGE sometimes give misleading results. Let us explore this:
Example 1: Repetitive Generated Text
- Reference (human): βThe sun is shining brightly.β
- Generated output: βshining shining shining shiningβ
Without clipping:
- Unigram Matches: βshiningβ (matches four times)
- ROUGE-1 Precision: 4/4 = 1.0
This perfect score is misleading because the generated text is repetitive and lacks meaningful content.
With clipping:
- Clipped Unigram Matches: βshiningβ (matches only once, as in the reference)
- Modified Precision: 1/4
Clipping provides a more accurate reflection of the generated textβs quality.
Example 2: Reordered Generated Text
- Reference (human): βThe sun is shining brightly.β
- Generated output: βbrightly the sun is shiningβ
With clipping:
- Clipped Unigram Matches: βTheβ, βsunβ, βisβ, βshiningβ, βbrightlyβ (matches exactly as in the reference)
- Modified Precision: 5/5=1
Despite the different word order, clipping correctly identifies that the generated text includes all relevant unigrams in the correct frequency, giving it a perfect score. This could also be misleading.
To sumup, ROUGE clipping improves evaluation accuracy by limiting unigram matches to the count present in the reference text, preventing artificially inflated scores from repetitive words, and correctly handling word order variations.
BLEU
BLEU primarily focuses on n-gram precision, which means it counts how often sequences of words (n-grams) in the machine translation match those in the reference translations. It considers 1-grams (single words), 2-grams (phrases), etc. You can also refer to it as average precision across range of n-gram sizes.
Other Metrics and Benchmarks
There are other important metrics also used for evaluation, which are listed below in table:
With regards to HELM, One important feature of HELM is that it assesses on metrics beyond basic accuracy measures, like precision of the F1 score. The benchmark also includes metrics for fairness, bias, and toxicity, which are becoming increasingly important to assess as LLMs become more capable of human-like language generation, and in turn, of exhibiting potentially harmful behavior. HELM is a living benchmark that aims to continuously evolve with the addition of new scenarios, metrics, and models.
Conclusion
In this post, we saw an important aspect of fine-tuning a large language model. We started by discussing zero shot, one shot, two shot, more shot to see if the model works by generating the correct output. If it does not, we need to finetune the model. We can finetune the model by picking a relevant model based on our task requirement. Then, we finetune the model by giving it more examples along with labels. We also saw how finetuning the model can lead to catastrophic forgetting and the way to avoid it is to finetune on multiple tasks so that the model generalizes well. In addition, we can also use Parameter-efficient fine-tuning, where we discussed 3 techniques to avoid computational problems as well. Techniques like LoRA is very beneficial. We then moved towards evaluating the model where we studied some important metrics like ROUGE, BLEU, and other benchmarks available.
References
[1] https://cobusgreyling.medium.com/catastrophic-forgetting-in-llms-bf345760e6e2
[2] https://arxiv.org/html/2401.05605v1
[4] https://www.deeplearning.ai/courses/generative-ai-with-llms/
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI