Fine-Tuning and Evaluating Large Language Models: Key Benchmarks and Metrics

Last Updated on July 20, 2024 by Editorial Team

Author(s): Saif Ali Kheraj

Originally published on Towards AI.

Figure 1: Generative AI Project Lifecycle by Author (Referred from deeplearning.ai)

In generative AI, we must first define the problem statement. Then, select a model accordingly. We must then select the model that best fits the specific task at hand. For example, we can use the FLAN-T5 model to summarize dialogues. We can also choose any other model. We then proceed with one, two, and more shots to see how it performs. If it does not produce the desired results, we may need to fine-tune the model. Then, we’ll look at the evaluation part. In this post, we will go into greater detail about fine-tuning and evaluating the model.

In-context learning (where you give one, two, or more shots) has limitations for certain cases and does not work well for smaller models. In-context learning is a process in which you try zero shots, one shots, or multiple shots and provide examples to LLM in prompts so that the model can generate for an unknown prompt.

LLM Finetuning

Figure2 by Author: Prompt Completion Pairs

Fine-tuning is a supervised learning process that uses a dataset of labeled examples to update the LLM’s weights. The labeled examples are prompt completion pairs, as illustrated in the diagram above. The fine-tuning process extends the model’s training to improve its ability to generate high-quality completions for a specific task.

For example, If I want to finetune the model to improve sentiment analysis capability, we would build up a dataset of examples that begin with the instruction Classify. We will build a dataset with many such example prompts as mentioned above.

Classify the following sentence into positive or negative: Text: {input_text} Summary: {expected sentiment}

We can use many example prompts as our training dataset. This includes instruction to classify the text along with the associated labels.

For translation:

Translate this sentence to Spanish: English: {input_sentence} Spanish: {expected_translation}

To summarize what we have said:

Use Pretrained Model: A model already trained on a large, general dataset.
Task-Specific Examples: Prompt completion pairs specific to the desired task.
Prepared Instruction Dataset Split: We divide the dataset into training, validation and test set.
Finetuning Process: We fine-tune the model using training and validation datasets and then evaluate the performance on testset using cross-entropy loss.

Surprisingly, good results can be obtained with relatively few examples. In comparison to the billions of pieces of text that the model saw during pre-training, only 500–1,000 examples can consistently produce good results.

Drawbacks of finetuning on a single task:

Catastrophic forgetting happens because the full fine-tuning process modifies the weights of the original LLM. While this leads to great performance on a single fine-tuning task, it can degrade performance on other tasks.

How to avoid catastrophic Forgetting?

Multi Task Finetuning

Catastrophic Forgetting can be avoided by providing a variety of examples to the model. For example, we can provide examples of summarization prompts, translation prompts, and rating prompts. This requires numerous examples of each instruction when completed. The ‘instruct’ version of the model is fine-tuned so that it can follow prompted instructions.

One example is the FLAN family of models. FLAN (fine-tuned language net) refers to a specific set of instructions used to fine-tune various models. Many models are based on FLAN models. For example, the FLAN T5 model is based on the FLAN model. SAMSUM is one of the datasets that FLAN T5 uses. There are several pre-trained FLAN T5 models that have been fine-tuned on SAMSUM, including Phil Schmid/flan-t5-base-samsum and jasonmcaffee/flan-t5-large-samsum on Hugging Face. If we want to fine-tune the FLAN T5 model specifically for formal dialogue conversations, we can do so using the DIALOGUESUM dataset.

Models fine-tuned on DialogSum can be applied to areas like customer support, meeting minutes generation, chatbot summarization, and more.

2. PEFT (Parameter efficient fine tuning)

Training LLMs is computationally intensive. Full finetuning is computationally expensive as it might change each weight in the model. First, we start with a pretrained LLM like GPT-3. This model already has a vast amount of knowledge and understanding of language. Then we provide task-specific datasets, which could be data for question answering or sentiment analysis or any other customer dataset. During training, full finetuning process makes slight adjustments to every weight in the pretrained model. While the model weights are substantial, we have other important aspects during training like Optimizer, which adds up to the cost. For example, Optimizer States, gradients, forward activation, and temporary memory. These additional components add up to the training cost.

Three main approaches are used in PEFT: Selective / reparameterization/additive.

1. Selective

Here, we select a subset of initial LLM parameters to fine-tune.

2. Reparameterization

We reparameterize model weights using a low-rank representation. We will discuss LoRA in detail below.

LORA: Low Rank Representation:

Each layer in a transformer architecture has multiple weight matrices for different operations, like self-attention or feed-forward networks. These matrices can have different sizes depending on the specific layer and configuration. Let us take an example by picking a matrix of size 512 x 64 = 32,768 parameters. Let us now see LoRA with rank = 8.

Original Weight Matrix: Dimensions: 512 x 64, Parameters: 32,768 (512 x 64)
Matrix A (Rank Decomposition): Dimensions: 8 x 64 (rank x original dimension), Parameters: 512 (8 x 64)
Matrix B (Rank Decomposition): Dimensions: 8 x 512 (rank x original dimension), Parameters: 4,096 (8 x 512)
Total LORA Parameters: 512 (A) + 4,096 (B) = 4,608

Approximation:

The original weight matrix (W) is approximated by the product of A and B:

Z ≈ W ≈ A * B

Reasoning Behind the Dimensions:

The dimensions of A and B are chosen to capture the essence of the original weight matrix (W) with fewer parameters.
The rank (here, 8) controls the trade-off between efficiency and accuracy. A lower rank leads to fewer parameters but might result in a slightly less accurate approximation.
We can also create task-specific decomposition matrices.

In the example we discussed, LORA achieves a reduction of approximately 86% in the number of trainable parameters needed for fine-tuning. Here’s the summary.

Original Weight Matrix: 32,768 parameters (512 x 64)
Total LORA Parameters: 4,608 parameters (512 + 4,096)

3. Additive

We add trainable layers or parameters to the model in the form of adapter modules.

The two main additive approaches are:

Adapter Modules: These are small, trainable neural network modules strategically inserted into specific layers of the pre-trained LLM. They help the LLM learn task-specific information without drastically changing its underlying knowledge.
Prompt Tuning: This approach doesn’t involve adding any new modules to the model itself. Instead, it focuses on crafting specific prompts (essentially instructions or questions) that guide the pre-trained LLM toward the desired task.

All these approaches are similar to transfer learning, but they are more efficient in that they only fine-tune a subset of parameters rather than fine-tuning the complete layer. Even adapter modules are lightweight.

PEFT is particularly beneficial when dealing with large LLMs that have billions or even trillions of parameters, as fine-tuning all of them can be computationally expensive and resource-intensive.

PEFT is less prone to the catastrophic forgetting problems of full fine-tuning. Full fine-tuning results in a new version of the model for every task you train on.

Metrics to assess the performance

In the language model, evaluation is more challenging since the output is non deterministic.

Let us explore some of the metrics that we can use to evaluate.

ROUGE-1: (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE-1 is recall oriented metric, which means it prioritizes identifying how many of the important words from the reference summaries are included in the generated summary. ROUGE 1 focuses on individual words(unigrams). Similarly, ROUGE-2 focuses on bigrams and so goes on.

Let us take an example of ROUGE-1:

Let’s walk through an example step-by-step:

Reference Text: “Mike really loves drinking tea.”
Generated Text: “Mike adores sipping tea.”

Step 1: Identify Unigrams

Reference Text Unigrams: {Mike, really, loves, drinking, tea}
Generated Text Unigrams: {Mike, adores, sipping, tea}

Step 2: Count Overlapping Unigrams

Overlapping Unigrams: {Mike, tea}
Number of Overlapping Unigrams: 2

ROUGE-1 Recall

ROUGE-1 Precision

ROUGE-1 F1 Score

ROUGE-L:

ROUGE-L is a metric used to evaluate the quality of text by measuring the longest common subsequence (LCS) between a generated text and a reference text. The LCS takes into account order of words making it more sensitive to the overall structure of the text compared to simple n gram overlap.

Let’s walk through an example step-by-step:

Reference Text: “It is cold outside” (We can see two subsequence “It is” in italics and cold outside in bold.)
Generated Text: “It is very cold outside” (We can see two subsequence “It is” in italic and cold outside in bold.)

ROUGE-L Recall = LCS(Gen, Ref) / unigrams in reference = 2/4 = 0.5`

ROUGE-L Precision = 2 / 5 = 0.4

ROUGE-L F1 = 2 . (0.2/0.9) = 0.44

ROUGE Cliping

ROUGE sometimes give misleading results. Let us explore this:

Example 1: Repetitive Generated Text

Reference (human): “The sun is shining brightly.”
Generated output: “shining shining shining shining”

Without clipping:

Unigram Matches: “shining” (matches four times)
ROUGE-1 Precision: 4/4 = 1.0

This perfect score is misleading because the generated text is repetitive and lacks meaningful content.

With clipping:

Clipped Unigram Matches: “shining” (matches only once, as in the reference)
Modified Precision: 1/4

Clipping provides a more accurate reflection of the generated text’s quality.

Example 2: Reordered Generated Text

Reference (human): “The sun is shining brightly.”
Generated output: “brightly the sun is shining”

With clipping:

Clipped Unigram Matches: “The”, “sun”, “is”, “shining”, “brightly” (matches exactly as in the reference)
Modified Precision: 5/5=1

Despite the different word order, clipping correctly identifies that the generated text includes all relevant unigrams in the correct frequency, giving it a perfect score. This could also be misleading.

To sumup, ROUGE clipping improves evaluation accuracy by limiting unigram matches to the count present in the reference text, preventing artificially inflated scores from repetitive words, and correctly handling word order variations.

BLEU

BLEU primarily focuses on n-gram precision, which means it counts how often sequences of words (n-grams) in the machine translation match those in the reference translations. It considers 1-grams (single words), 2-grams (phrases), etc. You can also refer to it as average precision across range of n-gram sizes.

Other Metrics and Benchmarks

There are other important metrics also used for evaluation, which are listed below in table:

With regards to HELM, One important feature of HELM is that it assesses on metrics beyond basic accuracy measures, like precision of the F1 score. The benchmark also includes metrics for fairness, bias, and toxicity, which are becoming increasingly important to assess as LLMs become more capable of human-like language generation, and in turn, of exhibiting potentially harmful behavior. HELM is a living benchmark that aims to continuously evolve with the addition of new scenarios, metrics, and models.

Conclusion

In this post, we saw an important aspect of fine-tuning a large language model. We started by discussing zero shot, one shot, two shot, more shot to see if the model works by generating the correct output. If it does not, we need to finetune the model. We can finetune the model by picking a relevant model based on our task requirement. Then, we finetune the model by giving it more examples along with labels. We also saw how finetuning the model can lead to catastrophic forgetting and the way to avoid it is to finetune on multiple tasks so that the model generalizes well. In addition, we can also use Parameter-efficient fine-tuning, where we discussed 3 techniques to avoid computational problems as well. Techniques like LoRA is very beneficial. We then moved towards evaluating the model where we studied some important metrics like ROUGE, BLEU, and other benchmarks available.

References

[1] https://cobusgreyling.medium.com/catastrophic-forgetting-in-llms-bf345760e6e2

[2] https://arxiv.org/html/2401.05605v1

[3] https://www.linkedin.com/pulse/catastrophic-forgetting-side-effect-fine-tuning-large-karan-sehgal-jjkqe/

[4] https://medium.com/@sthanikamsanthosh1994/understanding-bleu-and-rouge-score-for-nlp-evaluation-1ab334ecadcb

[4] https://www.deeplearning.ai/courses/generative-ai-with-llms/

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Fine-Tuning and Evaluating Large Language Models: Key Benchmarks and Metrics

Author(s): Saif Ali Kheraj

LLM Finetuning

Drawbacks of finetuning on a single task:

How to avoid catastrophic Forgetting?

Multi Task Finetuning

2. PEFT (Parameter efficient fine tuning)

1. Selective

2. Reparameterization

3. Additive

Metrics to assess the performance

ROUGE-1: (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE-L:

ROUGE Cliping

BLEU

Other Metrics and Benchmarks

Conclusion

References

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Sleepless Nights: A Statistical Look at Modern Sleep Patterns

Our NEW 8-Hour AI Crash Course for Developers!

Cache-Augmented Generation (CAG) vs Retrieval-Augmented Generation (RAG)

Accelerating Drug Approvals Using Advanced RAG

How AI is Transforming Evaluation Practices

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Fine-Tuning and Evaluating Large Language Models: Key Benchmarks and Metrics

Author(s): Saif Ali Kheraj

LLM Finetuning

Drawbacks of finetuning on a single task:

How to avoid catastrophic Forgetting?

Multi Task Finetuning

2. PEFT (Parameter efficient fine tuning)

1. Selective

2. Reparameterization

3. Additive

Metrics to assess the performance

ROUGE-1: (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE-L:

ROUGE Cliping

BLEU

Other Metrics and Benchmarks

Conclusion

References

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement