Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Fine-Tuning and Evaluating Large Language Models: Key Benchmarks and Metrics
Data Science   Latest   Machine Learning

Fine-Tuning and Evaluating Large Language Models: Key Benchmarks and Metrics

Last Updated on July 20, 2024 by Editorial Team

Author(s): Saif Ali Kheraj

Originally published on Towards AI.

Figure 1: Generative AI Project Lifecycle by Author (Referred from deeplearning.ai)

In generative AI, we must first define the problem statement. Then, select a model accordingly. We must then select the model that best fits the specific task at hand. For example, we can use the FLAN-T5 model to summarize dialogues. We can also choose any other model. We then proceed with one, two, and more shots to see how it performs. If it does not produce the desired results, we may need to fine-tune the model. Then, we’ll look at the evaluation part. In this post, we will go into greater detail about fine-tuning and evaluating the model.

In-context learning (where you give one, two, or more shots) has limitations for certain cases and does not work well for smaller models. In-context learning is a process in which you try zero shots, one shots, or multiple shots and provide examples to LLM in prompts so that the model can generate for an unknown prompt.

LLM Finetuning

Figure2 by Author: Prompt Completion Pairs

Fine-tuning is a supervised learning process that uses a dataset of labeled examples to update the LLM’s weights. The labeled examples are prompt completion pairs, as illustrated in the diagram above. The fine-tuning process extends the model’s training to improve its ability to generate high-quality completions for a specific task.

For example, If I want to finetune the model to improve sentiment analysis capability, we would build up a dataset of examples that begin with the instruction Classify. We will build a dataset with many such example prompts as mentioned above.

  • Classify the following sentence into positive or negative: Text: {input_text} Summary: {expected sentiment}

We can use many example prompts as our training dataset. This includes instruction to classify the text along with the associated labels.

For translation:

  • Translate this sentence to Spanish: English: {input_sentence} Spanish: {expected_translation}

To summarize what we have said:

  1. Use Pretrained Model: A model already trained on a large, general dataset.
  2. Task-Specific Examples: Prompt completion pairs specific to the desired task.
  3. Prepared Instruction Dataset Split: We divide the dataset into training, validation and test set.
  4. Finetuning Process: We fine-tune the model using training and validation datasets and then evaluate the performance on testset using cross-entropy loss.

Surprisingly, good results can be obtained with relatively few examples. In comparison to the billions of pieces of text that the model saw during pre-training, only 500–1,000 examples can consistently produce good results.

Drawbacks of finetuning on a single task:

  • Catastrophic forgetting happens because the full fine-tuning process modifies the weights of the original LLM. While this leads to great performance on a single fine-tuning task, it can degrade performance on other tasks.

How to avoid catastrophic Forgetting?

Multi Task Finetuning

Figure3 by Author: Multitask fine-tuning

Catastrophic Forgetting can be avoided by providing a variety of examples to the model. For example, we can provide examples of summarization prompts, translation prompts, and rating prompts. This requires numerous examples of each instruction when completed. The β€˜instruct’ version of the model is fine-tuned so that it can follow prompted instructions.

One example is the FLAN family of models. FLAN (fine-tuned language net) refers to a specific set of instructions used to fine-tune various models. Many models are based on FLAN models. For example, the FLAN T5 model is based on the FLAN model. SAMSUM is one of the datasets that FLAN T5 uses. There are several pre-trained FLAN T5 models that have been fine-tuned on SAMSUM, including Phil Schmid/flan-t5-base-samsum and jasonmcaffee/flan-t5-large-samsum on Hugging Face. If we want to fine-tune the FLAN T5 model specifically for formal dialogue conversations, we can do so using the DIALOGUESUM dataset.

Models fine-tuned on DialogSum can be applied to areas like customer support, meeting minutes generation, chatbot summarization, and more.

2. PEFT (Parameter efficient fine tuning)

Training LLMs is computationally intensive. Full finetuning is computationally expensive as it might change each weight in the model. First, we start with a pretrained LLM like GPT-3. This model already has a vast amount of knowledge and understanding of language. Then we provide task-specific datasets, which could be data for question answering or sentiment analysis or any other customer dataset. During training, full finetuning process makes slight adjustments to every weight in the pretrained model. While the model weights are substantial, we have other important aspects during training like Optimizer, which adds up to the cost. For example, Optimizer States, gradients, forward activation, and temporary memory. These additional components add up to the training cost.

Three main approaches are used in PEFT: Selective / reparameterization/additive.

1. Selective

Here, we select a subset of initial LLM parameters to fine-tune.

2. Reparameterization

We reparameterize model weights using a low-rank representation. We will discuss LoRA in detail below.

LORA: Low Rank Representation:

Each layer in a transformer architecture has multiple weight matrices for different operations, like self-attention or feed-forward networks. These matrices can have different sizes depending on the specific layer and configuration. Let us take an example by picking a matrix of size 512 x 64 = 32,768 parameters. Let us now see LoRA with rank = 8.

  • Original Weight Matrix: Dimensions: 512 x 64, Parameters: 32,768 (512 x 64)
  • Matrix A (Rank Decomposition): Dimensions: 8 x 64 (rank x original dimension), Parameters: 512 (8 x 64)
  • Matrix B (Rank Decomposition): Dimensions: 8 x 512 (rank x original dimension), Parameters: 4,096 (8 x 512)
  • Total LORA Parameters: 512 (A) + 4,096 (B) = 4,608

Approximation:

The original weight matrix (W) is approximated by the product of A and B:

Z β‰ˆ W β‰ˆ A * B

Reasoning Behind the Dimensions:

  • The dimensions of A and B are chosen to capture the essence of the original weight matrix (W) with fewer parameters.
  • The rank (here, 8) controls the trade-off between efficiency and accuracy. A lower rank leads to fewer parameters but might result in a slightly less accurate approximation.
  • We can also create task-specific decomposition matrices.

In the example we discussed, LORA achieves a reduction of approximately 86% in the number of trainable parameters needed for fine-tuning. Here’s the summary.

  • Original Weight Matrix: 32,768 parameters (512 x 64)
  • Total LORA Parameters: 4,608 parameters (512 + 4,096)

3. Additive

We add trainable layers or parameters to the model in the form of adapter modules.

The two main additive approaches are:

  • Adapter Modules: These are small, trainable neural network modules strategically inserted into specific layers of the pre-trained LLM. They help the LLM learn task-specific information without drastically changing its underlying knowledge.
  • Prompt Tuning: This approach doesn’t involve adding any new modules to the model itself. Instead, it focuses on crafting specific prompts (essentially instructions or questions) that guide the pre-trained LLM toward the desired task.

All these approaches are similar to transfer learning, but they are more efficient in that they only fine-tune a subset of parameters rather than fine-tuning the complete layer. Even adapter modules are lightweight.

PEFT is particularly beneficial when dealing with large LLMs that have billions or even trillions of parameters, as fine-tuning all of them can be computationally expensive and resource-intensive.

PEFT is less prone to the catastrophic forgetting problems of full fine-tuning. Full fine-tuning results in a new version of the model for every task you train on.

Metrics to assess the performance

In the language model, evaluation is more challenging since the output is non deterministic.

Let us explore some of the metrics that we can use to evaluate.

ROUGE-1: (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE-1 is recall oriented metric, which means it prioritizes identifying how many of the important words from the reference summaries are included in the generated summary. ROUGE 1 focuses on individual words(unigrams). Similarly, ROUGE-2 focuses on bigrams and so goes on.

Let us take an example of ROUGE-1:

Let’s walk through an example step-by-step:

  1. Reference Text: β€œMike really loves drinking tea.”
  2. Generated Text: β€œMike adores sipping tea.”

Step 1: Identify Unigrams

  • Reference Text Unigrams: {Mike, really, loves, drinking, tea}
  • Generated Text Unigrams: {Mike, adores, sipping, tea}

Step 2: Count Overlapping Unigrams

  • Overlapping Unigrams: {Mike, tea}
  • Number of Overlapping Unigrams: 2

ROUGE-1 Recall

Figure4 by Author: ROUGE-1 Recall

ROUGE-1 Precision

Figure by Author: ROUGE-1 Precision

ROUGE-1 F1 Score

Figure6 by Author: ROUGE-1 F1 Score

ROUGE-L:

ROUGE-L is a metric used to evaluate the quality of text by measuring the longest common subsequence (LCS) between a generated text and a reference text. The LCS takes into account order of words making it more sensitive to the overall structure of the text compared to simple n gram overlap.

Let’s walk through an example step-by-step:

  1. Reference Text: β€œIt is cold outside” (We can see two subsequence β€œIt is” in italics and cold outside in bold.)
  2. Generated Text: β€œIt is very cold outside” (We can see two subsequence β€œIt is” in italic and cold outside in bold.)

ROUGE-L Recall = LCS(Gen, Ref) / unigrams in reference = 2/4 = 0.5`

ROUGE-L Precision = 2 / 5 = 0.4

ROUGE-L F1 = 2 . (0.2/0.9) = 0.44

ROUGE Cliping

ROUGE sometimes give misleading results. Let us explore this:

Example 1: Repetitive Generated Text

  • Reference (human): β€œThe sun is shining brightly.”
  • Generated output: β€œshining shining shining shining”

Without clipping:

  • Unigram Matches: β€œshining” (matches four times)
  • ROUGE-1 Precision: 4/4 = 1.0

This perfect score is misleading because the generated text is repetitive and lacks meaningful content.

With clipping:

  • Clipped Unigram Matches: β€œshining” (matches only once, as in the reference)
  • Modified Precision: 1/4

Clipping provides a more accurate reflection of the generated text’s quality.

Example 2: Reordered Generated Text

  • Reference (human): β€œThe sun is shining brightly.”
  • Generated output: β€œbrightly the sun is shining”

With clipping:

  • Clipped Unigram Matches: β€œThe”, β€œsun”, β€œis”, β€œshining”, β€œbrightly” (matches exactly as in the reference)
  • Modified Precision: 5/5=1

Despite the different word order, clipping correctly identifies that the generated text includes all relevant unigrams in the correct frequency, giving it a perfect score. This could also be misleading.

To sumup, ROUGE clipping improves evaluation accuracy by limiting unigram matches to the count present in the reference text, preventing artificially inflated scores from repetitive words, and correctly handling word order variations.

BLEU

BLEU primarily focuses on n-gram precision, which means it counts how often sequences of words (n-grams) in the machine translation match those in the reference translations. It considers 1-grams (single words), 2-grams (phrases), etc. You can also refer to it as average precision across range of n-gram sizes.

Other Metrics and Benchmarks

There are other important metrics also used for evaluation, which are listed below in table:

With regards to HELM, One important feature of HELM is that it assesses on metrics beyond basic accuracy measures, like precision of the F1 score. The benchmark also includes metrics for fairness, bias, and toxicity, which are becoming increasingly important to assess as LLMs become more capable of human-like language generation, and in turn, of exhibiting potentially harmful behavior. HELM is a living benchmark that aims to continuously evolve with the addition of new scenarios, metrics, and models.

Conclusion

In this post, we saw an important aspect of fine-tuning a large language model. We started by discussing zero shot, one shot, two shot, more shot to see if the model works by generating the correct output. If it does not, we need to finetune the model. We can finetune the model by picking a relevant model based on our task requirement. Then, we finetune the model by giving it more examples along with labels. We also saw how finetuning the model can lead to catastrophic forgetting and the way to avoid it is to finetune on multiple tasks so that the model generalizes well. In addition, we can also use Parameter-efficient fine-tuning, where we discussed 3 techniques to avoid computational problems as well. Techniques like LoRA is very beneficial. We then moved towards evaluating the model where we studied some important metrics like ROUGE, BLEU, and other benchmarks available.

References

[1] https://cobusgreyling.medium.com/catastrophic-forgetting-in-llms-bf345760e6e2

[2] https://arxiv.org/html/2401.05605v1

[3] https://www.linkedin.com/pulse/catastrophic-forgetting-side-effect-fine-tuning-large-karan-sehgal-jjkqe/

[4] https://medium.com/@sthanikamsanthosh1994/understanding-bleu-and-rouge-score-for-nlp-evaluation-1ab334ecadcb

[4] https://www.deeplearning.ai/courses/generative-ai-with-llms/

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓