LLM Finetuning Strategies
Last Updated on September 27, 2024 by Editorial Team
Author(s): Raghunaathan
Originally published on Towards AI.
Large language models, trained on extensive datasets thanks to organizations like Common Crawl, are capable of performing a multitude of tasks with zero-shot to few-shot prompting. With the rise of retrieval-augmented generation (RAG) approaches, these generalist models are increasingly employed by organizations for various applications, ranging from simple chatbots to more complex agentic automations (personas). Although techniques like GraphRAG have been developed to extract relationships across documents based on entities, they may not fully address every domain-specific need due to the lack of substantial context in the base model. This limitation has led to a continuous stream of new models being released each month.
For these domain-specific models, it is possible to take an existing LLM architecture and adapt its weights to learn context specific to a particular domain, a process known as fine-tuning. In this article, we will explore the fine-tuning process for language models, examining the various types, key considerations involved, and an example of a no-code(almost) open-source tool. Letβs dive in and uncover the intricacies of fine-tuning language models!
Fine-Tuning: A Simple Analogy
To understand fine-tuning, letβs use an analogy. Imagine you are a student preparing for a science exam. You begin with a strong foundation from your classes. As the exam approaches, you focus on the specific topics that will be tested. You solve practice questions to assess your understanding and then review the material based on your performance on these questions. You might also seek guidance from friends, consult online resources, or revisit key topics.
This process mirrors fine-tuning: we take a pre-trained model (the student with a solid foundation), direct it toward specific tasks (revising particular topics), evaluate its performance (through practice tests), and iterate on this process until we achieve optimal results (based on performance metrics). Just as a student can become proficient in specific areas, we can develop a language model that excels in all tasks in a domain/ multiple domains. Ultimately, the effectiveness of this process depends on the chosen model, the specific tasks, and the quality of the training data used.
Common Fine-Tuning Use-Cases
Before we delve into fine-tuning, letβs examine why itβs necessary by analyzing a few scenarios.
Language Learning
Let us compare two versions of Llama tasked with answering in Tamil.
As illustrated in the example above, the base variant of Llama struggles to understand the requested language, while the finetuned model is able to respond fluently in that language. This capability arises from the fine-tuning process, which allows the model to learn and recognize patterns in the new language. In contrast, simple Retrieval-Augmented Generation (RAG) applications are limited, as they do not effectively connect new contexts with existing knowledge. Fine-tuning becomes essential in scenarios where a model must acquire and integrate diverse contexts.
Safeguarding LLMs Effectively
One significant challenge in AI development is establishing effective guardrails for models. Consider a tax assistant AI that unexpectedly begins answering questions about mental health. While itβs impressive that the AI can handle diverse topics, this can also be risky. Not all models are trained on appropriate data, particularly in sensitive areas like mental health.
Even if we instruct the model not to address certain questions, two major issues arise: prompt hacking and context windows. Prompt hacking occurs when users manipulate the input to bypass restrictions. Additionally, while larger models, such as Llama 3.1 with its 128k context window, provide more space for instructions, this does not fully resolve the issue. Although the context window can accommodate more information, if too many tokens are used for setting context, it diminishes the space available for actual content.
Effective prompt templates can help, but they cannot account for every possible nuance. While a large context window is beneficial, it is not a comprehensive solution, making fine-tuning a more reliable option. Even major players like Meta have introduced LlamaGuard, a finetuned version of the Llama model designed to enforce chat guardrails and prevent harmful responses.
AI Personas
News media often cover the same stories, but each outlet presents them from a unique perspective. Imagine a chat assistant designed to help write articles by gathering information from various sources. If your organization utilizes a pre-trained model like ChatGPT, effective prompts β both user and system instructions β can generate useful news snippets. However, these snippets may not always align with your organizationβs specific style or guidelines.
To ensure consistency and adherence to your organizationβs tone and standards, you can finetune the model using news articles written by your own team. This approach creates a more personalized AI that accurately reflects your organizationβs voice and can be relied upon for consistent and precise content. Additionally, several startups are now focusing on developing enterprise-level AI personas to streamline manual activities.
Smarter, Smaller Models
You donβt always need a massive model to achieve great results. A smaller model, with a billion (or even a few million) parameters, can often be more efficient and cost-effective for your specific needs compared to the very large language models.This approach significantly reduces the costs associated with running and maintaining these models.
In this article, we will explore a technique called Parameter-Efficient Fine-Tuning (PEFT). This method employs matrix decomposition to represent a large model in a smaller, more manageable form. This means you donβt need to utilize all of the modelβs parameters to achieve your objectives β though there may be minor trade-offs in performance. As a result, you can work with powerful models on consumer hardware without incurring excessive costs. Thus, very large models are not always necessary.
Before fine-tuning, consider these factors:
- Sufficient Data: Do you have enough data to effectively train the model?
- Hardware Availability: Is the necessary hardware available to train and run the model?
- RAG Strategies: Can your problem be solved using RAG strategies with existing LLM APIs?
- Time to Market: How quickly do you need the service to be operational?
- You can combine APIs from different service providers to build a unified product. These models are of high quality and are continuously updated to meet the latest standards. A thorough exploration can help you identify the most suitable models for your specific needs.
Fine-Tuning Process
Now that we understand what fine-tuning is and its applications, letβs explore the different types and how each one functions. There are three popular approaches for fine-tuning large language models (LLMs) based on the learning methodologies:
- Supervised Learning βIn this approach, a model acquires new concepts by training on input-output pairs. Techniques such as instruction fine-tuning exemplify this method, where we teach the model to provide precise responses to specific instructions.
Imagine a classroom where a student is learning to write essays. Initially, the student writes essays on various topics, but their work isnβt perfect. A teacher reviews the essays, provides detailed feedback, and suggests improvements. Over time, the student revises their essays based on this feedback and becomes a better writer.
In supervised fine-tuning for large language models (LLMs), the model begins with general knowledge and is then βtaughtβ through a similar process: it is trained on specific examples with correct outputs and feedback to enhance its performance on particular tasks, much like the student refining their writing skills. - Self-Supervised Learning β This powerful methodology is employed in language model tuning to help the model understand the nuances of data for language modeling. It leverages the inherent structure of the data to generate supervisory signals, eliminating the need for manually labeled data.
Imagine a classroom where students receive incomplete or scrambled notes and must deduce the missing pieces themselves. This scenario parallels self-supervised learning in AI. Just as students use context and their own knowledge to fill in the gaps, a self-supervised model learns by predicting hidden parts of data and refining its understanding through these predictions. It represents a way of learning from the data itself, without requiring explicit labels or direct answers. Some popular strategies are masked language modelling (BERT), autoregressive language modelling (GPT), contrastive learning (SimCLR), Next Sentence Prediction (BERT) and Permutation Language Modeling(XLNet) among many others and are often used in combination. - Reinforcement Learning β Reinforcement learning (RL) for language models involves training them to produce better responses through a reward system that evaluates their outputs. The model generates responses based on prompts and receives positive rewards for high-quality answers and penalties for poor ones. Through this feedback, the model adjusts its parameters to enhance its performance over time. The template used to maximize rewards (ideally) is called a policy, and various policy optimization strategies help RL achieve results closer to human outputs.
Imagine a student in a classroom learning to solve math problems. Every time the student correctly solves a problem, the teacher gives them a gold star as a reward, encouraging further effort and improvement. Occasionally, the teacher also provides constructive feedback on mistakes, helping the student adjust their approach. Over time, the student learns which strategies yield more gold stars and fewer mistakes.
In this analogy, the gold stars and feedback represent rewards and penalties in reinforcement learning, guiding the student to learn and optimize their problem-solving skills through trial and error. The severity of the penalties can vary depending on the teacherβs policy. The goal is to identify the right strategy that achieves the best results.
Next, we need to decide between vertical and horizontal fine-tuning strategies based on the task at hand.
Horizontal fine-tuning involves adapting a model to perform well across a range of similar tasks or domains. The model is finetuned on data that spans multiple related areas without specializing in any single one. A notable strength of this approach is its ability to handle various tasks while retaining the base modelβs generalist nature.
Vertical fine-tuning, on the other hand, focuses on adapting a model to excel in a specific task or domain. The model is finetuned using highly specialized or domain-specific data, enabling it to better understand and generate responses that are relevant to the particular area it has been fine-tuned for, resulting in highly accurate outcomes.
We also need to determine the number of parameters necessary for fine-tuning the model. The primary concern will be the computational resources required for both tuning and inference, as costs can escalate quickly. Additionally, we should consider the specific task and the type of data we want the base model to adapt to. There are three different strategies involved: full parameter retraining, Parameter-Efficient Fine-Tuning (PEFT), and transfer learning. In this discussion, we will focus on various PEFT strategies for fine-tuning the model on consumer GPUs.
These are just the key considerations and there also strategies like deciding on doing a top-down or bottom-up approach, train on individual layers or batch them, etc. Therefore, the key considerations along with appropriate fine-tuning module selection will give us the right infra baseline to work with. Since fine-tuning is an ocean of its own let us focus on a smaller subset(most commonly referred) for the implementation part which will be parameter efficient vertical fine-tuning. You can refer to the previous article to get an idea about the various quantized model formats seen on HuggingFace.
PEFT for Fine-Tuning
Parameter-efficient fine-tuning (PEFT) is a technique that capitalizes on the idea that not all parameters in a large language model need to be updated to achieve optimal performance. By freezing most parameters and focusing on a smaller subset, we can significantly reduce the computational resources and time required for fine-tuning.
Imagine a classroom where a student excels in many subjects but needs improvement in just a few specific areas. Instead of overhauling the entire curriculum, the teacher provides targeted extra practice in those areas. This approach is efficient because it builds on the studentβs existing knowledge, concentrating resources where they are most needed. Similarly, in PEFt we only focus on optimizing the impactful weights.
By freezing most parameters, continuing to use residual connections, and applying appropriate regularization techniques, these models retain their prior knowledge, thereby avoiding catastrophic forgetting. Methods like GaLore have made it possible to finetune large models, such as Llama-3, on personal computers, making advanced language modeling more accessible.
Let us explore a few PEFT techniques, noting that these are not mutually exclusive.
Techniques for Parameter Efficient Fine-tuning
As in the above image, PEFT methods are classified under 3 broad categories, namely: Addition-based (adding new trainable parameters), Selection-based (selecting a subset of parameters from the base model), and Reparametrization-based (having a alternate representation). In this article we will only be looking at sample code snippets for each process. You can find all the HuggingFace Transformer implementations of PEFT from the their official documentation.
Adapters fall under the addition class. Adapters are feed-forward modules added to an existing transformer architecture to reduce the parameter space between the fully-connected layers as shown in the below figure.
If the fully-connected layers are going to downscale the dimensions in one and then rescale them back to the input dimension in the next, how does this reduce the feature space? For instance, letβs say the first fully connected layer reduces a 256-dimensional input to 16 dimensions, and the second layer brings it back to 256 dimensions. This results in a total of 256 x 16 + 16 x 256 = 8,192 weight parameters. In comparison, a single fully connected layer that maps a 256-dimensional input to a 256-dimensional output would have 256 x 256 = 65,536 parameters. In adapter tuning, only the adapters, layer norms, and final head are trained on the downstream data, making the tuning quicker and more efficient. This method was proven successful based on the following observation: a BERT model trained with the adapter method reaches a modeling performance comparable to a fully finetuned BERT model while only requiring the training of 3.6% of the parameters. A simple adapter block looks like this.
import torch
import torch.nn as nn
class AdapterBlock(nn.Module):
def __init__(self, input_dim, adapter_dim=64):
super(AdapterBlock, self).__init__()
self.down_proj = nn.Linear(input_dim, adapter_dim)
self.activation = nn.ReLU()
self.up_proj = nn.Linear(adapter_dim, input_dim)
def forward(self, x):
# Apply the adapter block
return x + self.up_proj(self.activation(self.down_proj(x)))
There is another unique adapter called Llama-Adapter which has itβs unique adapter architecture specifically designed for turning Llama into an instruction-following model.
Prompt tuning is again an additive method utilizing the power of soft prompts(dynamic prompt updation using Loss feedback) instead of the explicit human static prompts/ hard prompts. This PEFT method aims to achieve improved model performance with inputs alone instead of changing the model weights. If this is the case then why not prompt-engineering? Prompt engineering takes lot of efforts to design the ideal prompt and there is a problem of context window length. Even if the prompts prove to work on test scenarios, it might not work on other scenarios, as there are plenty of ways for asking the same question.
As seen in the above image, we add additional trainable tokens(same size as the input vector) to the input embedding vectors. These are not fixed points in the embedding space and therefore can take the representation of any word. The goal is to continuously find the best representation of the these trainable tokens to guide the completion of the modelβs output. We can achieve this within 20 tokens (beyond this yields only marginal gains) for classification tasks according to the base paper. As seen above instead of working on 11B parameters we only work on 20K parameters related to the task prompts. This method has a black-box nature as it can take any representation in the embedding space and difficult to control. Upon nearest neighbor analysis it has been found that these tokens take a semantic word representation, which means this method cannot be used for highly specialized tasks. Secondly, as seen in the research observations, this method proves to be effective as the model size increases but not so much for smaller models. A simple prompt-tuning block is as below.
import torch
import torch.nn as nn
class PromptTuningBlock(nn.Module):
def __init__(self, input_dim, prompt_length, hidden_dim, output_dim):
super(PromptTuningBlock, self).__init__()
self.prompt_length = prompt_length
self.prompt_embeddings = nn.Parameter(torch.randn(prompt_length, input_dim))
self.input_embedding = nn.EmbeddingBag(input_dim, hidden_dim)
# A simple linear layer for output
self.fc = nn.Linear(hidden_dim + prompt_length * input_dim, output_dim)
def forward(self, input_ids):
# Get input embeddings
input_embeds = self.input_embedding(input_ids)
# Concatenate prompt embeddings
combined_embeds = torch.cat([input_embeds, self.prompt_embeddings.mean(dim=0).unsqueeze(0)], dim=1)
# Pass through the linear layer
output = self.fc(combined_embeds)
This is an enhanced variant of prompt tuning where we add soft-prompts to each transformer block(before positional encoding) instead of the input embeddings alone. This way the prefix token becomes the only trainable parameter across the layers and has a better influence on the output. How is it different from prompt-tuning? Prefix tuning enhances multiple layers of the model by adding a task-specific prefix to the input sequence, which requires more parameters to be finetuned. In contrast, prompt tuning focuses solely on adjusting the input prompt embeddings, leading to fewer updated parameters and potentially greater parameter efficiency, though it may limit adaptability to the target task. While prefix tuning may yield better performance due to its larger parameter set, it could also demand more computational resources and increase the risk of overfitting. It is safe to assume that prompt tuning, though more efficient, might not perform as well as prefix tuning due to the reduced number of finetuned parameters. A simple prefix tuning block looks like the one below.
import torch
import torch.nn as nn
class PrefixTuningBlock(nn.Module):
def __init__(self, num_prefix_tokens, hidden_size, num_layers):
super(PrefixTuningBlock, self).__init__()
self.prefix_tokens = nn.Parameter(torch.randn(num_prefix_tokens, hidden_size))
self.transformer_layers = nn.ModuleList([nn.TransformerEncoderLayer(d_model=hidden_size, nhead=8, dropout=0.1) for _ in range(num_layers)])
def forward(self, input_ids):
# Create a tensor of prefix tokens
prefix_tokens = self.prefix_tokens.unsqueeze(0).expand(input_ids.size(0), -1, -1)
# Concatenate input_ids and prefix_tokens
input_ids = torch.cat([prefix_tokens, input_ids], dim=1)
# Pass the concatenated input through the transformer layers
output = input_ids
for layer in self.transformer_layers:
output = layer(output)
return output
Low-Rank Adaptation of Large Language Models(LoRA Family)
This is a reparameterization method that operates on an alternative representation of the language model for fine-tuning. This technique simplifies a large weight matrix in the attention layers by breaking it down into two smaller matrices, significantly reducing the number of parameters that need to be adjusted during fine-tuning. Instead of directly decomposing the matrix, it learns from the decomposed representation (pseudo decomposition).
Rather than adding new parameters to the model, we focus on this alternative representation. The general consensus is to set the rank r proportional to the amount of training data and the model size to mitigate overfitting issues and manage model budgets effectively. It has also been observed that LoRA learns less and forgets less, which is expected.
Let us understand the decomposition with an example.
Assume we have 2 matrices A and B with 100 and 500 parameters. Therefore total number of parameters in ΞW is AXB = 100X500 = 50,000. Now letβs assume a rank of 5. Now the new weight matrices WA and WB become 100X5 = 500 and 500X5 = 2,500. The new ΞW = WA + WB = 500+2500 = 3000 parameters, which is a 94% decrease. A sample LoRA block looks like the one mentioned below.
import torch
import torch.nn as nn
import math
class LoRA(nn.Module):
def __init__(self, input_dim, output_dim, rank=8, alpha=1.0):
super().__init__()
self.input_dim = input_dim
self.output_dim = output_dim
self.rank = rank
self.alpha = alpha
# Create LoRA weight matrices
self.W_A = nn.Parameter(torch.empty(input_dim, rank))
self.W_B = nn.Parameter(torch.empty(rank, output_dim))
# Initialize LoRA weights
nn.init.kaiming_uniform_(self.W_A, a=math.sqrt(5))
nn.init.zeros_(self.W_B)
def forward(self, x, W):
h = x @ W
# Apply LoRA
h += self.alpha * x @ (self.W_A @ self.W_B)
return h
There are various flavors of LoRA like DoRA, QLoRA (a popular mixed precision strategy) , LoHA, etc. You can find some popular ones in this article and transformers implementations from the HuggingFace documentation.
Infused Adapter by Inhibiting and Amplifying Inner Activations(IA3)
This is another additive method involving three phases, addition of vectors, rescaling (inhibit/ amplify) and tuning on downstream data. The three added vectors are Key rescaling vector (This vector is multiplied with the keys in the self-attention layer), Value rescaling vector(This vector is multiplied with the values in the self-attention and encoder-decoder attention layers) and Intermediate activation rescaling vector(This vector is multiplied with the intermediate activations in the position-wise feed-forward network). The learned vectors are then used to rescale the corresponding elements in the model. This rescaling can either inhibit (reduce) or amplify (increase) the activations, depending on the values in the learned vectors. Finally, the model is finetuned on a downstream task. The learned vectors are updated during the fine-tuning process to optimize the modelβs performance. The model was proposed as a better alternative to few-shot prompting strategies (In-context learning).
Orthogonal Finetuning via Butterfly Factorization (BOFT)
This is another reparameterization strategy where we perform orthogonal transformation of the weight matrices using butterfly factorization. Let us understand butterfly factorization. The goal is to represent a given weight matrix as a product of 2 matrices, namely a diagonal matrix and a permutation matrix as shown in the below image.
By introducing the factor of orthogonality between the gradients of the fine-tuning loss and pre-training loss we maintain a structural constraint. As stated in HuggingFace conceptual guide it helps in keeping the hyperspherical energy remain unchanged during finetuning. Energy can be considered as the the distance of a point from the origin on the hypersphere. Maintaining the hyperspherical energy during fine-tuning ensures that the learned representations remain close to their original positions, reducing the risk of forgetting previously learned information. Also this sparse representation helps in better model generalization. As the author puts it βThe butterfly structure serves as a smooth interpolation between different block number hyperparameters in OFT, making the orthogonal finetuning framework more flexible and more importantly, more parameter-efficientβ. This approach is predominantly trained and tested on image models and can be preferred for for text-to-image models.
Other popular alternatives to LoRA are to share the same low-rank matrices across layers and initializing adapter layers with the principal singular values and singular vectors of the original modelβs weights. This is a constantly sought after problem as the primary goal is to fit and compute models on local hardware like smartphones and PCs (mobile chips).
LlamaFactory (fine-tuning framework)
Let us fine-tune a pretrained Llama model using the PEFT strategy to learn Docker queries. LlamaFactory is a unified framework for fine-tuning all large language models (LLMs) with a suite of cutting-edge efficient training methods. We will explore the user interface (UI) β though this can also be done via the command line interface (CLI) β including how to add new models and datasets from Hugging Face, a glossary of fine-tuning terms, and the process for exporting the fine-tuned model.
Installation
Installation can be done following these GitHub repo instructions. It is advisable to build specific tools like this in a separate virtual environment.
Let us look at the various fields in the above image.
Lang β sets the language for the UI. It currently supports english, russian, chinese and korean.
Model Name β sets the base model for finetuning. It already supports all the popular models. What if we need to add custom models? Add your model to model templates present inside src/llamafactory/extras/constants.py into the appropriate model registry, as shown in below image. The download source can also be a local directory having all the model information in it.
Model Path β Path to load the model from
Finetuning Method β It support full, freeze and LoRA tuning methods. These are full finetuning, transfer learning and PEFT tuning approaches.
Checkpoint Path β Once we finetune a model, all the model information are checkpointed (like a meta information of the finetuned modelβs core attributes which can later be loaded and used for data tasks). This will show up only when a finetuned variant of the selected model is available(trained via LlamaFactory or local dir path updated in the LlamaFactory template)
Advanced configurations β Since the local hardware might not be able to handle the tuning and full or half precisions, we need to quantize the model (learn more from here) to make it operable in our PC. All the quantization parameters can be adjusted from this tab.
- Quantization bit (QLoRA)β There are 2 options supported 4 bits and 8 bits. This is the amount of bits allocated to represent a model weight, therefore lower the precision lower the information stored, therefore lower the accuracy. This is in theory, but there are quantization methods like GGUF and bitsandbytes which help in retaining majority of the model inference even at 4 bit precisions making it possible to tune and run models on CPUs.
- Quantization Method β The method to use for quantization. A more detailed article on quantization can be found in my previous article. The methods supported are half quadratic quantization(hqq), eetq and bitsandbytes. Bitsandbytes have become the most well known tool/ approach for quantizing models to 4 and 8 bits. You can learn more about it from HuggingFace documentation.
- Prompt Template β There are plenty of readymade templates available. We can register a new template by adding one in src/ llamafactory/ data /template.py under _register_template section as shown in the fig below.
- RoPE scaling β RoPE Scaling involves adjusting the parameters of Rotary Position Embedding (RoPE) to enhance the extrapolation abilities of Large Language Models (LLMs) beyond their original training context lengths. This technique allows LLMs to effectively manage longer text sequences than those encountered during training by modifying the base value in RoPE calculations. By fine-tuning the RoPE with a smaller or larger base value, the model can better capture positional info over extended contexts, ultimately improving its performance on tasks involving long texts. The dynamic NTK and linear are the 2 supported extrapolation strategies by LlamaFactory. Learn more about RoPE scaling rom this article.
- Booster β As the name suggests these are techniques designed to accelerate the training and inference processes of LLMs. This currently supports flash attention 2, unsloth and liger kernels(kernels designed on triton language). You can read more about these from the resources.
Let us look at the basic training strategies and hyperparameters mentioned in the above image for finetuning LLMs.
Stage β There are several options available including supervised tuning, pre-training and reinforcement learning strategies. The codebase for each strategy can be found under src/llamafactory/train in their respective folders. Let me give a one-liner about each strategy.
- Supervised fine-tuning β Uses a sequence-to-sequence trainer. You can find a conceptual overview from here and the list of parameters to finetune from here.
- Reward Modelling β You can find an overview from this video and the list of parameters to finetune from here.
- Proximal Policy Optimization β You can read about PPO from the RL course in HuggingFace. This is the basic RL strategy that comes to mind, using policies to gradually improve the model performance.
- Direct Preference Optimization β This is an efficient strategy that replaces the separate reward model with a cross-entropy loss function, thereby minimizing the reinforcement learning part, therefore the associated computation and time. You can learn about HunggingFace DPO trainers from here.
- Kahneman-Tversky Optimization β This does away the pairwise preferences for responses in the dataset as in the above two steps. We instead work on a binary label (True or False) for directing the model. You can learn about the KTO trainer from HuggingFace docs.
- Pre-Training β You can find the difference between pre-training, supervised finetuning and reinforcement learning from this quick read. The goal of pre-training is to make the model learn all the nuances from the data without any explicit instructions.
Data dir β Path to the folder where you have your training data/ data templates.
Dataset β There are quite a few datasets already HuggingFace referenced for various models. For adding custom datasets we can modify the dataset template present in the data/dataset_info.json as shown in the below image.
Hyperparameters
Learning Rate β It aids in deciding how much the modelβs weights are adjusted during each training iteration. It essentially controls the step size taken in the optimization process. Lower the value === slower the learning.
Epochs β An epoch refers to one complete pass through the entire training dataset. Number of epochs determines how much time the model reads through the entire data. Greater the epochs === overfitting. Use strategies like early stops to monitor the model performance with the validation set before moving on to the next epoch.
Maximum gradient norm β Gradient clipping is a technique used to prevent the gradients from becoming excessively large during training. The norm values guides as thresholds for the clipping process. Greater the value of norm === more frequent the clipping => slower training.
Max Samples β Sample size for each iteration as we cannot do one epoch in one iteration on large datasets.
Compute type β This determines whether to mixed precision during training. Some quantization strategies rely on the idea that by retaining the most impactful set of wights with the actual precision and quantizing on the relatively lesser impactful weights can yield a performance closer to the original. Hereβs a WikiPedia link to bf16. LlamaFactory supports, bf16(mixed), fp32, fp16 and pure bf_16 formats.
Cutoff length β Maximum token length for the input .(varies based on the model)
Batch Size β Number of samples processed at a time.
Gradient Accumulation β Number of sub-batches to split from thr base training batch. This is needed for training models on consumer GPUs with low memories and compute.
Val Size β Percentage of data to use for validation.
LR scheduler β Learning rate scheduler is used to dynamically adjust the learning rate during each iteration.
Extra Configurations
Logging steps β Number of iterations to log each time (if 5, then log every 5 steps)
Save steps β The number of steps between two checkpoints determines the frequency at which checkpoints are saved. A higher number means checkpoints are saved less frequently, while a lower number means checkpoints are saved more frequently. This helps in scenarios where the training process gets interrupted.
Warmup steps β It refer to a training technique where the learning rate is gradually increased from a small initial value to a target learning rate over a specified number of steps.
NEFTune Alpha β NEFTune adds noise to the input embeddings of the language model, introducing randomness into the training process. You can learn more on the HuggingFace docs. Applying NEFTune on well distributed datasets might only add marginal gains(sometimes None).
Optimizer β As the name suggests, it is used for optmizing the models to have the least loss function value. It does this by iteratively adjusting the modelβs parameters (weights and biases) to find the best possible set of values that minimize the error between the modelβs predictions and the true values. List of supported optimizers are adamw_torch, adamw_8bit or adafactor. You can learn more on PyToch documentation.
Pack sequences β Packing combines sequences of different lengths into a single tensor, eliminating unnecessary padding. PyTorch allows us to pack the sequence, internally packed sequence is a tuple of two lists. One contains the elements of sequences and the other contains the batch size at each step.
Use neat packing β When using packed sequences, itβs generally recommended to avoid cross-attention between different sequences within the packed tensor.
Train on prompt β Disabling all the label masks, allowing the model to attend to all tokens in the target sequence, including future tokens, during training.
Resize token embeddings β Resize the tokenizer vocab and the embedding layers to accommodate the new tokens that are generated.
Enable sΒ² attention β You can learn about LongLoRA from here. Shift short attention would be like breaking down the textbook into smaller chapters or sections. Youβd focus on understanding each chapter individually, without trying to memorize every connection between them all at once.
Enable external logger β Can activate tensorboard monitoring for the training process. By default you can monitor the training activities using the bottom section from the LlamaFactory UI or cli.
Specific Configurations
Freeze tuning configurations
Trainable layers β Number of hidden layers to train as in the description.
Trainable and extra modules β We can specify modules of the LLM to train, if specified as βallβ everything is trained. You can find the corresponding code from src/llamafactory/model/adapter.py. You can find the supported modules from named_parameters() function for PreTrained transformers.
RLHF configurations
Beta value β Hyperparameter to control the balance between human feedback and reward signal. If the beta value is high, the model prioritizes human feedback and a low beta value focuses more on the reward structure.
Ftx gamma β It denotes the discount factor in reinforcement learning. It determines how future rewards are valued compared to immediate rewards. A gamma close to 1 means future rewards are valued highly, while a gamma close to 0 prioritizes immediate rewards.
Loss type β The loss function. List of supported types are sigmoid, hinge, IPO, KTO_pair, ORPO and SimPO.
Reward model β path to the reward model.
Score norm and Whiten rewards β Normalizing scores refers to the process of adjusting the rewards or advantage estimates to a standard scale. This technique helps to stabilize and improve the learning process. By whitening the rewards, we standardize them to have a mean of zero and a standard deviation of one.
GaLore(Gradient Low-Rank Projection) configurations
This is a memory-efficient low-rank training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods, such as LoRA. This is the first method that allows pre-training a llama-2-7B model on 28GB of VRAM without using model parallelism, checkpointing, or offloading.
GaLore Rank β higher rank needs more memory and more accuracy and vice versa.
Update Interval β The frequency of updates can significantly impact the performance of GaLore. If the updates are too frequent, the memory savings may be limited. If the updates are too infrequent, the approximation may become outdated and degrade the modelβs performance.
Galore Scale β Larger the scaling factor higher the updates.
BAdam configurations
This is a full paramter optimization technique on commercial GPUs. The authors were able to finetune Llama 2β7b and Llama 3β8B using a single RTX3090 with Adamβs update rule and mixed precision training. This uses a technique called Block Coordinate Optimization which divides the modelβs parameters into smaller blocks and iteratively updates each block while keeping others fixed significantly reducing the memory footprint required for training yet maintaining the good performance as shown below.
BAdam model β We can either do layer-wise (BlockOptimizer class is imported)or ratio-wise (BlockOptimizerRatio class is imported) BAdam optimizer. The layer-wise training requires DeepSpeed installation to work.
Update ratio β ratio for the sparse mask..
Switch mode β mode for switching between different blocks.
Switch interval β The number of optimization steps before switching to the next block.
You can follow along this video tutorial for finetuning using LlamaFactory.
Conclusion
In this article, we explored the diverse landscape of fine-tuning strategies, each offering unique advantages depending on the specific requirements of your project. From traditional methods like transfer learning to more advanced techniques such as adapters and prompt tuning, understanding these approaches empowers you to select the most effective strategy for your use case.
Transitioning into practical implementation, we delved into the LlamaFactory tool, which simplifies the fine-tuning process. By examining various parameters and hyperparameters, we highlighted how choices like learning rate, batch size, and dropout can significantly influence model performance. This practical insight equips you with the knowledge to tailor your fine-tuning efforts effectively.
Hope this overview enhances your understanding of the LLM fine-tuning landscape and empowers you to implement these strategies successfully.
Resources
- Modern Approaches in Natural Language Processing, LMU Munich, seminar booklet
- Deep Reinforcement Learning, OpenAI, Book
- Parameter Efficient Finetuning, Vinija Jain, Primer Notes
- AI blog by Sebastian Raschka, Sebastian Raschka, blog
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI