How to Fine-Tune Any Large Language Model (LLM)
Last Updated on January 29, 2025 by Editorial Team
Author(s): Pranjal Khadka
Originally published on Towards AI.
Fine-tuning large language models (LLMs) has become an easier task today thanks to the availability of low-code/no-code tools that allow you to simply upload your data, select a base model and obtain a fine-tuned model. However, it is important to understand the fundamentals before diving into these tools. In this article, weβll explore the entire process of fine-tuning LLMs in detail.
LLMs operate in two main stages: pre-training and fine-tuning.
1. Pre-training
During the pre-training phase, LLMs are exposed to massive datasets of text. This stage involves defining the model architecture, selecting the tokenizer and processing the data using the tokenizerβs vocabulary. In autoregressive models like GPT and LLaMA, the model learns to predict the next word in a sentence. For encoder-only architectures like BERT, the model learns to predict missing words in a sentence. These two approaches are known as causal language modeling (CLM) and masked language modeling (MLM) respectively.
Most commonly, we use causal language modeling where the model predicts the next word in the sequence based on the previous context. However, pre-trained models are general purpose and lack domain-specific knowledge and canβt perform specialized tasks. This is where fine-tuning comes in.
2. Fine- tuning
Fine-tuning allows us to specialize a modelβs capabilities for a particular task by adjusting the modelβs parameters in a way that minimizes the task-specific loss. However, conducting a full fine-tuning process which involves retraining all parameters of the model can be computationally expensive.
Recently, advanced techniques like LoRA (Low-Rank Adaptation) and QLoRA (Quantized Low-Rank Adaptation) have emerged making fine-tuning more efficient and accessible. The official paper of LoRA demonstrated with the example of GPT-3 175B that LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times and performs on-par or better than normal fine-tuning despite having fewer trainable parameters, a higher training throughput and no additional inference latency.
Letβs take a closer Look at the Fine-Tuning Process :-
1. Training Compute Requirements
Fine-tuning large models like LLaMA 7B and Mistral 7B typically requires around 150β195 GB of GPU memory. To train these models, youβll need to rent GPUs from cloud providers like AWS SageMaker or Google Colab.
2. Preparing the Dataset
You can either create your own dataset or use publicly available datasets from sources like Hugging Face. Using quality data is a very essential aspect of fine-tuning.
3. Using Low-Code Tools
Tools like Axolotl simplify the fine-tuning process by offering pre-defined configurations for LoRA and QLoRA and open source LLMs. Axolotl requires minimal coding. You just need to clone the GitHub repository, follow the setup instructions and youβll be able to fine-tune any available LLM with a simple trigger.
Understanding LoRA and QLoRA
LoRA is a technique that allows efficient fine-tuning of LLMs by freezing the pre-trained model weights and introducing low-rank matrices. These low-rank matrices are fine-tuned while the original model weights remain frozen. This approach reduces the number of trainable parameters and makes the fine-tuning process memory efficient while maintaining high performance.
Key Parameters when fine-tuning LoRA:-
- lora_r :- rank of the low rank decomposition matrices. A higher value allows to capture more information (better performance), but increases memory usage. If you have a very complex dataset, consider setting this value to high.
- lora_alpha :- Scaling factor to control the impact of the LoRA weight updates on the original model weights. A lower value gives more weight to the original pre-training data and maintains modelβs existing knowledge to greater extent.
- lora_target_module :-Determines which specific layers/matrices is to be trained. Typically, Query and Value projection matrices in the self-attention mechanism are chosen because they have the most impact on model adaptation.
QLoRA enhances LoRA by applying low-rank updates to a model that has been quantized to lower precision. This allows fine-tuning large models with significantly reduced memory requirements, making it easier to work with limited resources.
Key Parameters in QLoRA:-
- load_in_4bit :- Load model in 4-bit precision for memory efficiency. Can be set to either True or False. Similarly, there is load_in_8bit.
In addition to the specific parameters for LoRA and QLoRA, youβll encounter common machine learning hyperparameters when fine-tuning such as num_epochs, batch_size, optimizer, learning_rate, lr_scheduler, wandb parameters for model experimentation and so on.
Fine-tuning LLMs is no longer a daunting task thanks to no-code and low-code tools and techniques like LoRA and QLoRA. By understanding these core principles and training parameters, you can efficiently fine-tune any LLM to meet the specific needs of your application.
References:
1. https://www.upstage.ai/blog/en/understanding-fine-tuning-of-large-language-models
2. https://kim95175.tistory.com/28
3. https://docs.nvidia.com/nemo-framework/user-guide/latest/sft_peft/qlora.html
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI