Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Fast and Efficient Model Finetuning using the Unsloth Library
Data Science   Latest   Machine Learning

Fast and Efficient Model Finetuning using the Unsloth Library

Last Updated on January 10, 2024 by Editorial Team

Author(s): Eduardo Muñoz

Originally published on Towards AI.

Image generated by the author on Leonardo.ai

Introduction

Recently a new framework or library to optimize the training and fine-tuning stage of a large language model has been released: Unsloth. This library is a product of moonshot and was built by two brothers, Daniel and Michael Han, and they promise much faster and memory-efficient finetuning.

In the blog post, they announced:

30x faster. Alpaca takes 3 hours instead of 85.

60% less memory usage, allowing 6x larger batches.

0% loss in accuracy or +20% increased accuracy with our Max offering.

Supports NVIDIA, Intel and AMD GPUs with our Max offering.

Manual autograd and chained matrix multiplication optimizations.

Rewrote all kernels in OpenAI’s Triton language.

Flash Attention via xformers and Tri Dao’s implementation.

Free open source version makes finetuning 2x faster with 50% less memory.

[1]. Blog post “Introducing Unsloth: 30x faster LLM training

The authors highlight that while PyTorch’s Autograd is generally efficient for most tasks, achieving extreme performance requires manually deriving matrix differentials. The authors perform simple matrix dimension FLOP (floating-point operation) calculations and find that bracketing the LoRA weight multiplications significantly enhances performance.

All these features are spectacular, they can reduce a lot of time and resources needed to fine-tune LLMs.Here we will try the open source version that can achieve a 2x faster, but there is also a PRO and a MAX version that can achieve a 30x faster training and up to 60% memory consumption reduction.

How can it achieve that performance?

To achieve a better performance, they have developed a few techniques:

  1. Reduce weights upscaling during QLoRA, fewer weights result in less memory consumption and faster training.
  2. Bitsandbytes works with float16 and then converts to bfloat16; Unsloth directly uses bfloat16.
  3. Use of Pytorch’s Scaled Dot Product Attention implementation
  4. Integration of Xformers and Flash Attention 2 to optimize the transformer model
  5. Using a causal mask to speed up training instead of a separate attention mask
  6. Implementing fast ROPE embeddings with OpenAI’s Triton
  7. Accelerate RMS Normalization with Triton
  8. Optimize Cross entropy loss computation to significantly reduce memory consumption
  9. Implementing a manual Autograd for MLP and Self-Attention layers to optimize Pytorch’s default implementation

You can read a more detailed explanation in the excellent article by Benjamin Marie, “Unsloth: Faster and Memory-Efficient QLoRA Fine-tuning” [2], and in the blog post “Finetune Mistral 14x faster” by Unsloth [3].

Finetune a Llama-2 model

In the GitHub account of Unsloth [4] you can find a bunch of example notebooks on how to fine-tune a model with this library. The steps are the same as those we follow when applying QLoRA. If you are familiar with them, you can easily adapt the notebook to your use case.

In this article, we will show you the relevant steps to finetune a Llama-2 model for a code generation task. The code is based on the notebook from Unsloth official repo, adapted to my use case, and with some parameter modifications.

You can find my code and notebook in my repo “unsloth-llama-2-finetune”.

Install the library and modules

First, you need to keep in mind what GPU you’re going to use. To take advantage of many performance improvements of this library a recent GPU, Ampere generation, is required. In Colab, you can train on a GPU of that family, the A100 GPU, but you can use a V100 or even a T4 although not all benefits will be applied.

Select the library installation instruction depending on your CUDA version and GPU architecture. It’s recommended to load CUDA 12.1 and an Ampere GPU.

pip install "unsloth[cu118] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118_ampere] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121_ampere] @ git+https://github.com/unslothai/unsloth.git"

Then, you’ll need to install and load the “transformer family” libraries.

%%capture
!pip install flash-attn
!pip install transformers datasets

from unsloth import FastLlamaModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from transformers.utils import logging
from peft import LoraConfig, PeftModel

from datasets import load_dataset
from random import randrange
Image generated by the author on Leonardo.ai

Load the model and LoRA configuration

Then, we can load the model using the methods in the Unsloth library. Their methods load also the tokenizer and we just need to set the parameters: max input length, “load_in_4bit” which indicates if the model will be quantized with NormalFloat4 data type and the data type of the parameter computation during training. You can use None as dtype to automatically detect the proper type for the hardware.

In the next step, we prepare the model for a PEFT tuning, adding the common LoRA parameters (R and Alpha, dropout has to be 0 ). In this use case, I have only added LoRA for the MLP modules.

# Check if bfloat16 is supported
HAS_BFLOAT16 = torch.cuda.is_bf16_supported()
# Load the Llama-2 model
model, tokenizer = FastLlamaModel.from_pretrained(
model_name = model_name,
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
# Adapt the model for QLoRA training
model = FastLlamaModel.get_peft_model(
model,
r = 16,
#target_modules = target_modules,
target_modules = ["gate_proj", "up_proj", "down_proj"],
lora_alpha = 16,
lora_dropout = 0, # Currently only supports dropout = 0
bias = "none", # Currently only supports bias = "none"
use_gradient_checkpointing = True,
random_state = random_state,
max_seq_length = max_seq_length,
)

Load the dataset

For our tuning process, we will take a dataset containing about 18,000 examples where the model is asked to build a Python code that solves a given task. This is an extraction of a dataset with code from multiple languages, where only the Python language examples are selected. Each row contains the description of the task to be solved, an example of data input to the task if applicable, and the generated code fragment that solves the task is provided [5].

# Load dataset from the hub
dataset = load_dataset(dataset_name, split=dataset_split)
# Show dataset size
print(f"dataset size: {len(dataset)}")
# Show an example
print(dataset[randrange(len(dataset))])

Creating the prompt

To carry out an instruction fine-tuning, we must transform each one of our data examples as if it were an instruction, outlining its main sections as follows:

# Create the formating prompt
instruction_prompt = """### Instruction:
Use the Task below and the Input given to write the Response, which is a programming code that can solve the following Task:

### Task:
{}

### Input:
{}

### Response:
{}
"""


def formatting_prompts_func(examples):
instructions = examples["instruction"]
inputs = examples["input"]
outputs = examples["output"]
texts = []
for instruction, input, output in zip(instructions, inputs, outputs):
text = instruction_prompt.format(instruction, input, output)
texts.append(text)
return { "text" : texts, }

# Transforme the dataset into a instruction
dataset = dataset.map(formatting_prompts_func, batched = True,)

Once we apply the transformation function, our samples are ready to be used for finetuning:

Output:

### Instruction:
Use the Task below and the Input given to write the Response, which is a programming code that can solve the following Task:

### Task:
Develop a Python program that prints "Hello, World!" whenever it is run.

### Input:


### Response:
#Python program to print "Hello World!"

print("Hello, World!")

Create the Trainer

As I mentioned and discussed in my previous article about fine-tuning a Llama-2 model using QLoRA [6], the next steps are well-known for all Hugging Face users: setting up the training arguments and creating an SFTTrainer object.

training_arguments = TrainingArguments(
output_dir= adapter_name,
evaluation_strategy="no",
eval_steps=50,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
#num_train_epochs=num_train_epochs, # Training in epochs
max_steps= 1500, # Training in steps
warmup_steps= 10,
warmup_ratio= 0.01,
learning_rate=2e-4,
optim= "adamw_8bit",
save_strategy= "steps",
save_steps=500,
logging_steps=100,
fp16 = not HAS_BFLOAT16,
bf16 = HAS_BFLOAT16,
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = random_state,
)

The defined parameters are the most common in QLoRA-type training, nothing relevant to discuss. Depending on the VRAM of your GPU you will need to adjust the batch size and gradient_accumulation_steps parameters to fit memory requirements. The SFTTrainer call is also very similar to the one in other PEFT tuning jobs:

# Set the logging properties
logging.set_verbosity_info()

trainer = SFTTrainer(
model = model,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
tokenizer = tokenizer,
args = training_arguments,
)

Save the adapter to the Huggingface Hub

Finetuning is a heavy and time-consuming task, I highly recommend saving your model and outputs to disk or the hub just after training.

from huggingface_hub import login
from dotenv import load_dotenv

# Load the enviroment variables
load_dotenv()
# Login to the Hugging Face Hub
login(token=os.getenv("HF_HUB_TOKEN"))
# push merged model and tokenizer to the hub
trainer.push_to_hub(adapter_repo)
tokenizer.push_to_hub(adapter_repo)

All code is available in my repo “unsloth-llama-2-finetune”.

Conclusion

Although this experiment was simple, I observed a training almost 2x faster than when I trained this model using the base transformer model. At this moment, only Llama-2, Mistral and CodeLlama 34B models are supported but it‘s a really interesting option for future finetuning jobs. I hope they can keep evolving and improving the library and the techniques applied.

References

[1]. Blog post “Introducing Unsloth: 30x faster LLM training

[2]. “unsloth: Faster and Memory-Efficient QLoRA Fine-tuning” by Benjamin Marie

[3]. “Finetune Mistral 14x faster” by Unsloth

[4]. Github account of Unsloth

[5]. “iamtarun/python_code_instructions_18k_alpaca” Dataset

[6]. “Fine-Tuning a Llama-2 7B Model for Python Code Generation” by Eduardo Muñoz

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓