
Fine-Tuning 101: Unlocking the Power of AI Customization
Last Updated on March 10, 2025 by Editorial Team
Author(s): Dhruv Tiwari
Originally published on Towards AI.
Fine-tuning is the process of training a model with specific examples to shape its responses in a desired way. However, it should not be used to teach new knowledge.
1. What are the methods to do fine-tuning
There are 3 methods to do fine-tuning as of now:
- Full parameter training
- Low Rank Adaptation (LoRA)
- Quantized LoRA (QLoRA)
1. Full Parameter Fine-Tuning
Full-parameter fine-tuning is a method of adapting a pre-trained model to a specific task by updating all of its parameters. Unlike parameter-efficient tuning techniques (such as LoRA), which modify only a subset of parameters, full fine-tuning allows the model to learn task-specific knowledge comprehensively.
2. Low Rank Adaptation
LoRA is a technique designed to efficiently fine-tune models like LLAMA, which has 8 billion parameters. Training such a model is extremely GPU and memory intensive because all the parameters must be stored and calculated in memory.
How does LoRA work?
Let’s say we have a weight matrix W of shape (200,200), which means it has:
200 x 200 = 40,000 trainable parameters
Instead of updating this entire matrix, LoRA freezes W and introduces two much smaller matrices:
- A of shape (200,1) → 200 parameters
- B of shape (1,200) → 200 parameters
This gives a total of 200+200 = 400 trainable parameters
Low-Rank Approximation
- Instead of updating 40,000 parameters, we only train 400 parameters.
- However, the matrix multiplication A x B reconstructs a full-sized (200,200) matrix.
A x B = (200,1) x (1,200) = (200,200)
Now what?
We only store 400 parameters, yet during the forward pass, the model still operates on a (200,200) matrix.
This is because we only train 400 parameters instead of the usual 40,000, which significantly reduces memory usage and speeds up fine-tuning.
3. Quantized LoRA
It’s straightforward; we simply add a quantization layer before LoRA.
Final model = Quantization to 4 bit + LoRA
What does this help with?
Although LoRA helps in reducing the number of trainable parameters, we still have to load all the parameters into our GPU for training. By using Quantization in 4 bit we reduce the size of the model’s parameters by 4–6x.
What is Quantization?
Normally, the model’s parameters are stored in FP32 or Floating Point with 32 bits. This is necessary for the model’s accuracy during training. By using quantization, we reduce the number of bits to 4 bits (NF4). While this reduces accuracy, it’s not a significant loss since the parameters are already trained.
Now let’s get started with the practical implementation.
2. The process to Fine-Tune an LLM
Libraries that are used:
- Datasets — Loading the dataset
- Transformers — Loading the LLM and storing it
- PEFT — LoRA configuration
- TRL — Training the model using Supervised Fine-Tuning
- Optional: Unsloth — Fast fine-tuning
We will be working with the SocraticChat dataset
from datasets import load_dataset
dataset = load_dataset('FreedomIntelligence/SocraticChat',split='train[0:500]')
Loading the data with the first 500 rows
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
'meta-llama/Meta-Llama-3-8B',
quantization_config=bnb_config,
device_map="auto",
attn_implementation='eager'
)
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B')
model, tokenizer = setup_chat_format(model, tokenizer)
The model along with the tokenizer is loaded and bitsanbytes is used to quantize the Llama 3 8B model to 4 bits.
Setup chat format is a function imported from TRL and is used to transform the chat to ChatML
peft_config = LoraConfig(
r=8,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)
model = get_peft_model(model, peft_config)
This code converts a standard model into a LoRA model by adding two matrices.
r=8
– This defines the rank of the matrix. For example, if we had a(200,1)
matrix, its rank would be1
.target_modules
– Specifies where the LoRA layers will be added. These layers are the ones that will be fine-tuned.
def formatting_prompts_func(example):
k=[]
for converse in example['conversations']:
k.append({'role':converse['from'], 'content':'assistant' if converse['value'] == 'gpt' else 'user'})
example['text'] = tokenizer.apply_chat_template(k, tokenize=False)
return example
dataset = dataset.map(formatting_prompts_func, num_proc=4)
It will go through the entire 500 converstations and convert them into the format of {‘role’:”, ‘content’:’’}
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset=dataset,
peft_config=peft_config,
args = SFTConfig(
dataset_text_field="text",
per_device_train_batch_size=1,
gradient_accumulation_steps=2,
warmup_steps = 5,
num_train_epochs = 1,
learning_rate = 2e-4,
fp16 = False,
bf16 = True,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "model_traning_outputs",
report_to = "wandb",
max_seq_length = 512,
dataset_num_proc = 4,
packing = False,
),
)
There are several hyperparameters that determine how well our model performs. However, you don’t need to worry about them too much, as they are based on official documentation.
Key Hyperparameters (Abstract Overview):
peft_config
– The LoRA configuration we created.per_device_train_batch_size
– The number of batches sent to the GPU. Increase this if you have a powerful GPU for faster training.num_epochs
– Defines how many times the model trains on the dataset. You can use fractional values (e.g.,0.1
for quick training) or increase it (e.g.,2
for better accuracy). However, be mindful of overfitting.max_seq_length
– The maximum number of tokens in the model’s response.
And finally, to start training, we run
trainer.train()
wandb.finish()
model.config.use_cache = True
To evaluate our fine-tuned model, we use Weights & Biases for a comprehensive training report. After saving the model with trainer.model.save_pretrained(new_model)
, we can test it using the following code
trainer.model.save_pretrained(new_model)
messages = [ { "role": "user", "content": "What is the sum of 2+2" }]
prompt = tokenizer.apply_chat_template(messages, tokenize=False,
add_generation_prompt=True)inputs = tokenizer(prompt, return_tensors='pt', padding=True,
truncation=True).to("cuda")outputs = model.generate(**inputs, max_length=150,
num_return_sequences=1)text = tokenizer.decode(outputs[0], skip_special_tokens=True)print(text.split("assistant")[1])
With this, our fine-tuned model is ready! 🎉 We’ve successfully fine-tuned a model and gained hands-on experience with the fundamentals of AI.
And here’s the final repo
GitHub – dhruv1710/FinetuneSocraticLLM: project for fine-tuning large language models (specifically…
project for fine-tuning large language models (specifically Llama 3 8B) on the SocraticChat dataset to create a model…
github.com
Conclusion
While new AI technologies continue to emerge, mastering the fundamentals such as fine-tuning, is crucial for effectively applying AI. In this article, we successfully fine-tuned LLaMA 3 to perform Socratic questioning. This same technique can be leveraged for high-impact applications, such as fine-tuning models on radiological data to improve disease detection accuracy.
Keep Learning!
About Me
— — — — — —
I’m Dhruv, a 17yo builder passionate about Applied AI. I’ve been coding since I was 7, dropped out of high school to pursue startups, and have built several AI-driven projects. I love working with technology, and sharing insights on cutting-edge AI topics. Follow me for in-depth, knowledge-packed articles on the key topics of AI!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI