Fine-Tuning 101: Unlocking the Power of AI Customization

Last Updated on March 10, 2025 by Editorial Team

Author(s): Dhruv Tiwari

Originally published on Towards AI.

Fine-tuning is the process of training a model with specific examples to shape its responses in a desired way. However, it should not be used to teach new knowledge.

1. What are the methods to do fine-tuning

There are 3 methods to do fine-tuning as of now:

Full parameter training
Low Rank Adaptation (LoRA)
Quantized LoRA (QLoRA)

1. Full Parameter Fine-Tuning

Full-parameter fine-tuning is a method of adapting a pre-trained model to a specific task by updating all of its parameters. Unlike parameter-efficient tuning techniques (such as LoRA), which modify only a subset of parameters, full fine-tuning allows the model to learn task-specific knowledge comprehensively.

2. Low Rank Adaptation

LoRA is a technique designed to efficiently fine-tune models like LLAMA, which has 8 billion parameters. Training such a model is extremely GPU and memory intensive because all the parameters must be stored and calculated in memory.

How does LoRA work?

Let’s say we have a weight matrix W of shape (200,200), which means it has:

200 x 200 = 40,000 trainable parameters

Instead of updating this entire matrix, LoRA freezes W and introduces two much smaller matrices:

A of shape (200,1) → 200 parameters
B of shape (1,200) → 200 parameters

This gives a total of 200+200 = 400 trainable parameters

Low-Rank Approximation

Instead of updating 40,000 parameters, we only train 400 parameters.
However, the matrix multiplication A x B reconstructs a full-sized (200,200) matrix.

A x B = (200,1) x (1,200) = (200,200)

Now what?

We only store 400 parameters, yet during the forward pass, the model still operates on a (200,200) matrix.

This is because we only train 400 parameters instead of the usual 40,000, which significantly reduces memory usage and speeds up fine-tuning.

3. Quantized LoRA

It’s straightforward; we simply add a quantization layer before LoRA.

Final model = Quantization to 4 bit + LoRA

What does this help with?

Although LoRA helps in reducing the number of trainable parameters, we still have to load all the parameters into our GPU for training. By using Quantization in 4 bit we reduce the size of the model’s parameters by 4–6x.

What is Quantization?

Normally, the model’s parameters are stored in FP32 or Floating Point with 32 bits. This is necessary for the model’s accuracy during training. By using quantization, we reduce the number of bits to 4 bits (NF4). While this reduces accuracy, it’s not a significant loss since the parameters are already trained.

Now let’s get started with the practical implementation.

2. The process to Fine-Tune an LLM

Libraries that are used:

Datasets — Loading the dataset
Transformers — Loading the LLM and storing it
PEFT — LoRA configuration
TRL — Training the model using Supervised Fine-Tuning
Optional: Unsloth — Fast fine-tuning

We will be working with the SocraticChat dataset

from datasets import load_dataset

dataset = load_dataset('FreedomIntelligence/SocraticChat',split='train[0:500]')

Loading the data with the first 500 rows

bnb_config = BitsAndBytesConfig(
 load_in_4bit=True,
 bnb_4bit_quant_type="nf4",
 bnb_4bit_compute_dtype=torch.float16,
 bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
 'meta-llama/Meta-Llama-3-8B',
 quantization_config=bnb_config,
 device_map="auto",
 attn_implementation='eager'
)
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B')
model, tokenizer = setup_chat_format(model, tokenizer)

The model along with the tokenizer is loaded and bitsanbytes is used to quantize the Llama 3 8B model to 4 bits.

Setup chat format is a function imported from TRL and is used to transform the chat to ChatML

peft_config = LoraConfig(
 r=8,
 lora_alpha=32,
 lora_dropout=0.05,
 bias="none",
 task_type="CAUSAL_LM",
 target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)
model = get_peft_model(model, peft_config)

This code converts a standard model into a LoRA model by adding two matrices.

r=8 – This defines the rank of the matrix. For example, if we had a (200,1) matrix, its rank would be 1.
target_modules – Specifies where the LoRA layers will be added. These layers are the ones that will be fine-tuned.

def formatting_prompts_func(example):
 k=[]
 for converse in example['conversations']:
 k.append({'role':converse['from'], 'content':'assistant' if converse['value'] == 'gpt' else 'user'})
 example['text'] = tokenizer.apply_chat_template(k, tokenize=False)
 return example
dataset = dataset.map(formatting_prompts_func, num_proc=4)

It will go through the entire 500 converstations and convert them into the format of {‘role’:”, ‘content’:’’}

trainer = SFTTrainer(
 model = model,
 tokenizer = tokenizer,
 train_dataset=dataset,
 peft_config=peft_config,
 args = SFTConfig(
 dataset_text_field="text",
 per_device_train_batch_size=1,
 gradient_accumulation_steps=2,
 warmup_steps = 5,
 num_train_epochs = 1, 
 learning_rate = 2e-4,
 fp16 = False,
 bf16 = True,
 optim = "adamw_8bit",
 weight_decay = 0.01,
 lr_scheduler_type = "linear",
 seed = 3407,
 output_dir = "model_traning_outputs",
 report_to = "wandb",
 max_seq_length = 512,
 dataset_num_proc = 4,
 packing = False, 
 ),
)

There are several hyperparameters that determine how well our model performs. However, you don’t need to worry about them too much, as they are based on official documentation.

Key Hyperparameters (Abstract Overview):

peft_config – The LoRA configuration we created.
per_device_train_batch_size – The number of batches sent to the GPU. Increase this if you have a powerful GPU for faster training.
num_epochs – Defines how many times the model trains on the dataset. You can use fractional values (e.g., 0.1 for quick training) or increase it (e.g., 2 for better accuracy). However, be mindful of overfitting.
max_seq_length – The maximum number of tokens in the model’s response.

And finally, to start training, we run

trainer.train()

wandb.finish()
model.config.use_cache = True

To evaluate our fine-tuned model, we use Weights & Biases for a comprehensive training report. After saving the model with trainer.model.save_pretrained(new_model), we can test it using the following code

trainer.model.save_pretrained(new_model)
messages = [ { "role": "user", "content": "What is the sum of 2+2" }]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, 
 add_generation_prompt=True)inputs = tokenizer(prompt, return_tensors='pt', padding=True, 
 truncation=True).to("cuda")outputs = model.generate(**inputs, max_length=150, 
 num_return_sequences=1)text = tokenizer.decode(outputs[0], skip_special_tokens=True)print(text.split("assistant")[1])

With this, our fine-tuned model is ready! 🎉 We’ve successfully fine-tuned a model and gained hands-on experience with the fundamentals of AI.

And here’s the final repo

GitHub – dhruv1710/FinetuneSocraticLLM: project for fine-tuning large language models (specifically…

project for fine-tuning large language models (specifically Llama 3 8B) on the SocraticChat dataset to create a model…

github.com

Conclusion

While new AI technologies continue to emerge, mastering the fundamentals such as fine-tuning, is crucial for effectively applying AI. In this article, we successfully fine-tuned LLaMA 3 to perform Socratic questioning. This same technique can be leveraged for high-impact applications, such as fine-tuning models on radiological data to improve disease detection accuracy.

Keep Learning!

About Me

— — — — — —

I’m Dhruv, a 17yo builder passionate about Applied AI. I’ve been coding since I was 7, dropped out of high school to pursue startups, and have built several AI-driven projects. I love working with technology, and sharing insights on cutting-edge AI topics. Follow me for in-depth, knowledge-packed articles on the key topics of AI!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Fine-Tuning 101: Unlocking the Power of AI Customization

Author(s): Dhruv Tiwari

1. What are the methods to do fine-tuning

1. Full Parameter Fine-Tuning

2. Low Rank Adaptation

How does LoRA work?

Low-Rank Approximation

3. Quantized LoRA

What does this help with?

What is Quantization?

2. The process to Fine-Tune an LLM

Libraries that are used:

Key Hyperparameters (Abstract Overview):

GitHub – dhruv1710/FinetuneSocraticLLM: project for fine-tuning large language models (specifically…

project for fine-tuning large language models (specifically Llama 3 8B) on the SocraticChat dataset to create a model…

Conclusion

About Me

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Fine-Tuning 101: Unlocking the Power of AI Customization

Author(s): Dhruv Tiwari

1. What are the methods to do fine-tuning

1. Full Parameter Fine-Tuning

2. Low Rank Adaptation

How does LoRA work?

Low-Rank Approximation

3. Quantized LoRA

What does this help with?

What is Quantization?

2. The process to Fine-Tune an LLM

Libraries that are used:

Key Hyperparameters (Abstract Overview):

GitHub – dhruv1710/FinetuneSocraticLLM: project for fine-tuning large language models (specifically…

project for fine-tuning large language models (specifically Llama 3 8B) on the SocraticChat dataset to create a model…

Conclusion

About Me

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥