Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

From Quantization to Inference: Beginners guide for Practical Fine-tuning — (QLoRA with Mistral 7B)
Latest   Machine Learning

From Quantization to Inference: Beginners guide for Practical Fine-tuning — (QLoRA with Mistral 7B)

Last Updated on April 17, 2025 by Editorial Team

Author(s): Akhil Shekkari

Originally published on Towards AI.

From Quantization to Inference: Beginners guide for Practical Fine-tuning — (QLoRA with Mistral 7B)

Hi, If you are a beginner and looking for a detailed fine-tuning tutorial, you are in the right place. I will try to explain concept and give you code part.

Go through my previous blogs if you want a refresher in:

  1. Quantization(Math details and intuition): Link
  2. LoRA(Theory and Intuition): Link
From Quantization to Inference: Beginners guide for Practical Fine-tuning — (QLoRA with Mistral 7B)
Finetuning

Contents:

1. Setting up colab notebook
2. Checking if model is indeed loaded in 4bit
3. Playing with tokenizer and understanding outputs
4. Checking model ouput before fine-tuning
5. Loading and Inspecting dataset for fine-tuning
6. Formatting dataset to make it ready for fine-tuning
7. Setting up LoRA Adapters
8. Model Training
9.Model Inference

1. Setting up colab notebook

I would always suggest getting started in colab instead of local environment. This will save you lot of time and you can focus and experiment faster on actual finetuning part. With that being said, install the below dependencies.

## Install bitsandbytes if not already installed.
!pip install bitsandbytes
!pip install datasets
!pip install huggingface_hub

For this tutorial, we are going to use Mistral-7b. You need to accept licence to use this model. you can do that by logging into your hugging face account. Accept it here https://huggingface.co/mistralai/Mistral-7B-v0.1

Lets start with few Imports:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

Basic Info regarding Imports

  1. AutoTokenizer is used to convert text to corresponding Ids.
  2. CausalLM : These language Models predict next word based on previous tokens.AutoModelForCausalLM contains all autoregressive Causal Models like chatGPT, Llama etc.,
  3. BitsAndBytesConfig is used for specifying how we want to load the model and how do we like to do computation. Useful when we have GPU.
# loading model in 4bits

bnb_config = BitsAndBytesConfig(
load_in_4bit = True,
bnb_4bit_quant_type = 'nf4', ## This is instead of having uniform bucket, it will be in Gaussian distribution which is more around zero
bnb_4bit_use_double_quant = True,
bnb_4bit_compute_dtype = torch.float16
)

Let’s understand what those parameters mean.

  1. load_in_4bit: Loads the mistral modal with its quantized 4bit weights.
  2. bnb_4bit_quant_type: Most popular values for this are
    a. ‘Nf4’ → NormalFloat
    b. ‘FP4’ → Float

In Nf4 the values are more tighter around center. It follows normal distribution, where as in FP4 it is uniform throughout.

For Example: Around center we find +0.01, +0.02, -0.01, -0.02 and at the extremes we have 3.1, 3.5, -4.1, -4.5 etc.,

3. bnb_4bit_use_double_quant : If you have gone through my first blog, you will find something called scaling factor. This is used in Quantization — DeQuantization. So, if this parameter is set to true, these scaling factors for each quantized block are also quantized. This helps in great memory footprint reduction.

4. bnb_4bit_compute_dtype: This parameter helps decide what should be precision while Training/Inference.

Remember:
→ Weights are quantized to 4-bit (super compact)
→ But computation still happens in higher precision (to avoid losing accuracy)

Once this is done, we will just load model according to this configuration by running below code block.

model_id = 'mistralai/Mistral-7B-v0.1'
model = AutoModelForCausalLM.from_pretrained(
model_id, ## load this model
quantization_config=bnb_config, ## load the model like this
device_map="auto"
)

All the arguments are self explanatory. The device map refers to the availability of having GPU (moves to CUDA if yes./moves to MPS if apple silicon)

2. Checking if model is indeed loaded in 4bit

for name, module in model.named_modules():
if "Linear" in str(type(module)) or "4bit" in str(type(module)):
print(f"{name} -> {type(module)}")

Now, what does that above chunk of code mean ?

we will be using named_modules() in various cases.

For Example:

1. Inspecting Model Structure
2. Checking layer types
3. Applying LoRA
4. Replacing Layers
5. Adding hooks for visualization, logging, activation tracking
6. Freezing specific layers.

Note: There is also one more function called named_parameters() which would give weights and biases.

Upon running the above block of code we get:

bitsandbytes confirmation

Now we know that we loaded Quantized Model.

Up Next, we load tokenizer.

## Now we need to to tokenizer setup

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

Every model has got some kind of tokenizer( which converts text to numbers).

3. Playing with tokenizer outputs

sentence = " Quantization reduces model size and speeds up inference"

tokens = tokenizer(sentence)
#print(tokens)
print(f"Input IDs: {tokens['input_ids']}")
print(f"tokens: {tokenizer.convert_ids_to_tokens(tokens['input_ids'])}")

# decode back to see what original tokens were

decoded = tokenizer.decode(tokens['input_ids'])
print(f'decoded text: {decoded}')

We get the below output upon running the above code.

output
  1. <S> is the start token. Remaining numbers are different token numbers for different words/tokens.

2. tokenizer.decode() helps us constructing original sentence.

Let’s see an example of how batching looks like.

sentences = ['Quantization is great', 
'LoRA is much more better than Full Finetuning']
batch = tokenizer(
sentences,
padding = True,
truncation = True,
return_tensors = 'pt'

)
print(f'Input IDs: {batch["input_ids"]}')
print('Tokens: ')
for ids in batch['input_ids']:
print(tokenizer.convert_ids_to_tokens(ids))

lets observe the below output.

output

1. Here we process two sentences at once. Hence padding.
2. </s> is padding token which is denoted by “2”.
3. Truncate helps in cutting off sentence generation till the max limit mentioned.
4. return tensors = ‘pt’ just returns output in Pyorch tensors for further calculations

4. Checking model ouput before fine-tuning.

Whole purpose of this blog is about finetuning. Before doing that, lets check the model output.

prompt_1 = """### Instruction:
How to cook pasta.

### Response:"""


prompt_2 = """### Instruction:
How to win chess championship.

### Response:"""


prompt = [prompt_1, prompt_2]
inputs = tokenizer(prompt,padding = True, truncation = True, return_tensors="pt").to(model.device)

with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True, #### does random sampling instead of greedy search # can do beam search
temperature=0.7,
top_p=0.95
)

# Strip prompt if you just want the model's generated part
for i, output in enumerate(outputs):
response = tokenizer.decode(output, skip_special_tokens=True) #### removes <s>, <PAD> etc.,
completion = response[len(prompt[i]):].strip() ### removes prompt from
print(f"\n Prompt {i+1} Response:")
print(completion)

Nothing new in the code here. we are checking outputs for 2 different prompts.

torch.no_grad() disables gradient tracking while inference which should be done. After that we are just printing outputs. The output looks like:

output response of prompts

It is not giving steps. It is generating some kind of html and git/file instructions. Good enough reason for us to finetune.

5. Loading and Inspecting dataset for fine-tuning.

Okay. we are getting there slowly but step by step. Please see dataset format here https://huggingface.co/datasets/tatsu-lab/alpaca

from datasets import load_dataset

# load a small subset to test quickly

dataset = load_dataset("tatsu-lab/alpaca", split="train[:1%]")
print(dataset[0]['output']) ### Accessing output only
output

Here outputs are nicely ordered step by step without any html. we are taking only 1% of the data which is enough for practice purposes.

But We need to format the above dataset. Because there are multiple fields like instruction, input and output in the dataset. Out model understands only one prompt as a continuous string and it gives one response as output.

6. Formatting dataset to make it ready for fine-tuning

def format_alpaca(example):
if example["input"]:
prompt = f"""### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"""
else:
prompt = f"""### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"""
return {"text": prompt}

# Apply formatting
dataset = dataset.map(format_alpaca)
dataset[0]
After formatting

We can clearly see that, after formatting we have text and response variables in the dataset. They will be used for fine-tuning purposes. we do this to tell the model that as soon as it sees:

####Instruction → listen for task
####Response
→ Generate meaningful response

Now the next obvious step would be to tokenize these text variable in the dataset. Run below code.

def tokenize(example):
return tokenizer(
example['text'],
truncation = True,
padding = 'max_length',
max_length = 128
)

tokenized_dataset = dataset.map(tokenize, batched = True)

print(tokenized_dataset[0]["input_ids"][:10])
print(tokenizer.decode(tokenized_dataset[0]["input_ids"]))
output after running above code

I have cutdown the input_ids[:10] to ten. to see the full ids remove slicing.

7. Setting up LoRA Adapters

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
r = 32, #### reducing rank will give you less parameters
lora_alpha = 16,
target_modules = ['q_proj', 'v_proj'],
lora_dropout = 0.1,
bias = 'none',
task_type = TaskType.CAUSAL_LM
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

We just define rank, and other useful parameters. Please refer my blog linked at the start to understand theory and intuition.

output

We see that we train only 0.18 % parameters(LoRA parameters. Original weights are frozen). This significantly reduces memory footprint and speeds-up training.

8. Model Training

from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

training_args = TrainingArguments(
output_dir="./qlora-mistral-checkpoint", # Folder to save checkpoints and logs
per_device_train_batch_size=4, # Batch size per GPU/core
gradient_accumulation_steps=4, # Accumulate gradients over 4 steps → simulates batch_size = 4 x 4 = 16
num_train_epochs=1, # Train for 1 epoch over the dataset
learning_rate=2e-4, # Learning rate (typically 1e-4 to 2e-4 for LoRA)
fp16=True, # Use mixed precision (float16) for faster training and lower memory
logging_steps=10, # Log loss every 10 steps
save_steps=50, # Save checkpoint every 50 steps
save_total_limit=1, # Keep only the most recent checkpoint (free up disk space)
report_to="none" # Disable logging to W&B, TensorBoard, etc. (can set to "wandb" if needed)
)

trainer = Trainer(
model=model, # The full model with LoRA adapters applied
args=training_args, # Training configuration defined above
train_dataset=tokenized_dataset, # Tokenized dataset with input_ids, attention_mask
tokenizer=tokenizer, # Tokenizer for saving and syncing with model
data_collator=DataCollatorForLanguageModeling( # Prepares batches with correct padding/masking
tokenizer, mlm=False # mlm=False = causal LM task (not BERT-style masking)
)
)
trainer.train()

I tried to mention comments for all the parameters that we are using. I felt it was sufficient as it is vert simple. However, let me explain what gradient accumulation does here. Model only updates weights once every 4 steps, after seeing a total batch of 4 x 4 = 16 samples. This way, it is memory efficient and can train even on small GPU.

DataCollatorForLanguageModeling:

  1. This is very important because it handles multiple things. Earlier in the blog, we have seen padding, masking. DataCollator handles this job for us. If we don’t use it model starts attending to padded tokens also(which are garbage). This is attention masking.
  2. MLM masking randomly hides some input tokens and asks model to predict the next one. since we are not doing BERT style model, we opt it as False.
output of training

Training one epoch also gave a noticeable result. Let’s see how to do Inference.

9.Model Inference

Save the model first.

model.save_pretrained("qlora-mistral-lora")
tokenizer.save_pretrained("qlora-mistral-lora")

Now load the base-model + QLoRA(saved model)

base_model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1",
quantization_config=bnb_config,
device_map="auto"
)

# Load LoRA adapters
model = PeftModel.from_pretrained(base_model, "qlora-mistral-lora")
tokenizer = AutoTokenizer.from_pretrained("qlora-mistral-lora")
tokenizer.pad_token = tokenizer.eos_token

Finally, we have reached the fun part. Let’s check the output on the same prompts.

prompt = """### Instruction:
How to become chess champion
### Response:"""


inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True, # instead of greedy sampling
temperature=0.7,
top_p=0.9
)

output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(" Model Output:\n", output_text[len(prompt):].strip())
Final output image

Here, we see the output is much more organized, readable and relevant. Hope you have enjoyed the blog and learned something. Let me know in comments if you have any questions.

If you found this blog helpful , please clap, share, subscribe and stay tuned for the next one.

My Linkedin: https://www.linkedin.com/in/akhilshekkari/

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓