From Quantization to Inference: Beginners guide for Practical Fine-tuning — (QLoRA with Mistral 7B)

Last Updated on April 17, 2025 by Editorial Team

Author(s): Akhil Shekkari

Originally published on Towards AI.

From Quantization to Inference: Beginners guide for Practical Fine-tuning — (QLoRA with Mistral 7B)

Hi, If you are a beginner and looking for a detailed fine-tuning tutorial, you are in the right place. I will try to explain concept and give you code part.

Go through my previous blogs if you want a refresher in:

Quantization(Math details and intuition): Link
LoRA(Theory and Intuition): Link

From Quantization to Inference: Beginners guide for Practical Fine-tuning — (QLoRA with Mistral 7B) — Finetuning

Contents:

1. Setting up colab notebook
2. Checking if model is indeed loaded in 4bit
3. Playing with tokenizer and understanding outputs
4. Checking model ouput before fine-tuning
5. Loading and Inspecting dataset for fine-tuning
6. Formatting dataset to make it ready for fine-tuning
7. Setting up LoRA Adapters
8. Model Training
9.Model Inference

1. Setting up colab notebook

I would always suggest getting started in colab instead of local environment. This will save you lot of time and you can focus and experiment faster on actual finetuning part. With that being said, install the below dependencies.

## Install bitsandbytes if not already installed.
!pip install bitsandbytes
!pip install datasets
!pip install huggingface_hub

For this tutorial, we are going to use Mistral-7b. You need to accept licence to use this model. you can do that by logging into your hugging face account. Accept it here https://huggingface.co/mistralai/Mistral-7B-v0.1

Lets start with few Imports:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

Basic Info regarding Imports

AutoTokenizer is used to convert text to corresponding Ids.
CausalLM : These language Models predict next word based on previous tokens.AutoModelForCausalLM contains all autoregressive Causal Models like chatGPT, Llama etc.,
BitsAndBytesConfig is used for specifying how we want to load the model and how do we like to do computation. Useful when we have GPU.

# loading model in 4bits

bnb_config = BitsAndBytesConfig(
 load_in_4bit = True, 
 bnb_4bit_quant_type = 'nf4', ## This is instead of having uniform bucket, it will be in Gaussian distribution which is more around zero
 bnb_4bit_use_double_quant = True,
 bnb_4bit_compute_dtype = torch.float16 
)

Let’s understand what those parameters mean.

load_in_4bit: Loads the mistral modal with its quantized 4bit weights.
bnb_4bit_quant_type: Most popular values for this are
a. ‘Nf4’ → NormalFloat
b. ‘FP4’ → Float

In Nf4 the values are more tighter around center. It follows normal distribution, where as in FP4 it is uniform throughout.

For Example: Around center we find +0.01, +0.02, -0.01, -0.02 and at the extremes we have 3.1, 3.5, -4.1, -4.5 etc.,

3. bnb_4bit_use_double_quant : If you have gone through my first blog, you will find something called scaling factor. This is used in Quantization — DeQuantization. So, if this parameter is set to true, these scaling factors for each quantized block are also quantized. This helps in great memory footprint reduction.

4. bnb_4bit_compute_dtype: This parameter helps decide what should be precision while Training/Inference.

Remember:
→ Weights are quantized to 4-bit (super compact)
→ But computation still happens in higher precision (to avoid losing accuracy)

Once this is done, we will just load model according to this configuration by running below code block.

model_id = 'mistralai/Mistral-7B-v0.1'
model = AutoModelForCausalLM.from_pretrained(
 model_id, ## load this model
 quantization_config=bnb_config, ## load the model like this
 device_map="auto"
)

All the arguments are self explanatory. The device map refers to the availability of having GPU (moves to CUDA if yes./moves to MPS if apple silicon)

2. Checking if model is indeed loaded in 4bit

for name, module in model.named_modules():
 if "Linear" in str(type(module)) or "4bit" in str(type(module)):
 print(f"{name} -> {type(module)}")

Now, what does that above chunk of code mean ?

we will be using named_modules() in various cases.

For Example:

1. Inspecting Model Structure
2. Checking layer types
3. Applying LoRA
4. Replacing Layers
5. Adding hooks for visualization, logging, activation tracking
6. Freezing specific layers.

Note: There is also one more function called named_parameters() which would give weights and biases.

Upon running the above block of code we get:

Now we know that we loaded Quantized Model.

Up Next, we load tokenizer.

## Now we need to to tokenizer setup

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

Every model has got some kind of tokenizer( which converts text to numbers).

3. Playing with tokenizer outputs

sentence = " Quantization reduces model size and speeds up inference"

tokens = tokenizer(sentence)
#print(tokens)
print(f"Input IDs: {tokens['input_ids']}")
print(f"tokens: {tokenizer.convert_ids_to_tokens(tokens['input_ids'])}")

# decode back to see what original tokens were

decoded = tokenizer.decode(tokens['input_ids'])
print(f'decoded text: {decoded}')

We get the below output upon running the above code.

<S> is the start token. Remaining numbers are different token numbers for different words/tokens.

2. tokenizer.decode() helps us constructing original sentence.

Let’s see an example of how batching looks like.

sentences = ['Quantization is great', 
'LoRA is much more better than Full Finetuning']
batch = tokenizer(
 sentences, 
 padding = True,
 truncation = True,
 return_tensors = 'pt'

)
print(f'Input IDs: {batch["input_ids"]}')
print('Tokens: ')
for ids in batch['input_ids']:
 print(tokenizer.convert_ids_to_tokens(ids))

lets observe the below output.

1. Here we process two sentences at once. Hence padding.
2. </s> is padding token which is denoted by “2”.
3. Truncate helps in cutting off sentence generation till the max limit mentioned.
4. return tensors = ‘pt’ just returns output in Pyorch tensors for further calculations

4. Checking model ouput before fine-tuning.

Whole purpose of this blog is about finetuning. Before doing that, lets check the model output.

prompt_1 = """### Instruction:
How to cook pasta.

### Response:"""

prompt_2 = """### Instruction:
How to win chess championship.

### Response:"""

prompt = [prompt_1, prompt_2]
inputs = tokenizer(prompt,padding = True, truncation = True, return_tensors="pt").to(model.device)

with torch.no_grad():
 outputs = model.generate(
 **inputs,
 max_new_tokens=100,
 do_sample=True, #### does random sampling instead of greedy search # can do beam search
 temperature=0.7,
 top_p=0.95
 )

# Strip prompt if you just want the model's generated part
for i, output in enumerate(outputs):
 response = tokenizer.decode(output, skip_special_tokens=True) #### removes <s>, <PAD> etc.,
 completion = response[len(prompt[i]):].strip() ### removes prompt from 
 print(f"\n Prompt {i+1} Response:")
 print(completion)

Nothing new in the code here. we are checking outputs for 2 different prompts.

torch.no_grad() disables gradient tracking while inference which should be done. After that we are just printing outputs. The output looks like:

It is not giving steps. It is generating some kind of html and git/file instructions. Good enough reason for us to finetune.

5. Loading and Inspecting dataset for fine-tuning.

Okay. we are getting there slowly but step by step. Please see dataset format here https://huggingface.co/datasets/tatsu-lab/alpaca

from datasets import load_dataset

# load a small subset to test quickly 

dataset = load_dataset("tatsu-lab/alpaca", split="train[:1%]") 
print(dataset[0]['output']) ### Accessing output only

Here outputs are nicely ordered step by step without any html. we are taking only 1% of the data which is enough for practice purposes.

But We need to format the above dataset. Because there are multiple fields like instruction, input and output in the dataset. Out model understands only one prompt as a continuous string and it gives one response as output.

6. Formatting dataset to make it ready for fine-tuning

def format_alpaca(example):
 if example["input"]:
 prompt = f"""### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"""
 else:
 prompt = f"""### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"""
 return {"text": prompt}

# Apply formatting
dataset = dataset.map(format_alpaca)
dataset[0]

We can clearly see that, after formatting we have text and response variables in the dataset. They will be used for fine-tuning purposes. we do this to tell the model that as soon as it sees:

####Instruction → listen for task
####Response → Generate meaningful response

Now the next obvious step would be to tokenize these text variable in the dataset. Run below code.

def tokenize(example):
 return tokenizer(
 example['text'],
 truncation = True,
 padding = 'max_length',
 max_length = 128
 )

tokenized_dataset = dataset.map(tokenize, batched = True)

print(tokenized_dataset[0]["input_ids"][:10])
print(tokenizer.decode(tokenized_dataset[0]["input_ids"]))

I have cutdown the input_ids[:10] to ten. to see the full ids remove slicing.

7. Setting up LoRA Adapters

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
 r = 32, #### reducing rank will give you less parameters
 lora_alpha = 16, 
 target_modules = ['q_proj', 'v_proj'],
 lora_dropout = 0.1,
 bias = 'none',
 task_type = TaskType.CAUSAL_LM
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

We just define rank, and other useful parameters. Please refer my blog linked at the start to understand theory and intuition.

We see that we train only 0.18 % parameters(LoRA parameters. Original weights are frozen). This significantly reduces memory footprint and speeds-up training.

8. Model Training

from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

training_args = TrainingArguments(
 output_dir="./qlora-mistral-checkpoint", # Folder to save checkpoints and logs
 per_device_train_batch_size=4, # Batch size per GPU/core
 gradient_accumulation_steps=4, # Accumulate gradients over 4 steps → simulates batch_size = 4 x 4 = 16
 num_train_epochs=1, # Train for 1 epoch over the dataset
 learning_rate=2e-4, # Learning rate (typically 1e-4 to 2e-4 for LoRA)
 fp16=True, # Use mixed precision (float16) for faster training and lower memory
 logging_steps=10, # Log loss every 10 steps
 save_steps=50, # Save checkpoint every 50 steps
 save_total_limit=1, # Keep only the most recent checkpoint (free up disk space)
 report_to="none" # Disable logging to W&B, TensorBoard, etc. (can set to "wandb" if needed)
)

trainer = Trainer(
 model=model, # The full model with LoRA adapters applied
 args=training_args, # Training configuration defined above
 train_dataset=tokenized_dataset, # Tokenized dataset with input_ids, attention_mask
 tokenizer=tokenizer, # Tokenizer for saving and syncing with model
 data_collator=DataCollatorForLanguageModeling( # Prepares batches with correct padding/masking
 tokenizer, mlm=False # mlm=False = causal LM task (not BERT-style masking)
 )
)
trainer.train()

I tried to mention comments for all the parameters that we are using. I felt it was sufficient as it is vert simple. However, let me explain what gradient accumulation does here. Model only updates weights once every 4 steps, after seeing a total batch of 4 x 4 = 16 samples. This way, it is memory efficient and can train even on small GPU.

DataCollatorForLanguageModeling:

This is very important because it handles multiple things. Earlier in the blog, we have seen padding, masking. DataCollator handles this job for us. If we don’t use it model starts attending to padded tokens also(which are garbage). This is attention masking.
MLM masking randomly hides some input tokens and asks model to predict the next one. since we are not doing BERT style model, we opt it as False.

Training one epoch also gave a noticeable result. Let’s see how to do Inference.

9.Model Inference

Save the model first.

model.save_pretrained("qlora-mistral-lora")
tokenizer.save_pretrained("qlora-mistral-lora")

Now load the base-model + QLoRA(saved model)

base_model = AutoModelForCausalLM.from_pretrained(
 "mistralai/Mistral-7B-v0.1",
 quantization_config=bnb_config,
 device_map="auto"
)

# Load LoRA adapters
model = PeftModel.from_pretrained(base_model, "qlora-mistral-lora")
tokenizer = AutoTokenizer.from_pretrained("qlora-mistral-lora")
tokenizer.pad_token = tokenizer.eos_token

Finally, we have reached the fun part. Let’s check the output on the same prompts.

prompt = """### Instruction:
How to become chess champion
### Response:"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
 outputs = model.generate(
 **inputs,
 max_new_tokens=100,
 do_sample=True, # instead of greedy sampling
 temperature=0.7, 
 top_p=0.9
 )

output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(" Model Output:\n", output_text[len(prompt):].strip())

Here, we see the output is much more organized, readable and relevant. Hope you have enjoyed the blog and learned something. Let me know in comments if you have any questions.

If you found this blog helpful , please clap, share, subscribe and stay tuned for the next one.

My Linkedin: https://www.linkedin.com/in/akhilshekkari/

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

From Quantization to Inference: Beginners guide for Practical Fine-tuning — (QLoRA with Mistral 7B)

Author(s): Akhil Shekkari

From Quantization to Inference: Beginners guide for Practical Fine-tuning — (QLoRA with Mistral 7B)

1. Setting up colab notebook

Basic Info regarding Imports

2. Checking if model is indeed loaded in 4bit

3. Playing with tokenizer outputs

4. Checking model ouput before fine-tuning.

5. Loading and Inspecting dataset for fine-tuning.

6. Formatting dataset to make it ready for fine-tuning

7. Setting up LoRA Adapters

8. Model Training

9.Model Inference

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

From Quantization to Inference: Beginners guide for Practical Fine-tuning — (QLoRA with Mistral 7B)

Author(s): Akhil Shekkari

From Quantization to Inference: Beginners guide for Practical Fine-tuning — (QLoRA with Mistral 7B)

1. Setting up colab notebook

Basic Info regarding Imports

2. Checking if model is indeed loaded in 4bit

3. Playing with tokenizer outputs

4. Checking model ouput before fine-tuning.

5. Loading and Inspecting dataset for fine-tuning.

6. Formatting dataset to make it ready for fine-tuning

7. Setting up LoRA Adapters

8. Model Training

9.Model Inference

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement