Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!


Fine-Tuning a Llama-2 7B Model for Python Code Generation
Data Science   Latest   Machine Learning

Fine-Tuning a Llama-2 7B Model for Python Code Generation

Last Updated on August 11, 2023 by Editorial Team

Author(s): Eduardo Muñoz

Originally published on Towards AI.

A demo on how to fine-tune the new Llama-2 using PEFT, QLoRa, and the Huggingface utilities

Image by author created in

About 2 weeks ago, the world of generative AI was shocked by the company Meta's release of the new Llama-2 AI model. Its predecessor, Llama-1, was a breaking point in the LLM industry, as with the release of its weights along with new finetuning techniques, there was a massive creation of open-source LLM models that led to the emergence of high-performance models such as Vicuna, Koala, …

In this article, we will briefly discuss some of this model's relevant points but will focus on showing how we can quickly train the model for a specific task using libraries and tools standard in this world. We will not make an exhaustive analysis of the new model, there are already numerous articles published on the subject.

New Llama-2 model

In mid-July, Meta released its new family of pre-trained and finetuned models called Llama-2, with an open source and commercial character to facilitate its use and expansion. The base model was released with a chat version and sizes 7B, 13B, and 70B. Together with the models, the corresponding papers were published describing their characteristics and relevant points of the learning process, which provide very interesting information on the subject.

An updated version of Llama 1, trained on a new mix of publicly available data. The pretraining corpus size was increased by 40%, the model’s context length was doubled, and grouped-query attention was adopted. Variants with 7B, 13B, and 70B parameters are released, along with 34B variants reported in the paper but not released.[1]

For pre-training, 40% more tokens were used, reaching 2T, the context length was doubled and the grouped-query attention (GQA) technique was applied to speed up inference on the heavier 70B model. On the standard transformer architecture, RMSNorm normalization, SwiGLU activation, and rotatory positional embedding are used, the context length reaches 4096 tokens, and an Adam optimizer is applied with a cosine learning rate schedule, a weight decay of 0.1 and gradient clipping.

The Supervised Fine-Tuning (SFT) stage is characterized by a prioritization of quality examples over quantity, as numerous reports show that the use of high-quality data results in improved final model performance.
Finally, a Reinforcement Learning with Human Feedback (RLHF) step is applied to align the model with user preferences. A multitude of examples are collected where annotators select their preferred model output over a binary comparison. This data is used to train a reward model, where the focus is on helpfulness and safety.

In short:

· Trained on 2T Tokens

· Commercial use allowed

· Chat models for dialogue use cases

· 4096 default context window (can be increased)

· 7B, 13B & 70B parameter version

· 70B model adopted grouped-query attention (GQA)

· Chat models can use tools & plugins

· LLaMA 2-CHAT as good as OpenAI ChatGPT

The dataset for tuning

For our tuning process, we will take a dataset containing about 18,000 examples where the model is asked to build a Python code that solves a given task. This is an extraction of the original dataset [2], where only the Python language examples are selected. Each row contains the description of the task to be solved, an example of data input to the task if applicable, and the generated code fragment that solves the task is provided [3].

# Load dataset from the hub
dataset = load_dataset(dataset_name, split=dataset_split)
# Show dataset size
print(f"dataset size: {len(dataset)}")
# Show an example

Creating the prompt

To carry out an instruction fine-tuning, we must transform each one of our data examples as if it were an instruction, outlining its main sections as follows:

def format_instruction(sample):
return f"""### Instruction:
Use the Task below and the Input given to write the Response, which is a programming code that can solve the following Task:

### Task:

### Input:

### Response:


### Instruction:
Use the Task below and the Input given to write the Response, which is a programming code that can solve the following Task:

### Task:
Develop a Python program that prints "Hello, World!" whenever it is run.

### Input:

### Response:
#Python program to print "Hello World!"

print("Hello, World!")
Image by Irvan Smith from Unsplash

Fine-tuning the model

To carry out this stage, we have used the Google Colab environment, where we have developed a notebook that allows us to run the training in an interactive way and also a Python script to run the training in unattended mode. For the first test runs, a T4 instance with a high RAM capacity is enough, but when it comes to running the whole dataset and epochs, we have opted to use an A100 instance in order to speed up the training and ensure that its execution time is reasonable.

In order to be able to share the model, we will log in to the Huggingface hub using the appropriate token, so that at the end of the whole process, we will upload the model files so that they can be shared with the rest of the users.

from huggingface_hub import login
from dotenv import load_dotenv
import os

# Load the enviroment variables
# Login to the Hugging Face Hub

Fine-tuning techniques: PEFT, Lora, and QLora

In recent months, some papers have appeared showing how PEFT techniques can be used to train large language models with a drastic reduction of RAM requirements and consequently allowing fine-tuning of these models on a single GPU of reasonable size.
The usual steps to train an LLM consist, first, an intensive pre-training on billions or trillions of tokens to obtain a foundation model, and then a fine-tuning is performed on this model to specialize it on a downstream task. In this fine-tuning phase is where the PEFT technique has its purpose.

Parameter Efficient Fine-Tuning (PEFT) allows us to considerably reduce RAM and storage requirements by only fine-tuning a small number of additional parameters, with virtually all model parameters remaining frozen. PEFT has been found to produce good generalization with relatively low-volume datasets. Furthermore, it enhances the reusability and portability of the model, as the small checkpoints obtained can be easily added to the base model, and the base model can be easily fine-tuned and reused in multiple scenarios by adding the PEFT parameters. Finally, since the base model is not adjusted, all the knowledge acquired in the pre-training phase is preserved, thus avoiding catastrophic forgetting.

Most widely used PEFT techniques aim to keep the pre-trained base model untouched and add new layers or parameters on top of it. These layers are called “Adapters” and the technique of their adjustment “adapter-tuning”, we add these layers to the pre-trained base model and only train the parameters of these new layers. However, a serious problem with this approach is that these layers lead to increased latency in the inference phase, which makes the process inefficient in many scenarios.
In the LoRa technique, a Low-Rank Adaptation of Large Language Models, the idea is not to include new layers but to add values to the parameters in a way that avoids this scary problem of latency in the inference phase. LoRa trains and stores the changes of the additional weights while freezing all the weights of the pre-trained model. Therefore, we train a new weights matrix with the changes in the pre-trained model matrix, and this new matrix is decomposed into 2 Low-rank matrices as explained here:

Let all the parameters of a LLM in the matrix W0 and the additional weight changes in the matrix ∆W, the final weights become W0 + ∆W. The authors of LoRA [1] proposed that the change in weight change matrix ∆W can be decomposed into two low-rank matrices A and B. LoRA does not train the parameters in ∆W directly, but the parameters in A and B. So the number of trainable parameters is much less. Hypothetically suppose the dimension of A is 100 * 1 and that of B is 1 * 100, the number of parameters in ∆W will be 100 * 100 = 10000. There are only 100 + 100 = 200 to train in A and B, instead of 10000 to train in ∆W

[4]. Explanation by Dr. Dataman in Fine-tuning a GPT — LoRA

The size of these low-rank matrices is defined by the r parameter. The smaller this value is, the fewer parameters to train, therefore, less effort and faster, but on the other hand, a potential loss of information and performance.
If you want a more detailed explanation, you can refer to the original paper, or there are plenty of articles that explain it in detail, such as [4].

Finally, QLoRa [6] consists of applying quantization to the LoRa method allowing 4-bit normal quantization, nf4, a type optimized for normally distributed weights; double quantization to reduce the memory footprint and the optimization of the NVIDIA unified memory. These are techniques to optimize memory usage to achieve “lighter” and less expensive training.

Image by the author from

Implementing QLoRa in our experiment requires specifying the BitsAndBytes configuration, downloading the pretrained model in 4-bit quantization, and defining a LoraConfig. Finally, we need to retrieve the tokenizer.

# Get the type
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_id,
quantization_config=bnb_config, use_cache = False, device_map=device_map)
model.config.pretraining_tp = 1
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Parameters defined,

# Activate 4-bit precision base model loading
use_4bit = True
# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"
# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"
# Activate nested quantization for 4-bit base models (double quantization)
use_double_nested_quant = False
# LoRA attention dimension
lora_r = 64
# Alpha parameter for LoRA scaling
lora_alpha = 16
# Dropout probability for LoRA layers
lora_dropout = 0.1

And the next steps are well-known for all Hugging Face users, setting up the training arguments, and creating a Trainer. As we are executing an instruction fine-tuning we call to the SFTTrainer method that encapsulates the PEFT model definition and other steps.

# Define the training arguments
args = TrainingArguments(
per_device_train_batch_size=per_device_train_batch_size, # 6 if use_flash_attention else 4,
# Create the trainer
trainer = SFTTrainer(
# train the model
trainer.train() # there will not be a progress bar since tqdm is disabled

# save model in local

The parameters can be found on my GitHub repository, most of them are commonly used in other fine-tuning scripts on LLMs and are the following ones:

# Number of training epochs
num_train_epochs = 1
# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = True
# Batch size per GPU for training
per_device_train_batch_size = 4
# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1
# Enable gradient checkpointing
gradient_checkpointing = True
# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3
# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4
# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001
# Optimizer to use
optim = "paged_adamw_32bit"
# Learning rate schedule
lr_scheduler_type = "cosine" #"constant"
# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03
# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = False
# Save checkpoint every X updates steps
save_steps = 0
# Log every X updates steps
logging_steps = 25
# Disable tqdm
disable_tqdm= True

Merge the base model and the adapter weights

As we mention, we have trained “modification weights” on the base model, our final model requires merging the pretrained model and the adapters in a single model.

from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(

# Merge LoRA and base model
merged_model = model.merge_and_unload()

# Save the merged model
# push merged model to the hub

You can find and download the model in my Hugging Face account edumunozsala/llama-2–7b-int4-python-code-20k. Give it a try!

Inferencing or generating Python code

And finally, we will show you how you can download the model from the Hugging Face Hub and call the model to generate an accurate result:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Get the tokenizer
tokenizer = AutoTokenizer.from_pretrained(hf_model_repo)
# Load the model
model = AutoModelForCausalLM.from_pretrained(hf_model_repo, load_in_4bit=True,
# Create an instruction
instruction="Optimize a code snippet written in Python. The code snippet should create a list of numbers from 0 to 10 that are divisible by 2."

prompt = f"""### Instruction:
Use the Task below and the Input given to write the Response, which is a programming code that can solve the Task.

### Task:

### Input:

### Response:

# Tokenize the input
input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
# Run the model to infere an output
outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.5)

# Print the result
print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")
### Instruction:
Use the Task below and the Input given to write the Response, which is a programming code that can solve the Task.

### Task:
Optimize a code snippet written in Python. The code snippet should create a list of numbers from 0 to 10 that are divisible by 2.

### Input:
arr = []
for i in range(10):
if i % 2 == 0:

### Response:

Generated instruction:
arr = [i for i in range(10) if i % 2 == 0]

Ground truth:
arr = [i for i in range(11) if i % 2 == 0]

Thanks to Maxime Labonne for an excellent article [9] and Philipp Schmid who provides an inspiring code [8]. Their articles are a must-read for everyone interested in Llama 2 and model fine-tuning.

And it is all I have to mention, I hope you find useful this article and claps are welcome!! You can Follow me and Subscribe to my articles, or even connect to me via Linkedin. The code is available in my Github Repository.


[1] Llama-2 paper

[2] Link to the original dataset in the Huggingface hub

[3] Link to the used dataset in the Huggingface hub

[4] Fine-tuning a GPT — LoRA by Chris Kuo/Dr. Dataman

[5] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, & Weizhu Chen. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685

[6]. QLoRa: Efficient Finetuning of QuantizedLLMs

[7] Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

[8] Extended Guide: Instruction-tune Llama 2 by Philipp Schmid.

[9] Fine-Tune Your Own Llama 2 Model in a Colab Notebook by Maxime Labonne

[10]. My Github Repository

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓