Fine Tuning Pytorch ViT for CIFAR10
Last Updated on November 5, 2024 by Editorial Team
Author(s): Ahmad Mustapha
Originally published on Towards AI.
In the previous article here we created a ViT model from scratch and trained it on the CIFAR10 dataset. However, the model accuracy peaked at 67% without deliberate hyperparameters fine tuning. This is expected as the original creators of the ViT model noted that these models have modest performance compared to CNNs when trained on small datasets. However, when scaled on a large dataset, they start to be on par with CNNs or even better. That is why it is recommended to fine tune ViT models that have been pretrained on large datasets such as ImageNet. And this is exactly what we will do in the post.
The Training Loop
We start by writing the boilerplate code for training and testing any model on the CIFAR10 dataset. You will notice that we resized the images in the training and testing image transformations to 224, noting that the original image size of CIFAR10 is 32. This is because the model that will be used from Pytorch requires the input size to be 224, as it has been trained on ImageNet.
transform_train = transforms.Compose([
transforms.Resize(224),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])
transform_test = transforms.Compose([
transforms.Resize(224),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])
train_set = CIFAR10(root='./datasets', train=True, download=True, transform=transform_train)
test_set = CIFAR10(root='./datasets', train=False, download=True, transform=transform_test)
train_loader = DataLoader(train_set, shuffle=True, batch_size=64)
test_loader = DataLoader(test_set, shuffle=False, batch_size=64)
n_epochs = 10
lr = 0.0001
optimizer = Adam(model.parameters(), lr=lr)
criterion = CrossEntropyLoss()
for epoch in range(n_epochs):
train_loss = 0.0
for i,batch in enumerate(train_loader):
x, y = batch
x, y = x.to(device), y.to(device)
y_hat = model(x)
loss = criterion(y_hat, y)
batch_loss = loss.detach().cpu().item()
train_loss += batch_loss / len(train_loader)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if i%100==0:
print(f"Batch {i}/{len(train_loader)} loss: {batch_loss:.03f}")
print(f"Epoch {epoch + 1}/{n_epochs} loss: {train_loss:.03f}")
Loading The Model
Now we have to load the ViT_b_16 model from torchvision.models. All ViT models available in torchvision are listed in the following link here. If you check the link, you will find several models with labels such as b, l, and h. Those labels correspond to the model size we have base, large, and huge. The architecture of these models are the exact ones that have been published in the first ViT paper titled An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. The number associated with these labels such as 16, 32, and 14 corresponds to the patch size that the model used. All these models have been trained on ImageNet. We start by loading the model. The default model provided is not pretrained to make sure we load a pretrained model we have to pass the weights argument as ViT_B_16_Weights.IMAGENET1K_V1.
from torchvision.models import ViT_B_16_Weights, vit_b_16
model = vit_b_16(ViT_B_16_Weights.IMAGENET1K_V1)
By default, this model output logits from 1000 classes as it has been trained on ImageNet. However, our dataset contains only 10 classes. Thus, we need to change the head of this model from 1000 to 10 logits. The outer layer of the loaded model is the βheadsβ layer which is a sequential layer that include only one linear layer. To do adapt the model we simply assign a new Linear layer to the βheadsβ layer while preserving the input features of the layer and replacing the outer features by 10.
model = vit_b_16(ViT_B_16_Weights.IMAGENET1K_V1)
model.heads = nn.Sequential(
nn.Linear(model.heads.head.in_features, 10)
)
Rather than training or the transformer blocks in the loaded model we can freeze all the layers except for the last transformer layer. By doing this we make the fine-tuning procedure less compute intensive. We finally move the model to the GPU device and train it using the previous training loop.
model = vit_b_16(ViT_B_16_Weights.IMAGENET1K_V1)
model.heads = nn.Sequential(
nn.Linear(model.heads.head.in_features, 10)
)
# Freeze all layers
for param in model.parameters():
param.requires_grad = False
# Unfreeze the last encoder layer and the head
for param in model.encoder.layers[-1].parameters():
param.requires_grad = True
for param in model.heads.parameters():
param.requires_grad = True
Testing Loop
We finally test our model on the testing dataset of CIFAR10. You will find that the model reach a very high accuracy even after training on only one epoch. This is because of the powerful features that have been crafted when the model was being trained on ImageNet.
with torch.no_grad():
correct, total = 0, 0
test_loss = 0.0
for batch in tqdm(test_loader, desc="Testing"):
x, y = batch
x, y = x.to(device), y.to(device)
y_hat = model(x)
loss = criterion(y_hat, y)
test_loss += loss.detach().cpu().item() / len(test_loader)
correct += torch.sum(torch.argmax(y_hat, dim=1) == y).detach().cpu().item()
total += len(x)
print(f"Test loss: {test_loss:.2f}")
print(f"Test accuracy: {correct / total * 100:.2f}%")
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI