Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-FranΓ§ois Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Exploring MobileCLIP: A lightweight solution for Zero-Shot Image Classification
Artificial Intelligence   Computer Vision   Latest   Machine Learning

Exploring MobileCLIP: A lightweight solution for Zero-Shot Image Classification

Last Updated on April 15, 2025 by Editorial Team

Author(s): Antonio Guerra

Originally published on Towards AI.

Exploring MobileCLIP: A lightweight solution for Zero-Shot Image Classification

An example of a Zero-Shot Image Classification Model identifying a cat in an image with class probabilities for β€œcat”, β€œdog”, and β€œbird” (source: https://huggingface.co/tasks/zero-shot-image-classification)

Introduction

In today’s rapidly evolving world of computer vision, there’s a growing need for technology that can adapt to new situations quickly and efficiently. One of the most exciting developments in this area is zero-shot image classification. But what does that mean in plain terms?

Imagine showing a computer an image of an object it has never seen before. Traditionally, you would have to train the system with tons of labeled images to help it recognize different objects. But with zero-shot classification, you’re bypassing that lengthy process. Instead of teaching the model what every object looks like, you give it descriptions and let it figure out the match on its own. Pretty neat, right?

This capability is a game changer in environments where new categories pop up frequently, and it’s hard (or costly) to get enough labeled data.

One of the standout models that makes this possible is CLIP (Contrastive Language-Image Pretraining), developed by OpenAI. It connects the dots between images and their text descriptions, allowing the model to recognize new objects without the usual heavy training. The downside? CLIP is a bit of a resource hog. It needs a lot of computational power, making it difficult to run on smaller devices like mobile phones or IoT gadgets.

That’s where MobileCLIP steps in. It’s an optimized, lightweight version of CLIP, designed for devices with fewer resources. In this article, I’ll show you how MobileCLIP works and how it brings zero-shot classification to the palm of your hand.

Note
Rather than diving into the technical architecture of CLIP (and similarly MobileCLIP), this article will focus on how to practically implement MobileCLIP for zero-shot classification. For a deep dive into the technical details, you can visit the official GitHub repositories and papers for both MobileCLIP and CLIP:

MobileCLIP Paper

CLIP Paper

Understanding MobileCLIP

The foundation: CLIP

At its core, CLIP is a model that’s learned to link images with text. It’s trained to predict which piece of text best matches a given image, or which image best matches a description. This process involves huge datasets with pairs of images and text.

For instance, if you provide an image of a cat and the text prompts β€œdog,” β€œcat,” and β€œcar,” CLIP can determine that β€œcat” is the most likely match, even if it has never seen that image before.

This general understanding of visual concepts sets CLIP apart from traditional image classifiers. A conventional classifier typically needs labeled examples of each category you want it to recognize; if you add a new category, you have to collect more labels and retrain. CLIP, on the other hand, just needs a text label or description, no additional image data is required.

The power of CLIP comes from its ability to generalize across a wide range of visual concepts, thanks to its training on diverse data. However, this power comes at a cost: CLIP requires significant computational resources, making it less suitable for applications where resources are limited.

The need for optimization: enter MobileCLIP

To make CLIP more practical for devices with limited resources, MobileCLIP was created. It’s a simplified version of CLIP, fine-tuned for efficiency without losing the accuracy of the original model especially for zero-shot classification, where it even outperforms traditional CLIP models.

The key differences between CLIP and MobileCLIP include:

  • Smaller model size: it’s trimmed down to use less memory, which is crucial for mobile or edge devices with limited storage.
  • Computational efficiency: MobileCLIP is designed to perform well even on devices with limited processing power, such as smartphones or IoT devices.
  • Low latency: MobileCLIP offers lower latency in inference, which is critical for real-time applications like live video analysis.
Comparison between CLIP and MobileCLIP models on latency and zero-shot classification accuracy (data extracted from MobileCLIP Paper)

Potential use cases for MobileCLIP

Now that we know what MobileCLIP can do, let’s explore where it can be used. Since it’s designed for devices with limited resources, the potential applications are pretty exciting:

  • Mobile apps: think about the apps you use every day on your phone. With the push towards on-device intelligence, MobileCLIP can enhance your experiences in augmented reality apps, personal assistants, or even real-time photo classification. Instead of sending data to the cloud for processing (which takes time and bandwidth), your phone can do all the hard work locally.
  • Edge computing: MobileCLIP is ideal for edge computing environments where bandwidth and processing power are limited. Devices such as drones, robots, and remote sensors can leverage the model for visual recognition tasks, enabling real-time decision-making without constant cloud connectivity.
  • IoT devices: The integration of MobileCLIP into Internet of Things (IoT) devices, like security cameras or smart home assistants, allows these systems to perform local visual recognition. This brings benefits in terms of privacy, latency, and the ability to operate in environments with intermittent internet connectivity.

Implementing MobileCLIP

Let’s dive into how you can actually use MobileCLIP for zero-shot classification. If you’re ready to get your hands dirty, here’s a step-by-step guide on setting it up.

Step-by-Step code: Zero-Shot Image Classification with MobileCLIP

1. Environment setup

import os
import time
import argparse
from typing import List, Tuple

import cv2
import torch
import matplotlib.pyplot as plt
from PIL import Image
import open_clip
from timm.utils import reparameterize_model
import numpy as np

# Check CUDA availability and set the device (GPU if available, otherwise CPU)
cuda = torch.cuda.is_available()
device = torch.device("cuda" if cuda else "cpu")
print(f"Torch device: {device}")

First, we need to import the necessary libraries, including open_clip for model loading, torch for tensor operations, cv2 for image processing, and matplotlib for visualizing the results. If you have a GPU, MobileCLIP can take advantage of it to speed things up. If not, it still runs well on a CPU.

2. Model and preprocessing

# Load MobileCLIP model and preprocessing transforms
model, _, preprocess = open_clip.create_model_and_transforms(
'MobileCLIP-S1', pretrained='datacompdr'
)
tokenizer = open_clip.get_tokenizer('MobileCLIP-S1')

# Set model to evaluation mode, reparameterize for efficiency,
# and move it to the selected device
model.eval()
model = reparameterize_model(model)
model.to(device)

Next, we load the MobileCLIP model (let’s go with MobileCLIP-S1, a lighter version). We will also need to load the tokenizer, which converts your text prompts into token sequences that the model can understand. Set the model to evaluation mode, so it’s ready for inference.

3. Image Classification function

def classify_image(img: np.ndarray, labels_list: List[str]) -> Tuple[str, float]:
"""
Classify an image using MobileCLIP.

This function preprocesses the input image, tokenizes the provided
text prompts, extracts features from both image and text,
computes the similarity, and returns the label with the highest
probability along with the probability value.

Args:
img (numpy.ndarray): Input image in RGB format.
labels_list (list): List of labels to classify against.

Returns:
tuple: A tuple containing the predicted label (str) and
the probability (float).
"""

# Convert the image from a NumPy array to a PIL image, preprocess it,
# add batch dimension, and move to device.
preprocessed_img = preprocess(Image.fromarray(img)).unsqueeze(0).to(device)

# Tokenize the labels inside the function and move tokens to the device.
text = tokenizer(labels_list).to(device)

# Disable gradient calculation and enable automatic mixed precision
with torch.no_grad(), torch.cuda.amp.autocast():

# Extract features from the image using the model.
image_features = model.encode_image(preprocessed_img)

# Extract text features from the tokenized text.
text_features = model.encode_text(text)

# Normalize image and text features to unit vectors.
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

# Compute the similarity (dot product) and apply softmax to
# obtain probabilities.
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

# Get the label with the highest probability from the provided label list.
selected_label = labels_list[text_probs.argmax(dim=-1)]
selected_prob = text_probs.max(dim=-1)[0].item()

return selected_label, selected_prob

The heart of the process is the image classification function. This function takes an image as input, preprocesses it, and passes it through the mobileCLIP encoder model to extract image features. It then calculates similarity with the given labels (e.g., β€œcat,” β€œdog,” β€œcar”), also encoded with mobileCLIP, and returns the most likely label with its associated probability.

4. Visualizing the results

def plot_results(results: List[Tuple[np.ndarray, str, float, float]]) -> None:
"""
Plot the classification results.

This function creates a horizontal plot for each image in the results,
displaying the image along with its predicted label, probability,
and processing time.

Args:
results (list): List of tuples (img, label, probability, elapsed_time).
"""

# Create subplots with one image per subplot.
fig, axes = plt.subplots(1, len(results), figsize=(len(results) * 5, 5))

# If there is only one image, make axes a list to handle it uniformly.
if len(results) == 1:
axes = [axes]

# Iterate over results and plot each one.
for ax, (img, label, prob, elapsed_time) in zip(axes, results):
ax.imshow(img)
ax.set_title(
f"Label: {label},\nProbability: {prob:.2%},\nTime: {elapsed_time:.2f}s"
)
ax.axis('off')

plt.tight_layout()
plt.show()

This section introduces a visualization function that plots the classified images along with their predicted labels, probabilities, and processing times.

5. Main loop for classifying images

def main(data_folder: str, labels_list: List[str]) -> None:
"""
Process images and perform zero-shot image classification.

This function processes each image in the specified folder,
classifies them using MobileCLIP, and then plots the results.

Args:
data_folder (str): Path to the folder containing input images.
labels_list (List[str]): List of labels to classify against.
"
""
results: List[Tuple[np.ndarray, str, float, float]] = []

for image_file in os.listdir(data_folder):
image_path = os.path.join(data_folder, image_file)
# Read the image using OpenCV.
img = cv2.imread(image_path)
# Skip files that are not valid images.
if img is None:
print(f"Warning: Unable to read image {image_file}. Skipping.")
continue

# Convert the image from BGR (OpenCV default) to RGB.
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

start_time = time.time()
selected_label, selected_prob = classify_image(img, labels_list)
elapsed_time = time.time() - start_time

print(f"{image_file} - Label: {selected_label}, Prob: {selected_prob:.2%} (Time: {elapsed_time:.2f}s)")

results.append((img, selected_label, selected_prob, elapsed_time))

plot_results(results)


if __name__ == '__main__':
data_folder = 'data'
labels_list = ['dog', 'cat', 'car']

main(data_folder, labels_list)

This final section is where the magic happens. We iterate over images in the data folder, classify each image using classify_image(), and append the results for visualization. The results are then passed to plot_results() to generate a visual output.

Note: For the full code, check out this GitHub repo.

MobileCLIP performing zero-shot classification on sample images

Conclusion and next steps

In this article, we’ve explored how MobileCLIP makes zero-shot image classification accessible on resource-constrained devices. By leveraging the power of language-vision models, MobileCLIP can classify objects it has never seen before, opening up a world of possibilities in various applications where labeled data is scarce.

This is just the beginning. In the next articles, we will explore how to apply zero-shot classification to live video streams, enabling real-time object recognition in dynamic environments. We’ll also discuss advanced techniques like integrating MobileCLIP with GPT2 for rapid caption generation. Stay tuned!

medium-repo/exploring_MobileCLIP_A_Lightweight_Solution_for_ZeroShot_Image_Classification at main ·…

Contribute to vargroup-datascience/medium-repo development by creating an account on GitHub.

github.com

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓