Notes on CLIP: Connecting Text and Images

Last Updated on November 5, 2023 by Editorial Team

Author(s): Nieves Crasto

Originally published on Towards AI.

Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” International conference on machine learning. PMLR, 2021.

The authors of the above paper aim to produce good representations (features) for images that can be used for various tasks with minimal or no supervision.

Limitations with supervised learning

Off-the-shelf features generated by image classification models have been used in other tasks like image retrieval. However, these features do not generalize very well, as the classification models were trained to recognize a fixed set of classes. Any new category added to this set of classes would require collecting additional annotated images for this new category and then retraining the model. This is a time-consuming and expensive process.

Can self-supervised learning techniques be leveraged to address this problem?

Can image captions be used as a means to produce better image representations and avoid the cost of annotation? That is, can natural language be used as supervision to learn visual perception?

Main Contribution

The authors propose a pre-training task (CLIP = Contrastive Language-Image Pre-training) of predicting which caption goes with which image in order to learn SOTA image representations from scratch. For this, they created a dataset of 400 million (image, text) pairs collected from the internet. This pre-trained model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset-specific training.

Background

CLIP draws its inspiration from the field of supervised image captioning. Each image with a corresponding caption is used to train a model that predicts the exact words in the caption for the corresponding images. This is a difficult task as an image can be described in various ways and still convey the same meaning.

But to somehow leverage supervision provided by captions, the authors propose a proxy task to predict if a caption matches a particular image rather than predicting the caption word-for-word.

Contrastive Pre-training

Consider a batch of N images and their corresponding N captions. With these, we can create N x N possible (image, text) pairings across a batch. Now, the task is to predict the N real pairs in the batch.

To do so, CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder (see Figure 1). The image encoder produces a feature vector, I; similarly, the text encoder produces a feature vector, T.

For the N real pairs, we want to maximize the cosine similarity between I and T.
For N² — N incorrect pairings, we want to minimize the cosine similarity between I and T.

Figure 1: Contrastive Pre-training (Image courtesy: paper)

Zero-shot Prediction

Consider the task of image classification (see Figure 2). At test time, for a single image, the image encoder will produce a feature vector I₁. To identify the class of the image, the text encoder embeds the class names of the target dataset to produce N feature vectors T₁, T₂ …, and so on. N in the number of classes in the target dataset.

Model Details

For the image encoder, the authors evaluate two different architectures:

ResNet-50: They used the modified ResNet-D (refer paper) architecture with anti-aliased rect-2 blur pooling (refer paper). They also replace the global average pooling layer with a “transformer-style” attention pooling mechanism.
Vision Transform (ViT): The authors use an additional layer normalization to the combined patch and position embeddings before the transformer and use a slightly different initialization scheme.

For the text encoder, use a Transformer described in this paper with 63M parameters (12-layer 512-wide) and 8 attention heads.

Training

The authors train 5 ResNets (ResNet-50, ResNet-101, and 3 EfficientNet-style ResNet models) and 3 Vision Transformers (ViT-B/32, a ViT-B/16, and a ViT-L/14). The models are trained for 32 epochs using Adam optimizer with decoupled weight decay regularization and decay the learning rate using a cosine schedule. They used a very large minibatch size of 32,768.

Some Results and Discussion

Effect of Prompt Engineering:

Image classification datasets are annotated with label IDs that are mapped to class names. Since the CLIP model is trained with text being a full sentence, the authors found using the prompt template “A photo of a {label}.” to be a good default for the text associated with the image. In Figure 3, we see the classification accuracy improve by 5 points using prompt engineering across 36 classification datasets.

Zero-shot CLIP versus Linear Probe

The zero-shot CLIP classifier outperforms a supervised linear classifier fitted on ResNet-50 features on 16 out of 27 datasets (Figure 4). However, the performance of CLIP is still below that of state-of-the-art for most of these datasets.

Limitations

CLIP does not perform well on tasks like counting objects in an image or finding the distance to the nearest object in an image.
It performs very poorly on out-of-distribution datasets like MNIST. Its performance on digital OCR is good however, it fails (88% accuracy) in recognizing hand-written digits of MNIST.
Using CLIP for few-shot learning leads to poor performance. There is a counter-intuitive drop in performance when going from zero-shot to few-shot learning.
As CLIP is trained on the text-image pairs queried from the internet, it will learn many social biases.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Notes on CLIP: Connecting Text and Images

Author(s): Nieves Crasto

Limitations with supervised learning

Main Contribution

Background

Contrastive Pre-training

Zero-shot Prediction

Model Details

Training

Some Results and Discussion

Effect of Prompt Engineering:

Zero-shot CLIP versus Linear Probe

Limitations

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

TAI #143: New Scaling Laws Incoming? Ilya’s SSI Raises at $30bn, Manus Takes AI Agents Mainstream

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Notes on CLIP: Connecting Text and Images

Author(s): Nieves Crasto

Limitations with supervised learning

Main Contribution

Background

Contrastive Pre-training

Zero-shot Prediction

Model Details

Training

Some Results and Discussion

Effect of Prompt Engineering:

Zero-shot CLIP versus Linear Probe

Limitations

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement