Notes on CLIP: Connecting Text and Images
Last Updated on November 5, 2023 by Editorial Team
Author(s): Nieves Crasto
Originally published on Towards AI.
Radford, Alec, et al. βLearning transferable visual models from natural language supervision.β International conference on machine learning. PMLR, 2021.
The authors of the above paper aim to produce good representations (features) for images that can be used for various tasks with minimal or no supervision.
Limitations with supervised learning
Off-the-shelf features generated by image classification models have been used in other tasks like image retrieval. However, these features do not generalize very well, as the classification models were trained to recognize a fixed set of classes. Any new category added to this set of classes would require collecting additional annotated images for this new category and then retraining the model. This is a time-consuming and expensive process.
Can self-supervised learning techniques be leveraged to address this problem?
Can image captions be used as a means to produce better image representations and avoid the cost of annotation? That is, can natural language be used as supervision to learn visual perception?
Main Contribution
The authors propose a pre-training task (CLIP = Contrastive Language-Image Pre-training) of predicting which caption goes with which image in order to learn SOTA image representations from scratch. For this, they created a dataset of 400 million (image, text) pairs collected from the internet. This pre-trained model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset-specific training.
Background
CLIP draws its inspiration from the field of supervised image captioning. Each image with a corresponding caption is used to train a model that predicts the exact words in the caption for the corresponding images. This is a difficult task as an image can be described in various ways and still convey the same meaning.
But to somehow leverage supervision provided by captions, the authors propose a proxy task to predict if a caption matches a particular image rather than predicting the caption word-for-word.
Contrastive Pre-training
Consider a batch of N images and their corresponding N captions. With these, we can create N x N possible (image, text) pairings across a batch. Now, the task is to predict the N real pairs in the batch.
To do so, CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder (see Figure 1). The image encoder produces a feature vector, I; similarly, the text encoder produces a feature vector, T.
- For the N real pairs, we want to maximize the cosine similarity between I and T.
- For NΒ² β N incorrect pairings, we want to minimize the cosine similarity between I and T.
Zero-shot Prediction
Consider the task of image classification (see Figure 2). At test time, for a single image, the image encoder will produce a feature vector Iβ. To identify the class of the image, the text encoder embeds the class names of the target dataset to produce N feature vectors Tβ, Tβ β¦, and so on. N in the number of classes in the target dataset.
Model Details
For the image encoder, the authors evaluate two different architectures:
- ResNet-50: They used the modified ResNet-D (refer paper) architecture with anti-aliased rect-2 blur pooling (refer paper). They also replace the global average pooling layer with a βtransformer-styleβ attention pooling mechanism.
- Vision Transform (ViT): The authors use an additional layer normalization to the combined patch and position embeddings before the transformer and use a slightly different initialization scheme.
For the text encoder, use a Transformer described in this paper with 63M parameters (12-layer 512-wide) and 8 attention heads.
Training
The authors train 5 ResNets (ResNet-50, ResNet-101, and 3 EfficientNet-style ResNet models) and 3 Vision Transformers (ViT-B/32, a ViT-B/16, and a ViT-L/14). The models are trained for 32 epochs using Adam optimizer with decoupled weight decay regularization and decay the learning rate using a cosine schedule. They used a very large minibatch size of 32,768.
Some Results and Discussion
Effect of Prompt Engineering:
Image classification datasets are annotated with label IDs that are mapped to class names. Since the CLIP model is trained with text being a full sentence, the authors found using the prompt template βA photo of a {label}.β to be a good default for the text associated with the image. In Figure 3, we see the classification accuracy improve by 5 points using prompt engineering across 36 classification datasets.
Zero-shot CLIP versus Linear Probe
The zero-shot CLIP classifier outperforms a supervised linear classifier fitted on ResNet-50 features on 16 out of 27 datasets (Figure 4). However, the performance of CLIP is still below that of state-of-the-art for most of these datasets.
Limitations
- CLIP does not perform well on tasks like counting objects in an image or finding the distance to the nearest object in an image.
- It performs very poorly on out-of-distribution datasets like MNIST. Its performance on digital OCR is good however, it fails (88% accuracy) in recognizing hand-written digits of MNIST.
- Using CLIP for few-shot learning leads to poor performance. There is a counter-intuitive drop in performance when going from zero-shot to few-shot learning.
- As CLIP is trained on the text-image pairs queried from the internet, it will learn many social biases.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI