Learning Transferable Visual Models From Natural Language Supervision

Introduction to the

Over the past few years, NLP has been revolutionized by pre-training methods that learn directly from the original text. Task-agnostic objectives, such as autoregression and mask language models, have been scaled at multiple levels of computation, model capacity, and data, steadily improving performance. The development of text-to-text as a standardized I/O interface enables task-unknown architectures to zero-shot to downstream data sets, eliminating the need for special output headers or special customization of data sets.

In network-scale text sets, Aggregate Supervision can adapt to modern pre-training methods and is superior to high-quality population marker NLP data sets. However, in other fields such as computer vision, it is still standard practice to pre-train models on crowd-marking datasets such as ImageNet. Could scalable pretraining methods that learn directly from web texts lead to similar breakthroughs in computer vision? Some previous work is encouraging:

More than 20 years ago, Mori et al. explored improving content-based image retrieval by training a model to predict nouns and adjectives in text documents paired with images. Quattoni et al. proved that it is possible to learn more efficient image representation of data by manifold learning and training to predict image-related characters in the weight space of classifier. Srivastava and Salakhutdinov explored Deep representational learning by training multimodal Deep Boltzmann Machines based on underlying image and text marker features …………

As exciting as the theory proves, it is rare to use natural language supervised images to represent learning. This may be because the performance shown in the general benchmark test is far lower than other methods. For example, Li et al. successfully tested a zero-shot setting on ImageNet with only 11.5% accuracy, which is far lower than the 88.4% accuracy of the current technology level, and even lower than the 50% accuracy of the classical computer vision method.

Using a large amount of publicly available data on the Internet, the paper creates a new dataset of 400 million pairs (i.e. images, texts) and proves a simplified version of ConVIRT called CLIP (Contrastive Language-Image Pre-training). It is an effective method of learning from natural language supervision and studies the extensibility of CLIP by training a series of eight models covering nearly two orders of magnitude of computing and data services, i.e. transport performance is a smooth and predictable computing function. The zero-shot conversion performance of CLIP was benchmarked on more than 30 existing data sets and found to be comparable to previous mission-specific surveillance models.

An overview of the methods in this paper is shown in Figure 1:

methods

Natural language supervision

The idea of this method is to learn from natural language to perception with supervision, and learning from natural language has some potential advantages over other training methods. Scale natural language supervision is much easier than standard group source labels for image classification because it does not require annotations to be in classic “machine learning-compatible formats” such as canonical 1-of-N Majority Vote “Gold Label”. Methods that work as natural languages can passively learn from large amounts of text on the web. Learning from natural language also has an important advantage over most unsupervised or self-supervised learning methods, because it does not just “only” learn a representation, but also relates that representation to language for flexible zero-shot transfer.

Create a large enough data set

One of the main motivations for natural language monitoring is that this form of massive data can be made public on the Internet. As existing data sets do not fully reflect this possibility, the results that consider them alone will not be able to estimate the potential of this line of research. To solve this problem, we built a new dataset consisting of 400 million pairs of data (image, text) collected from various publicly available sources on the Internet, called WIT (WebImageText).

Choose an effective pre-training method

The most advanced computer vision systems require so much computation that the task of learning an open set of visual concepts from a natural language may seem daunting. During the experiment, we found that training efficiency is the key to successful extension of natural language monitoring, and selected the final pre-training method on this basis. The initial method, similar to VirTex, combines training of an image CNN with a text converter to predict the image’s title from scratch.

However, there are difficulties in effectively scaling this approach. Figure 2 shows a 63 million parameter Transformer language model that has used twice the computing power of its ResNet-50 image encoder and learned to recognize ImageNet classes three times slower than the simpler benchmark that predicts the encoding of word packages for the same text.

The paper proposes a training system to solve the potentially simpler proxy task of pairing images with the text as a whole rather than the exact words of the text. Starting from the same word packet coding baseline, the prediction target is replaced by the contrastive target in Figure 2, and a further 4-fold increase in the rate of zero-shot conversion to ImageNet is observed.

Given a batch of NNN (image, text) pairs of samples, CLIP is trained to predict which N×NN \times NN×N possible sample pairs are likely to actually occur in the batch. To achieve this, CLIP learns a multimodal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of image and text embedding and minimize the cosine similarity of incorrect N2−NN^ 2-Nn2 −N pairs in a batch of NNN real sample pairs.

The pseudo-code at the core of the CLIP implementation is shown in Figure 3. This batch structure technique and target has been introduced into the field of deep measurement learning for the first time.

Select and extend the model

Two different image encoder architectures were considered: first, resNET-50 was used as the basic architecture for the image encoder because it was widely adopted and had proven performance. For the second architecture, experiment with the recently launched Vision Transformer (ViT).

The text encoder is a Transformer whose basic structure uses a 63M-parameter 12-layer 512-wide model and has 8 attention heads.

Source code analysis

Source link

Method of use

First install PyTorch 1.7.1 and TorchVision, along with some additional dependencies. On a CUDA GPU machine, it looks like this:

$conda install --yes -c PyTorch PyTorch =1.7.1 Torchvision CudatoolKit =11.0 $PIP install FTFY regex TQDM $PIP install git+https://github.com/openai/CLIP.gitCopy the code

With no GPU on the machine, replace CUDatoolKit =11.0 with appropriate CUDA version or CPUOnly.

import torch import clip from PIL import Image device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load("ViT-B/32", device=device) image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device) text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device) with torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text) logits_per_image, logits_per_text = model(image, text) probs = logits_per_image.softmax(dim=-1).cpu().numpy() print("Label probs:", [[0.9927937 0.00421068 0.00299572]] [[0.9927937 0.00421068 0.00299572]]Copy the code