This is the 16th day of my participation in the First Challenge 2022

This paper is published in ECCV 2020. The first author is Yen-Chun Chen from Microsoft. UNITER: UNiversal Image-text Representation Learning

Preface

When I read this article for the first time, I just started to get into deep learning. I had not read two articles in total, so I was totally ignorant of them at that time, and even now I can’t remember the content of the article at all. Now read a few articles, feel a little progress, so go back to read again, incidentally write a reading notes, easy to accumulate and share.

Motivation

In the past, a large number of models emerged in the field of vision and language (V+L), constantly refreshing SOTA in various fields (such as Visual Question Answering, Image Captioning, etc.). But these models are designed specifically for a particular task and are highly relevant to the task itself. The authors hope to learn a general image-text representation that can be used for all V+L tasks.

Model

The model structure is relatively simple. The author designs a model architecture based on Transformer, and encodes visual information (visual features and boundary boxes) and text information (text tokens and location codes) extracted by HtF-RCNN into the common embedded space as the input of Transformer. The overall structure is as follows:

Pre-training

The author pre-trained UNITER through four pre-training tasks, which are:

  • Image oriented Masked Language Modeling (MLM).
  • Text-oriented Masked Region Modeling (MRM). In order to further study the effectiveness of MRM, the authors proposed three MRM variants: Masked Region Classification (MRC), Masked Region Feature Regression (MRFR), and Masked Region Classification with KL – divergence (MRC – KL);
  • Image – Text Matching (ITM);
  • Word – Region Alignment (WRA).

Masked Language Modeling

Similar to MLM in BERT, MASK some words randomly from the input with a 15% probability and learn to recover the words as the output of Transformer. Masking words are predicted by surrounding words and all image areas. Loss is negative log-likelihood:

MLM is separated from MRM, and only one modal word or region is randomly masked at a time, in case the masked word just describes the masked region and affects the semantic alignment between the modes.

Masked Region Modeling

Similar to MLM, the model was trained to predict the masked regions based on the remaining region features and all tokens, with 15% probability of shielding the input region features. Unlike the TOKENS prediction, however, the image features were high-dimensional and continuous. Here, the authors designed three different variants, but the loss function was consistent:

Masked Region Feature Regression (MRFR) : The author projects the output of Transformer to a dimension consistent with the features of the input ROI, and performs Regression by predicting visual features.

Masked Region Classification (MRC) : Classification and prediction of Masked area. Due to the object category ground truth, the author uses fast-RCNN to label detected objects.

Masked Region Classification with KL-divergence (MRC-KL) : The probability distribution of various categories output in the Classification task, which uses KL Divergence to describe the difference from the true value (pseudo-true value obtained by self-supervision).

Image-Text Matching

The input of ITM is a sentence and a group of image regions, and the output is whether the sentence and image regions match (0/1), which is optimized by binary cross entropy loss:

Word-Region Alignment

In order to provide more fine-grained alignment between word tags and image regions, the authors propose WRA, using Optimal Transport (OT), which effectively calculates the minimum cost of transferring a subculture image embed to a word embed (and vice versa). Loss function:

Datasets

The author used the current four largest data sets of image captions for pre-training, namely, COCO, VG Dense Captions, Conceptual captions and SBU captions.

Experiments

We migrate the pre-trained model to the target task, and test the performance of UNITER on six V+L downstream tasks by fine-tuning. Four pre-training tasks and their variants were tested on different data sets and different downstream tasks.

Downstream Tasks

  • Visual Question Answering: images and text questions are input, and the distribution of answers is output, the answer with the highest probability is chosen as the prediction result, and the binary cross-entropy is used for loss.

  • Visual Entailment: Output text input image and a text description, whether it is right to describe the image {Entailment/Neutral/Contradiction}, similar Yu Yasi reading {T/F/NG (Not Given)}, using cross – entropy as a loss.

  • Natural Language for Visual Reasoning: Input two images and a text description. The objective is to predict the corresponding relationship between description and image {T/F}. The input of UNITER is still a picture and a text sequence. The author repeats the text with two UNITER and obtains the prediction result after a bi-attention. The authors demonstrate the extensibility of UNITER, which can be adapted to a completely different downstream task with minor modifications.

  • Visual Commensense Reasoning: Similar to reading comprehension, the author predicts the probability of the four choices by the four UNITER, and uses the cross-entropy as loss.

  • Referring to Expression Comprehension: The input is a sentence, and the model refers to the corresponding area of the picture. Output a score for each region, with the highest score being the most predicted result.

  • Image-text Retrieval: Use a UNITER to predict whether the input pairs match.

Performance

UNITER- Large achieves SOTA in all of these downstream tasks.

Ablation Study

The authors found several conclusions:

  • The pre-trained are better than the untrained;
  • V+L pretraining is better than V/L pretraining alone;
  • Best combination: MLM+ITM+ MRC-KL +MRFR+WRA
  • The more data, the better the model

Metrics

In addition, the authors use three techniques to speed up training:

  • Dynamic Batching: The complexity of Transformer is related to the square length of the input sequence. If all batches are made with the longest sequence as the length, a lot of computing space is wasted. The author adopts dynamic batching to sort the input sequences first, so that the sequences in the same batch have similar length, and the short sequences can be loaded by the short batch.
  • Gradient Accumulation: The main bottleneck of UNITER-large training comes from network communication overhead between nodes, so Gradient Accumulation is used to reduce the number of communication.
  • Mixed-precision Training: Use both 16-bit and 32-bit to save space.

Summary

We design a large-scale pre-training model that provides universal image-text representation for visual and linguistic tasks, and combine four pre-training tasks to achieve SOTA in numerous downstream tasks.