This is the 29th day of my participation in the First Challenge 2022

This paper was published in ACL 2020. The author is from Peking University. Multimodal Transformer for Multimodal Machine Translation

Motivation

Most of the work in the field of multimodal machine translation (MMT) does not consider the relative importance of language and visual modes. Treating both modes equally in MMT will result in the encoding of much irrelevant information, but in fact the language must be relatively more important. In order to solve the above problems, we propose a multi-modal self-attention method to learn image representation based on text and avoid learning redundant information in images.

Methods

The author introduces the proposed method from three aspects: multimodal fusion, self-attention and decoding.

Incorporating Method

In Transformer, the representation of each word is produced by all words in the process of self-attention. Thus, if we treat each word as a node, Transformer can be seen as a variant of GNN, with each sentence being a fully connected graph. (I am not sure of the significance of the author’s analogy, which seems to have little to do with the idea of this article, and the author does not mention the graph later. There is, however, work using graph structures in multimodal machine translation: [Paper note] Graph-based multimodal fusion encoders: When GNN meets Multimodal machine translation — Nuggets (juejin. Cn), uploaded to arXiv shortly after publication.

For the visual mode, the author extracted the regional features of the image, which were used as pseudo-words, and stitched together with the source sentence to feed the multimodal self-attention layer.

Multimodal Self-attention

In this paper, we design a special multi-modal self-attention, which splicing text and visual embedding together to generate Q matrix, and then using text embedding to generate K and V matrix. Under the guidance of text, we adjust attention, explore visual modal information, and finally obtain multi-modal context representation. As shown in the figure.

Experiments

Results

The authors selected several plain text machine translation models and previous multimodal machine translation models as baseline and tested them with additional back-to-translated data. The test results on the Multi30k dataset are compared as follows:

The performance of the proposed model exceeds that of the existing SoTA, especially the comparison with the plain text baseline shows that the model does benefit from image information.

In addition, while all models improved with additional data, the model in this paper improved the most. This suggests that the model will perform better on larger data sets.

The author selected some examples in the test set for separate analysis. The picture shows the attention distribution of different attention heads. It can be seen that under the guidance of the text, the model pays more attention to the areas related to the translation content in the image, namely people and buildings.

Ablation Study

We investigate the effects of multimodal self-attention and image input on model performance.

Firstly, the multi-mode self-attention designed in this paper is replaced by ordinary self-attention, that is, the visual and text modes are treated equally, and the effect is obviously decreased. Then the author tries to replace the image input with an empty image, and the model performance deteriorates further. When the input is replaced with other random images, mismatched graphics result in the model’s performance falling even below the plain text baseline.

Summary

The core of the author is in multimodal machine translation, image and text two modal information is not equally important, the author USES a special kind of information on multimodal from attention to encode, and since the concentration of K, V from the text, and Q from text and image, in other words, in a text to guide the attention of the image, To reduce the introduction of invalid visual information. The work reached SoTA.