This is my 36th day of participating in the First Challenge 2022

Multimodal translation is just like that. Is there any use for visual information? (I) – Nuggets (juejin.cn)

This article is the best of naACL-HLT 2019. The authors are from Le Mans University and Imperial College London. Probing the Need for Visual Context in Multimodal Machine Translation

Experiments

Let’s first clarify the four models used for comparison:

  • NMT: Plain text Baseline
  • INIT: Initializes the encoder and decoder with visual features
  • DIRECT: multimodal attention
  • HIER: Improved version of DIRECT (with an attention layer)

Normal

The author first directly tested the score of METEOR on Test2017 with four models (left in the following table), and MMT showed a slight improvement compared with NMT.

Color Deprivation

The author carried out color deprivation of the source sentence, that is, masking the words describing colors in the source sentence and then translating them with the model, expecting the model to provide correct results. The following is an example:

The results are shown in the last column (Dc) of the table above, indicating that MMT has significantly improved compared to the plain text baseline. Note that this is the result across the entire dataset, and the improvement is even more pronounced if we focus only on the subset of sentences containing color descriptions, where multimodal attention increases by about 12% and INIT increases by about 4%, indicating that complex models are more conducive to visual information extraction.

Entity Masking

The author also performs entity masking, that is, masking the entity in the source sentence, as shown in the following example:

The source sentence degrades more seriously this time, because the scale of masking is larger, and the MMT improves more obviously. DIRECT improves 4.2 points directly. Based on Elliott et al. ‘s ideas, the author compares intextually ent decoding, in which the author inputs visual features in the opposite sequence of samples to break the image-sentence alignment. The results show that the effect is seriously reduced. The experimental results are shown in the figure below. The authors found similar results in other languages.

This shows that the image information is more important. The author visualizes the weight of attention. In the following example, MMT successfully notices the correct position of the image and correctly translates the wrong word song into son.

Progressive Masking

Finally, the author conducts a gradual masking experiment, which degrades a sentence gradually. As the number of words remaining in the source sentence gradually decreases to 0, the advantage of the three-MMT model becomes bigger and bigger. When the source sentence is completely degraded, it is about 7 points higher than NMT.

The author also tested the model’s image sensitivity under different degradation degrees of source sentences. With the increase of source sentence information, the model’s performance difference becomes smaller and the importance of images becomes lower and lower.

Finally, the author makes a comparative experiment of “constraining”, applying DIRECT models to make intextually decoding and training from the beginning. The results show that the model learned to ignore visual information and achieved an effect comparable to that of the NMT model.

Summary

The author studies MMT’s use of visual information by degrading source sentences and using inconsistent image features, and proves the usefulness of visual information in MMT, such as using visual information to reduce noise and correct source sentences in translation.