This is the 34th day of my participation in the First Challenge 2022

Distilling, Translations with Visual Awareness: Distilling, Translations with Visual Awareness

This article appears in ACL 2019. Lead author Julia Ive is from the University of Sheffield. Distilling Translations with Visual Awareness

Experiments

Compared with Baseline

The author compares the effect of Transformer BASED MMT and Transformer with ____ decoder, and compares the performance of the two architectures in three image input strategies. The result is shown below:

First, the hustle decoder model performed significantly better overall than the baseline, with an average improvement of 1 point on both METEOR and BLEU.

Secondly, the inclusion of image information in both the baseline model and our model does not bring much improvement to the translation effect.

Manual inspection

For this result, the author further conducted a manual evaluation experiment. Some professional translators and native speakers were employed to score the translation results of base+ ATT, Del and Del + OBj models in the case of given images. The average scores were as follows:

Manual evaluation experiments show that del model tends to improve the grammaticality and accuracy of the first input. For German, the most common modification in the second channel decoder is the substitution of adjectives and verbs (15% and 12%). The changes to adjectives are mostly grammatical, while the changes to verbs are contextual (such as changing “run” to a different “run” that means faster speed); In French, 15 percent of the change is noun substitution, which is related to the cultural background of French.

Source degradation

In addition, the author also modified the source sentence by introducing noise, namely source degradation, to investigate the ability of these models to deal with these problems. Modification operations include the following three types:

  • Random Content Words: Discard the source content words immediately. Mark sentences with the Spacy kit for words of various parts of speech and replace them with blank;
  • Ambiguous Words: The author identified polysemous words in the source sentence using the MLT dataset (which provides polysemous words in the Multi30K dataset) and then replaced them with blank;
  • Person words: Replace words corresponding to the category of Person with blank.

The results are as follows:

Del +obj is the most successful configuration for German, but del and del+sum are no better than base; For French, all DEL models have a significant improvement over Base. This difference comes from the characteristics of French and German, which is explained in detail in the paper and will not be repeated here.

Image information is more helpful for person Words being replaced, with del+obj solving 10% more blanks than del.

Summary

The authors propose a better MMT scheme based on the idea of translation first and then optimization, which can make better use of text and image context. The authors verify the effectiveness of hustle network in MMT, and the hustle network design and the added visual features can make the model more robust in the face of noise input.