This is my 33rd day of participating in the First Challenge 2022

This article appears in ACL 2019. Lead author Julia Ive is from the University of Sheffield. Distilling Translations with Visual Awareness

Motivation

In the past, MMT focuses on using image information to help model translation in special cases (such as disambiguation of polysemy). The author believes that MMT can also be implemented in a better way, using text context and visual information to optimize the translation results.

Method

The author proposes a two-stage translation approach: first translation, then optimization: first generates a first-edition translation result directly, and then optimizes the translation result using the target language context and visual information.

Related Works

The author first introduces more than twenty existing multi-modal machine translation works, each of which is summarized in one sentence (this paragraph is recommended for reading and used for survey), and then introduces several translation refinement methods in the past:

  • Methods based on iterative optimization:
    • Hardmeier et al., 2012: Exploring the context of the entire document through hill-climbing for local improvement at the sentence level;
    • Novak et al., 2016: Predicting discrete substitutions in translation drafts using attentional mechanisms;
    • Lee et al., 2018: A non-autoregressive method is used, focusing on speeding up decoding speed;
  • Learn a separate model to optimize (requires additional training data, i.e. translated drafts and correctly translated sentence pairs) :
    • Niehues et al., 2016;
    • Junczys-Dowmunt and Grundkiewicz, 2017;
    • Chatterjee et al., 2018

Model

The model has a two-layer decoding structure. The first channel decoder and the encoder before it are based on the standard Transformer architecture and are used to generate translation drafts. The second channel decoder uses a ____ network (which is added to the standard end-to-end architecture for adding an extra decoder).

____ network Achieving Human Parity on Automatic Chinese to English News Translation Deliberation Networks: Sequence Generation Beyond One-Pass Decoding

In addition to the hidden state of the encoder, this encoder utilizes the output of the first-channel encoder (selected from the candidate results using a beam search) and the image input. The illustration is as follows:

The authors believe that images are only needed in a few cases, so only visual information is added to the second channel decoder. A learnable matrix is used to project visual features into vectors to be spliced with the encoder’s output.

Among them, the author designs three visual input strategies:

  • Att: spatial visual features extracted by CNN. Among them, spatial visual features come from the last convolutional layer of ResNET-50.
  • Sum: Object detection is carried out using the off-the-shelf object detection algorithm of open image dataset, and each object is represented by a 545 dimensional vector;
  • Obj: Object detection is also carried out, and each object category is represented by a pre-trained 50-dimensional word vector.

Distilling, Translations with Visual Awareness