DCNN: Capsule network and dynamic routing to extract high quality visual features for MMT

This is the 30th day of my participation in the First Challenge 2022, for more details: First Challenge 2022

This paper was published in ACM MM 2020. The first author is Huan Lin, a master student of Xiamen University in 2019, and the corresponding author is Jinsong Su. Dynamic Context-Guided Capsule Network for Multimodal Machine Translation

Motivation

The processing of visual information in the past multi-modal machine translation models can be generally divided into three categories:

The global visual features of the image are taken as the visual context.

【 thesis Notes 】Attention-based Multimodal Neural Machine Translation – Excavation (Juejin. Cn)
Extracting visual context using attention mechanism;

Where, either at each time step, the attention mechanism is used to extract visual context from the image:

Damage to the key is guaranteed to be nothing. Key -Attentive Decoder: Multi-modal attention classic.

Or the visual features can be directly used as hidden states on the source side.

Incorporating Global Visual Features into attention-based NMT (Juejin. Cn)
Learn multimodal joint representations.

Graph-based Multimodal Fusion encoder: When GNN meets Multimodal machine Translation — Nuggets

Rereading UNITER: Pre-training of General Model for Representational Learning and Bridging the Semantic Gap between Text and Text

In Multimodal Machine translation, visual information should not dominate.

However, the attention mechanism lacks semantic interaction between modes, and the attention capacity of a single step is limited, which cannot accommodate enough visual information. In the other two approaches, the fixed visual context is not suitable for modeling the observed changes when generating the translation. This paper proposes a dynamic context-guided capsule network (DCCN) to solve the above problems.

Capsule Network: What is a capsule network? _bilibili bilibili

Method

At each time step decoded, the source context vector of the time step is first generated using standard source-target attention. Next, DCCN takes this context vector as input and uses it to guide the iterative extraction of the relevant visual context during dynamic routing, and then updates the multimodal context. We introduce two parallel DCCNS to extract different granularity visual features (global/regional visual features) and obtain two multimodal context vectors, which are fused and incorporated into the decoder to predict target words.

Model

Overview

The model is mainly based on Transformer, but with two DCNNS added. The model structure is as follows:

The encoder has the same structure as the normal Transformer encoder and will not be described here.

Decoder is an extension of Transformer decoder. The front Ld− 1L_D-1LD −1 layer is the same. The only difference is the last layer, which adds two DCNNS to learn multimodal representation. The first DCNN is used to receive global visual features, which are extracted by RESNET-50 and projected onto a 196×256 matrix. The second DCNN is used to accept regional Visual features. It is worth noting that the regional Visual features here are not image embedding, but r-CNN is used to identify image regions and predict the probability distribution of classification results on more than 1600 categories of Visual Genome. The region is characterized by a weighted sum of classification results (word embedding).

DCNN

DCNN consists of low-level capsule, high-level capsule and multi – modal context capsule. Low-level capsules firstly extract visual features from image input, and then high-level capsules extract corresponding visual context from the visual features extracted from low-level capsules according to the instructions of multi-modal context capsules (calculate Pearson correlation coefficient). The overall structure is shown as follows:

The specific process is as follows:

Input: source context and image input of this time step;
The low-level capsule was initialized. Procedure Initialize each low-level capsule UI with image input;
Multimodal context capsule initialization. Initializing multimodal context capsule MJ with source context;
According to the low level and dimensions of the high – level capsule, with Wij matrix to a mapping of the UI, get the middle tier u_j | I;
ρij represents the correlation between MJ and UI, which is then used to update BIJ. Bij is obtained by Softmax operation, and CIJ is used to determine the update of intermediate layer to high-level capsule layer.
The high-level capsule layer was then used to calculate the multimodal correlation MJ, which was used to update the ρ IJ.
Output: multimodal context.

In general, the author relies on the correlation between the source sentence context calculation and image features to form a dynamic routing mechanism to guide the iterative extraction of visual features, which can effectively extract the visual information needed for translation and integrate it into the multi-modal context for decoding and generating target sentences.

Experiments

Results

The authors selected several widely used MMT models as baseline:

Transformer: Plain text machine translation based on Transformer;
Encoder-attention: An encoder-based visual attention mechanism has been added to Transformer;
Doubly-attention: Introduces an additional sublayer of visual attention to take advantage of the visual features.
Stochastic attention: a random, sampling-based attention mechanism that focuses on only one spatial position of an image at each time step;
Imagination: A multi-tasking learning model that includes translation and visual-based representation predictions;
Fusion-conv: A single feedforward network is used to establish the attention alignment between the visual features and the hidden state on the target side at each time point, in which all spatial positions of the image are considered to derive the context vector;
Trg-mul: Use element multiplication to modulate the embedding and visual features of each target word.
Latent Variable MMT: A Latent Variable is designed for MMT, which can be viewed as a multimodal random embedding of the image and its target language description.
____ Network: Based on the translate-and-refine policy, the visual features are used by the decoder only in the second stage.

The results of the English-German translation task are as follows:

With the introduction of 1M extra parameters, DCCN outperforms almost all other models on three data sets.

Better than Encoder-attention: Encoder-attention uses static source hidden state to extract visual background, while DCCN uses specific time step source side context to extract visual context. The dynamic routing mechanism of capsule network can generate multi-modal context with higher quality by calculating the correlation between different modes.
Better than Doubly-attention: Doubly-attention also uses the source-side context of a specific time step, but DCCN still achieves better results, which demonstrates the effectiveness of the dynamic routing mechanism.

Ablation Study

The author carried out ablation experiments for several innovative points in this paper.

It can be seen from the results that both regional and global visual features are useful, and the dynamic routing mechanism can extract the two visual features better than the traditional attention mechanism. In line 7, the author removes the context guidance in dynamic routing and uses the standard capsule network to extract visual features, but the performance is also sharply reduced, indicating that the scheme of dynamic extraction of visual features in different time steps is also effective.

Summary

In this paper, capsule network method is applied to MMT, and source sentence context is used to dynamically guide visual feature extraction at different time steps. Experimental results show that the proposed scheme makes full use of semantic interaction between modes to extract effective visual information and achieves SoTA in MMT task.