What is the best way to integrate multiple inputs in Transformer architecture? (on)

This is my 37th day of participating in the First Challenge 2022

This article was uploaded to arXiv in November 2018. The author is from Charles University. Input Combination Strategies for Multi-source Transformer Decoder

Motivation

There has been a lot of work exploring ways to incorporate visual information into RN-based machine translation, namely, RN-based MMT, but no one has studied how to incorporate visual information into pure Transformer architectures. The author wants to explore this and tests the effect in a multi-source machine translation task (using the same multi-language source sentence to get the target language sentence).

Related Works

In previous work, the input of different modes is projected into a shared space, and the vector is directly used in the RNN or a hierarchical attention layer is used to calculate attention. For example, the HIER ([paper note] multimodal translation effect is just like that in the experiment introduced last time. (top) – Nuggets (juejin. Cn)); Others have introduced Transformer with a gating mechanism in combination with multimodal context vectors, or with a gating mechanism. In addition, there are other methods that are not suitable for use in Transformer, which are listed here.

Method

A series of methods for fusing multiple inputs are designed for the Transformer architecture and then applied to multi-modal machine translation (MMT) and multi-source machine translation (MSMT) tasks.

Strategies

The author proposes four input fusion strategies, which can be realized by transforming Transformer decoder. Among them, the first two (Serial and Parallel) are independent encoding of multiple inputs, and the last two (Flat and hierarchical) are modeling joint distribution of multiple inputs.

1. Serial

Each input is encoded using an encoder, and the codec attention is calculated one by one using cross-attention. The first cross attention query set is a set of context vectors calculated from the previous self-attention, followed by the output of the previous sub-layer, with residual connections between sub-layers.

2. Parallel

This is similar to Serial, except instead of Serial, we add the two context vectors from cross-attention together. All encoders take the output of self-attention as the query set.

3. Flat

The hidden state of all encoder is spliced together as K and V. Unlike HIER, HIER (based on RNN) needs to project the hidden state of encoder into a shared space. This paper is based on Transformer architecture. Use these hidden states directly as q and V.

4. Hierarchical

First, the attention of each input is calculated independently, then cross-attention is calculated separately, and attention is calculated between the corresponding contexts of different inputs.

So what’s the result? Next issue: ‘◡’●)~

What is the best way to integrate multiple inputs in Transformer architecture? (on)

Motivation

Related Works

Method

Strategies

Related Posts

[Data Science from Zero to one]· Titanic survival prediction (data reading, processing and modeling)

OpenCV (1) — How to lighten an image

Interviewing a 46-year-old programmer with a lot on his mind