I’ve been writing my last post in Transformer for two weeks. The reading is too slow. Mainly in their own understanding, and no time to look at the source code, very ashamed, if where said wrong hope big guys can remind

After careful study of Attention and Transformer before, universal Transformer read once to understand, the lack of the previous basic children’s shoes, please first step:

  1. [NLP] Attention principle and source code analysis

2. 【NLP】Transformer

The emergence of Universal Transformer is due to the two disadvantages of Transformer both in practice and in theory (refer to the previous article). Universal represents computationally Universal, It is turing-complete (see Transformer section 3 for details). The main change is to add a loop, but not a loop in time, but a loop in depth. Notice that the Transformer model actually uses six layers, fixed depth, and universal uses a mechanism to control the number of loops.

1. Model structure


The structure of the model is still very similar to the traditional Transformer, so I will not repeat the interpretation here, but mainly talk about the following changes in Universal Transformer:

1.1 Recurrent mechanism

In Transformer, the input enters the fully connected layer after going through multihead self-attention, and in this case the Transition layer is fully connected, and the loop continues through a Transition function with shared weights:


The vertical position refers to the position of each symbol in a sequence (i.e., time step in RNN), and the horizontal time refers to the computational order, such as a sequence, is first expressed by embeddingAttention +transition = attention+transition = attention+transition. If it’s RNN, you have to compute it firstAnd then calculate, and Transformer’s self-attention can be calculated simultaneouslyAnd then calculate t plus 1.

So, the output of each self-attention+transitionIt can be expressed as:

In this case, Transition function can be fully connected layer as before, or it can be convolution layer.

1.2 Coordinate embeddings

Transformer only needs to consider the position of symbol for positional embedding. There is another time dimension here, so each cycle will make a coordinate embedding again, which is not shown in the figure. Need to see the source code to confirm. Embedding formula is as follows:

Have not thought clearly why this operation, think clearly say.

1.3 Adaptive Computation Time (ACT)

ACTA Universal Transformer with the ACT mechanism is called Adaptive Universal Transformer. The detail to note is that the ACT for each position is independent. If a position a is stopped at time T,Will be copied to the last position to stop, of course, also set a maximum time to avoid infinite loop.

2. To summarize

Universal Transformer has improved on the shortcomings of Transformer, and has better results in questions and answers, language model, translation and other tasks, becoming the new SEQ2Seq state-the-art model. There are two key features:

  • Weight sharing: Inductive bias is an assumption about the objective function. CNN and RNN assume spatial translation invariace and time translation invariance respectively, which are reflected in the weight sharing of CNN convolution kernel in space and RNN unit in time. Therefore, Universal Transformer also increases this hypothesis, making the weight sharing in the recurrent mechanism more close to the RNN’s variability while increasing the model’s expressive force.
  • Conditional Computation: Better results than universal Transformer with fixed depth by adding ACT to control the number of computations

Read down, there are still a lot of details worth digging, I have a shallow dig, you are interested to see more

Computdrawn universal, individualized bias, coordinate embedding

The above.


[Reference] :

  1. Universal Transformers
  2. Academic | Transformer, Google’s machine translation model can be used to do anything now