Stephan Gouws, research scientist at The Google Brain Team, Mostafa Dehghani, PhD student at the University of Amsterdam, and A Google Research intern

Source | public Google developers

Last year, we launched Transformer, a new machine learning model that outperforms existing machine translation algorithms and other language understanding tasks. Prior to Transformer, most neural network-based machine translation methods relied on recursive neural networks (RNN) of cyclic operations that run in recursive order using loops (that is, the output of each step goes to the next step) (for example, translating words verbatim in a sentence). Although RNNS are very powerful at modeling sequences, their sequential nature means that they are very slow to train, as long sentences require more processing steps and their repetitive structure makes training even more difficult.

In contrast to the RNn-based approach, Transformer does not require a loop, but processes all words or symbols in a sequence in parallel while using a self-awakening mechanism to combine context with distant words. By processing all the words in parallel and having each word process the other words in the sentence in multiple processing steps, Transformer can train much faster than a copy model. It is worth noting that its translation results are much better than THOSE of RNN. Transformer, however, has struggled with smaller, more structured language-understanding tasks or simple algorithmic tasks such as copying strings (for example, converting input from “ABC” to “abCABc”). By contrast, models that do well in this regard, such as neural Gpus and neural Turing machines, are overwhelmed in large-scale language understanding tasks, such as translation.

In Universal Transformer, we extend standard Transformer to a computationally general-purpose (Turing-complete) model using a novel and efficient time-parallel loop that produces stronger results on a wider range of tasks. We build the model on the parallel structure of Transformer to maintain its fast training speed. But we replace the fixed stack of different transformation functions in Transformer with multiple applications of a single time-parallel loop of transformation functions (that is, the same learning transformation function is applied in parallel to all symbols in multiple processing steps, where the output of each step is fed into the next step). The key is that RNN processes the sequence symbol by symbol (left to right), while Universal Transformer processes all symbols simultaneously (like Transformer) and then refines the interpretation of each symbol in parallel using a self-awakening mechanism in a variable number of cases. This time-parallel loop mechanism is faster than the sequential loop used in RNN and makes Universal Transformer more powerful than standard feedforward Transformer.

At each step, the self-awakening mechanism is used to pass information from each symbol (for example, the word in a sentence) to the other symbols, just as in original Transformer. However, the number of such conversions (that is, the number of steps in the loop) can now be set manually in advance (for example, to a fixed number or input length) or dynamically with the Universal Transformer itself. To achieve the latter, we added an adaptive computation mechanism at each location that can assign more processing steps to symbols that are more obscure or require more computation.

Let’s use an intuitive example to illustrate how this works, such as the sentence “I arrived at the bank after crossing the river”. In this case, inferring the most precise meaning of the word “bank” requires more contextual information than unambiguous “I” or “river”. When we encode this sentence with standard Transformer, we need to apply the same amount of computation to each word unconditionally. However, the adaptive mechanism of Universal Transformer allows the model to spend more computation only on more ambiguous words, for example, using more steps to integrate the additional contextual information required to disambiguate “bank” and fewer steps on those more explicit words.

At first, it seems restrictive to have Universal Transformer only repeatedly apply a single learning function to its input, especially when compared to standard Transformer, which learns different functions that apply a fixed sequence. However, learning how to apply a single function repeatedly means that the number of applications (processing steps) is now variable, and this is an important distinction. In summary, in addition to allowing Universal Transformer to apply more calculations to fuzzier symbols, it also allows the model to adjust the number of function applications (more steps in a longer sequence) based on the overall size of the input, Or dynamically determine how often to apply a function to any given part of the input based on other properties learned during training. This makes Universal Transformer more powerful in a theoretical sense, as it can effectively learn to apply different transformations to different parts of the input. This is something standard Transformer cannot do because it consists of a fixed stack of learning transform blocks that are applied only once.

Although Universal Transformer is more powerful in theory, we also care about experimental performance. Our experimental results confirmed that Universal Transformers were indeed able to learn from samples how to copy and reverse strings, as well as how to perform integer addition better than Transformer or RNN (though not as well as neural GpuS). In addition, Universal Transformer achieved significantly better generalization results in a variety of challenging language comprehension tasks and achieved current best performance in bAbI language reasoning tasks and the very challenging LAMBADA language modeling tasks. But perhaps most interesting, the Universal Transformer improved its translation quality by 0.9 BLEU when trained in the same way with the same number of parameters and training data as the base Transformer. When Transformer was released last year, its performance improved by two BLEU values over the previous model, while the relative improvement for Universal Transformer was 50 percent over last year.

Thus, Universal Transformer Bridges the gap between competitive practical sequence models for large-scale language understanding tasks (such as machine translation) and computational general-purpose models (such as neural Turing machines or neural Gpus) that can be trained using gradient descent for performing random algorithm tasks. We are pleased to see the latest advances in the time parallel sequence model, as well as the increase in computing power and loops in processing depth, and we hope that further improvements to the basic Universal Transformer described here will help us build more, more powerful algorithm-based features that make more efficient use of data. The generalization performance exceeds the current optimal algorithm.

If you want to experience it yourself, the code for training and evaluating a Universal Transformers can be found in the open source Tensor2Tensor repository. Note: Code link github.com/tensorflow/…

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit and ł ukasz Kaiser, who carried out the study, were commended. Thanks also to Ashish Vaswani, Douglas Eck and David Dohan for their productive suggestions and inspiration.