From OpenAI, Heart of Machine compilation, Heart of Machine editorial Department.

Transformer is a powerful sequence model, but the time and memory required increases second-order with the length of the sequence. OpenAI researchers have developed a deep neural network called Sparse Transformer that sets a new record for predicting long sequences — whether text, images, or sound. Using an improved algorithm in the attention mechanism, the neural network can extract patterns from sequences that may be 30 times longer than before.

Now, one of the challenges in AI research is fine-correlation modeling of long sequences in complex data such as images, video, or sound. Sparse Transformer incorporates the O(N√N) recombination of the O(N ^2) Transformer’s self-attention mechanism, among other improvements, for direct use with these rich data types. Previously, the models used on this data were tailored to a particular domain, or it was difficult to scale the sequence to include thousands of elements.

In contrast, OpenAI’s model, which can model sequences of tens of thousands of elements using hundreds of layers, achieves current best performance in many areas. OpenAI researchers use this model to help create AI systems that can better understand the world.

Deep attention

In Transformer, each output element is connected to the input element and the weight between them is dynamically calculated based on the specific situation, a process known as the “attention mechanism.” While this is believed to make Transformer more flexible than models with fixed connection patterns, the practice of creating an N×N attention matrix for each layer and attention header can consume a lot of memory when applied to data types with many elements, such as images or raw audio.

Attentional memory usage of the Deep Transformer (64 tiers and 4 heads) when the matrix is stored in memory or recalculated during reverse calculation.
As a reference, standard GPU memory for deep learning is typically 12-32GB.

One way to reduce memory consumption is to recalculate the attention matrix from the checkpoint during backpropagation, which is a well-established approach in deep learning to reduce memory usage with more computations.

When the attention matrix in Transformer is complete, this means that maximum memory consumption will not be dictated by the number of layers, allowing the researchers to train the network much deeper than before. In practice, the researchers found that 128-layer Transformer outperformed shallower networks when handling benchmark tasks such as CIFAR-10.

In order to train deeper models, the researchers made several changes to transformer’s operation order and modified the initialization method. See paper for details.

Sparse attention

However, even computing a single attention matrix is not practical for very large inputs. Therefore, OpenAI uses a sparse attention mode, in which each output position calculates weights only from the input position subset. When the subset is small relative to the entire input set (e.g., the number of elements is √N instead of N), the attention calculation becomes easier even for very long sequences, and the algorithm complexity is O (N√N) instead of O (N^2).

To evaluate the feasibility of the method, the researchers first visualized and learned the attentional patterns of depth Transformer on the image and found that many of them exhibited interpretable and structured sparse patterns. Each image below shows which input pixel (highlighted in white) is processed by a given attention head to predict the next value in the image. When the input is concentrated on a small subset and shows a high degree of regularity, the layer tends to be sparse. Here is a sample of the 128-layer model on the CIFAR-10 image:

Left:
Layer 19, right:
Layer of 20.
Learning attentional modes for several layers of a 128-layer CIFAR-10 network (highlighted in white).
These layers learn to divide attention between two dimensions.
Layer 19 summarizes the information for each row, and Layer 20 summarizes the information by column, effectively breaking down the full-attention calculation.

Layers trained for location memory (left:
Layer 6;
Right:
At Layer 36), they usually focus on similar positions regardless of the input data or time step (Layer 6).
The other layers learn the highly data-dependent access pattern (Layer 36).

While many layers show sparse structure, some clearly show dynamic attention that extends across the entire image. To preserve the ability of the network to learn this model, the researchers implemented a two-dimensional decomposition of the attention matrix, in which the network can focus on all locations through two steps of sparse attention.

The first version of Strided attention is roughly equivalent to each position processing its own row and column, which is similar to the attention model learned on the network above. (Note that column attention can be equivalent to dealing with rows in a transpose matrix). The second edition of Fixed Attention deals with fixed columns and elements after the latest column elements. The researchers believe this model is useful for situations where data cannot fit two-dimensional structures such as text.

The experimental results

Sparse Transformer refreshes current optimal density estimation scores on CIFAR-10, Enwik8, and Imagenet 64 datasets.

Density estimation performance (bits per byte/Dim) on CIFAR-10, Enwik8, and Imagenet 64 datasets.
M represents the parameters (in millions) used in the network, W represents the network width, L represents the number of layers, and H represents the number of heads.

The researchers also found that the loss of sparse attention was lower and faster than that of full attention. This could point to a useful inductive bias from sparse patterns, or an underlying optimization problem for intensive attention.

Generate the image

Transformer with sparse attention seems to have a notion of global structure, which can be assessed qualitatively by looking at image completion. The following figure visualizes a model trained on a 64×64 ImageNet:

Damage to the original

To repair the image

Real images

The researchers also generated completely unconditional samples with an unadjusted Softmax temperature of 1.0. These models are trained with maximum likelihood goals that cover all data patterns, including data that may not exist, rather than enhancing the fidelity of a smaller portion of the data. Sampling from models with unadjusted temperatures, the researchers saw the complete distribution of images that the model thought existed in the world. As a result, some samples look odd.

Model example

Generate the original audio waveform

Sparse Transformer can also be used to generate raw audio rather than images by simply changing position embedding. As deep learning extends to new data types, it is also easy to specify inductive bias with such networks.

The model was trained on raw pieces of classical music and used sparse attention to generate sequences of 65,000 lengths. This equates to about five seconds of raw audio, and the researchers linked together several samples for each clip below.

Code released

Typically, achieving sparse attention requires partitioning queries and key matrices into chunks, so to simplify the experiment, OpenAI implements a set of block sparse cores that perform these operations efficiently on the GPU. OpenAI open-source these cores and provides an example of a sparse attention function: github.com/openai/spar…

Future development and limitations

The sparse attention model introduced in this paper is only a preliminary attempt to efficiently model long sequences. The researchers believe that it is useful to explore different patterns and combinations of sparse attention, and that learning sparse patterns is an important approach for the next generation of neural network architecture.

Even with the improvements described above, autoregressive sequence generation is impractical for very high-resolution images and audio. However, the optimized attentional operation described by the researchers could be useful when combined with other methods, such as multi-scale approaches, to model high-dimensional data.

Thesis: Generating Long Sequences with Sparse Transformers

Paper links: d4mucfpksywv.cloudfront.net/Sparse_Tran…

Abstract: Transformer is a powerful sequence model, but the time and memory required increases with sequence length. This paper introduces the sparse factorization of attention matrix, which can be reduced to O(N√N). This study proposes a) architecture and initialization variants for training deeper networks; B) Recalculate the attention matrix to save memory; C) A fast attention kernel for training. The researchers refer to networks with these variations as Sparse Transformers and demonstrate that they can model thousands of time-step sequences with hundreds of layers.

The network uses the same architecture for modeling images, audio, and text from raw bytes, achieving current best density estimation performance on Enwik8, CIFAR10, and Imagenet-64 datasets. The unconditional samples generated by the researchers demonstrate global consistency and great diversity, and demonstrate that in principle sequences of over a million lengths can be modeled using self-attention.

Reference links:
Openai.com/blog/sparse…