Reformer: Efficient Transformer

The original address: cf020031308. Making. IO/cca shut / 2020…

Reformer is a performance improvement for Transformer with three main changes:

LSH improved attention module is introduced to reduce the complexity from O(L2)O(L^2)O(L2) to O(Llog⁡L)O(L \log L)O(LlogL), where L is sequence length
The reversible residual layer is introduced to improve the residual layer and the amount of memory is exchanged with the amount of computation
The input of the feedforward layer is divided into blocks, and the serial behavior saves memory

Locality-sensitive Hashing (LSH)

A matrix R is generated randomly, and the hash value is obtained by h(x)=arg⁡ Max ⁡[xR;−xR]h(x) = \arg\ Max [xR; -xr]h(x)=argmax[xR;−xR]. In this way, close x can be hashed into the same bucket with high probability

There is an existing method called cross-Polytope LSH (but it is not a practical and Optimal Angular Distance Locally Sensitive Hashing Algorithm cited in the original article)
The distance metric for this method is Angular Distance, where x’s are all on a sphere
Since it is argmax, the magnitude of R is not important and can be regarded as a rotation
The argmax operation bisects the space into 2b regions

So intuitively, distant x and y (top) are easily rotated randomly into different regions, but if they’re closer (bottom), they’re more likely to stay in the same region.

Locally sensitive hash attention

The author applies LSH to attention mechanism and obtains LSHA

Sparsity: SoftMax results depend on the largest batch of elements, so the attention weight depends on the most relevant part. Reflected in the attention matrix is its (soft) sparsity. Figure A has the high correlation.

Hash buckets: To avoid traversal, calculate softmax as an approximation of the weight by LSH finding the part k that is most relevant to q. The problem with this, however, is that k and Q are not evenly distributed in the bucket, which can be very troublesome when calculating in batches (otherwise it would be traversal), or even if there is no Q and no K, attention cannot be calculated, or if there is no K and no Q, information is wasted. Figure B is the attention matrix after sorting K and Q according to their buckets.

Plastic: in order to solve the problem of uneven distribution, k = q / | | q, this will make the preceding LSH about Angle h (k) = h (q). In Figure C, the rows and columns become Q, and a bucket corresponds to an upper triangle.

After the conventional Transformer do K = Q experiment, it does not affect the effect
Another problem with K = Q is that they tend to pay more attention to themselves, so they can mask it by adding a mask

Block calculation: let block size m = 2L/b, where L is sequence length (5 in the figure) and B is bucket number (3 in the figure). Calculate the weight in the block. In order to smooth, it is actually calculated in the union of the current block and the previous block. Figure D is divided into blocks of 2:1 length and width.

Review the above process from another perspective:

other

I only care about LSH, so I didn’t read it carefully. And I don’t understand why RevNet can improve the amount of memory: back propagation needs to store gradients, not function values. For a more detailed interpretation, please refer to the Reformer.

Locality-sensitive Hashing (LSH)

Locally sensitive hash attention

other

Related Posts

Foundation of data processing: Pandas Initial exploration of data

CRF principle and code are easy to understand

Shed friends, please pay attention, you want the intelligent greenhouse greenhouse is coming!