The original address: cf020031308. Making. IO/cca shut / 2020…

Reformer is a performance improvement for Transformer with three main changes:

  1. LSH improved attention module is introduced to reduce the complexity from O(L2)O(L^2)O(L2) to O(Llog⁡L)O(L \log L)O(LlogL), where L is sequence length
  2. The reversible residual layer is introduced to improve the residual layer and the amount of memory is exchanged with the amount of computation
  3. The input of the feedforward layer is divided into blocks, and the serial behavior saves memory

Locality-sensitive Hashing (LSH)

A matrix R is generated randomly, and the hash value is obtained by h(x)=arg⁡ Max ⁡[xR;−xR]h(x) = \arg\ Max [xR; -xr]h(x)=argmax[xR;−xR]. In this way, close x can be hashed into the same bucket with high probability

  • There is an existing method called cross-Polytope LSH (but it is not a practical and Optimal Angular Distance Locally Sensitive Hashing Algorithm cited in the original article)
  • The distance metric for this method is Angular Distance, where x’s are all on a sphere
  • Since it is argmax, the magnitude of R is not important and can be regarded as a rotation
  • The argmax operation bisects the space into 2b regions

So intuitively, distant x and y (top) are easily rotated randomly into different regions, but if they’re closer (bottom), they’re more likely to stay in the same region.

Locally sensitive hash attention

The author applies LSH to attention mechanism and obtains LSHA

Sparsity: SoftMax results depend on the largest batch of elements, so the attention weight depends on the most relevant part. Reflected in the attention matrix is its (soft) sparsity. Figure A has the high correlation.

Hash buckets: To avoid traversal, calculate softmax as an approximation of the weight by LSH finding the part k that is most relevant to q. The problem with this, however, is that k and Q are not evenly distributed in the bucket, which can be very troublesome when calculating in batches (otherwise it would be traversal), or even if there is no Q and no K, attention cannot be calculated, or if there is no K and no Q, information is wasted. Figure B is the attention matrix after sorting K and Q according to their buckets.

Plastic: in order to solve the problem of uneven distribution, k = q / | | q, this will make the preceding LSH about Angle h (k) = h (q). In Figure C, the rows and columns become Q, and a bucket corresponds to an upper triangle.

  • After the conventional Transformer do K = Q experiment, it does not affect the effect
  • Another problem with K = Q is that they tend to pay more attention to themselves, so they can mask it by adding a mask

Block calculation: let block size m = 2L/b, where L is sequence length (5 in the figure) and B is bucket number (3 in the figure). Calculate the weight in the block. In order to smooth, it is actually calculated in the union of the current block and the previous block. Figure D is divided into blocks of 2:1 length and width.

Review the above process from another perspective:

other

I only care about LSH, so I didn’t read it carefully. And I don’t understand why RevNet can improve the amount of memory: back propagation needs to store gradients, not function values. For a more detailed interpretation, please refer to the Reformer.