Summary: When we develop middle and background applications or middleware, we store some data in memory to speed up access. As the volume of data increases, in addition to being placed in the heap, it can also be alleviated by real-time compression. Today I will introduce a way to compress an integer array.

The author | | ali XuanYin source technology public number

When we develop middle and back end applications or middleware, we store some data in memory to speed up access. As the volume of data increases, in addition to being placed in the heap, it can also be alleviated by real-time compression. Today I will introduce a way to compress an integer array.

### I. Data compression

Arrays refer to types of long[] or int[], and are widely used in Java. The memory footprint problem becomes apparent when the data is large, because a long is 8 bytes, and an int is 4 bytes, and when there are tens of millions of bytes, the memory footprint is hundreds of megabytes.

#### 1 to redundancy

The first thing that comes to mind is to reduce the space taken up by each number. As we all know, for positive numbers, int more than 3 bytes means 2^24 = 16M, that is, 1600 million numbers. After that, even if we use the fourth byte, we will not use most of them, but we can not cut them, in case we need to use them again. So you can remove the high bit, is a feasible idea, but must be removed dynamically, when the use is still used, this needs to store how many bytes (see figure: digital compression basic principle).

Fundamentals of digital compression

Occupied said data bytes there are two ways: one is that several of the original data, and then take long, we only need to borrow three can cover use the number of bytes (because 2 ˆ 3 = 8), because 2 ^ after 60 is the very large number, almost can’t use, so we use is the basic won’t produce negative effect; The other is to use the highest bit of the byte to represent the remaining data (see Figure 2), which is what Facebook uses in Thrift to compress the data being transferred. We can compress a long or int array into a byte array, and then restore the corresponding number based on the information stored in the byte array.

Method of identifying data size when decompressing

The above compression idea can solve the access problem well in the transmission scenario, because it is all forward, first out, but what if we want the compressed structure to still have the subscript access of the array?

The difficulty here is: before each number is a fixed length, we can use “take up the number of bytes] [single number * [p1]” quickly find the corresponding memory address, but after the compression every number to take up the space is not the same, this way is failure, we don’t know the first N number of memory locations. Do I have to do 200 linear searches to get index 200? Obviously, this efficiency is quite low, and the time complexity drops from O(1) to O(n). Is there a better way? Of course there is. We can set up an index (as shown below), that is:

- Divide the number into several buckets, the size of the buckets can be adjusted (such as 1KB a bucket, 4KB a bucket, etc.)
- We use another array, size of the number of buckets, to store the index of the first data in each bucket
- In the search, I first use binary search to find the corresponding bucket, and then use linear search to find the corresponding data

With index can improve the compression index retrieval speed

Since it is only a linear lookup within the bucket, it is generally not too large, either 1KB or 4KB (not too small, since each bucket’s array pointer also takes 16B). Since the first binary search of the index solves most of the problems, the search speed is improved significantly, which is close to O(logN). Using this approach, it has been tested that the space taken up with 400 million random data can be reduced by around 30%. After a simple test TPS can reach 100W level, a single access is certainly enough.

Effect of compression

#### 2 Offset storage

Use ladle is the nature of the sequential search, we can only in the barrel of the first element in the original number, behind of all remaining an offset, because when the data is not obvious discrete (namely is more than a dozen for a while, a moment is billions), is a good way to reduce data size, such as both number occupies three bytes, save after offset, The second number can be represented in 1 or 2 bytes. Of course, if you do not require the order of the array itself, you can also sort the array first, the effect of this offset can be a table. In most cases, it can be reduced by more than 70%.

### Two access optimization

When the above scheme was tested online in an application, we found that the single random acquisition was not affected, but the batch sequential acquisition interface decreased by as much as 97%, from more than 2800 to 72. It is found that the drop of batch interface is obviously due to the TPS upper limit of random acquisition, as shown in the “Figure: As shown in “Compression effect”, the limit TPS of random acquisition is about 100W. However, in the batch sequence scenario, each batch operation will perform 1\~ 3W fetch operations, and each fetch operation takes the random access interface, so it can only be a two-bit TPS of 72. Therefore, we need to deeply optimize the access efficiency of compressed data. To this end, the following measures were adopted.

#### Variable length bucket is defined as variable length bucket

Before, binary search was a factor. We used fixed-length buckets (that is, the number of bytes stored in each bucket is the same), and the number of numbers stored in each bucket was variable. However, if we used variable length buckets, and each bucket stored N numbers, then we could quickly hit the bucket where the number was by “division + remainder”. So you can find the bucket index in order of one, which is much faster than the order logn of two. After testing, the TPS of batch interface is increased to about 320, which is increased by more than 4 times, and the effect is obvious.

The non-fixed length of the bucket can make the length of the index block fixed and can be quickly searched

#### Write a dedicated iterator

Batch is actually a pass operation, before the pass is a separate one, that is, each time through the GET interface. The interface calculates the position of the bucket each time, then the offset, and then looks up each offset from the beginning of the bucket, which is of course a high performance overhead under a large number of requests. For this reason, we can design a special iterator according to its order. The first time the iterator is initialized, it will record the position of the bucket, and the next time it can actually offset a number of length n and directly find the next data, with a time complexity of O(1). After the test batch interface, the TPS can be improved to about 680.

#### 3 reduce the intermediate data, the use of stack direct transfer sharing

In the original unpack process, we read the data out of the bucket and then pass it to the solution for unpack, which creates a large amount of intermediate data in the heap, and then uses a number of ByteBuffer wrap operations, which creates a new ByteBuffer object each time, which is quite time-consuming. Since this is a read-only operation and data deletion is not currently supported, we can simply refer to the data in the bucket and pass it through the stack to the decompression function, which is much faster.

The code before modification is as follows, and its main logic is as follows

- Calculate the bucket and offset of the number and wrap it as a ByteBuffer
- The byte array is linearly analyzed using the wrapped ByteBuffer to find the numbers in the bucket by the offset
- Copies the corresponding bytes to a temporary array based on the length information of the number (the first three bits)
- The temporary array is passed into the inflate method to decompress

`Public long get(int ix) {int I = ix/SHARD_SIZE; public long get(int ix) {int I = ix/SHARD_SIZE; Ix %= SHARD_SIZE; ix %= SHARD_SIZE; ByteBuffer buf = ByteBuffer.wrap(shards[i]); Long offset = 0; long offset = 0; if (ix > 0) { int len = (Byte.toUnsignedInt(buf.get(0)) >>> 5); byte[] bag = new byte[len]; buf.get(bag, 0, len); offset = inflate(bag); } // Reset position buf.position(0); int numPos = 0; while (ix > 0) { int len = (Byte.toUnsignedInt(buf.get(numPos)) >>> 5); numPos += len; ix -= 1; } buf.position(numPos); int len = (Byte.toUnsignedInt(buf.get(numPos)) >>> 5); byte[] bag = new byte[len]; buf.get(bag, 0, len); return offset + inflate(bag); } } private static long inflate(byte[] bag) { byte[] num = {0, 0, 0 ,0 ,0 ,0, 0, 0}; int n = bag.length - 1; int i; for (i = 7; n >= 0; i--) { num[i] = bag[n--]; } int negative = num[i+1] & 0x10; num[i + 1] &= 0x0f; num[i + 1] |= negative << 63; return negative > 0 ? -ByteBuffer.wrap(num).getLong() : ByteBuffer.wrap(num).getLong(); }}`

Modified code:

`Public long get(int ix) {int I = ix/SHARD_SIZE; public long get(int ix) {int I = ix/SHARD_SIZE; Ix %= SHARD_SIZE; ix %= SHARD_SIZE; byte[] shard = shards[i]; Long offset = 0; long offset = 0; if (ix > 0) { int len = (Byte.toUnsignedInt(shard[0]) >>> 5); offset = inflate(shards[i], 0, len); } int numPos = 0; while (ix > 0) { int len = (Byte.toUnsignedInt(shard[numPos]) >>> 5); numPos += len; ix -= 1; } int len = (Byte.toUnsignedInt(shard[numPos]) >>> 5); return offset + inflate(shards[i], numPos, len); } private static long inflate(byte[] shard, int numPos, int len) { byte[] num = {0, 0, 0 ,0 ,0 ,0, 0, 0}; System.arraycopy(shard, numPos, num, num.length - len, len); int i = num.length - len; int negative = num[i] & 0x10; num[i] &= 0x0f; num[i] |= negative << 63; return negative > 0 ? -longFrom8Bytes(num) : longFrom8Bytes(num); }`

It can be seen from the comparison that the intermediate variable of bag array is mainly removed here, and the corresponding byte array is directly obtained by referring to the data in the original shard. Previously, byteBuffer was used to obtain the byte data in the bucket, but now we directly search through shard[I]. It’s much more efficient. In tests, this optimization increased performance by around 45%, directly driving TPS to over 910.

#### 4. Change heap data to stack data

This transformation point is a little difficult. For intermediate variables, some of them can be avoided, and we can solve them in the way mentioned above, but some are unavoidable. For example, when we decompress the data at last, we must need a temporary storage place for the numbers that need to be returned. Byte [] num = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; Statements. If you use a long, you are initializing an 8-byte array on the stack. The only thing you need to figure out is how long operates on the specified byte. For example, if we want to change the second byte of long to 0x02, we just need to do the following:

`LongData = (longData & ˜ (0 XFF) < < 2 * 8) | (0 x02 < < 2 * 8)`

One more detail is that the values we extract directly from byte[] data are represented by signed numbers. Using the above displacement directly is affected by sign bits, so we need to use 0xff & byteAry[I] to convert it to an unsigned number. Finally, the optimized inflate method is as follows:

`private static long inflate(byte[] shard, int numPos, int len) { - byte[] num = {0, 0, 0 ,0 ,0 ,0, 0, 0}; + long data = 0; - System.arraycopy(shard, numPos, num, num.length - len, len); + for (int i = 0; i < len; i++) { + data |= (long) Byte.toUnsignedInt(shard[numPos + i]) << (len - i - 1) * 8; + } - int i = num.length - len; - int negative = num[i] & 0x10; + long negative = data & (0x10L << (Len-1) * 8); - num[i] &= 0x0f; - num[i] |= negative << 63; + / / reset will occupy a data + data & = ~ (0 xf0l < < (len - 1) * 8); - return negative > 0 ? -longFrom8Bytes(num) : longFrom8Bytes(num); + return negative > 0 ? -data : data; }`

In this case, all heap declarations are removed, and there is a side optimization, that is, we used a temporary array and we need to convert the array to a long, which is what the longfrom8Bytes method does. Now we can return it directly, further optimizing the effect. After testing, the performance improved by 35% again, TPS to about 1250.

#### Inline short functions

Every function call needs a into the back stack operation, is also time-consuming, these losses are negligible in the routine, but in the case of the batch is magnified, by the previous code we can find that one of the get method updateOffset function, the function is very simple, can be directly inline, That’s one more line of code, like this:

`private void updateOffset() { byte[] shard = shards[shardIndex]; Int len = (0xff & shard[0]) >>> 5; curOffset = inflate(shard, 0, len); }`

We will inline it to show as follows:

`if (numPos >= shard.length) { shardIndex++; numPos = 0; - updateOffset(); + shard = shards[shardIndex]; + curOffset = inflate(shard, 0, (0xff & shard[0]) >>> 5); }`

There are other things like byte.tounsignedInt (int), which is just a simple line of code, that you can just copy out and get rid of the method call.

### Three performance

Finally, the TPS of our batch interface has been upgraded to about 1380, which is nearly 20 times higher than the initial 72. There’s still a performance gap, but it’s the same order of magnitude. According to the calculation that the batch is amplified by 5W, the TPS obtained in the sequence single time has reached 6500W, and the random single GET has also reached 260W TPS, which is fully enough to meet the production needs.

### 4. Optimization summary

From the above optimization, we can get:

- Java base type data structures are much faster than object structures; the more low-level oriented, the faster
- It is very slow to allocate data on the heap. High frequency calls also trigger Yong GC frequently, which has a considerable impact on the execution speed, so the stack can never use the heap
- Object calls are slower than direct operations because they need to be pushed and pushed, so it’s much faster to simply call the logical copy out for a few lines, such as byte.tounsignedInt () — of course, with extreme performance
- Reduce intermediate data and reduce temporary variables
- Any small performance penalty can be multiplied by large amounts of tuning, so be careful with bulk interfaces

This article is the original content of Aliyun, shall not be reproduced without permission.