Introduction: When we develop background applications or middleware, we store some data in memory to speed up access. As the amount of data increases, it can be mitigated by real-time compression in addition to being placed in the heap. Today we will introduce a way to compress an integer array.

The author | | ali XuanYin source technology public number

When we develop background applications or middleware, we store some data in memory to speed up access. As the amount of data increases, it can be mitigated by real-time compression in addition to being placed in the heap. Today we will introduce a way to compress an integer array.

I. Data Compression

Arrays, which are of type long[] or int[], are widely used in Java. When there is a large amount of data, the memory footprint becomes a problem because a long takes up 8 bytes and an int takes up 4 bytes. When there are tens of millions of bytes, the memory footprint is in the hundreds of megabytes.

1 to redundancy

The first thing that comes to mind is reducing the amount of space each number takes up. Because we all know that in terms of positive numbers, int 3 bytes above 2^24 = 16M is 1600 million numbers, and after that, even if we use the fourth byte, most of us don’t need it, but we can’t cut it off, in case we need it again; So you can remove the high bits, which is a feasible idea, but you have to remove them dynamically and use them when you need them, which depends on how many bytes you need to store them (see figure: Digital compression fundamentals).

Basic principles of digital compression

Occupied said data bytes there are two ways: one is that several of the original data, and then take long, we only need to borrow three can cover use the number of bytes (because 2 ˆ 3 = 8), because 2 ^ after 60 is the very large number, almost can’t use, so we use is the basic won’t produce negative effect; The other is to use the highest bit of a byte to indicate that there is still data left (see Figure 2), which Is what Facebook uses in Thrift to compress transmitted data. In short, we want to compress long or int arrays into byte arrays, and then restore the corresponding numbers based on the information stored in byte arrays.

Method to identify data size when decompressing

The above compression ideas can solve the access problem well in the transport scenario, because they are all forward first out ideas, but what if we need the compressed structure to still have the array subscript access capability?

The difficulty here is: before each number is a fixed length, we can use “take up the number of bytes] [single number * [p1]” quickly find the corresponding memory address, but after the compression every number to take up the space is not the same, this way is failure, we don’t know the first N number of memory locations. Can I only do 200 linear searches to get 200 subscripts? Obviously, such efficiency is quite low, and its time complexity decreases from O(1) to O(n). Is there a better way? Of course there is. We can create an index (as shown below), that is:

  • Divide the number into buckets of adjustable size (e.g. 1KB per bucket, 4KB per bucket, etc.)

  • We use another array, the number of buckets, to store the index of the first data in each bucket

  • In the search, I first use binary search to find the corresponding bucket, and then use linear search to find the corresponding data

Index can improve the speed of the compressed subscript retrieval

Since it is linear lookup only within buckets, it is usually not too large, 1KB or 4KB (not too small, since each bucket’s array pointer also needs to occupy 16B). Since the first binary index search solved most of the problems, the search speed improved significantly, approaching O(logN). Using this method, it has been tested that the space occupied by 400 million random data can be reduced by about 30%. After a simple test TPS can reach 100W level, single access is definitely enough.

Effect of compression

2 Offset storage

Use ladle is the nature of the sequential search, we can only in the barrel of the first element in the original number, behind of all remaining an offset, because when the data is not obvious discrete (namely is more than a dozen for a while, a moment is billions), is a good way to reduce data size, such as both number occupies three bytes, save after offset, The second number can be represented in 1 or 2 bytes. Of course, if you do not specify the order of the array itself, you can also sort the array first, and the effect of the offset will be full. In most cases, they can be cut by more than 70%.

Dual access optimization

We found that the single random acquisition was not affected, but the batch sequential acquisition interface decreased by up to 97%, from more than 2800 to 72. After research, it is found that the decline of batch interface is obviously due to reaching the upper limit of RANDOMLY acquired TPS, as shown in figure: “Compression effect” shows that the limit TPS of random acquisition is about 100W, but in the batch sequence scenario, each batch operation will perform 1\~ 3W fetch operations, and each fetch operation takes the random access interface, so it can only be 72 two-digit TPS, so we need to deeply optimize the access efficiency of compressed data. To this end, the following measures were adopted.

1 Change a fixed length bucket to a variable length bucket

Before, binary search is a factor. We use a fixed length bucket (that is, each bucket stores the same number of bytes), and the number of digits stored in each bucket is variable. However, if we use a variable length bucket and let each bucket store N numbers, then we can directly call the bucket where the number is quickly through the way of “divisor + remainder”. So when you’re looking for buckets, you can find them in order 1 complexity, which is much faster than the order logn that you had before. After the test, the TPS of batch interface is increased to about 320, which is more than 4 times higher, and the effect is obvious.

The variable bucket length can make the index block length fixed, which can be searched quickly

Write specialized iterators

In fact, batch operation is the same as the example operation. Before, the example was taken separately, that is, each time through the GET interface. This interface calculates the location of the bucket each time, then the offset, and then searches the bucket from the start based on the offset, which is of course high performance overhead under a large number of requests. For this reason, we can design an iterator specially according to the characteristics of its sequential selection. The first initialization of this iterator will record the position of the bucket, and the next iterator can directly find the next data by actually offsetting the length of a number of n, and its time complexity is O(1). The TPS of the batch interface can be raised to about 680 after testing.

3. Reduce intermediate data and use stack direct transfer sharing

In the original decompression process, we read the data from the bucket and passed it to the solution for decompression, which generated a large amount of intermediate data in the heap and used many ByteBuffer wrap operations before the wrap created a new ByteBuffer object each time, which was quite time-consuming. Since these are read-only operations and currently do not support data deletion, we can reference the data in the bucket directly and pass it to the decompression function through the stack, which is much faster.

The code before modification is as follows, and its main logic is

  1. Calculate the bucket and offset of the number and wrap it as a ByteBuffer

  2. Using a wrapped ByteBuffer, the byte array is linearly parsed to find the number in the bucket by the offset

  3. Copies the corresponding bytes into a temporary array based on the length information of the number (that is, the first three bits)

  4. Unzip the temporary array by passing it into the inflate method

    Public long get(int ix) {public long get(int ix) {int I = ix/SHARD_SIZE;

    Ix %= SHARD_SIZE; ix %= SHARD_SIZE; ByteBuffer buf = ByteBuffer.wrap(shards[i]); // Find the corresponding data offset long offset = 0; if (ix > 0) { int len = (Byte.toUnsignedInt(buf.get(0)) >>> 5); byte[] bag = new byte[len]; buf.get(bag, 0, len); offset = inflate(bag); } // reset position buf.position(0); int numPos = 0; while (ix > 0) { int len = (Byte.toUnsignedInt(buf.get(numPos)) >>> 5); numPos += len; ix -= 1; } buf.position(numPos); int len = (Byte.toUnsignedInt(buf.get(numPos)) >>> 5); byte[] bag = new byte[len]; buf.get(bag, 0, len); return offset + inflate(bag); }Copy the code

    }

    private static long inflate(byte[] bag) { byte[] num = {0, 0, 0 ,0 ,0 ,0, 0, 0};

        int n = bag.length - 1;
        int i;
        for (i = 7; n >= 0; i--) {
            num[i] = bag[n--];
        }
    
        int negative = num[i+1] & 0x10;
    
        num[i + 1] &= 0x0f;
        num[i + 1] |= negative << 63;
    
        return negative > 0 ? -ByteBuffer.wrap(num).getLong() : ByteBuffer.wrap(num).getLong();
    }
    Copy the code

    }

Modified code:

Public long get(int ix) {public long get(int ix) {int I = ix/SHARD_SIZE; Ix %= SHARD_SIZE; ix %= SHARD_SIZE; byte[] shard = shards[i]; // Find the corresponding data offset long offset = 0; if (ix > 0) { int len = (Byte.toUnsignedInt(shard[0]) >>> 5); offset = inflate(shards[i], 0, len); } int numPos = 0; while (ix > 0) { int len = (Byte.toUnsignedInt(shard[numPos]) >>> 5); numPos += len; ix -= 1; } int len = (Byte.toUnsignedInt(shard[numPos]) >>> 5); return offset + inflate(shards[i], numPos, len); } private static long inflate(byte[] shard, int numPos, int len) { byte[] num = {0, 0, 0 ,0 ,0 ,0, 0, 0}; System.arraycopy(shard, numPos, num, num.length - len, len); int i = num.length - len; int negative = num[i] & 0x10; num[i] &= 0x0f; num[i] |= negative << 63; return negative > 0 ? -longFrom8Bytes(num) : longFrom8Bytes(num); }Copy the code

By comparison, it can be seen that the intermediate variable bag array is removed, and the corresponding byte array of data is directly obtained by referencing the data in the original SHard. Previously, byte data in the bucket was obtained by ByteBuffer, but now we directly search through shard[I]. It’s much more efficient. In tests, this optimization improved performance by around 45%, directly pushing THE TPS to over 910.

4 Change heap data to stack data

For intermediate variables, some of them can be avoided. We can solve them in the above way, but some of them cannot be avoided. For example, when we decompress the data at last, we definitely need a temporary storage place for the numbers that need to be returned. This is why the first row of the inflate has a byte[] num = {0, 0, 0, 0, 0, 0, 0}; Statement cause. But consider that this array is only meant to store 8 bytes of long’s data. If you use long directly, you’re essentially initializing an 8-byte array on the stack. All you need to figure out is how to manipulate the specified bytes for long. In fact, this is very simple, we just need to move the corresponding byte left to the corresponding position, for example, we need to change the second byte of long to 0x02, just do the following:

LongData = (longData & ˜ (0 XFF) < < 2 * 8) | (0 x02 < < 2 * 8)Copy the code

Another detail is that the value we fetch directly from byte[] data is represented as a signed number. The direct shift above is affected by the sign bit, so we need to use 0xff & byteAry[I] to convert it to an unsigned number. The final optimized inflate method is as follows:

private static long inflate(byte[] shard, int numPos, int len) { - byte[] num = {0, 0, 0 ,0 ,0 ,0, 0, 0}; + long data = 0; - System.arraycopy(shard, numPos, num, num.length - len, len); + for (int i = 0; i < len; i++) { + data |= (long) Byte.toUnsignedInt(shard[numPos + i]) << (len - i - 1) * 8; + } - int i = num.length - len; - int negative = num[i] & 0x10; + // Long negative = data & (0x10L << (len-1) * 8); - num[i] &= 0x0f; - num[i] |= negative << 63; + / / reset will occupy a data + data & = ~ (0 xf0l < < (len - 1) * 8); - return negative > 0 ? -longFrom8Bytes(num) : longFrom8Bytes(num); + return negative > 0 ? -data : data; }Copy the code

In this case, all heap data declarations are removed, and there is also a side optimization, that is, we need to convert the array to long, which is the effect of the longFrom8Bytes method. Now we can return it directly, further optimizing the effect. After testing, the performance is improved by 35% again, and the TPS is about 1250.

5. Inline short functions

Every function call needs a into the back stack operation, is also time-consuming, these losses are negligible in the routine, but in the case of the batch is magnified, by the previous code we can find that one of the get method updateOffset function, the function is very simple, can be directly inline, There is one more line of code, as follows:

private void updateOffset() { byte[] shard = shards[shardIndex]; Int len = (0xff & shard[0]) >>> 5; curOffset = inflate(shard, 0, len); }Copy the code

We express the inline as follows:

if (numPos >= shard.length) { shardIndex++; numPos = 0; - updateOffset(); + shard = shards[shardIndex]; + curOffset = inflate(shard, 0, (0xff & shard[0]) >>> 5); }Copy the code

Others like byte.tounsignedint (int), which is a simple line of code, can be copied directly to get rid of method calls.

Three performance

Finally, the TPS of our batch interface was upgraded to about 1380, nearly 20 times higher than the initial 72. There is still a performance gap compared to the original array, but it is on the same order of magnitude. According to the calculation of batch amplification based on 5W, the TPS of sequential single acquisition has reached 6500W, and the RANDOM single GET has reached 260W TPS, which is fully enough to meet the needs of production.

Iv optimization Summary

From the above optimization, it can be concluded that:

  1. Java primitive data structures are much faster than object structures, and the more low-level they are, the faster they are
  2. The allocation of data on the heap is slow, and the Yong GC is frequently triggered by frequent calls, which has a considerable impact on execution speed, so the stack never uses the heap
  3. Object calls are slower than direct operations because they need to be pushed up and down the stack, so it’s much faster to just copy the logic out of simple calls such as byte.tounsignedint () — at extreme performance, of course
  4. Reduce intermediate data and reduce temporary variables
  5. Any small performance loss can be multiplied at large tuning rates, so be careful with batch interfaces

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.