Self-developed string compression algorithm

An overview of the

In development, there are often scenarios where an online stack is reported to analyze and deal with an online problem, so compression and encryption of the stack are also essential. Encryption: Can use AES symmetric encryption algorithm, compression: can use protobuf’s inherent compressibility to compress strings during upload.

However, in order to save traffic and improve transmission efficiency, this can be ensured by compressing the data before stack uploading. The following to introduce a kind of author explored his own compression string algorithm, and its own encryption effect.

Algorithm is introduced

This algorithm uses scenarios: string compression with a finite character set.

For example, the compression of fully qualified Java method names, for fully qualified methods, consists of uppercase and lowercase letters, numbers, and special characters. In the development process, a standard and qualified class name, method name needs to be known by name, according to valid statistics, the method is restricted to more than 99% of upper and lower English letters.

Algorithm implementation

Brief introduction of compression Principle

The free bits of a char character are used to store valid data. For example, by mapping a to z to numbers 1 to 26 and grouping Char types into 5 bits, one digit is stored (this digit represents a character).

Create a string header structure:Head

In Java code, the proportion of uppercase letters in a fully qualified string is relatively small, so you record the uppercase letters in a fully qualified string by using a precomplement character. If a string is finite and immutable, the relative positions of the characters they are composed of are fixed. The implementation algorithm is as follows:

public char[] build(String s) {
            ...
    for (int i = 0; i < len; i++) {
        c = s.charAt(i);
        b = Character.isUpperCase(c);
        if (b || c == FILL) {
            if (i - lastIndex >= maxDistance) {
                maxDistance = i - lastIndex;
            }
            upCharIndex.add(i - lastIndex);
            lastIndex = i;
       }
    if(b) upCharCount++; }...return handleHead(type);
}

Copy the code

The first step before compression: At the beginning of the string, save and record the position of the capital letters and the distance between each capital letter. (The decimal point is considered to be a capital letter).


private char[] handleHead(int type) {
        ...
    int k, j;
    // Record the position of the uppercase letter in char
    for (int i = 0; i < chars.length; i++) {
        if (i == 0) {
            for (j = 0, k = 1; j < ch1; j++, k++) {
                ch = bitToLeft(ch, upCharIndex.get(j), 12 - (k * stepDistance));
            }
            chars[i] = ch;
        } else {
            char emptyCh = FILL;
            emptyCh &= 0;
            int start = (i - 1) * sizeOfChar + ch1;
            for (j = start, k = 1; j < start + sizeOfChar; j++, k++) {
                if (j == upCharIndex.size())
                    break;
                emptyCh = bitToLeft(emptyCh, upCharIndex.get(j), 16- (k * stepDistance)); } chars[i] = emptyCh; }}return chars;
}

Copy the code

The minimum length of the Head is 1 Char (16 bits). Storage step size is 2 bits high at 16 bits. The next two bits record the actual size of the Head.

Head length: The minimum length of the head is 1 Char, which records information about the step size and head length. At present, the longest filling length is 3+1, and the Head length can be extended through the step size algorithm. Extension methods: getTypeBySize, getSizeByType

When storing the location of uppercase letters, fill it in step size. For example, if the step size is 3, that means one capital letter position is stored for every three bits.
HeadThe length depends on how many steps are filled. For example, to fill 10 positions of step size 3, it would take 16% for 3 to equal 5, so it would take twoChar.

Step size: Step size is a variable quantity. In algorithm design, the following step size types are provided: (According to statistics, the longest English word: 45 characters)

STEP_0: indicates that there are no uppercase letters

STEP_3: indicates the uppercase letter distance (0,8), and the step length is 3

STEP_15: indicates the distance between uppercase letters [8,16], and the step length is 4

STEP_OVER_15: indicates the distance between uppercase letters [16,63], and the step length is 6

Create compressed string content:Content

Content compression is done by storing one character in each of the high, middle, and low parts of a Char. The implementation of specific FormatUtil. ContentBuilder:

Padding: Because strings are not all multiples of 3. To ensure the integrity of the original string, fill the original string with a certain number of characters before splitting the string to ensure that the string can be divided into groups of three characters.


public String handleString(String s) {
    int f;
    if ((f = s.length() % 3) != 0) {
        StringBuilder sBuilder = new StringBuilder(s);
        for (f = 3 - f; f > 0; f--)
            sBuilder.append(FILL);
        s = sBuilder.toString();
    }
    return s.toLowerCase();
}

Copy the code

Split replacement: After filling, split the original string into groups of three. For numbers or special characters in strings, there is no mapping in the Mapping file, so once they appear, “MASK” is used instead.

public short buildShort(char high, char mid, char low) {
    short b = 0;

    b |= getShortFromMapping(high) << 10;
    b |= getShortFromMapping(mid) << 5;
    b |= getShortFromMapping(low);
    return b;
}

public short getShortFromMapping(char ch) {
    if (mapping.containsKey(ch))
        return mapping.get(ch);
    return mapping.get(MASK);
}
Copy the code

Create a compressed string

Head + Content = Compressed string.

conclusion

In the early days of the algorithm’s conception, the theoretical compression efficiency was 66% : three ChArs were stored in one Char, but the compression rate should be only about 50% based on the total compression rate of the final package size. The reasons for this are as follows:

Not all strings are multiples of 3, with extra padding
The compressed character is not a correct ASCII code. In the process of encoding and decoding the character set in Java, it is considered as a Chinese character, and one character at a time will be decoded into two characters.

Complete code welcome to comment on the message, guide learning!

Self-developed string compression algorithm

An overview of the

Algorithm is introduced

Algorithm implementation

Brief introduction of compression Principle

Create a string header structure:Head

Create compressed string content:Content

Create a compressed string

conclusion

Related Posts

ASM has been strongly inserted into Android from a precision test -Gradle plugin

Weex line trampling pits record

Silicon Valley-based Internet unicorn Evernote is recruiting for ios/ Android