Zero version

Lucene-Core version 8.8.2

An introduction to the

Lucene’s Index design basically relies on disk storage, and inverted Index relies on a large number of redundant data to complete word segmentation search technology, so Lucene used a lot of time-space data compression technology in the design, in order to ensure that the minimum disk resources can be stored in the most data. Vint is one of the interesting designs.

II Technical Principles

1 profile

A normal int in Java occupies four bytes.

However, when the value of int is between -128 and 127, only one byte is needed to fit the space. The other three bytes are meaningless redundant (the other several bytes can represent the interval and so on), and there are not many cases where the four bytes can actually be used. Variant Int means Variant Int. Its essence is to distribute according to need, reducing this redundancy.

2 byte indicates a bit

A normal byte has eight data-significant bits, whereas a VInt has only seven, and the highest bit becomes the indicator bit of the next byte.

  • The highest byte is 1, which means that the next byte is still the current data
  • The highest bit is 0, which means there’s no more data

3 side effects of VInt

  • For positive numbers, there are only seven data bits, so when the int value is large, 5 bytes may be needed to represent the current data (this problem cannot be solved, and VInt feels it needs to be solved, since this is not often the case in real production).
  • For negative numbers, the highest bit is 1 and cannot be compressed (Zigzag encoding is introduced)

4 zigzag encoding

Moves the first sign bit to the last bit of data using the shift and XOR operations.

Three Demo

If you need to serialize the three int numbers 1/200 / -1 respectively, then the steps of VInt algorithm are as follows:

1 Binary

  • The binary number of 1 is 00000000 00000000 00000001
  • The binary number of 200 is 00000000 00000000 00000000 11001000
  • The binary number of -1 is 1111111111 1111111111 11111110

We move the 2 one place forward, we add the 0

  • 1 The binary number after processing is 00000000 00000000 00000010
  • The binary number processed by 200 is 00000000 0000000000001 10010000
  • The binary number processed by -1 is 1111111111 1111111111 11111100

XOR operation

The essence of the XOR operation is that difference is 0 and similarity is 1.

  • For positive numbers, exclusivity or a 1111111111 11111111 11111111

    • The processing expression for 1 is 00000000 00000000 00000010 ^ 1111111111 11111111 1111111111 = 00000000 00000000 00000010;
    • The processing expression for 200 is 00000000 0000000000001 10010000 ^ 1111111111 11111111 11111111 11111111 = 00000000 0000000000001 10010000
  • For negative numbers, exclusivity or a 00000000 00000000 00000000 00000000

    • The processing expression for -1 is 1111111111 1111111111 1111111111 11111100 ^ 00000000 00000000 00000000 = 00000000 00000000 00000000 00000011

Octaves process numbers as a unit

The data is read in eight bits. When eight bits are read, the first bit is regarded as the marker bit. If there is any other data, the first eight bits are read.

  • For the number 1

    • Serialization process:

      • First read seven bits 0000010, before all 0, there is no data, then the first fill 0, 00000010
    • Read procedure:

      • Read the serialized data 00000010. The first byte is 0, which means there is only one byte, and there is no other data after it, so the data is 00000010
      • The last digit is 0, which represents a positive number. The XOR operation with 11111111 results in 00000010
      • Move the data back one bit, and the front end is filled with 0, and the final value is 00000001
  • For the number 200

    • Serialization process:

      • First read seven bits 0010000, not all of the previous 0, then first add 1, 10010000
      • Then read seven bits 0000011, before all 0, then the front is added 0, 00000011
      • The combined data is 10010000 00000011
    • Read procedure:

      • Read the serialized data 10010000, the first is 1, represents more than one byte, followed by other data, so the data is 0010000
      • Read 00000011 again, the first is 0, means no other data to come, the data is 0000011
      • The combined data is 00000001 10010000
      • The last digit is 0, which represents a positive number. The XOR operation with 11111111 11111111 yields 00000001 10010000
      • Move the data back one place, and the front end is filled with 0, and the final value is 00000000 11001000
  • For the number minus 1

    • Serialization process:

      • First read seven bits 0000011, before all 0, there is no data, then the first fill 0, 00000011
    • Read procedure:

      • Read the serialized data 00000011. The first byte is 0, which means there is only one byte, and there is no other data after it, so the data is 00000011
      • The last digit is 1, which represents a negative number, and the XOR operation with 00000000 results in 11111100
      • Move the data back one place, and the front end is filled with 1, so the final value is 11111110

Four source

1 writeZInt

writeZInt(…) Methods in the org. Apache. Lucene. Store. DataOutput:

Public final void writeZInt(int I) throws IOException {BitUtil. ZigzagenCode (I) throws IOException {BitUtil Zigzag encoding writeVInt (BitUtil zigZagEncode (I)); } public final void writevInt (int I) throws IOException {while ((I & ~0x7F)! = 0) { // writeByte(...) Methods are used to byte persisted to the file, temporarily do not need to pay close attention to writeByte ((byte) ((I & 0 x7f) | 0 x80)); i >>>= 7; } writeByte((byte)i); }

2 zigZagEncode

zigZagEncode(…) Methods in the org. Apache. Lucene. Util. BitUtil:

// I >> 31 for positive numbers returns a full 0 barrier // I >> 31 for negative numbers, Public static int zigzagenCode (int I) {return (I >> 31) ^ (I << 1); }

3 readZInt

readZInt(…) Methods in the org. Apache. Lucene. Store. DataOutput:

public int readZInt() throws IOException { return BitUtil.zigZagDecode(readVInt()); } public Int ReadVInt () throws IOException {// ReadByte b = ReadByte (); If (b BBB = 0) return b; if (b BBB = 0) return b; int i = b & 0x7F; // Continue reading a byte b = ReadByte (); i |= (b & 0x7F) << 7; if (b >= 0) return i; // Continue reading a byte b = ReadByte (); i |= (b & 0x7F) << 14; if (b >= 0) return i; // Continue reading a byte b = ReadByte (); i |= (b & 0x7F) << 21; if (b >= 0) return i; // ReadByte (); // ReadByte (); i |= (b & 0x0F) << 28; if ((b & 0xF0) == 0) return i; throw new IOException("Invalid vInt detected (too many bits)"); }

4 zigZagDecode

zigZagDecode(…) Methods in the org. Apache. Lucene. Util. BitUtil:

public static int zigZagDecode(int i) {
  return ((i >>> 1) ^ -(i & 1));
}