What is a StoredField?

As mentioned earlier, Lucene stores seven types of data as a search engine: PostingList, TermDict StoredField, DocValue TermVector, Norms, PointValue, the first four are all search engines will save the data, after three is Lucene data, in fact, StoredField is what we call straight row data. It’s a kind of row storage, similar to the row data in mysql. StoredField plays the role of storing the most original data. However, when we talk about StoredField, we will cover a lot of basic knowledge of Lucene. Once we thoroughly understand StoredField, we will basically understand the process of index and some coding methods (such as vint, packedInt and Delta storage techniques).

This chapter focuses on two questions: 1. How are data types stored? 2. How is the whole storedField drop disk compressed?

Demo example

Lucene debug: Lucene debug: Lucene debug: Lucene Debug

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;


public class demo {

    public static void main(String[] args) throws IOException{
        MAIN();
    }

    public static void MAIN(a) throws IOException {
        / / specify the analyzer
        StandardAnalyzer analyzer = new StandardAnalyzer();

        // Specify a directory
        Directory directory = FSDirectory.open(Paths.get("tempPath"));
        / / specified config
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        / / IndexWriter is established
        IndexWriter w = new IndexWriter(directory, config);
        / / create the document
        Document doc = new Document();
        doc.add(new TextField("title"."Lucene in action", Field.Store.YES));
        doc.add(new StringField("isbn"."193398817", Field.Store.YES));
        doc.add(new StoredField("visit".5));
        doc.add(new SortedNumericDocValuesField("visit".5));
        // Add document to indexWriter
        w.addDocument(doc);
       / / trading flush
        w.close()

}
}
Copy the code

The scheduling process for storing data

The call logic for storing data is as follows

Open source projects and company code, open source projects tend to each do one thing, call hierarchy is very deep, very accord with the philosophy of software engineering, generally speaking, there are seven or eight layer calls hierarchy, writing code in the company generally call hierarchy in 4-5 layer, often write some steps together to, remember the baidu an excellent engineer ZhangMiao said, “We think we can write object-oriented code in an object-oriented language, but most of us are loyal process-oriented fans,” so learning open source is good for teaching us how to build complex engineering code.

The first four steps are the logical call layer. The IndexWriter class calls the addDocument method with the entire Document as an argument. The DocWriter under IndexWriter calls the corresponding updateDocument to update the document, and finally pulls a DocWriterPerThread object from the thread pool to execute the final updateDocument logic. Nothing is actually happening at this level, right

DefaultIndexingChain is a very core class, which is responsible for the core operation of indexing the current document. It defines when to write the inverted zipper, when to write the DocValue, when to write the StoredField, etc. ProcessDocument is the entry method of the entire index chain, which is responsible for breaking the entire document apart by Field and calling the following processField methods:

DefaultIndexingChain processDocument source code

  @Override
  public void processDocument(a) throws IOException {

    // How many indexed field names we've seen (collapses
    // multiple field instances by the same name):
    // How many fields does this document have? Remember that fields of the same name count only once
    int fieldCount = 0; 
    // This doc version adds +1 each time it is updated
    long fieldGen = nextFieldGen++;

    termsHash.startDocument();
    
    startStoredFields(docState.docID);
    try {
      for(IndexableField field : docState.doc) { fieldCount = processField(field, fieldGen, fieldCount); }}finally {
      if (docWriter.hasHitAbortingException() == false) {
        // Finish each indexed field name seen in the document:
        // Process all terms in turn
        for (int i=0; i<fieldCount; i++) { fields[i].finish(); } finishStoredFields(); }}try {
      termsHash.finishDocument();
    } catch (Throwable th) {
      // Must abort, on the possibility that on-disk term
      // vectors are now corrupt:
      docWriter.onAbortingException(th);
      throwth; }}Copy the code

The real keychain core execution logic is still in processField:

DefaultIndexingChain processField source code

  private int processField(IndexableField field, long fieldGen, int fieldCount) throws IOException {
    String fieldName = field.name();
    IndexableFieldType fieldType = field.fieldType();

    PerField fp = null;

    if (fieldType.indexOptions() == null) {
      throw new NullPointerException("IndexOptions must not be null (field: \"" + field.name() + "\")");
    }

    // Invert indexed fields:
    // Create an inversion table on the Field
    if(fieldType.indexOptions() ! = IndexOptions.NONE) { fp = getOrAddField(fieldName, fieldType,true);
      booleanfirst = fp.fieldGen ! = fieldGen; fp.invert(field, first);if(first) { fields[fieldCount++] = fp; fp.fieldGen = fieldGen; }}else {
      verifyUnIndexedFieldType(fieldName, fieldType);
    }
    
    // Add stored fields: the storedField where the field is stored
    if (fieldType.stored()) {
      if (fp == null) {
        fp = getOrAddField(fieldName, fieldType, false);
      }
      if (fieldType.stored()) {
        String value = field.stringValue();
        if(value ! =null && value.length() > IndexWriter.MAX_STORED_STRING_LENGTH) {
          throw new IllegalArgumentException("stored field \"" + field.name() + "\" is too large (" + value.length() + " characters) to store");
        }
        try {
          storedFieldsConsumer.writeField(fp.fieldInfo, field);
        } catch (Throwable th) {
          docWriter.onAbortingException(th);
          throwth; }}}/ / build docValue
    DocValuesType dvType = fieldType.docValuesType();
    if (dvType == null) {
      throw new NullPointerException("docValuesType must not be null (field: \"" + fieldName + "\")");
    }
    if(dvType ! = DocValuesType.NONE) {if (fp == null) {
        fp = getOrAddField(fieldName, fieldType, false);
      }
      indexDocValue(fp, dvType, field);
    }
    if(fieldType.pointDataDimensionCount() ! =0) {
      if (fp == null) {
        fp = getOrAddField(fieldName, fieldType, false);
      }
      indexPoint(fp, field);
    }
    
    return fieldCount;
  }

Copy the code

In this chapter, we are concerned about storing raw data in the middle. In the index chain, whether you create inverted tables, write DocValue, or write StoredField, you need a corresponding consumer instead of the class itself. The reason why it is called DefaultIndexingChain is precisely because the author wants to express that users can define an index chain to determine the index process by themselves, and the index details are not responsible for the index chain, which can effectively complete the decoupling.

In DefaultIndexingChain, the termVector is actually written by the termHash class, and StoredField is written by the StoredFieldsConsumer. StoredFieldsConsumer has a layer of StoredFieldWriter embedded in it. The final write will be written by this class. Why is there a layer of Writer embedded in StoredFieldWriter? The documentWriterPerThread of the consumer class determines which writer is executed. Why do we need to do this? Because there are now eight versions of Lucene, the logic that writes to StoredField is different for each version, because Lucene doesn’t want to couple the logic that writes last to the scheduling logic, It is expected that the user can specify which version of docWriterPerThread should be used by using the COdec of docWriterPerThread. The user can specify the final encoding using the new version of StoredFieldWriter or the old version in the IndexWriterConfig of the call layer, which is very flexible. While Lucene7 default is writer is CompressingStoredFieldWriter, this class can be found in the codec compressing package.

The above is scheduling logic, not real write logic, which can be found in CompressingStoredField.

The coding structure of each type of storedField

In CompressingStoredField bufferedDocs contains an object, the byte stream is actually ends up set, it is a GrowableByteArrayDataOutput, This class inherits from the abstract class DataOutput, and defines the details of writing data to various underlying primitive types in its methods. Remember, it does not fall at this stage, only in memory, until it reaches flush.

The int type

Call is GrowableByteArrayDataOutput writeZInt method, the Z is zigzag encode meaning, that is to say, enter a int value, the zigzag encoding first, and then to he code:

He code

Before we talk about Zigzag encoding, it is necessary to understand why int is zigzag encoded first. The V in VINt is called variant, which means that it can be encoded with an appropriate number of bytes according to the size of the value.

Consider: what if we were to write the value 20 of int?An Int takes four bytes and 32 bits, three of which are wasted.

Vint requires only one byte. The first byte is the marker bit, which means that no more bytes are needed to encode the number. If the byte starts with 1, it means that the next byte will encode the number, such as the integer number 200:

Each encoding will take the following 7 bits, if after the completion of the left side is not all 0, it needs to be marked as set 1, so the first byte of vint200 is 11001000. After the first round, the digits that need to be encoded above will become 00000000 00000000 00000000 00000001. After taking the last 7 bits, all the remaining bits will be 0. Stop the encoding and put the last 7 bits into vinT, and mark it as bit 0. The second byte is 000000001, and finally, only two bytes are needed to complete the encoding.

The code is as follows:

void DataOutput::writeVint(int i) {
        while ((i & ~(0x7f))! =0) {writeByte(char((i&0x7f) |0x80));
            i >>= 7;
        }
        writeByte(char(i));
    }
Copy the code
Zint coding

The disadvantage of vinT encoding is that it still consumes a lot of bytes for negative numbers, because negative numbers are encoded by complement code. For example, for -5, the original code is 10000000 00000000 00000000 00000101. The inverse code is 1111111111 11111111 11111010, the complement code is 1111111111 11111111 11111111 11111011, the use of complement code to represent the negative number of the method, is to facilitate the operation of positive and negative numbers can produce correct results

Therefore, it takes 4 times of encoding each time to encode the symbol bit, as shown in the diagram below (see encode int-5):

Encoding a negative number costs even one byte more than before, which is not worth it. The solution to this problem is zigZag encoding:

  1. Generates a Mask based on sign bits, all zeros if positive and all ones if negative
  2. Move the negative number forward one place
  3. Perform xOR on mask and the negative number just moved, i.e., 0 if the two digits are the same, 1 if they are different, and finally convert to00000000 00000000 00000000 00001001.

If it’s a positive number, zigZag coding just moves the sign bit to the end.

The code is very simple:

int zigZagEncode(int i){
    return (i<<1) ^ (i>>31);
}
Copy the code

Then vint method is used to encode, as long as 1 byte.

In the actual business, a large number of values are small values that can be represented by a byte. Using Zint encoding can ensure that the number of bytes can be reduced, but also can adapt to the situation of occasional large numbers, at the cost of just one more flag bit. Google Protobuf also uses this compression method to compress the transmitted data.

long

In some programming languages, long and dateTime are equivalent, so instead of storing large integers, long is used to represent time, so Lucene uses writeTLong to encode it. Where T refers to timestamp, let’s see how Lucene encodes long data:

For input, four scenarios are assumed, each requiring a different format header. If the value is divisible by 1000*60*60*24, the timestamp accuracy is day level. If the number is divisible by 1000*60*60, the timestamp precision is hour. Can be divisible by 1000 of the accuracy of the second level, otherwise does not compress. In each case, the corresponding Header is different. The accuracy is indicated by the first two digits: 11 is day, 10 is hour, 01 is second, and 00 is other. For the number itself to be encoded, it needs to be divided by its own precision to compress, the lower the precision, the smaller the compressed value.

The other six bits of the Header should not be wasted. First, the value should be zigzag encoded, which has been mentioned before, mainly to avoid the influence of symbol bits on subsequent VLONG encoding. Then, the last five bits of the encoding should be placed in the last five bits of the Header. You need to keep coding; Conversely, if all zeros are zero after the last five digits are removed, the third bit of the HEADER, 1, is used to represent the end of the encoding. (This is counter-intuitive, since in vint vlong, 1 represents subsequent encoding.)

The subsequent byte encoding rules are the same as vinT encoding, which also encodes only 7 digits at a time, and the first digit is used to indicate whether there is a subsequent encoding.

To sum up, since we have a priori knowledge of the content to be encoded, that is, we know that the content to be encoded is mostly date and time, so we can take advantage of this to complete the coding with the lowest cost as much as possible.

A Float

Before we get to float, let’s review the coding rules for float in IEEE coding:

The first Bit is the sign Bit, then the eight bits represent the exponent Bit, and the last 23 bits represent the mantissa part, for example: 12.25F

  1. First to convert it to binary, the integer part 12 is simple and familiar:1100The transformation of.25 part is01Because:2 to the 3 plus 2 to the 2 plus 2 to the minus 2 is 12.25, so the binary representation is1100.01
  2. If we go to scientific notation, the base part is zero1.10001, the exponent part is 3 (the decimal part moves three places to the left), because the decimal part of the base part is always 1, so this part can not participate in the encoding
  3. Now, the sign is 0, the exponent is 3, but we’re dealing with integers, what if the exponent is negative? Let’s say the binary notation is0.000101So instead of moving the decimal to the left, we’re going to move it 4 to the right1.01The exponent is a negative number, so I need to normalize this plus 127 to a positive number, and some of you might say, well, I’m just going to add a sign bit to the exponent, but the problem is that you’re going to have a minus 0 and a zero, so you’re going to have to normalize to[0255]Between,3 + 127 = 130The binary representation is10000010. The mantissa part, since the 1 is not involved in the original code, is just the part after the decimal point1000100 00000000 00000000
  4. Finally, the symbol is, the index bit and the mantissa bit are joined together to make the final code:

Now we can talk about lucene’s rules for writing a Float. The related method is writeZFloat:

  1. First, convert float to int, such as 12.25 to 12
  2. Determines whether the value of the original float is the same as the value of the int, and if so, and the value is between [-1,125], it can be encoded as a byte. The rule is to convert the high position to 1 and the low position to binary encoding of the int +1. For example, if float is 12.0, the final encoding is10001101
  3. If the first case is not satisfied, but it is still a positive number, then the second case, such as 12.25, satisfies the second case and is encoded directly in IEEE format:0 10000010 1000100 00000000 00000000That is, no compression
  4. Otherwise enter the third case, i.e. -12.25, which requires an all-1 byte to represent the case:11111111 1 10000010 1000100 00000000 00000000, why plug in an entire extra byte to represent the situation, using -12.25 encoding1 10000010 1000100 00000000 00000000Not sweet? Xiang, but there is a conflict between this encoding and the first encoding format, and you can’t tell whether it belongs to the first case or the third case when decoding. Because in the first case the maximum value of the first byte is at most11111110So it’s a good way to distinguish it from the third case.

Sorted as follows:

The idea of this compression method is: Try to reduce the accuracy, if still the same after lower precision, then use low accuracy, in this scenario is assuming we input data are mostly small number, and a small digital 0, can take it as an ordinary int to encode, but once appear small digital, lucene doesn’t do further compression, even when it is decimal, It also wastes an extra byte than writing directly. Therefore, prior knowledge is very important in the design of compression system. The distribution and characteristics of input data have a great influence on its final compression ratio. As a general search engine, Lucene cannot assume that user input is in line with any laws, so it can only make a compromise. However, if we want to design a search engine more in line with the needs of the business side, it is necessary to consider the user input, and then adopt appropriate compression scheme.

The source code is as follows:

  static void writeZFloat(DataOutput out, float f) throws IOException {
    int intVal = (int) f;
    // the purpose of this step is to convert float from IEEE encoding to binary representation and from binary representation to int, because we only have writeInt instead of writeFloat, and Java does not have the flexibility of C++ memcpy
    final int floatBits = Float.floatToIntBits(f);

    if (f == intVal
        && intVal >= -1
        && intVal <= 0x7D&& floatBits ! = NEGATIVE_ZERO_FLOAT) {// In the first case, the decimal place is 0 and between [-1,125]
      out.writeByte((byte) (0x80 | (1 + intVal)));
    } else if ((floatBits >>> 31) = =0) {
      // Other positive numbers float: 4 bytes
      out.writeInt(floatBits);
    } else {
      // Negative float: 5 bytes
      out.writeByte((byte) 0xFF); out.writeInt(floatBits); }}Copy the code

Type Double

Double is very similar to float, except that there is one more case. If the value of a double is converted to float and the precision is still equal, float is encoded and 0b11111110 is preceded by the marker byte. The idea is the same: try to lower the precision, see if it’s the same as the original, and if it is, use low precision.

Type String

In Java, a char is not a single byte like C++, but two bytes, so it uses utf16 encoding, which means each character is represented by two or four bytes. GCC uses utF8 as its default value. Do we use UTF8 or UTf16 as its default value? Use UTF-8 because it takes up fewer characters. In UTF-8, a letter or number takes up only 1 byte and a Chinese character takes up only 3 bytes. But in UTF16, a live letter or number takes up 2 bytes and a Chinese character takes up 2 bytes. But don’t forget that mainstream open source software is for Westerners, and most of their expectations are still in English, so Utf8 is still used. At this point, if we need to develop a General search engine in Chinese, we can consider using GBK coding (I suspect baidu uses GBK coding in large numbers to consider the lower cost, anyway, it does not index Korean and Japanese Arabic web pages).

Chinese English
utf8 3 bytes (minority 4 bytes) 1 byte
utf16 2 bytes (minority 4 bytes) 2 –
gbk 2 – 1 byte

Lucene uses UTF8, but Java’s default encoding is UTF16, so we can’t write the char byte in the input string directly.

Ok, so now the question is, when you initialize this new character array, how long do you have to give it?

Utf8 allocates 1 byte for each Utf16 character, 2 bytes for each Utf16 character, 3 bytes for each Utf16 character, 2 bytes for each Utf16 character, and 3 bytes for each Utf16 character. If you allocate too much, it’s wasteful; If you calculate exactly how many bytes it takes to convert a utF16 string to UTF8, it can be CPU intensive.

Lucene’s strategy is that I pay for everything, in terms of the maximum three bytes, and if it’s a small number (65536 or 64K), I’ll pay for anything more, which is fine, because 64K is not that big, and I’m putting it in an extra array (scratchBytes, TempBytes, if THAT’s what I’m calling it), so that the next time any bytes are encoded, they’re encoded in scratchBytes and then written to the destination array for reuse; But if the number is too big, more than 65536, then I must work out how much you really need to byte first, and then on demand expansion, and then write in, there is essentially conducted two similar operations, the first operation is to calculate the accurate transcoding need a numerical value, the second operation was to write values after transcoding.

Before writing utF8 encoded bytes, we need to write the number of bytes in vint, otherwise we don’t know where the boundary is when reading.

The source code

  public void writeString(String string) throws IOException {
    // Calculate the maximum required length, which is essentially string.length()*3
    int maxLen = UnicodeUtil.maxUTF8Length(string.length());
    if (maxLen <= MIN_UTF8_SIZE_TO_ENABLE_DOUBLE_PASS_ENCODING)  {
      // This string is small enough that we need no memory to avoid two operations
      // string is small enough that we don't need to save memory by falling back to double-pass approach
      // this is just an optimized writeString() that re-uses scratchBytes.
      // Build a scratchBytes to avoid opening memory every time.
      if (scratchBytes == null) {
        scratchBytes = new byte[ArrayUtil.oversize(maxLen, Character.BYTES)];
      } else {
        scratchBytes = ArrayUtil.grow(scratchBytes, maxLen);
      }
      int len = UnicodeUtil.UTF16toUTF8(string, 0, string.length(), scratchBytes);
      writeVInt(len);
      writeBytes(scratchBytes, len);
    } else  {
      // use a double pass approach to avoid allocating a large intermediate buffer for string encoding
      int numBytes = UnicodeUtil.calcUTF16toUTF8Length(string, 0, string.length());
      writeVInt(numBytes);
      bytes = ArrayUtil.grow(bytes, length + numBytes);
      Bytes. length is the length of the open bytes array, some of which is not yet used.
      length = UnicodeUtil.UTF16toUTF8(string, 0, string.length(), bytes, length); }}Copy the code

Bytes type

If you want to write to a binary type, use this. Writing bytes binary does not require any extra compression, just append the contents to the array. Of course, if you need to expand bytes array, you need to expand bytes array first. Unlike C++ vector, which always grows twice as fast, lucene uses the conservative strategy of growing 1/8 at a time, hoping to use more CPU to avoid wasting unnecessary memory (even so, lucene is still a memory killer).

  @Override
  public void writeBytes(byte[] b, int off, int len) {
    final int newLength = length + len;
    if (newLength > bytes.length) {
      bytes = ArrayUtil.grow(bytes, newLength);
    }
    System.arraycopy(b, off, bytes, length, len);
    length = newLength;
  }

Copy the code

StoredField meta information

We know that the last disk is a string of bytes array, so I get a bytes array, how to restore? This requires some meta data to tell us which part is field1, which part is field2, and what the type of each field is.

In fact, 8 bytes of meta information are written each time before a storedField is written. This meta information records the serial number and type of the field. The field type is only 3 bits, but the serial number of the field is 61bits.

// bits is the field type, and info.number is the sequence number
final long infoAndBits = (((long) info.number) << TYPE_BITS) | bits;
bufferedDocs.writeVLong(infoAndBits);
Copy the code

The field type is defined as follows:

The field type coding
int 0x02
long 0x04
float 0x03
double 0x05
bytes 0x01
string 0x00

The info number means the number of fields. For example, if you only have 3 fields, title, isbn, visit, their info number is 0,1,2. Since it is written in vlong, normally only one byte is used to represent the meta information.

So the format of the storedField for the last document is something like this:

A metainfo followed by a value, followed by the next metainfo followed by another value..

Overall drop disk compression format

When will it fall?

Drop disks, also known as flush, are available for StoredField in two ways:

  1. When executing finishDocument after processing a document, it will check if the number of documents exceeds 128 or if the bufferedDocs in memory exceeds 16384, that is, 16k, falling disk will occur.
private boolean triggerFlush(a) {
    return bufferedDocs.getPosition() >= chunkSize || // chunks of at least chunkSize bytes
        numBufferedDocs >= maxDocsPerChunk;
  }
Copy the code
  1. The second time is when we close the last closed indexWriter, we must commit, this time do not drop disk, that save a lonely.

Trading format

Lucene7 for storedField, there are two files that need to be written, one is a.fdt file, which contains the data of the drop disk, and one is a.fdx file (field index), which is an index file of the.fDT (this I guess is field data). A similar drop disk format will appear later when we introduce inverted storage and column storage of docValues. Before we introduce.fDT and.fdx, let’s introduce a few concepts:

Chunk refers to a collection of documents. Each 128 doc forms a chunk, or a chunk of physical data stored over 16K.

A block is a collection of 1024 chunks.

Slice is also a collection of documents

Many students reading source code for the first time will wonder, why block storage? Because some compression algorithms are used, these algorithms get good compression ratios on small sets of documents, read on.

.fdt data file

First of all, the storage structure diagram of the actual falling disk:

Remember, this file generates one segment per file, not just one segment for all data. The segment can be thought of as a small index.

1. First, write the HEAD part, which consists of HEADER, SegmentID, suffix. Including the HEADER MagicNo written by a death, a Lucene50StoredFieldFastData string (this is the default configuration, actually can choose configured to compression ratio of the largest, but in the case of take the maximum compression ratio, various parameters will also change, don’t do too much introduction here). And a field representing Version; SegmentID is a random number of 16 bytes, with suffix left alone, which is usually empty (this was probably designed to facilitate expansion in later versions).

This part of the code in CompressingStoredFieldWriter initialization code:

public static void writeIndexHeader(DataOutput out, String codec, int version, byte[] id, String suffix) throws IOException {
    if(id.length ! = StringHelper.ID_LENGTH) {throw new IllegalArgumentException("Invalid id: " + StringHelper.idToString(id));
    }
    writeHeader(out, codec, version);
    out.writeBytes(id, 0, id.length);
    BytesRef suffixBytes = new BytesRef(suffix);
    if(suffixBytes.length ! = suffix.length() || suffixBytes.length >=256) {
      throw new IllegalArgumentException("suffix must be simple ASCII, less than 256 characters in length [got " + suffix + "]");
    }
    out.writeByte((byte) suffixBytes.length);
    out.writeBytes(suffixBytes.bytes, suffixBytes.offset, suffixBytes.length);
  }
Copy the code
  1. Write the Chunk of meta information, chunkSize refers to a Chunk how many bytes of data, is a fixed value, 16384, if the configuration of the format is the maximum compression ratio Lucene50StoredFieldsHigh, that this number is 61440. PackIntVersion specifies the PackInt version. The current value is 2. These two parts are logically part of the Header, because they don’t change much.

  2. Write Chunk information. There are 128 DOC in a Chunk or when the data amount reaches the specified 16384, a Chunk will also be formed. For a single Chunk, meta information should be written first:

DocBase refers to the minimum docNo of the current chunk. This value is 0 for the first chunk and 128 for the second chunk (with 128doc triggering flush logic).

NumBufferedDocs is the number of doc files in this chunk. Move the number to the left by one. The extra docs will mark whether to slice.

final boolean sliced = bufferedDocs.getPosition() >= 2 * chunkSize;
Copy the code

If the current chunk has more than twice the chunkSize, sliced is true. Eh? If the chunkSize of a chunk exceeds the threshold for triggering flush, a new chunk will be created. Why do files in a chunk exceed twice the threshold for triggering flush? Imagine if we had a single document that was really, really long, such as a whole book as a doc, then the bufferedDocs would be very large and would trigger not only Flush but also slice.

DocFieldCount is an int array, which refers to the number of fields in each document in the current chunk, and DocLength is an int array, which refers to the number of fields in each Doc in the current chunk.

Since Lucene stores int arrays, what good compression methods do you have?

Yes, if all the values in the array are the same, write the flag bit 0 first, and then write the first value in the array. (This is often the case when writing docfieldCounts, because the schema is fixed and the fieldCount does not change from doc to DOC. This also tells us not to mess with the es document structure.

For example, to store an array [4, 2, 8, 10], 4 * 4 = 16 bytes is required, but there are very few significant bits in the array. 4 is 0100, 2 is 0010, 8 is 1000, and 10 is 1010. All you need are 16 bits and 2 bytes. When storing such an int array, the first step is to calculate the number of bits required for its largest value, then to represent each value with the calculated bitsRequired, and finally to pack. One of the benefits of storing documents in chunks is that if we have 10W documents, most of them occupy more than 400 bytes of storedField, and only one document occupies 1,000,000 bytes of storedField, this will directly result in the storage of document length, Being forced to use the bitsRequired of the maximum value can cause a lot of waste, which can be mitigated to some extent by chunking.

Related source code:

private static void saveInts(int[] values, int length, DataOutput out) throws IOException {
    assert length > 0;
    if (length == 1) {
      out.writeVInt(values[0]);
    } else {
      boolean allEqual = true;
      for (int i = 1; i < length; ++i) {
        if(values[i] ! = values[0]) {
          allEqual = false;
          break; }}if (allEqual) {
        out.writeVInt(0);
        out.writeVInt(values[0]);
      } else {
        long max = 0;
	// The bits needed to calculate the maximum value
        for (int i = 0; i < length; ++i) {
          max |= values[i];
        }
        final int bitsRequired = PackedInts.bitsRequired(max);
        out.writeVInt(bitsRequired);
        final PackedInts.Writer w = PackedInts.getWriterNoHeader(out, PackedInts.Format.PACKED, length, bitsRequired, 1);
        for (int i = 0; i < length; ++i) { w.add(values[i]); } w.finish(); }}}Copy the code

And then doc data information, which we talked about earlier, is a fieldNoAndType attached to a value and so on and so forth. Note that at the end, the docData will be compressed using the LZ4 algorithm (TODO: I’ll talk about this algorithm later) and then compressed again, so the fieldValue itself is already compressed by the code, so I’ll compress it again. In addition, if the previous slice=true, each slice is compressed in 16384 bytes.

The overall code logic for this section is as follows:

private void flush(a) throws IOException {
    indexWriter.writeIndex(numBufferedDocs, fieldsStream.getFilePointer());

    // transform end offsets into lengths
    final int[] lengths = endOffsets;
    for (int i = numBufferedDocs - 1; i > 0; --i) {
      lengths[i] = endOffsets[i] - endOffsets[i - 1];
      assert lengths[i] >= 0;
    }
    final boolean sliced = bufferedDocs.getPosition() >= 2 * chunkSize;
    writeHeader(docBase, numBufferedDocs, numStoredFields, lengths, sliced);

    // compress stored fields to fieldsStream
    if (sliced) {
      // big chunk, slice it
      for (int compressed = 0; compressed < bufferedDocs.getPosition(); compressed += chunkSize) { compressor.compress(bufferedDocs.getBytes(), compressed, Math.min(chunkSize, bufferedDocs.getPosition() - compressed), fieldsStream); }}else {
      compressor.compress(bufferedDocs.getBytes(), 0, bufferedDocs.getPosition(), fieldsStream);
    }

    // reset
    docBase += numBufferedDocs;
    numBufferedDocs = 0;
    bufferedDocs.reset();
    numChunks++;
  }

Copy the code

“NumDirtyChunks” refers to the number of chunks that have not been compressed. This number will be useful in the segment merge. The magicNumber (0) and the checkSum (0) are written to the footer to check whether the file has been tampered with or damaged.

.fdx Index file

Also store the structure diagram first:

  1. Just like.fdt, we’re going to write the Header part first, and we’re not going to go into detail here, but we’re going to look at the chunkIndex part
  2. If the number of chunks reaches 1024, a block must be generated for writing. The writing method is as followsCompressingStoredFieldsIndexWriter writeBlocks insideA block essentially records 1024 chunks of meta-information. In BLCOK, BlockChunks are written first, which is the number of chunks under that block. And then I write two very important arrays,DocBasesDeltasandStartPointersDeltas.

DocBaseDelta records the number of Docs in each chunk. StartPointersDelta records the difference between the bufferDocs pointer in the current block and the bufferDocs pointer in the previous block. This bufferDocs is an array that stores the values of the actual document, as I mentioned earlier.

A difference algorithm is used to encode the two arrays:

For example, we have the docBaseDelta array :[128, 76, 75, 102, 100, 120]

First, calculate the average value and round it up to 100. Avg =100

Calculate an array of deltas using the following rules (python code for clarity):

base = 0
delta_array = []
for i in range(len(array)):
    delta = base - avg * i
    delta_array.append(delta)
    base += array[i]
Copy the code

Calculate the delta array as follows [0, 28, 4, -21, -19, -19]

3. For the delta array, use the previous packedInt to compress, that is, find the maximum number of bits required bitsRequired, then encode these bits with bitsRequired, and then join them tightly to form a new array. Note that the negative numbers must be processed using the zigzagEncode algorithm described earlier, otherwise the negative numbers are all preceded by 1’s.

Let me ask you a question:

[(array[I] -avg) for I in range(len(array))] Isn’t that easier to understand?

Secondly, the benefits of this encoding are obvious. If the array variance is smaller, the delta value is smaller, which saves more space. But why didn’t we use this encoding method for DocFieldCount and DocLength in FDT? Is it coded directly by packedInt?

You are welcome to discuss with me the answers to these two questions.

  1. FDT: “footer”, “maxPointer”, “magicNum”, “0”, and “checksum”.

conclusion

  1. When storedField is written, data is first stored in the bufferedDocs array, and each data type has its own memory-saving encoding format, vint, zigzagEncode, and so on
  2. .fdt records data files,.fdx is the index of. FDT. Doc is a collection of fields, chunk is a collection of doc, and block is a collection of chunk. There are two thresholds for doc to chunk. One is 128 doc and the other is 16384 bufferedDocs. The chunk to block threshold is 1024.
  3. For unordered int arrays, compute the delta and then use packedInt encoding.