For more Greenplum technology dry goods, please visitGreenplum Chinese community website

Apache Solr is an efficient text search engine based on Apache Lucene, which features fault tolerant, highly availability, and scalability. Distribution and other features are widely used in the world’s famous large-scale applications and websites, such as eBay, Instagram, Netflix and so on.

GPText is part of the Greenplum ecosystem. It seamlessly integrates the Greenplum database parallel processing of massive data and Apache Solr enterprise text retrieval capabilities, providing users with a set of easy-to-use, fully functional text retrieval and analysis solutions.

During the tuning of GPText, the authors found that Solr 7 had an unusually slow response time when returning a large number of search results. After testing, the author located the performance bottleneck in Solr 7’s reading of DocValues. After a lot of analysis, it was found that this was caused by Solr 7’s unreasonable reading mechanism of DocValues.

This article will introduce Solr DocValues and explain the reading mechanism of Solr 7/Solr 8 DocValues and how to optimize it. In the author’s test environment, the Solr 7 response speed was improved more than 10 times!

Solr search

A typical use case for GPText

Before taking a formal look at solr’s search process, I’ll take a quick look at a typical GPText search case.

Suppose we have a table named Test with two fields, ID and content, as follows:

We have created an index to this table called “demo.public.test” (Note 2). To search for records containing the keyword VMware, you can call the following statement (Note 3) :

select id from gptext.search(table(select 1 scatter by 1), 
'demo.public.test', 'VMware', null);
Copy the code

The results are as follows:

To search for records containing both “VMware” and “Your”, you can call the following statement:

select id from gptext.search(table(select 1 scatter by 1), 
'demo.public.test', 'Your AND VMware', null);
Copy the code

The results are as follows:

In a practical use case, you might join the above results with the original table “test” for further analysis.

Note 1: lucene.apache.org/solr/

Note 2: about GPText installation, the establishment of the index and data import, please refer to the GPText. Docs. The pivotal. IO / 340 / welcome…

Note 3: about gptext. The meaning of each parameter, the search function. Please refer to gptext docs. Pivotal. IO / 340 / switchable viewer /…

Solr’s search process

Before we begin the Solr search process, we need to explain two concepts: doc ID and inverted index.

The Doc ID is an internal identifier that Solr creates for each document in Solr (a Solr document corresponds to a database record) that is continuous and unique. Assuming there are 10 documents in Solr, the DOC ids of those 10 documents would be 0 to 9. The Doc ID functions like a primary key in a database, but not necessarily the same value as the primary key in a database.

For example, the previous test table is stored in Solr:

Inverted Index, the most commonly used index method in document retrieval systems, stores a map of the position of a word in a document or a group of documents under full text search. In most storage methods, we look up properties from records. An inverted index, on the other hand, is an index that allows us to look up records by attributes.

Taking the previous test table as an example, we create an inverted index in Solr for the content field of the table, then the index is as follows:

Where, “VMware: [1, 2]” indicates that the word “VMware” appears in two documents with doc ID 1 and doc ID 2.

GPText makes a request to Solr after executing the following SQL statement: find all records that contain the word VMware in the content and return their ids.

select id from gptext.search(table(select 1 scatter by 1), 
'demo.public.test', 'VMware', null);
Copy the code

Solr receives the request and looks for documents containing the word “VMware” in an inverted index it maintains. At this point, Solr will find that the two documents with doc ids 1 and 2 are eligible. Solr then looks for the ids of the two documents with doc ids 1 and 2. If Solr is configured to store columns, the ID is read from the column store (called DocValues in Solr), otherwise the ID is read from the row store (called Stored Fields in Solr). (Solr’s row and column storage can be configured at the same time.)

If column storage is configured in Solr as in the table above, the ids of the documents with doc ids 1 and 2, 10 and 11, are read from the store.

After obtaining the ID from the column or row store, Solr returns the following result to GPText:

Solr的DocValues

DocValues introduction

As mentioned above, DocValues are the column-oriented fields in Solr that store the mapping of the document to the values of the corresponding column.

Why use column storage when Solr already has row storage? This is the same as using column stores in relational databases:

  1. All values in each column have the same data type, so a more efficient compression algorithm can be used.

  2. A more efficient compression algorithm means a smaller amount of data and therefore can significantly reduce the COST of I/O.

  3. The low COST of I/O can greatly speed up operations such as sorting, grouping, and faceting that read values by column.

Solr 7 reads DocValues

Solr 7 stores 65536 docvalues as a block. In each block header the length of the block is stored. If we want to know the position of a block, we need to know the position of the previous block and the length of the previous block, in other words, to locate the position of a block, we need to traverse block by block starting at the 0th block.

The diagram below.

The number before each block indicates the starting position of the block. The green part of each block indicates the length of that block. If we want to locate Block2, we need to know the starting position and length of Block1. To know the start position of Block1, you need to know the start position and length of Block0. Since Block0 starts at position 0 and is 1000 in length, Block1 starts at position 1001 (0 + 1000 + 1). Knowing the start position of Block1 (1001), we can read the length of Block1 (2999) and calculate the start position of Block2 as 4001(1001 + 2999 + 1).

The code is as follows:

for docID in docList { desiredBlock = docID / 65536 blockIndex = 0 blockPosition = 0 while (blockIndex <= desiredBlock) {blockLength= file.readint (blockPosition) # file.getBlock Does not generate actual I/O block = file.getBlock(blockPosition, blockLength) blockPosition += (blockLength + 1) blockIndex++ } result.append(block.readValue(docID % 65536)) } return resultCopy the code

Solr 7 repeats the same process from zero to read the next DocValue every time it reads a DocValue, even if the next DocValue is in the same block.

Assuming we use Solr 7 to read the IDS of doc ids [400000, 3, 5, 500000] (each ID is a DocValue), we need a total of 21 I/ OS.

Solr 8 Reads DocValues

Similar to Solr 7, Solr 8 still stores 65536 docvalues as a block. Each block header still stores the length of that block. But Solr 8 stores a metadata in the file header that stores the starting location of each block. Therefore, we can read the specified block location directly from the metadata.

The diagram below.

To locate Block2, read location information from metadata instead of starting from Block0 as in Solr 7.

The code is as follows:

For docID in docList {desiredBlock = docID / 65536 # blockPosition = file.readLong(desiredBlock * 8) blockLength= file.readint (blockPosition) # file.getBlock does not generate actual I/O block = file.getBlock(blockPosition, blockLength) result.append(block.readValue(docID % 65536)) } return resultCopy the code

Like Solr 7, Solr 8 will repeat the same process from zero to read the next DocValue after reading a DocValue, even if the two DocValues are in the same block.

If we use Solr 8 to read the id corresponding to doc ID [400000, 3, 5, 500000], we need a total of 12 I/ OS.

Optimize the reading of DocValues

Optimization of I/O counts

When Solr 7 and Solr 8 read DocValues, we can find a very big problem: after reading a DocValue, we have to start from the beginning to read the next DocValue. This not only dramatically increases I/O times, but also invalidates the disk cache (Solr’s disk cache is 1KB). In other words, 1KB of data is read each time a file is read. If the file data to be read next time is already in 1KB, you do not need to read the disk.

The order of the results in the returned result set is not the same as the location of the data in the file, which makes it difficult to read the file from the current location.

To solve this problem, we can first sort the result by doc ID, and then read the corresponding ID (DocValue) according to the sorted DOC ID order. In addition, instead of starting from scratch after reading an ID each time, you can read from the current position backwards. Finally, return the original doc ID order. For Solr 7, the optimized code looks like this:

# DocIdPositionMap is a map<int, int> type Value is the location (subscript) of the doc ID in the docList sortedDocList, DocIdPositionMap = sort(docList) blockIndex = 0 blockPosition = 0 for docID in sortedDocList { desiredBlock = docID / 65536 # After sorting, While (blockIndex <= desiredBlock) {blockLength= file.readint (blockPosition) # file.getBlock generates no actual I/O  block = file.getBlock(blockPosition, blockLength) blockPosition += (blockLength + 1) blockIndex++ } id = block.readValue(docID % 65536) position = DocIdPositionMap[docID] result[position] = id } return resultCopy the code

According to the optimized algorithm, in Solr 7, reading the ID corresponding to doc ID [400000, 3, 5, 500000] (each ID is a DocValue) requires 11 I/O times. (Doc id is [3, 5, 400000, 500000])

For Solr 8, the optimized code looks like this:

# DocIdPositionMap is a map<int, int> type Value is the location (subscript) of the doc ID in the docList sortedDocList, DocIdPositionMap = sort(docList) blockIndex = -1 for docID in sortedDocList { desiredBlock = docID / 65536 if (blockIndex ! ReadLong (desiredBlock * 8) blockLength= ReadInt (blockPosition) blockIndex = desiredBlock # file.getBlock No actual I/O is generated block = file.getBlock(blockPosition, blockLength) } id = block.readValue(docID % 65536) position = DocIdPositionMap[docID] result[position] = id } return resultCopy the code

According to the optimized algorithm, in Solr 8, reading the ID corresponding to doc ID [400000, 3, 5, 500000] requires 9 I/O times. (Doc id is [3, 5, 400000, 500000])

So far, we can conclude the advantages of the optimization algorithm:

  • Reduce the number of I/ OS. For Solr7, the reduction is significant.

  • Improve cache hit ratio.

However, the optimization algorithm also has disadvantages:

  • The need to sort and hold some intermediate data increases the memory footprint slightly.

  • Use a Map object (the “DocIdPositionMap” in the code) to hold the mapping of the doc ID to its original location. Therefore, when the number of returned objects is large, a large number of objects can be generated, and even GC (garbage collection) can become a bottleneck. (Each key and value is an object, so when 2 million records are returned, more than 4 million objects are generated.)

Optimization of GC

As mentioned above, because our optimization algorithm uses a Map object to hold the mapping of the doc ID to its original location, a large number of objects are generated when more records are returned, and even considerable GC time is generated. Therefore, we need to eliminate the Map object and use other methods to record the original location of a doc ID.

The maximum number of documents in a Solr replica is 2147483519(integer.max_value — 128), so the DOC ID and the return location of each document can be represented by an int without worrying about overflow. So we came up with this solution:

  1. A value of type long is used to hold a doc ID and its original location, where the high 4 bytes are the DOC ID and the low 4 bytes are the original location.

  2. Sort these long values. Clearly, sorting the long value gives the same result as sorting the doc ID directly.

  3. Ids are read in sequence according to the sorted order. Restore the longs to their doc IDS and their original locations before reading.

  4. After the ID is read from the doc ID, it is returned in its original order.

Let’s say the code before GC optimization looks like this:

# DocIdPositionMap each key/value is an object sortedDocList, DocIdPositionMap = sort(docList) for docID in sortedDocList { id = docValuesReader.read(docID) position = DocIdPositionMap[docID] result[position] = id } return resultCopy the code

The gC-optimized code looks like this:

for docID in docList { combinedValue = ((long) docId) << 32 | docList.currentPosition() CombinedList [doclist.CurrentPosition ()] = combinedValue} # Since an int array and a long array are just one object in Java, the entire combinedList is just one object, Sort_in_place (combinedList) for combinedValue in combinedList {docID = (int) (doc.docid >> 32) position = (int) (doc.docId & 0xFFFFFFFFL) id = docValuesReader.read(docID) result[position] = id } return resultCopy the code

In principle, after GC optimization, we significantly reduce the number of generated objects. Because there are so many fewer objects, the memory used for the object headers, kclass Pointers, Padding and so on can be saved. So, we also reduced memory usage to a certain extent.

The experimental results

The following experiments are based on a real wiki dataset of 9,306,9618 entries. The hardware configuration of the experiment is as follows:

The experimental Solr cluster configuration is as follows:

Query information used in the experiment:

The optimization effect

The time in the chart is the time it takes for gptext.search to run. From the above two experiments, it can be seen that with the increase of returned records, the time required for GPText based on optimized Solr 7, GPText based on optimized Solr 8 and GPText based on optimized Solr 8 basically increases linearly (the time required for returning 4x records is 4 times as original). The time required for Solr 7-based GPText went up dramatically. This conforms to the DocValues reading principle mentioned above. According to the experimental data, the optimization algorithm proposed in this paper can greatly improve the performance of Solr 7 when a large number of documents are returned. The performance of Solr 8 is also significantly improved.

The optimized Solr 7 performance is higher than the optimized Solr 8 performance due to the large number of documents returned, covering every block. Because in Solr 7, the next block is read only from the current position; In Solr 8, reading the next block requires going back to the file header, reading the location of the next block from the file header, and then jumping to the next block, which not only increases the AMOUNT of I/O, but also causes a wide range of physical movement of the magnetic head.

In addition, it should be noted that in the query “q = test”, due to the small number of matched documents, all matched documents were returned in the experiment “return 500,000 data” (Note 5). So, the result of “return 2 million pieces of data” is the same as the result of “return 500,000 pieces of data”.

Note 4: Solr, as a full-text search engine, does not return this much data in a typical use case (the default number of documents returned is 10), so getting DocValues is usually not a bottleneck. But the typical use case for GPText is different, and returning hundreds of thousands of pieces of data is common.

Note 5: In the experiment, “return 500,000 /2 million data” refers to the number of documents returned by each shard as 500,000 /2 million. In the experiment, gptext.search was used to send a request to all shards of Solr at the same time, requiring each shard to return 500,000/2 million matching documents. Since there are 24 shards, the total number of documents returned is 12 million /48 million.

Bottleneck analysis

The last experiment was based on GPText to get data. In this experiment, the time required to obtain data based on GPText and directly using Solr was compared.

GPText streams data from Solr and converts those streams to output on GPDB. In other words, getting the data and transforming it happen in parallel. As can be seen from the above two figures, due to the optimization of Solr 7, when a large amount of data is returned, the bottleneck of GPText has changed from Solr to data conversion on the GPDB side, that is, the speed of Solr’s data return has exceeded the speed of GPDB data conversion. Since the query “q = test” returns a significantly reduced number of documents in a similar period of time, the conversion effort on the GPDB side is small and data conversion is not a bottleneck.

GC optimization (Note 6)

In order to prevent GPDB from becoming a bottleneck and affecting the experimental results, the optimized Solr 7 was used in this experiment to directly obtain data (without GPText), and the query was “q = *”. From this experiment, it is clear that GC optimization can greatly improve performance when the JVM is out of memory.

Note 6: Except for this experiment, “optimization” in all experiments is to optimize “I/O” and “GC” simultaneously.

Cache effect

As mentioned above, our I/O optimizations improve disk cache hit ratio, and cache size also affects hit ratio. So we designed an experiment to verify whether cache size has an impact on performance. In order to prevent GPDB from becoming a bottleneck, this experiment uses Solr 8 to obtain data directly (GPText is not used).

According to the experimental results, when the cache is less than 1KB, the performance is greatly reduced. When the cache is larger than 1KB, increasing the cache has no significant impact on performance. So it makes sense for Solr to set the cache size to 1KB.

It should be noted that in the query “q = test”, the cache hit ratio is almost 0 because the documents are sparsely distributed. Therefore, the impact of cache size on performance is negligible.

conclusion

This article introduces the concept of Solr DocValues, and proposes a method to optimize the efficiency of reading DocValues from two aspects of reducing I/O and GC. This method can greatly improve the efficiency of reading DocValues in Solr 7 and greatly reduce the response time of GPText in typical application scenarios. Application of GPText in Solr 8 also provides significant performance gains (Note 7).

Note 7: As of now (October 2020), the latest release of GPText uses Solr version 7.4.0. In this paper, “GPText based on Solr 8”, “GPText based on optimized Solr 7” and “GPText based on optimized Solr 8” are all development versions compiled and installed by the author himself. The optimizations mentioned in this article will be officially implemented in the new GPText soon, so stay tuned.