Based on Lucene, trillion-level multi-dimensional retrieval and real-time analysis are realized

On May 29th, Zheng Qihua, the director of RXSS, shared the keynote speech of “Realizing Trillion Multidimensional Search and Real-time Analysis Based on Lucene” in QCON. The audience was full and the activity was successfully concluded in the strong atmosphere of technical discussion. Below, we share the full text of the speech.

One of the trillion challenges: data storage

The first one is about data storage. Usually we save the data is very simple, to the hard disk inside write on the line. Massive data is not so simple and presents many problems. Cost, for example. It can make a huge difference in cost whether you use SSDs or mechanical disks, 100 large disks or 10,000 small disks.

Second, data security is also an issue. If the disk is damaged or deleted by mistake, the data will be lost. Data migration, capacity expansion will also be more troublesome. In addition, there is the issue of read-write balance. Unevenly written data can cause some disks to be extremely busy and others to be idle. Or, if an individual disk has a problem and reads and writes slow down, all queries will get stuck on that disk’s IO, slowing overall performance.

In view of these storage problems, we use index technology based on HDFS to solve.

What problems can HDFS solve? HDFS is a highly fault-tolerant system for unbalanced reads and writes. If a disk dies or slows down, it will automatically switch to a faster copy for reading. Also, the disk data reads and writes are automatically balanced to avoid data skew problems. For data security issues, HDFS has data snapshots, redundant copies and other functions, can reduce the loss of data due to disk damage, or accidental deletion of the operation. As for storage cost, HDFS supports heterogeneous storage, which can be mixed with various storage media to reduce hardware cost. Moreover, HDFS can support large clusters and is relatively cheap to use and manage.

In addition, in order to further reduce the storage cost, we developed the function of column clustering. Clustering is not supported in native Lucene. What are the benefits of clustering?

We can assign data columns to different column clusters, mix different disks by column cluster, and set different lifetime for different column clusters. For example, a document may contain some structured data, such as title, author, abstract, etc., which is usually small and often searchable. We can then define these data columns as a column cluster and place them on the SSD. There are also some similar attachments, pictures, videos and other unstructured data, these data is relatively large, and generally will not be queryable, can be defined as another column cluster, placed on the SATA disk. This reduces the SSD usage. In addition, combined with the heterogeneous strategy of HDFS, we can also achieve the separation of hot and cold data. For example, some businesses often query data within the latest 1 month. Then, we can keep the latest 1 month on SSD, after 1 month, the data will be moved to SATA disk, so as to further reduce the SSD usage.

Now, let’s look at another problem. One basic application of big data is query retrieval. For example, a “full-text search” feature shown on this page looks for data containing user input keywords from a huge amount of data. Such a search function is common and not difficult to implement. The hard part is performance. For trillion-scale data, does it take seconds or hours to respond?

2. Trillion Challenge II, Retrieval Performance

To achieve trillion-second search performance. We optimized Lucene’s inversion list.

In the field of full-text retrieval, the usual practice is to cut the word, and then record the keyword in which document it appeared. At the same time, some other relevant information will be saved. For example, the frequency of the keyword, the location of the keyword in the document, and so on, there are about a dozen elements that are stored in the inversion list. Lucene uses row storage for these elements. This means that all dozen elements need to be read out for each query. When we’re searching, we might only use two or three of these elements. Using line storage can cause a lot of unnecessary IO. So what we’re doing here is we’re replacing the metadata in the inverted list with a column store.

The nice thing about columns is that when I look up which element I’m looking for, I just read the contents of that element and I can skip everything else. This change may seem small, but in a scenario with large amounts of data, the impact on performance can be significant. For example, the keywords we query might hit hundreds of millions of pieces of data. You need to read hundreds of millions of metadata information for inversion lists, or if you use row storage, you read hundreds of millions of metadata times a dozen or so elements. In the case of column storage, only two or three elements need to be read, and the difference on disk IO is several times. So, an optimization that reduces invalid IO through column storage of inverted list metadata provides a multifold performance boost.

And then, the second optimization, we’re going toInversion lists are stored in chronological order.

Because what we find in real scenarios is that a lot of data has a temporal nature. That is, data is generated over time. When querying these data, it is also generally done in conjunction with a time frame. For example, check the car exhaust emissions of the last day or the last few hours and so on. The native Lucene index data is unorganized when stored. Same day data may exist in different locations on disk. So when you read it, it’s a random read. So, we’ve improved it a little bit here as well. To store data in the order in which it is stored. So, when we query the data for a certain period of time, we only need to read this whole block of data from the disk. We know that the random read performance of a mechanical hard disk is much worse than the continuous read performance. Moreover, the disk has a read-ahead function when reading continuous data. For the same amount of data, the performance of continuous reads may be an order of magnitude better than that of random reads.

After the index optimization, the basic trillions of data retrieval performance can achieve second level response.

3. The third trillion challenges, multi-dimensional statistics

Now, another common application. Statistical analysis.

This is a typical data cube. There will be multiple dimensions involved, time dimension, region dimension and so on. Each dimension may also have a hierarchical relationship. For example, we first query the annual car ownership data, and then drill down to the quarterly data or the monthly data in the time dimension. However, native Lucene only has single-layer relationship on the index. For a column, it can only build single-column index, so the performance will be relatively poor in multi-dimensional or multi-level retrieval statistics.

For multi-dimensional statistics, the usual practice in the industry is to use DocValues for statistical analysis. DocValues, on the other hand, introduces random IO issues. So, we changed Lucene’s inverted index table. The term in the inverted list is modified this time.

The original term can store only one column, but it is improved to store more than one column. For example, the first column stores the year, the second the quarter, and the third the month. By intervening in a sorted distribution of data, quarterly data are consecutive for the same year, such as 2018. In this way, when analyzing each month’s data for 2018, you just need to find the starting position of 2018 and read the next block of data directly.

In addition, on top of this multi-column joint index, we have added two levels of skip lists. Each skip list contains the maximum and minimum values of a multi-column joint index. The purpose is to be able to quickly locate the specified keyword when searching. This improves the performance of single or multi-column statistical analysis.

4. Trillion Challenge 4: Regional Search

Then, let’s look at a more specific application scenario — region retrieval. Regional retrieval is based on geographic location information, which is generally based on latitude and longitude. Commonly seen in public security industry.

What are the problems with regional retrieval? Native Lucene handles region retrieval by using Geohash to select a square and then using docValues for secondary validation. For example, to retrieve a circular area, you need to crop out a few extra corners. Using DocValues, as in the previous statistical analysis, leads to random reads and poor performance.

What we have changed is that we have changed the docValues which were scattered randomly in various locations on the disk to store them by location nearby. In other words, data in the same geographic location are stored next to each other on disk. When searching, because the data is close together, the reads become one sequential read. This is a significant performance improvement over the original random read.

5. The Trillion Challenge: Computing Frameworks

In addition to the optimization of the storage layer and index layer, we have also made corresponding changes to the upper computing framework. Although Spark is a general-purpose computing engine, it basically meets our needs functionally. But on a trillion-size scale, there are still some problems. For example, Spark’s underlying fetching of data is a series of brute force reads. When the data size exceeds trillion, performance is poor. Want to improve performance, you need to increase hardware investment, the cost is relatively high. And in the actual production application has appeared all kinds of problems.

So, we made changes to Spark. Change the underlying data store to a distributed index based on our own development. When Spark queries, it doesn’t need to do a brusque scan of the data. Instead, it quickly locates the hit parts of the data through the index first. Thus improve the data query, statistical time. At the same time, we fixed a number of issues with the open source Spark to ensure it would run stably in production.

6. Product architecture

Through the integration and optimization of the above components, we finally formed the following product architecture.

At the bottom is the basic services of Hadoop, and at the top is the SQL computation layer, which interacts with the application layer through SQL statements. The core is the middle storage engine layer, including Lucene’s full-text search function, HBase’s KV model, and the OLAP analysis engine realized by multi-column joint index. In the future, it can also be extended in this layer to realize graph database and some industry customized functions.

In addition, ETL tools can be used to import and export data with common external data sources.

7. Application scenarios

In what scenarios can such a system be applied? Can it support the retrieval and analysis of trillions of data?

The answer is yes. We have already had many practical projects of this system in the public security and military industry. In the public security industry, data from various dimensions of the whole network are collected, and projects with a scale of more than one trillion are very common. This system can be used to provide intelligent research and judgment network through real-time retrieval, association and collision and other functions, and help the public security to detect various cases.

Another scenario is the auto industry. With the rapid development of the Internet of Vehicles, the scale of data generated by cars is also increasing. For example, each on-board terminal T-box will produce tens of thousands of pieces of data every day, and hundreds of thousands of vehicles will produce more than a trillion pieces of data every year. In view of the implementation of the Sixth National Standard, the vehicle exhaust data, fuel consumption data need to be supervised. The data storage and retrieval analysis solutions we provide in cooperation with China Automotive Research Institute have been used in major OEMs.

Another scenario that we are currently developing and studying is spatiotemporal trajectory analysis for aviation and ships. Through multi-dimensional retrieval and visual analysis of a large number of aviation and ship data, and by means of the optimization of time and space data of our products, collision analysis and accompanying analysis of similar trajectories can be realized.

The amount of data in the above several scenarios has reached the scale of trillions. From the actual use effect, our framework can meet the retrieval and analysis requirements supporting trillions of data.

After the event in Beijing, we will bring a new share at the ArchSummit Global Architect Summit at Sheraton Shenzhen Greater China Hotel on July 23-24. The topic has not been prepared yet, so stay tuned!

Next stop, I’ll see you in Shenzhen!

Based on Lucene, trillion-level multi-dimensional retrieval and real-time analysis are realized

One of the trillion challenges: data storage

2. Trillion Challenge II, Retrieval Performance

3. The third trillion challenges, multi-dimensional statistics

4. Trillion Challenge 4: Regional Search

5. The Trillion Challenge: Computing Frameworks

6. Product architecture

7. Application scenarios

Related Posts

Solr (full) – Lucene-based full-text search server

– Check Elasticsearch’s Cache footprint (Qbit)

Learn about Lucene’s full-text index in 5 minutes