On May 29, Zheng Qihua, director of Software technology of Record Data, shared his keynote speech “Achieving trillion-level Multi-dimensional Retrieval and Real-time Analysis based on Lucene” at QCon, which was fully attended. The event was successfully concluded in a strong technical discussion atmosphere. Next, we will share the full text of the speech and the courseware.

I. Full text of the speech

1. Trillion-dollar Challenge: Data storage

The first one is about data storage. Normally we save the data is very simple, write to the hard drive. Large amounts of data are not so simple and present many problems. Cost, for example. Whether you use SSDS or mechanical drives, whether you use 100 large drives or 10,000 small drives, can make a huge difference in cost.

Second, the security of the data is also an issue. If the disk is damaged or mistakenly deleted, the data will be lost. Data migration and capacity expansion are also troublesome. In addition, there is the issue of reading and writing balance. Uneven data writes may cause some disks to be extremely busy and others to be idle. Or, if there is a problem with an individual disk, the read and write speed becomes slow, resulting in all queries getting stuck on the DISK’s IO, reducing overall performance.

Aiming at these problems of storage, we adopted the index technology based on HDFS to solve.

What problems can HDFS solve? The HDFS is a highly fault-tolerant system for unequal reads and writes. If a disk fails or slows down, it automatically switches to a faster copy for reading. In addition, disk read and write data is automatically balanced to avoid data skew. To ensure data security, the HDFS provides functions such as data snapshot and redundant copy to reduce data loss caused by disk damage or incorrect deletion. To reduce the storage cost, HDFS supports heterogeneous storage and can use a variety of storage media to reduce the hardware cost. In addition, HDFS can support large-scale clusters with low cost to use and manage.

In addition, to further reduce the cost of storage, we have developed the capability of column clustering. Native Lucene does not support column clustering. What are the benefits of column clustering?

We can specify data columns as different column clusters, mix different disks by column clusters, and set different life cycles for different column clusters. For example, a document may contain structured data such as title, author, summary, etc., which is usually small and often retrieves. We can then define these data columns as a column cluster and place them on SSDS. Some unstructured data, such as attachments, pictures, and videos, are large and generally cannot be queried. You can define them as another column cluster and place them on the SATA disk. This reduces the use of SSDS. In addition, the combination of column cluster and HDFS heterogeneous strategy, we can also achieve the separation of hot and cold data. For example, some businesses often query data within the latest 1 month. So, we can keep the latest month on SSDS, and after a month, we can move the data to SATA disks, further reducing SSD usage.

Now, let’s look at another problem. One basic application of big data is query retrieval. For example, the page shows a “full-text search” feature that looks for data that contains the keywords entered by the user from a mass of data. Such a search feature is common and not difficult to implement. The hard part is performance. For trillions of pieces of data, do you respond in seconds or hours?

2. Trillionchallenge Ii, Retrieval performance

In order to achieve trillionsec performance. We optimized Lucene’s inversion list.

In the field of full-text retrieval, it is common practice to cut words and record which documents the keywords appear in. Other relevant information is also stored. For example, the frequency of the keyword, the location of the keyword in the document, and so on, there are about a dozen elements, which are stored in the inverted list. Lucene uses row storage for these elements. This means that all dozen elements need to be read in each query. We might only use two or three of them when we retrieve them. Using line storage creates a lot of unnecessary IO. So, what we’ve done here is we’ve changed the metadata in the inversion list to column storage.

The nice thing about columns is that I can only read the contents of the element I’m querying for which element to use, and I can skip the rest of the content. This may seem like a small change, but in large data scenarios, the impact on performance can be significant. For example, the keyword we query may hit hundreds of millions of pieces of data. You need to read hundreds of millions of inversion list metadata, and if you use row storage, the amount of data read is hundreds of millions of metadata multiplied by a dozen elements. With column storage, you only need to read two or three elements, and the difference on disk IO is several times that. Therefore, the optimization, which reduces invalid IO by storing the inversion table metadata, provides several times the performance improvement.

And then, the second optimization, we stored the inversion list in order.

Because we found in the actual scenario, a lot of data will have a temporal characteristic. The data is generated over time. Queries on these data are also generally performed in conjunction with time ranges. Such as querying the last day or the last few hours of car emissions and so on. While the index data of native Lucene is stored in a disorderly way. Data on the same day may be stored in different locations on disks. So when you read it, it’s a random read. So, ** we’ve made a little improvement here as well. Store data in the order in which they are stored. ** So, when we query data for a certain period of time, we only need to read the whole block of data on the disk. We know that the random read performance of mechanical hard disks is much worse than the continuous read performance. In addition, the disk has a read-ahead function when reading continuous data. For the same amount of data, the performance of a continuous read may be an order of magnitude better than that of a random read.

After the optimization of the index, the retrieval performance of the basic trillion data can be achieved second level response.

3. Trillion-dollar Challenge # 3: Multi-dimensional statistics

Now, another common application. Statistical analysis.

This is a typical data cube. There are many dimensions involved, such as time dimension, regional dimension and so on. Each dimension may also have a hierarchical relationship. For example, we first query the annual car ownership data, and then we need to drill down to the quarterly data in the time dimension, or to the monthly data. In contrast, native Lucene has only a single-layer relationship in the index. For a column, only a single-column index can be built. Therefore, the performance of multi-dimensional or multi-level retrieval statistics is relatively poor.

** For multi-dimensional statistics, the industry’s common practice is to use DocValues for statistical analysis. DocValues, on the other hand, creates the issue of random IO. ** So, we changed Lucene’s inverted index table again. The term in the inversion list of this modification.

** There was only one column in the term. After the improvement, the value of multiple columns can be stored. ** For example, the first column holds the year, the second column holds the quarter, and the third column holds the month. Through the sequential distribution of the intervention data, each quarter of the data in the same year, such as 2018, is continuous. In this way, when analyzing the data of each month in 2018, you just need to find the starting position of 2018 and directly read the next piece of data.

In addition, we have added a two-level skip list to this multi-column joint index. Each skip table holds the maximum and minimum values of the multi-column union index. ** The purpose is to search, can quickly locate to the specified keyword above. To improve the performance of single or multi-column statistical analysis.

4. Trillion-dollar Challenge Number four, Regional Search

Then, let’s look at a more specific application scenario – regional retrieval. Area retrieval is based on geographic location information, generally longitude and latitude for query and matching. Common in public security and security industry.

What’s the problem with regional retrieval? Native Lucene handles area retrieval by GeoHash to select a square and then secondary validation using DocValues. For example, to retrieve a circular area, you need to cut out the extra corners. Using DocValues, like the previous statistical analysis, results in random reads and poor performance.

The change we have made is to store the DocValues that were scattered randomly on the disk instead of adjacent to them. ** That is, data with the same geographical location are stored next to each other on disk. When searching, since the data is next to each other, the read becomes a sequential read. Compared to the original random read, the performance has a significant improvement.

5. Trillion-dollar Challenge Number five: The Computing Framework

In addition to the optimization of the storage and index layers, we have also made corresponding changes to the upper computing framework. As a general computing engine, Spark basically meets our requirements. But with trillions of data, there are still some problems. For example, spark violently reads data one by one. When the data size exceeds trillion, the performance is poor. If you want to improve the performance, you need to increase the hardware investment, which is relatively high. And in the actual production application has appeared a variety of problems.

So, ** we changed Spark. Change the underlying data store to be based on our own distributed index. ** When querying data, Spark does not need to perform violent scanning on data. Instead, Spark quickly locates part of the data hit through the index first. Thus improving the data query, statistics time. At the same time, we fixed a number of issues with Open source Spark to ensure that it works well in production systems.

6. Product architecture

Through the integration and optimization of the above components, we finally formed the following product architecture.

At the bottom is the basic services of Hadoop, and at the top is the SQL computing layer, which interacts with the application layer through SQL statements. The core is the middle storage engine layer, which includes lucene’s full-text retrieval function, HBase’s KV model, and OLAP analysis engine implemented by multi-column joint index. This layer can be extended to implement graph database and some industry customized functions.

In addition, data can be imported and exported with common external data sources through ETL tools.

7. Application scenario

In what scenarios can such a system be applied? Can it support the retrieval analysis of trillions of data?

The answer is yes. This system we have many practical projects in the public security and military industry. The public security industry gathers data from all dimensions of the whole network, and projects with a scale of more than one trillion yuan are very common. This system can be used to provide intelligent research and judgment relationship network through real-time retrieval, related collision and other functions, and help the public security to detect various cases.

Another scenario is the auto industry. With the rapid development of the Internet of vehicles, the scale of data generated by cars is getting bigger and bigger. For example, every on-board terminal T-box generates tens of thousands of pieces of data every day, and hundreds of thousands of vehicles generate more than a trillion pieces of data every year. In view of the implementation of NATIONAL standard 6, the vehicle exhaust data and fuel consumption data need to be monitored. We cooperate with China Steam Research and development Co., LTD to provide data storage and retrieval analysis solutions, which have been applied to major oEMS.

** Another scenario that we are currently developing and studying is space-time trajectory analysis for aviation and ships. ** Through multi-dimensional retrieval and visual analysis of a large number of aviation and ship data, with the help of the optimization of time and space data of our products, collision analysis and adjoint analysis of similar trajectory can be realized.

The data volume of the above several scenarios has reached the scale of one trillion. From the actual use effect, our architecture can meet the retrieval and analysis requirements of supporting one trillion data.

2. Courseware sharing

After the activities in Beijing, we will bring new sharing at ArchSummit Global Architects Summit in Sheraton Shenzhen Greater China hotel from July 23 to 24. The agenda has not been worked out yet, so stay tuned!

Next stop, see you in Shenzhen!