Apache Kylin practice: Evolution of the Lianjia Data Analysis engine

About the author:

Lianjia engineer, member of big data architecture team, mainly responsible for OLAP platform construction and big data application expansion.

preface

With the expansion and development of Lianjia’s business lines, as well as the construction of data ecology, the scale of data grows rapidly. Since the establishment of the big Data department in 2015, the data storage capacity of the cluster is 9PB and the server scale is 200 +. At the same time, with the development of business, data demand is also increasing, such as statistical analysis, index API, operating statements, etc., different business needs are greatly different, more and more dimensions, need customized development. In the face of billions of rows of data, the characteristics of low latency response guarantee stable service and accurate data. The data analysis engine of Lianjia has experienced the following development process.

Early ROLAP architecture

At first, the numbers were small and didn’t grow very fast. Moreover, data requirements are fragmented and in the exploratory stage. The following ROLAP engines are used to support data analysis:

Procedure Connect the data source to HDFS and load the data source to HIVE. Data development engineers develop ETL scripts based on service requirements, configure OOZIE task scheduling and execution, generate data layer by layer based on the hierarchical model of data warehouse, push data to mySQL, filter data by dimension, and aggregate data for display.

As the size of data increases and demand increases, bottlenecks gradually appear. Data scripts have to be developed for each requirement, which increases the dimension, lengthens the development cycle, and requires more manpower to quickly produce data and respond to changes in requirements.

OLAP platform and Kylin

As shown above, the OLAP platform structure of Lianjia was built at the end of 2016. Kylin adopts cluster deployment mode, with a total of 6 machines, 3 for distributed Cube construction and 3 for load-balancing query. The available memory of each query machine is limited to 80G. In addition, Hbase performance deteriorates when computing clusters run large tasks and are under heavy memory pressure. To avoid interaction with computing clusters, the Kylin cluster relies on independent Hbase clusters. In addition, the Hbase cluster is optimized, including read/write separation, SSD_FIRST reading of remote SSDS, and HDFS optimization.

Because Kylin only focuses on prediction and does not save detailed data, for impromptu and detailed queries, it realizes them through self-developed QE engine. At the bottom level, It relies on Spark, Presto and HIVE, and routes to the corresponding query engine to execute the query according to specific rules. Multi-dimensional analysis query, provided by the Kylin cluster query service, can realize simple real-time aggregation computing.

At present, the main query party of Kylin is the index API platform, which can make corresponding cache according to the query SQL characteristics. The metric API serves as a unified export of data, spawns other business products. The statistics used are as follows: the number of cubes is 350+, covering 12 business lines of the company. The total volume of Cube storage is 200+TB, with trillions of data rows, and the maximum size of a single Cube is 4 + billion rows. The daily query volume is 270,000 +. In the case of cache mismatch, the latency is less than 500ms(70%) and less than 1s(90%). A small number of complex SQL queries take about 10s.

Kylin application scenarios and usage specifications

Application scenario: Large data scale, non-real-time, currently supports the hour level. The combination of dimensions and query conditions is in a predictable range; The scan range of query conditions is not too large. Not suitable for scenarios that require large-scale fuzzy Search sorting (similar to Search). How to use Kylin in a standard way is very important. In the early stage of Kylin construction, many pits have been stepped on. It was not a mistake of the program, but that I failed to understand Kylin’s usage process and specifications in detail. I gradually accumulated some experience and deposited it on the company wiki for reference by relevant personnel. Roughly as follows:

1. Dimension optimization: The expected calculation results need to be stored in Hbase and can be queried in real time. Therefore, you need to consider the optimization of storage and query when configuring dimensions. Dimension encoding. Select an appropriate storage type based on the dimension value type to save space and speed up Hbase scan efficiency. Dimensions can be fragmented based on service requirements, increasing query concurrency and shortening query time. As far as possible, dictionary coding is used for the dimensions within the permitted range of cardinality. For the partition field, the general format is YY-MM-DD hh: MM: SS. If you only need to refine the field to the day level, you can save it as the number type yyMMdd, which greatly reduces the dimension cardinality.

2. According to the Hbase query feature, RowKeys are composed of dimension combinations. Therefore, consider the following query scenarios: For dimensions that are frequently queried, set RowKeys first.

Dimension combination optimization. Since the combination of dimensions affects the final amount of data, how to reduce the combination of dimensions should be considered in Cube configuration. According to business needs and features supported by Kylin, dimension combination optimization can be carried out as follows: using derived dimensions, materializing only primary keys of dimension tables, sacrificing part of runtime performance for real-time join aggregation; You can use an aggregation group to aggregate related dimensions into a group. In the aggregation group, you can configure mandatory dimensions, hierarchical dimensions, and federated dimensions based on the characteristics of the dimensions. Aggregation groups can be very flexible in design, for example, high cardinality dimensions can be a separate group.

Iv. Timely cleaning of invalid data. Errors in the construction process or cluster faults may result in garbage files. Some unused segments accumulated over time occupy storage space, increase namenode memory pressure, and occupy Hbase, HIVE, and Kylin metadata space. Therefore, you need to periodically clear them to keep the storage environment clean.

The cluster status should be monitored in real time, the low latency of Cube construction and query should be emphasized, the data model and Cube design and storage should be constantly optimized, and the optimal balance between storage, construction and query performance should be found according to the real needs of users.

Linker Kylin ability extension

Kylin is currently in use with version 1.6 and the latest version is 2.3. Since version 2.0, some new features have been added, and some changes have been made to configuration files and properties. Due to the large amount of Cube data and many business parties involved, there is no real-time update of the new version under the current situation of no obvious bottleneck. However, some important new features of 2.0+ were introduced, such as distributed builds and distributed locks.

We maintain our own set of Kylin code, which is optimized for specific scenarios, including:

First, support distributed build. Native Kylin can only be built on one machine. When there are more and more cubes on Kylin, it is obvious that a single machine cannot meet the task requirements. In addition to the limitation of task data, multiple tasks will also affect the efficiency of data construction. By modifying Kylin’s task scheduling policy, multiple machines can build data simultaneously. Enabling Kylin’s build capabilities to scale horizontally to ensure data build;

Optimize the dictionary download strategy at build time. The dictionary used by native Kylin when building Cubiod Data will download all the dictionaries of the field to the node. When there are many dictionaries of the field or the dictionary file is very large, it will consume a lot of unnecessary time in file transfer. By modifying the code, the task can download only the dictionary files it needs, thus reducing the file transfer time consumption and speeding up the build;

Three, the global dictionary lock, in the same Cube tasks build time, due to sharing the global dictionary lock, when a mission anomalies, can lead to other tasks for less than the lock, this bug has been fixed and submit official (issues.apache.org/jira/browse…). ;

Four, when there is more than one query case, metadata synchronization, RestClient using concurrent BasicClientConnManager encountered bottleneck, an exception is thrown, the solution to replace PoolingClientConnectionManager, And submit the official (issues.apache.org/jira/browse…). ;

Last_build_time = last_build_time = last_build_time = last_build_time = last_build_time = last_build_time Has been fixed submit official (issues.apache.org/jira/browse…). ;

6. Support setting Cube to force associated dimension table and filter invalid dimension data in fact table. The temporary table created by Kylin serves as the data source. If the OLAP table and the associated fields of the dimension table are used as dimensions, the dimension table is not associated by default and the OLAP fields are used as dimensions. In the Build Cube step, the dimension table dictionary is used to convert the dimension values. This can cause problems if the value dimension table is not present in OLAP. By adding configuration items, Kylin can force associated dimension tables to filter out dirty data in OLAP tables.

7. Kylin Query machines, such as queries or aggregations, load a large amount of data into memory, occupying a large amount of memory, and even have frequent Full GC cases. In this case, CMS garbage collection performance is not very good, so change to G1 collector, try to achieve controllable STW time, and timely tuning. In addition to the above modification of Kylin itself, we developed Kylin middleware to realize tasks scheduling, state monitoring, permission management and other functions.

Kylin middleware

The middleware undertakes Cube management and task scheduling and shields Kylin cluster externally. The architecture diagram is as follows

The following functions can be enhanced:

In theory, infinite capacity queue can be realized. In reality, there will not be such a large number of tasks, nor will it always pile up.
At the same time, for a specific Cube, priority scheduling is implemented to ensure the timely output of important data.
Metadata management platform, which can execute SQL query through middleware, and indicator API platform, need to configure API query interface on metadata management platform in advance, so that you can see the data corresponding to its own permission during configuration, so as to achieve permission control.
If the task fails to be executed, a limited number of retries can be performed. If the retries fail, an alarm is reported.
At the same time, concurrency control can be realized. Due to the limited carrying capacity of Kylin cluster, too many tasks executed at the same time will cause a large number of tasks to fail. At present, a maximum of 50 construction tasks can be submitted for simultaneous operation.

conclusion

The core components of Kylin engine are extensible, support super-large scale data, and ANSI SQL is easy to use. As a key component of OLAP platform, Kylin engine basically carries all multidimensional analysis requirements and improves data output efficiency and query performance. Compared to the rOLAP architecture, it now only focuses on basic data construction and data exploration, saving a lot of manpower and improving overall maintainability.

Kyligence helped us a lot during the construction of the OLAP platform and maintained technical exchanges with other companies. The Kylin community is very active, and the core development team is very enthusiastic and efficient. As a Chinese, I host the top-level open source Apache project, and I hope Kylin and the community will have a better development.

In the future, we will continue to track business requirements and optimize cluster performance to improve cluster stability and ease of use. It also focuses on large result set query performance, Spark build engine, task resource isolation, etc.

About the Big data Architecture team of Lianjia

The big data architecture team of Lianjia is responsible for the architecture, performance optimization and research and development of the company’s big data storage platform, computing platform and real-time data flow platform, providing efficient big data OLAP engine and big data tool chain component research and development, providing stable, efficient and open big data basic components and basic platform for the company.

Apache Kylin practice: Evolution of the Lianjia Data Analysis engine

preface

Early ROLAP architecture

OLAP platform and Kylin

Kylin application scenarios and usage specifications

Linker Kylin ability extension

Kylin middleware

conclusion

About the Big data Architecture team of Lianjia

Related Posts

When multi-party calculation, each result actually exists huge hidden trouble, this article tells you can solve it like this!

Spring Security Jwt Token automatically refreshes

Programmer, don’t you have a tree in your heart?