Application of Apache Kylin in 4399 big data platform

The back view

Before starting the case sharing, I will briefly introduce 4399 and the big data team of 4399

4399 is China’s earliest and leading online casual small game platform, daily active more than 20 million
4399 game box is 4399 mobile games distribution platform, live more than 350W
4399 big data team size in about 15 people, the main work content for the game recommendation, game search, bidding advertising, multidimensional analysis, big data platform and so on

4399 has been used since Kylin V1.5, and the version is being upgraded with the official version. Now the production system has two versions running at the same time: Kylin V2.0.0 and Kylin V2.3.0 have a total of 20 cubes to provide analysis services for our big data platform, such as funnel model analysis. Among them, the largest Cube needs to build 250 million pieces of data, 18 dimensions and 9 indicators every weekend, which takes about 80 minutes to build. The introduction and use of Kylin helped us solve three major problems:

Provides anSI-SQL interface to simplify statistical analysis.
Solve the problem of inconsistent caliber. In the past, statistical logic needs to be rewritten for each requirement. Different writers will lead to inconsistent caliber, large data discrepancy and large calibration workload. Now unified collation of a fact table, related requirements through SQL query the same table, consistent caliber, easy calibration.
When you add dimensions or metrics, you dramatically reduce the amount of work required.

While addressing these three issues, Kylin also ensures fast response times. The rationality of Kylin dimension combination design can not only reduce Cube construction time, but also enable us to obtain reasonable query response time. Now produces the largest fact table with 18 dimensions, 9 metrics, and 95% of SQL returns correct results in less than 3 seconds.

4399 big data platform introduction

With the increase of business, the data scale of 4399 is exploding. In order to collect data completely and mine business value from data, the introduction of big data platform is imperative. Our company began to introduce big data platform a few years ago, using popular open source components to build a platform in line with the company’s business needs. In order to ensure Exactly-One data drop idempotent operation, write consumption, we developed some gadgets accordingly.

Up to now, the company’s big data platform has more than 50 nodes with three major responsibilities:

Collect original logs and add about 5T of logs every day
OLAP – Multidimensional data analysis of logs, this part using Kylin
User profiling, machine learning. Discover user value

Apache Kylin online application

In the 4399 big data platform, Hadoop provides us with data management functions, but the existing business analysis tools (such as Tableau, Microstrategy, etc.) have great limitations, such as difficulty in horizontal expansion, inability to deal with large-scale data, lack of support for Hadoop, etc. Hive also provides an SQL query interface, but the response time is poor. In this context, Kylin’s ability to query huge Hive tables at subsecond level and support high concurrency was born at the right time.

A) Kylin platform architecture

Figure 2-1 Technical architecture of Kylin in 4399, including query and build servers.

Figure 2-1 Kylin platform architecture diagram

I. Deployment

To ensure the stability of the query service, we use Nginx to configure load balancing.

Production environment: three queries, one construction; An HBase cluster contains 23 nodes.
Test environment: one query, one build and one query.

If multiple versions of Zookeeper are deployed at the same time, modify Zookeeper configurations to separate the ZNode path and kylin_metadata.

Ii. Data flow

Take our application platform 4399 game box funnel model analysis (from display to click download retention) as an example, analyze data flow.

Figure 2-2 Data flow diagram of funnel model

As shown in Figure 2-2, the fact table and dimension table should be sorted out first to form a star model (or snowflake model), analyze the required latitude and indicators, configure the Kylin model, and then configure the corresponding Cube. After Kylin is built, you can query Kylin using SQL statements.

B) Display page

In order to facilitate the operation personnel to view data and reflect the advantages of Multidimensional analysis engine of Apache Kylin, we developed a set of multidimensional analysis display page. (The chart below uses simulated data.)

As shown in Figure 2-3, this is the analysis model that has been downloaded. Above is dimension dependent conditional filtering and group expansion. In the lower left corner is the statistical tree of behavior path, with the total download amount and proportion of each path at a glance. In the lower right corner is the trend chart of indicators and drill-down list of indicator dimensions. All dimensions support condition screening and group expansion, and the comparison of indicators of each dimension makes the analysis more intuitive to feel the internal cause of the data trend.

Figure 2-3 Multidimensional analysis interface

The effect of grouping by game dimension is shown in Figure 2-4.

Figure 2-4 Games are grouped into dimensions

C) Apache Kylin optimization suggestions

As is known to all, the core idea of Kylin is predictive calculation, that is, to perform predictive calculation on the measurement values that may be used in multidimensional analysis, save the calculated results into Cube and provide queries. So there are two performance issues involved:

Response time to query.
Estimate the time and space it takes.

I. Query time optimization

Kylin’s query process consists of four steps: parsing SQL, obtaining data from HBase, performing secondary aggregation operations, and returning results. Obviously, the focus of optimization falls on how to speed up HBase data acquisition and reduce the secondary aggregation budget.

Improve HBase response time: Modify configurations, modify Cache policies, and increase the Block Cache capacity
Reduce the secondary aggregation operation: design latitude reasonably, so that the query can accurately hit Cuboid as far as possible. Deweighting uses lossy algorithms.

D) Optimization of predictive computation

Optimization of predictive computation is mainly concerned with how to shorten the build time and the space occupied by intermediate and final results. There is a separate Cube for each business to avoid large and complete Cube and reduce unnecessary calculation.

I.C ube optimization

As the number of dimensions increases, the number of Cuboids grows exponentially. To ease the pressure of Cube construction, Kylin provides advanced Cube Settings. These advanced Settings include Aggregation Groups, Joint Dimensions, Hierarchy Dimensions, and Mandatory Dimensions.

Reasonably adjust the latitude configuration, prune the cuBOids to be built, and select the cuBOids that are really needed, optimize the construction performance, reduce the construction time, and greatly improve the utilization efficiency of cluster resources. Table 2-5 shows the effects before and after optimization:

Figure 2-5 Effect comparison after optimization

1. Necessary dimensions

When querying, often use dimensions, as well as low cardinal latitude. If the cardinality of the dimension is less than 10, it can be considered as a required dimension.

2. Hierarchical dimensions

Hierarchical dimensions can be used when dimension relationships are hierarchical and cardinality is small to large.

3. To be dimension

The relationship between dimensions is simultaneous, and when querying, most cases are simultaneous. You can use the joint dimension.

4. Dimension group

Dimensions are grouped so that the dimensions between groups do not appear at the same time.

Ii. Configuration optimization

Configuration optimization, including configuration of Kylin resources and modification of Hadoop cluster configuration.

1. Build resources

When each Cube is built, the resources required are different and need to be adjusted accordingly.

2. Adjust the copy

The default number of copies of files in a cluster is 3. When Cube is built, set the number of copies to 2 and you can also set it to 1 for some intermediate tasks to reduce cluster IO during build tasks. To ensure query stability, the number of HBase copies is still 3.

3. Enable compression

Snappy compression is enabled for the Hadoop cluster, and Snappy compression is enabled for HBase. The maximum compression rate of HFILE generated is about 70%, greatly reducing I/O load of the cluster.

Afterword.

There are still several problems on the 4399 big data platform: HBase cluster is not stable, query response time is not stable, the response time of some statements is not ideal, and the original HBase table cannot be deleted automatically when Cube segment is rebuilt. Around these questions, we will carry out some optimization later.

The HBase cluster is independent to avoid being affected by other cluster tasks. Adjust the configuration and optimize the query to increase the cache memory ratio for query.
Adjust the results of some quasi-real-time build tasks to HBase memory tables, reducing the response time and greatly improving the response time stability.
Modify the source code to automatically clear expired HBase tables after rebuilding to reduce HBase index pressure.

The authors introduce

Lin Xingcai graduated from Xiamen University, majoring in computer Science and technology. I have years of experience in embedded development and system operation and maintenance. Now I am working as a big data development engineer in 43999 Network Co., LTD., mainly responsible for the planning and construction of big data platform.

Thanks to CAI Fangfang for proofreading this article.