In the era of refined product operation, problems of product growth are often encountered, such as the analysis of the reasons for the rise and fall of indicators, the analysis of the effect of version iteration, and the analysis of the effect of operation activities. This kind of analysis problem has high frequency and high timeliness requirements, but in the case of human resources shortage, the traditional data analysis model is difficult to meet. This paper tries to implement a lightweight big data analysis system — MVP from 0 to 1 to solve the above pain points.

Article author: Data Xiong, Tencent Cloud big data analysis engineer.

I. Background and problems

In a product matrix business, you can quickly discover growth problems through the dashboard. However, how to quickly understand the reasons behind a problem is a high frequency and complex data analysis demands.

If the data analyst runs the analysis manually, it usually takes 0.5 to 1 day to find the cause. Therefore, manual calculation and analysis takes up a lot of manpower, and the data analysis efficiency is low.

In addition, product version iteration and business operation activities also require rapid data analysis of new versions, new functions and new activities to verify the effect. Therefore, in the refined operation of product matrix business, there are a lot of data analysis demands, which need to be completed quickly.

In traditional data analysis models, it typically takes 3-5 days for each requirement to be resolved. In addition, the model requires a large number of data analysts to meet the requirements. Therefore, in the case of manpower shortage of data analysts, this model cannot meet the demand of data analysis for product growth.

Second, solutions

In the case of the failure of traditional data analysis mode, it is urgent to develop a new data analysis mode to rapidly meet the demand of data analysis of product growth.

Therefore, THE author and a small project team implemented a lightweight big data analysis system from zero to one — MVP. The MVP data analysis was intended to drive the Product from “Minimum Viable Product” to “Most Valuable Product”.

In addition, through the MVP data analysis system, on the one hand, we hope to improve the efficiency of data analysis; On the other hand, we hope to save data analysis manpower.

MVP data analysis system is divided into four modules, in the product business-management index module, based on AARRR model to analyze the product growth index, analyze the product growth Polaris index; In the indicator abnormality – root cause warning module, the abnormal growth indicators are monitored and root cause clues are provided. In the analysis tool – growth analysis module, in-depth analysis of user behavior, insight into user behavior; In the ab-test experimental evaluation module, the business decision scheme is tested to evaluate the rationality of the business decision. Through four modules, data analysis drives product refinement operation.

Third, technical implementation

A lightweight big data analysis system needs to be implemented from at least three aspects: data modeling, technology selection and page interaction. Data modeling, such as water flow, runs through the entire data analysis system; Technical selection is the infrastructure to support the efficient operation of the whole system; Page interaction is user-oriented, speaking with data, enabling data for business growth.

1. Data modeling

Before MVP was developed, due to historical reasons, there were problems such as scattered data construction, repeated data development and data isolation among products in the existing product matrix, and a user would have multiple information records.

This data pattern will not only lead to the waste of computing, storage and human resources, but more seriously, it will greatly affect the efficiency of upper-layer data applications. Therefore, the old data schema does not work, and new data schema needs to be developed.

On the one hand, based on the idea of “User + Event ID + Config”, the MVP data analysis system carries out highly abstract integration of product data information and converges product matrix business data. On the other hand, based on the key-value model, a large and wide table of users is generated, and a User_Id has only one record information.

2. Technical selection

In daily product data visualization, it is usually thought to use MySQL for page interactive data analysis. However, MySQL database carries data with a capacity of millions, which is suitable for analysis of results-based data, and cannot do anything for hundreds of millions of data.

In complex data analysis scenarios, it is usually necessary to perform OLAP multi-dimensional free cross combination analysis based on user portrait and user behavior. Therefore, for millions of product businesses, using MySQL is unable to meet OLAP real-time analysis, and new technology selection needs to be tried.

In order to realize real-time OLAP analysis, we investigated and compared the technical solutions of big data analysis platforms in the industry. HDFS and HBASE are the main storage engines in the industry. Impala, Druid, ClickHouse, and Spark are the most commonly used computing engines. The Druid system has high maintenance costs, no Join capability, and complex syntax applications.

ClickHouse is 2 + times faster than Presto, 3 + times faster than Impala, and about 4 times faster than SparkSql in terms of computational performance.

The measured data, 220 million + 1.79GB record data, single table polymerization 0.095s, analysis speed of 18.95GB/s.

Compared to Impala, ClickHouse can be imported directly through JDBC, the data import cost is low, and the ClickHouse system maintenance cost is relatively low. ClickHouse also has simple syntax, ease of use, and is friendly for page development, making it a quick way to create visual pages.

Based on the above factors, we adopt HDFS+ClickHouse+Spark technology solution. In this case, Using Spark to complete ClickHouse does not allow for large-scale Join operations, such as large and complex association analysis tasks.

In addition, Spark can seamlessly access Hive table data in the HDFS without re-derivative data, which improves application efficiency. HDFS is used to store historical full label and behavior data (about 80%), and ClickHouse is used to store recent label and behavior data (20%).

3. Page interaction

In the form of MVP page interaction, 80% of data analysis demands can be completed directly through real-time analysis of the page, and the remaining 20% of complex analysis tasks can be completed through task-submitted analysis.

Page real-time analysis returns analysis results in seconds. It takes 5-15 minutes to submit task-based analysis and return results. Business indicator system analysis, event model analysis, funnel model analysis, retention model analysis, etc. are completed through real-time page analysis, user group portrait insight, user interest preference insight is completed through task-submitted analysis.

4. Application effect

According to the traditional data analysis mode, according to the standard process of “putting forward requirements -> requirements review -> writing requirements -> data analysis -> output results”, data demands need 3-5 days to solve the problem. MVP system can quickly complete data analysis demands, greatly shorten the construction period and improve the analysis efficiency significantly. Currently, the MVP data analysis system has been used internally. Recently, the number of data analysis tasks using MVP has reached more than 1500 and the peak has exceeded 2000.

The shift from “manual data analysis -> instrumental data analysis” has significantly improved the efficiency of data analysis and is more conducive to the refined operation of data-driven products.

5. To summarize

This paper tries to introduce a lightweight big data analysis system from 0 to 1 — MVP. At present, MVP data analysis system has been used internally, which significantly improves the efficiency of data analysis and enables the business growth of data-driven products. At the same time, it saves the manpower input of data analyst. Later, based on the product matrix business, while improving the existing modules, each growth tool will be further polished to improve the MVP experience.

MVP takes the wind to sea and combines the first data platform to serve the industry end

As an internal system, MVP currently saves a lot of time and cost for mobile data analysis in the department, and has accumulated a wealth of Internet analysis templates and tools. In the process of serving industry customers, we found that the mobile data analysis solution represented by MVP is also a necessary tool for the digital transformation of traditional industries.

To this end, we use the lightweight data platform — Xix as the data base to solve the underlying platform problem of MVP’s external deployment, and develop a toB version of MVP that can be privately delivered to industrial customers to help them optimize their operation strategies driven by real-time user behavior analysis and portrait insight.

Note that the data platform is a lightweight big data platform product with high deployment cost performance, convenient operation and maintenance, privatized and other characteristics, which can meet the implementation of big data applications of small and medium-sized projects in a “small but beautiful” way. In specific project practice, we have formed a complementary combination of all data platforms and MVP, and have started to provide “out-of-the-box” mobile analytics services for industry customers.

Brief introduction of functions:

First, the big data component with high performance and integrated batch stream can quickly realize the deployment of privatized data platform without the need to deploy various complicated open source components.
First of all, visual task flow is provided as a data development platform, combining Spark SQL and SPL provided by us to quickly develop a data application on a graphical interface.
You can quickly build a visual site to show your data metrics to your colleagues, customers and leaders.

According to the platform consultation/business cooperation:

[email protected]

Reference article:

[1] zhuanlan.zhihu.com/p/54907288

[2] clickhouse. Tech/docs/en/SQL… statements/create/

See Tencent technology, learn cloud computing knowledge, pay attention to the cloud plus community

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

HDFS+ClickHouse+Spark: Implement a lightweight big data analysis system from 0 to 1

I. Background and problems

Second, solutions

Third, technical implementation

1. Data modeling

2. Technical selection

3. Page interaction

4. Application effect

5. To summarize

MVP takes the wind to sea and combines the first data platform to serve the industry end

HDFS+ClickHouse+Spark: Implement a lightweight big data analysis system from 0 to 1

I. Background and problems

Second, solutions

Third, technical implementation

1. Data modeling

2. Technical selection

3. Page interaction

4. Application effect

5. To summarize

MVP takes the wind to sea and combines the first data platform to serve the industry end

Related Posts

NDK mimics QQ sound effects

You’ll test the WebService interface in Python

Original | using JUnit, AssertJ and Mockito write unit tests and practice TDD (3) the unit test position in the whole test system