Abstract: This article is shared by MAO Xiangyi, senior data platform engineer of Good Future, mainly introduces the practice of batch streaming fusion in the education industry. The content consists of two parts. The first part is some thoughts on how to make a real-time platform for the good future. The second part mainly shares the unique data analysis scenes in the education industry. The outline is as follows:

1, Background introduction 2, Good future T-Streaming real-time platform 3, K12 education typical analysis scenario 4, outlook and planning

1. Background

Good Future introduction

Good future is an education technology company founded in 2003, under the brand xueersi, now you hear of Xueersi Peyou, Xueersi online school are derivatives of this brand, the company was listed on NASDAQ in 2010, changed its name into Good Future in 2013. In 2016, the company’s business scope has covered users between minus one and 24 years old. At present, the company’s main business units include smart education, open platform in the field of education, K12 education and overseas study.

Good future data in Taiwan panorama

The above picture shows the panorama of the good Future Data Center, which is mainly divided into three layers:

● The first layer is the data enabling layer ● the second layer is the global data layer ● the third layer is the data development layer

First, the data enablement layer. Mainly business intelligence, intelligent decision-making applications, including some data tools, data capabilities and thematic analysis system, data tools mainly include buried data analysis tools, AB testing tools, large screen tools; Data capability analysis mainly includes future portrait service, future growth service, future user service and new campus site selection service; Thematic analysis system mainly includes enterprise management thematic analysis and so on.

Secondly, data global layer. We hope to integrate the data of all business divisions of the group, and open up user pools of different business lines and product lines, so as to invigorate the data of the whole group. The specific method is IDMapping, which mines the ID mapping relationship of device ID, natural person and family, and associates user data of different products. In this way, we can form a taxi user pool, which is convenient for us to better empower users.

Finally, the data development layer. Data development carries all data development projects of the group through a series of platforms, mainly including data integration, data development, data quality, data services, data governance and other services. The real time platform we’re going to share today is in data development.

2. Good Future T-Streaming real-time platform

Real-time platform building before the appeal

At the beginning of the real-time platform construction, we sorted out four important appeals.

● The first demand is to expect a unified cluster, which can improve resource utilization by providing multi-tenant and resource isolation, and solve the problem of multiple business units and multiple clusters. ● The second appeal is to lower the barriers to real-time data development through a platform approach to reach more developers. ● The third appeal is to provide solutions for common scenarios, improve project reusability, and avoid the development of analysis tools for the same scenario in each business division. ● The fourth appeal is to conduct a full range of life cycle management of the job, including metadata and kinship, so that when a job fails, we can quickly analyze and locate the scope of impact.

Real-time platform features overview

Now our platform has become a one-stop real-time data analysis platform, including data integration, data development, operation assurance, resource management, data security and other functions.

● In terms of data integration, we support the integration of database, buried data and server-side log data. In order to improve the efficiency of data integration, we provide a lot of common template jobs, users only need to configure quickly to achieve data integration. ● In terms of data development, we support two ways of job development, one is Flink SQL job development, one is Flink Jar package hosting, in Flink SQL development, we built a lot of UDF functions, such as UDF function can achieve dimension table join, User-defined UDFs are also supported and hot loading of UDFs is implemented. In addition, we will also record the user’s metadata information during the process of job development to facilitate the construction of blood system. ● In terms of job security, we support job status monitoring, exception alarms, automatic pull up after job failure, automatic pull up job. We automatically select the available checkpoint version to pull up job, and we also support job switching between multiple clusters. ● In terms of resource management, we support multi-tenant platform. Each tenant uses namespace for isolation, realizing the isolation of different business units, different users, different versions of Flink client, and realizing the isolation of computing resources. ● In terms of data security, we support role permission management, table-level permission management, operation audit log query and other functions.

These are the functions of our platform. While enabling business, we are also in rapid iteration, and expect the platform to be simple, easy to use, stable and reliable.

Batch stream fusion for real-time platforms

Next, let’s talk about some practices in platform construction. The first one is batch streaming convergence.

Let’s be clear about what batch fusion is.

Batch streaming fusion can be divided into two concepts, one is the batch streaming fusion proposed by Flink. The specific understanding is that a Flink SQL can act on both stream data and batch data, so as to reduce the difference of the result data by ensuring the consistency of the computing engine. This is a batch streaming fusion at the technical level. Another concept that we came up with internally is batch convergence at the architecture level. The specific operation method is to ensure real-time ODS layer of data warehouse through Flink job, and then provide scheduling at the level of hour and minute, so as to improve real-time data analysis.

Why we propose batch convergence in architecture is that we see two major trends in the industry.

● The first trend is real-time and componentized data integration, such as Flink integration Hive, Flink CDC’s continuous improvement and enhancement, so that we do data integration is very simple. ● The second trend is the growing maturity of real-time OLAP engines, such as Kudu+ Impala, Ali Cloud’s Hologres, lake warehouse integrated solution.

These two trends make it easier and easier for users to develop real-time data and focus only on SQL itself.

As you can see in the figure above, we have three types of real-time storehouse, one based on Hive, one based on real-time OLAP engine, and one based on Kafka. Among them, the blue line is the concrete realization of real-time ODS layer. We provide a unified tool for writing data in real time to Hive, real-time OLAP engines, and of course Kafka. This tool is relatively simple to use. If it is MySQL data synchronization, the user only needs to enter the database name and table name.

With ODS layer real-time tools, we can build real-time data warehouse in Hive, real-time OLAP engine, Kafka.

Flink is used to merge real-time incremental data into the ODS layer, and then provides a timed merge script to merge incremental data and historical data to keep ODS layer data up to date. As requested by Airflow’s hour-level scheduling capabilities, user will be able to access one-hour volume positions. ● For a real-time OLAP engine like ** Kudu/Hologres, we import offline data from Hive to the real-time OLAP engine, and then use Flink to write real-time incremental data to the ODS layer. Writing is recommended using features such as upsert so that users can get a pure real-time stack. As requested by Airflow’s min. level scheduling capability, user will receive one min. level open position. ● Kafka to build real-time data warehouse, is very classic architecture, development costs are relatively high, in addition to the need to update the second analysis scenario, we do not recommend users use. Of course, in 2021, we will also do the Flink batch integrated solution, so that users have more choices, while making the whole real-time data warehouse more simple.

The above is our thinking and practice on batch fusion. With this kind of batch fusion at the architectural level, the real-time requirements that used to take a month to develop can be almost completed in 2 days. It greatly reduces the threshold of developing real-time data and improves the efficiency of data analysis.

Real-time platform ODS layer real-time

So let’s talk about how we did ODS layer realtime.

In order to realize real-time ODS layer data, we need to solve two problems, the first is the initialization of offline data, and the second is how to write incremental data. Offline data import is easier to do. If the data source is MySQL, we can use DataX or Spark to import MySQL full data to Hive. Real-time incremental data write requires two steps. The first step is to collect MySQL binlog to Kafka. The second step is to import Kafka data to Hive using Flink job. In order to solve the real-time ODS layer problem, we need one offline initialization job, one incremental data acquisition job, and one incremental data write job, which means three jobs.

On our platform, three jobs of ODS layer are encapsulated and uniformly scheduled. Users only need to input a database name and table name to complete the real-time work of ODS layer.

The above is the realization process of real-time ODS layer in batch stream fusion.

Real-time platform Flink SQL development process

Our other practice is job encapsulation of Flink SQL. Take a look at the overall flow of Flink SQL development on our platform.

From left to right, the data in the data source will be collected to Kafka by tools such as Maxwell and Canal. The format of the original data collected from Kafka is not uniform, so we need to format the data in Kafka uniformly. By default, we support buried data format, Canal data format, Maxwell data parsing, and users can upload Jar packages for data parsing. The normalized data parsed will be sent to Kafka again.

Then we will use Flink SQL jobs to consume Kafka data and develop SQL scripts. The SQL script development here is a little different from the original Flink SQL script development. The original SQL script development users need to write Source information and Sink information. On our platform, users only need to write specific SQL logic.

After writing SQL, the user will submit the SQL job information to our encapsulated Flink SQL execution job, and finally run the Flink cluster submitted by the job through our encapsulated SQL engine. How we encapsulate this will be described later.

The above is the process of Flink SQL development on our platform. After the development and submission of Flink job itself, the platform will also retain the input and output schema information related to the job. For example, the schema information of the business database table, the schema information after consent processing, and the schema information of the data output table. Through these records, we can quickly sort out the background and scope of influence of the operation when we troubleshoot problems later.

Real-time platform Flink SQL development process

To develop Flink SQL jobs on our platform, there are only three steps:

The first step is to verify that Kafka’s Topic is registered. If not, you need to register the Topic manually. After registration, we parse the Topic data and save the field information. ● The second step is for users to write SQL. As mentioned earlier, users only need to write specific SQL logic, and do not need to write Source and Sink information. ● The third step is for users to specify where to output data. Now the platform can specify multiple Sink storage devices at the same time, such as writing the calculated data to Hive, Holo and other storage.

After the above three steps are configured, the user can submit the job.

So let’s talk about how we did it, and I’m going to break it down into 2 phases and 10 steps. The first stage is the job preparation stage and the second stage is the SQL execution stage.

■ Job preparation stage

● The first step, the user in the page data SQL and specify Sink information. ● The second step is the SQL parsing and verification process. When the user submits the SQL, we will parse the SQL to see whether the Source table and UDF used in the SQL have been registered in the platform. ● The third step is to speculate the table construction. We will first use the USER’s SQL, then get the return result of SQL, generate some table construction sentences according to the result data, and finally build the table automatically to the target Sink storage through the program. ● Fourth step, assemble Flink SQL script file, get a script file with Source, SQL, Sink three elements. ● Step 5, job submission, where the Flink SQL file is submitted to our own execution engine.

■ SQL execution phase

● The first step is to initialize the StreamTableAPI, and then use connect to register the Kafka Source. Kafka Source information needs to specify the data parsing rules and schema information of the field, which is automatically generated according to the metadata. The second step is to use the StreamTableAPI to register dimension tables and UDF functions used in SQL. UDF functions include UDF functions uploaded by users themselves. ● The third step is to use the StreamTable API to execute SQL statements and views if there are any. ● The fourth step, which is more critical, is to convert the StreamTabAPI into DataStream API. ● the fifth step is to addSink information based on DataStream.

This is a two-stage process where the user’s SQL job is actually running.

Real-time platform native jobs and template tasks

After sharing how our Flink SQL jobs are developed and run, let’s talk about our platform’s support for JAR package type jobs.

On our platform, we support users to upload JAR packages and manage them on our platform. At the same time, in order to improve the reuse of common code scenarios, we developed many template jobs, such as support Maxwell to directly write binlogs to Hive, Kudu, Holo and other storage devices, support Ali Cloud SLS logs to write to various OLAP engines.

Real-time platform hybrid cloud deployment solution

Talk about hybrid cloud deployment solutions and platform technology architecture.

Our platform now supports submitting jobs to Ali Cloud room and self-built room, and jobs can be switched back and forth between the two rooms. In order to have this function?

Earlier this year, with the outbreak, the online education into a lot of traffic, in response to the explosion of traffic and during the Spring Festival we have purchase thousands of machines for emergency deployment and online, then stabilise the epidemic, the utilization of these machines is lower, in order to solve this problem, we will support the hybrid cloud platform deployment plan, During peak hours, operations can be transferred to Aliyun and normally run on their own clusters, which not only saves resources but also ensures flexible capacity expansion.

Real-time platform technology architecture

Next, the technical architecture of the platform.

We are a project with separate front and back ends. Vue + ElmentUI is used in the front end and Springboot is used in the server end. We will deploy an instance of back-end service in different machine rooms. Tasks are submitted to different machine rooms through the forwarding layer of Nginx + Lua. Tasks on the platform are submitted, paused, and logged off through the driver layer, which consists of shell scripts. Finally, there is the client side, where we have isolated the Namespace/ user /Flink version.

3. Typical analysis scenarios of K12 education

Renewal business introduction

We talk about a specific case, the case is a typical analysis scenario in the K12 education industry, the user continued report business. First of all, what is renewal? Renewal is repeated purchase. The user buys a one-year course, and we expect the user to buy a two-year course. For users to purchase courses, we will have a concentrated period for renewal, each time lasting about one week, four times a year.

Because renewal period is relatively concentrated and the time is relatively short, teachers have an urgent demand for real-time renewal data every time they do renewal business.

To this end, we made a universal renewal solution to support the renewal action of each business unit. There are several challenges to doing live updates.

● The first challenge is to calculate whether a user’s order is renewed or not. This relies on all orders in the user’s history, which requires historical data to participate in the calculation. ● The second challenge is that changes in one order can have a ripple effect on changes in other orders. For example, the user has five orders and the order number 345 is in the renewed status. If the user cancels the order number 3, the renewed status of orders 4 and 5 needs to be recalculated. ● The third challenge is that the dimension changes frequently. For example, the user’s branch school status in the morning is Beijing, and the branch school status in the afternoon may be Shanghai. The tutor in the morning is Zhang SAN, and the tutor in the afternoon is Li Si.

The challenges of relying on historical data, the ripple effects of order changes, and the frequently changing dimensions may not seem like much in isolation, but become more interesting when taken together.

Real-time continuation solution

First, let’s talk about the overall architecture. We use the batch stream fusion method to do it, which is divided into two lines. One line is minute-level real-time renewal data calculation, and the other is second-level real-time renewal data calculation. The calculated data is stored in MYSQL to make large screens and BI kanban.

If you look at the blue line, we will import the offline data from Hive into Kudu. The offline data is the calculated order width table. Then the Flink job will be used to make the new orders into a wide table and write it into the Kudu, so that the Kudu will have the latest and most complete data. With a 4-minute schedule, we provide minute-level real-time renewal data.

Look at the first orange line. There are two Flink jobs on this line. One is an ETL Job and the other is an Update Job.

The ETL job is responsible for static dimension concatenation and continuation state calculation. Static dimension concatenation is directly accessed by MySQL and cached in the JVM. The calculation of the continuation state relies on historical data. ETL Job loads all the order data into THE JVM. We customize a partitionCustom method to fragment all the historical data, and each downstream Task caches one fragment of the data. By loading data into memory, we greatly speed up Flink’s real-time calculations.

The data calculated by ETL Job will have two outputs, one of which is output to Kudu to ensure that the data in Kudu is the latest and most complete, and the other is Kafka. In Kafka, a Topic records the information about the order or dimension changes caused by the change of the current order.

The Update Job that follows Kafka is designed to handle affected orders or dimensions, directly modifying MySQL statistics.

Thus we realized the real-time continuation calculation through two Flink jobs.

The bottom line is real-time dimension data change processing. The dimension change data is sent to Kafka and then processed by Flink to see which data statistics are affected by the dimension change. Finally, the affected orders are sent to the affected Topic for recalculation by Update Job.

The above is our real-time continuation of the overall solution, if there are friends in the education industry to hear this share, maybe you can refer to it.

Real-time continuous stability assurance

Let’s take a look at what guarantees this generic solution has once it goes live.

● The first guarantee is remote hypermetro. We have deployed a set of continuation procedures in Ali Cloud and self-built computer rooms. If one of them is abnormal, we can switch the front-end interface. If the programs in both rooms are down and we reset the program, it only takes 10 minutes. ● The second guarantee is fault tolerance, we have two Flink operations, these two operations stop with the start, does not affect the accuracy of the data. The other thing is that we cache all the order data in the JVM. If the data volume explodes, we just need to change the parallelism of the ETL program without worrying about running out of JVM memory. ● The third safeguard is job monitoring. We support abnormal alarms of jobs and automatic pull up after failure, as well as consumption data delay alarms.

Through the above safeguards, real-time renewal procedures after several renewal cycle, are relatively stable, let a person very worry.

4. Outlook and planning

The above content introduces in detail the current business and technical solutions of Hao in the future. To sum up, we realize resource isolation of each business unit through multi-tenant, solve real-time analysis through the architecture solution of batch stream fusion, solve data integration problem from data source to OLAP through ODS layer, and use Flink SQL Encapsulation Lowers the threshold for real-time data development, provides solutions for common scenarios through template tasks, provides flexible resource expansion through hybrid cloud deployment solutions, and covers data analysis in the same scenario through real-time continuation solutions.

Finally, let’s look at our vision and planning. Next, we will continue to deepen batch convergence, enhance hybrid cloud deployment, and improve the timeliness and stability of data analysis. Support real-time algorithm platform, real-time data application, improve the timeliness of data decision.