Bilibili: The application of machine learning workflow platform based on Flink in B station

Guest: Zhang Yang, senior development engineer of B station

The whole process of machine learning, from data reporting, to feature calculation, to model training, to online deployment, to final effect evaluation, is very long. At Station B, multiple teams will build their own machine learning links to fulfill their respective machine learning requirements, and engineering efficiency and data quality are difficult to be guaranteed. Therefore, we built a whole set of standard workflow platform for machine learning based on the AIFLOW project of Flink community, to accelerate the construction of machine learning process and improve the data effectiveness and accuracy of multiple scenes. This sharing will introduce the application of Ultron, the machine learning workflow platform of Station B, in multiple machine learning scenarios of Station B.

Directory:

1. Real-time machine learning

2. The use of Flink machine learning at Station B

3. Machine learning workflow platform construction

4. Future planning

I. Real-time machine learning

Firstly, I will talk about real-time machine learning, which is mainly divided into three parts:

The first is real-time sampling. In traditional machine learning, the samples are all T +1, that is to say, today’s model uses yesterday’s training data, and every morning the model is trained using yesterday’s whole day’s data;
The second is the real-time feature. The previous feature was basically T +1, which led to some inaccurate recommendations. For example, today I watched a lot of new videos, but the ones I saw yesterday or longer ago were still recommended to me.
The third is the real-time model training. After we have real-time samples and real-time features, model training can also achieve real-time online training, which can bring more real-time recommendation effect.

Traditional offline link

The above figure is a traditional offline link diagram. First, the APP generates the log or the server generates the log, and the whole data will be dropped onto HDFS through the data pipeline. Then, at t+1 every day, some feature generation and model training will be done. And then to the Inference Online service above.

Insufficiency of traditional offline link

So what’s wrong with it?

First, the characteristics of T +1 data model have low timeliness, so it is difficult to update them with high timeliness.
Second, in the whole process of model training or feature production, day by day data are needed. The whole training or feature production takes a very long time, which requires very high computing power for the cluster.

Real-time link

In the figure above, after we optimized the whole process of real-time link, the Red Cross part was removed. After the whole data is reported, it is directly sent to the real-time Kafka through the pipeline, and then a real-time feature generation and real-time sample generation will be done. The feature results will be written to the feature store, and the sample generation also needs to read some features from the feature store.

After the sample is generated, we conduct real-time training directly. The whole long link on the right has been removed, but the part of the off-line feature has been saved. Because for some special features, we still need to do some off-line calculations, such as those that are too complicated to be real-time or that have no real-time requirements.

2. The use of Flink machine learning at Station B

So let’s talk about how we do real-time sample, real-time feature and real-time effect evaluation.

The first is a live sample. Flink currently manages the production process of all recommended business sample data of Station B;
The second is the real-time feature. At present, quite a few features are calculated in real time by Flink, which is very time-efficient. Many features are obtained by combining offline and real-time methods. Historical data is calculated offline, real-time data is calculated by Flink, and features are read by splicing.

However, sometimes these two sets of computing logic can not be reused, so we are trying to use Flink to do batch flow integration, and use Flink to do all the definition of features. According to business needs, real-time calculation or offline calculation, the underlying computing engine is all Flink.
The third is an evaluation of the real-time effect. We used Flink+ OLAP to get through the whole real-time calculation + real-time analysis link to carry out the final evaluation of the model effect.

Real-time sample generation

The figure above shows the current generation of real-time samples for the entire recommended service link. After the log data falls into Kafka, we first make a Label-Join of Flink to join the clicks and displays together. After the result continues to fall into Kafka, another Flink task is followed for feature join, and feature join will join multiple features, some of which are public domain features and some of which are private domain features of the business side. Features come from a variety of sources, both offline and real-time. After all the features are completed, an instance sample data will be generated and sent to Kafka for use in the following training model.

Real-time feature generation

The figure above shows the generation of real-time features. The column here shows a relatively complex process of features. The whole calculation process involves 5 tasks. The first task is an offline task, followed by four Flink tasks. A feature generated after a series of complex calculations falls into Kafka, and then is written into a feature-store, which is then used for online prediction or real-time training.

Real-time performance evaluation

The figure above shows the evaluation of real-time effect. A very core index that the recommendation algorithm focuses on is CTR click-through rate. After Label-Join is done, CTR data can be calculated. After the report system docking, you can see the very real time effect. The data itself will be labeled with experimental labels. In ClickHouse, experiments can be distinguished according to the labels and corresponding experimental effects can be seen.

3. Machine learning workflow platform construction

Pain points

The whole link of machine learning includes sample generation, feature generation, training, prediction and effect evaluation. Each part needs to be configured and developed for many tasks. The on-line of a model eventually needs to span multiple tasks, and the link is very long.
The new algorithm is difficult for students to understand the whole picture of the complex link, and the learning cost is extremely high.
The change of the whole link affects the whole body, very easy to break down.
The computing layer uses multiple engines, batch stream mixed use, semantics is difficult to keep consistent, the same logic to develop two sets, maintain no GAP is also very difficult.
The whole real-time cost threshold is also relatively high, which requires a strong real-time offline capability. Many small business teams cannot complete it without platform support.

The figure above shows the general process of a model from data preparation to training, with seven or eight nodes involved. Can we complete all the process operations on one platform? Why do we use Flink? Because our team’s real-time computing platform is based on Flink, we also see the potential of Flink for batch-stream integration and some future paths for real-time model training and deployment.

The introduction of Aiflow

Aiflow is a set of machine learning workflow platform open source by Alibaba’s Flink Ecology team, which focuses on the standardization of processes and the entire machine learning link. Last August and September, we got in touch with them, introduced such a system, built it together, and started to roll out at Station B. It abstracts the whole machine learning into the processes of example, transform, Train, validation and inference in the figure. In the project architecture, the core capability scheduling is to support the mixed dependency of flow and batch, and the metadata layer supports model management, so it is very convenient to carry on the iterative update of the model. We built our machine learning workflow platform based on this.

Platform features

Next, let’s talk about platform features:

The first is to use Python to define the workflow. In the direction of AI, people use Python more. We also refer to some external ones. For example, Netflix also uses Python to define this kind of machine learning workflow.
The second is to support batch-stream task mixed dependencies. In a full link, the real-time offline processes involved can be added to it, and the batch tasks can depend on each other via signals.
The third is to support one-button cloning throughout the experimental process. From the original log to the final pull-up training part of the whole experiment, we hope to be able to pull up a brand new experimental link quickly by one-click integral link cloning.
Fourth, there are some performance optimizations to support resource sharing.
The fifth is to support the feature of backtracking batch stream integration. The cold start of many features requires the calculation of data with a long history. It is very expensive to write a set of offline feature calculation logic specifically for the cold start, and it is difficult to align with the real-time feature calculation results. We support the direct tracing of offline features on the real-time link.

Basic architecture

Above is the basic architecture, with the business at the top and the engine at the bottom. There are also many supported engines: Flink, Spark, Hive, Kafka, Hbase, Redis. There are computing engines and storage engines. AIFLOW as the middle workflow management, Flink as the core computing engine, to design the entire workflow platform.

Workflow Description

The entire workflow is described in Python, where the user only needs to define compute nodes and resource nodes, and the dependencies between these nodes, similar to the scheduling framework Airflow.

Dependency definition

There are four main types of batch stream dependencies: flow-to-batch, flow-to-stream, batch-to-stream, and batch-to-batch. It can basically meet all the needs of our business at present.

Resource sharing

Resource sharing is mainly used for performance, because a lot of times a machine learning link is very long, such as just the figure I often changes may have only five or six nodes inside, when I want to pull up the whole process of the experiment, the whole figure cloning again, I just need to change the middle part or most of the nodes, the upstream node can be do data sharing.

This is a technical implementation. After cloning, a status trace is made to the shared node.

Real-time training

The picture above shows the process of real-time training. Feature traversal is a very common problem, which occurs when the progress of multiple computing tasks is inconsistent. In the workflow platform, we can define the dependency relationship of each node. Once there is a dependency between nodes, the processing progress will be synchronized, generally speaking, fast and slow, so as to avoid feature crossing. In Flink we use WaterMark to define processing progress.

Characteristics of the back

The figure above shows the process of feature tracing. We use real-time link to trace its history data directly. Offline and live data are different, and there are a lot of issues that need to be addressed, which is why we used Spark, and we will change this to Flink.

The problem of feature backtracking

There are several big problems with feature backtracking:

The first is how to ensure the sequence of data. The underlying semantics of real-time data is that data comes in order and is processed as soon as it is produced, so there is a certain sequential nature to it. But offline HDFS is not, HDFS is partition, partition of the data is completely out of order, the actual business inside a large number of computing process is dependent on timing, how to solve the out-of-order offline data is a big problem.
The second is how to ensure the consistency between features and sample versions. For example, if there are two links, one is feature production, the other is sample production, sample production depends on feature production, how to ensure the consistency of versions between them without crossing?
The third is how to ensure that the real-time link and backtrack link calculation logic is consistent? This problem is actually not a worry for us, we are directly tracing the offline data on the real-time link.
Fourth, there are some performance issues, how quickly to calculate a large number of historical data.

The solution

Here are the solutions to the first and second questions:

First question. In order to maintain the sequence of data, we Kafka the offline data of HDFS. Instead of pouring it into Kafka, we simulate the data structure of Kafka, partition and order within partition. We also process HDFS data into a similar structure, simulate the logical partition. And logical partition within the order, Flink read the HDFSSource has also been developed to support this simulation of the data architecture. The simulation is currently done using Spark, we will change to Flink later.
The second question has two parts:
- The real-time feature part of the solution relies on HBase storage, which supports querying by version. After the feature is calculated, it is directly written into HBase according to the version. When the sample is generated, it is enough to check the corresponding version number in HBase, where the version is usually the data time.
- Offline feature part, because there is no need to recalculate, offline storage is available in HDFS, but it does not support point-checking, this part is OK for KV processing, for performance, we did asynchronous preloading.

The process of asynchronous preloading is shown in the figure.

Fourth, future planning

Next, I will introduce our future planning.

One is data quality assurance. Now the whole link is getting longer and longer, there may be 10 nodes, 20 nodes, then how to quickly find the problem points when the whole link is out of order. Here, we want to do DPC for node sets. For each node, we can customize some data quality check rules, and the data can be bypassed to the unified DQC-Center for rule operation and warning.

The second is exactly once of the whole link. How to ensure the exact consistency between workflow nodes has not been clearly thought out yet.

Third, we will add nodes for model training and deployment in the workflow. Training and deployment can be connected to other platforms, or it can be training models and deployment services supported by Flink itself.

Guest introduction: Zhang Yang, who has been working in Station B since 2017, is engaged in big data work.