Abstract: This paper is compiled from a speech delivered by Lin Jia, technical director of Real-Time billing platform and SDK of NetEase Interactive Entertainment Technology Center, at Flink Forward Asia 2021. This paper is mainly divided into three parts:

  1. Start with an in-app purchase and payment
  2. Real-time SDK and platformization of the two-line development
  3. Move towards real-time full correlation

Click to view live replay & Presentation PDF

When it comes to NetEase Interactive entertainment, we must think of games first. As one of the core business lines of NetEase, the stable and reliable operation of game business is naturally the top priority, and the reliability of in-app purchase service is the most important in game business. The sharing of this article starts from an APP purchase.

1. Start with an in-app purchase and payment

When a player purchases props in the game, the client will first be triggered to communicate with the channel provider and billing center to complete the order and payment. The billing center will also interact with channel providers to verify the validity of client orders and payment status. The game service will only be notified of the shipment if the order is legitimate. In this process, each participant generates logs, data monitoring points, and so on, their source, data structure, and time pace may vary greatly. In addition, there are communication networks, databases and monitoring systems involved in this process, which makes the whole process very complicated.

The continuous and massive data generation, the association of sessions between data, heterogeneity between data sources, heterogeneity of data structure and inconsistency of time pace are all the reasons why we choose to process in real-time mode.

Before 2017, we were relatively backward in processing methods, including some older processing methods, such as web disk, Rsync, T+1 to handle offline tasks, etc.

Various components, fragmentation of technology stack, low timeliness and rough use of resources will make resources unable to be used evenly, which is one of the reasons for low timeliness, and makes code energy efficiency, data energy efficiency and resource energy efficiency relatively low.

The figure above shows the resource situation when our offline computing business was running, calculating the data report of the previous day in the early morning. Before streaming computing became common, this was a very large-scale model, with a large number of machines performing Spark offline tasks in the wee hours of the morning to calculate the previous day’s results. In order for the reports to be delivered on time, the entire offline cluster requires a lot of computing power and stacks a lot of machine resources that are idle at many times, resulting in resource inefficiency.

If this kind of computing can be done in real time, the computing power required can be spread over each time slice, avoiding a heavy skewing in resource usage in the wee hours of the morning. This machine power can be hosted on a resource management platform, so it can also be used by other businesses to improve energy efficiency.

So how to choose a real-time framework? After in-depth research and attempts, we finally chose Flink, whose features are completely suitable for our scenario. The following figure illustrates some of our considerations on the selection of technical architecture.

Second, real-time SDK and platform development

NetEase Interactive Entertainment has developed a two-line development plan since 2018 to comprehensively promote the real-time process of data center JFlink.

After several iterations, we have formed a one-stop operation and maintenance platform + an SDK supporting configuration development, and have completed the progress from available to practical, the next step is to make users love to use.

How to improve the efficiency of manpower and code has been our focus since the beginning of designing JFlink. We want to maximize energy efficiency with limited manpower, so it is particularly important to configure and modularize THE SDK, so that every real-time job can be described by a set of configuration semantics.

The CONNECTOR handlers and data flow objects commonly used by the JFlink SDK are encapsulated in the SDK so that they can be assembled and used in a configurable form. The SDK also provides a unified configuration grammar, which can dynamically organize and construct Flink DAG after describing jobs in the form of configuration, so as to realize an SDK program package that covers all kinds of data services and improve code reuse and energy efficiency.

On the SDK, you can write or generate udFs for Kafka Source, TiDB sink, and intermediate aggregation Windows to pull up a real-time service without any additional development.

In order to coordinate with the unified grammar of SDK work, we also built a one-stop processing platform, so that data operation and maintenance personnel can construct their own data business in a one-stop, convenient and visual way.

Even as intricate as a DAG is, it can still be generated using a parsing grammar.

Sdk-oriented strategy realizes function modularization, job configuration, data view and stream batch integration, making module reuse daily, making everyone understand each other’s work, making heterogeneous data be processed for each written UDF module, and more importantly, making historical work transition to Flink.

Sdkification also gives us the ability to quickly follow the community and upgrade the Flink version. The SDK isolates the business logic from the Stream API, and most of the extensions are extensions to the Flink native classes on the SDK side. When following Flink to carry out major version upgrade, on the one hand, the business side can achieve nearly zero change upgrade, on the other hand, it also solves the huge cost that the internal expansion function of Flink needs to constantly merge from the internal branch of each version to the new branch.

On the other side of the two-line development plan is NetEase Interactive entertainment’s one-stop platform, which realizes independent clustering operation based on K8s.

The figure above is the technical architecture diagram of the platform. Its supporting big data components such as Nexus and HDFS are used as the infrastructure to maintain the versioning software warehouse, which hosts SDK and other business JAR packages. At the operation level, Flink uses the concept of K8S independent cluster, that is, each job runs in its own independent K8S namespace, with its own resource matching and dependency set, to achieve complete isolation of business operations and fine allocation of resources.

The JFlink platform also encapsulates various o&M interfaces to track business iteration, job execution, log set analysis and other platform-based functions, which are provided externally through stateless REST service nodes. The platform also provides the ability for operations personnel to visually create real-time jobs, which is what we achieved when we combined the platform with the SDK.

In the one-stop platform, users can monitor the real-time status of their jobs, check running logs, roll back historical versions, and even check historical exceptions, records and statistics, risk control, and detailed management of life cycle.

In addition to the capabilities mentioned above, there are quite a few other features on our one-stop platform, all of which work together with the SDK to make up our real-time computing infrastructure.

Third, move towards real-time full correlation

Next, from the perspective of data business, this paper analyzes and expounds the experience and practice of NetEase Interactive Entertainment in real-time business in the key field of billing.

Our earliest practice was to perform statistical analysis of data logs generated on billing nodes. Logs from different sources are often in various forms, especially the callbacks from external channels. It is difficult to standardize their log formats. How to deal with these chaotic formats and turn them into a unified data processing? That was our first exploration objective.

To this end, the SDK packages UDF Message Unified Parse that can handle semi-structured data By defining an abstract syntax tree, as well as UDFs that can handle Group By and aggregate functions. Such requirements realize the statistical service in the form of configuration grammar, and write it into the TSDB developed by ourselves through encapsulated Sink.

Log analysis monitoring from the Angle of the points on the billing business interface, module of traffic, the situation and the time delay of laws and regulations to monitor, in order to realize real-time monitoring of business without invasive, reduces the original through micro batch caused by time delay and business monitoring script on the server CPU overhead, and improve the monitoring discovery rate, make the business more reliable.

Next, we set our sights on creating a generic ETL framework.

Heterogeneous data can be converted into unified view and unified flow object through Parser, and then can be processed and converted by built-in UDF conforming to protocol. We also implemented a JavaScript UDF, which can easily and conveniently process data conversion through flexible EMBEDDING of JS scripts.

The data processed by Flink flows into our own heterogeneous data warehouse, which can be easily used by the business side. You can also use SQL directly to query and even aggregate logs generated in real time. These businesses make real-time use of the data generated by interface modules in the payment environment from the point of view, and the daily data volume is about 30 Billion, which provides a strong guarantee for further real-time data business.

Around 2019, we started thinking about how do we connect these dots into organic lines? The payments that are going to occur in the payment environment will undergo a full-link association analysis from start to finish. Will there be a problem with the services involved?

The logs of these services come from a variety of sources, including client logs, billing logs, and gateway logs. For these link logs related to context analysis, in fact, Flink has provided us a very convenient API, that is, keyed Stream + session window.

The diagram above shows the architecture of full-link monitoring. The knowledge of link analysis is encapsulated into a model and loaded into Flink’s real-time analysis program. It will concatenate data on a payment link in real time and then write it to our graph database for further downstream use. In addition, it provides the Daily Spark Job to handle abnormal links to fulfill some link completion requirements, which is also a practice for the Lambda architecture.

The figure above shows the effect of full link tandem. A payment order can be sequestered and displayed, which is a great way for dbAs and products to locate payment issues.

Around 2020, NetEase Interactive Entertainment began to explore real-time data warehouse, one of the most important applications is the user portrait system.

Previously, data reports were displayed in the form of T+1 with low timeliness. After the report upgrade and real-time transformation, it is now possible to achieve real-time query through the form of interface. The improvement of timeliness enables products to carry out refined operations, respond to marketing needs in a more timely manner, and thus increase profits.

These various calculations are also implemented in the form of configuration +SDK.

In particular, the flow of data width, the use of Flink to provide Async IO to the appearance of Lookup Join, are real-time data processing right-hand man.

Real-time user warehouse and real-time data warehouse indicators provide the product with player level micro query and report level macro query. These user data can be connected to the visualization tool, through the data visualization intuitive display, so that the product operation can find the rule that cannot be found from the number, further mining the data value.

After the above practice, we began to think whether it is possible to associate all kinds of data in the whole payment environment at the level of one link and one user to achieve macro monitoring of the payment environment.

Various heterogeneous terms in payment environment sessions, such as payment database TiDB and various log data generated by payment middleware, are analyzed by Flink’s Interval Join feature for association analysis.

For example, in TiDB, there are 40 lines of order and payment database stored, and there are records of payment process such as user order from client to channel callback in the log. The corresponding service module can be analyzed by associating them respectively.

Further, links generated by each module can be associated and merged, and finally the results of association analysis on the whole payment environment can be obtained.

For example, if the number of data logs drops sharply or there are many error codes after the data logs are shipped, o&M personnel can quickly discover that there is an exception in the shipping service. The above figure shows the situation of such association analysis. In some complex scenes of the generation environment, this full association analysis framework processed data of nearly ten heterogeneous sources and analyzed business scene sessions of dozens of situations through association. Real-time reporting of many payment environments based on the ability of association analysis to assist operations in fixing problems, guide product strategy, and ultimately increase revenue.

The improvement of resource and data energy efficiency brought by real-time data service is obvious to all, and high timeliness brings new inspiration of data use, which is also the new big data future brought by Flink.

Click to view live replay & Presentation PDF