Application of Apache Flink in intelligent operation scenario of large state-owned banks

Abstract: Most modern applications are developed in the mode of back-end separation. In the new generation system of China Construction Bank, each transaction corresponds to three message packets: front-end buried point information, sent HTTP request, and returned HTTP response. CCB has a large number of branches and salesmen all over the world, and generates a large number of financial business transactions every day, including massive message messages, including hundreds of application scenarios such as operation and distribution, cash distribution and credit card approval. The financial business is characterized by complexity, stability and high requirements, especially in the banking industry.

This article is based on a presentation by Zhou Yao, fintech development engineer of CCB, at Flink Forward Asia 2021. The experience of Flink, the flow computing framework, will be based on the intensive operation service team of CCB. Based on how to introduce stateful flow, stateful flow can be calculated stably, timely and efficiently to process, merge and transmit the three messages of buried point, request and response to application consumption, and finally generate business value. To show the evolution process, solutions and detours of flow computing architecture in bank operation big data. It is hoped to provide reference for financial enterprises to use flow computing. Meanwhile, we hope to transform the products and scenarios polished by CCB into products and promote them to more peers and industries.

This paper will focus on the introduction of the company, business background and challenges, plan evolution and business effects, and future prospects.

Click to view live replay & Presentation PDF

I. Introduction to the company

Letter jinke is financial technology subsidiary of China construction bank, by the construction bank before the software development center as a whole, the conversion of the company continue is committed to be the technology drivers of the new financial system and the ecological connection, the power of China construction bank group digital transformation, fu can “digital China” construction, do all they can to make the science and technology, make society better. At the same time, it is also engaged in digital transformation consulting and digital construction business of To B.

The intensive operation service team of CCB mainly provides four intelligent operation products, namely process operation, delivery management, operation risk control and channel management. This paper mainly shares the practical application of Flink real-time computing in process operation.

Second, business background and challenges

Process operation is customer-centered. Driven by process, data and technology, it realizes digital control of operation journey and resources, and builds an intelligent operation system of the whole group with “horizontal and horizontal connection and horizontal integration”.

2.1 Introduction to process operation

Take the credit card process as an example. Every user can apply for a CCB credit card through the bank’s mobile APP, wechat or other channels. After the user submits the application, the application will be transferred to the credit card examination and approval department, the examination and approval department will determine the amount according to the user’s comprehensive situation, and then the information will be transferred to the card business card printing department, finally to the distribution department, by mail to the hands of the user, the user can start to use after activation. Each of the key business nodes, such as: application, audit, card issuance, business card printing, activation can be called a business action. Multiple business actions are linked together to form a business process from a business perspective.

For such a credit card application process, different roles have different perspectives and want to obtain different information from this process.

As an application customer, I want to know whether the application card has been approved and when can it be sent out?
As a staff member of offline card recommendation, I want to know whether the process was returned due to incomplete materials of the customer information collected today.
As a bank leader, you may want to know in real time how many credit card applications are made in your branch on that day. Is the average time for processing and checking slower than in the past?

For critical and high-frequency applications such as credit cards, there is a dedicated process system to meet user needs, but some scenarios are relatively low-frequency and data is scattered, and data from each system is isolated from each other. For example: door-to-door collection, ATM cash plus banknotes, cash supply chain process, public account opening process, etc., there are hundreds of similar process applications. If each of these process applications were developed and built separately, it would not only be too costly, but would also require invasive modifications to each component, introducing more uncertainty.

So we hope to build a system to be able to access all log information, the data through various systems, the dispersed kiosk in various business systems of the perspective of the business to business process reduction, so that the business people standing on the global Angle of view to see the data of the business, will be able to create greater value data, also very accord with Banks in recent years the trend of the digital transformation. Our intelligent operation can well meet this demand.

2.2 Process operation objectives

In order to meet the needs of various business users and roles for process analysis, the process operation is mainly responsible for four things:

The first is a complete presentation of the current state of the business process;
The second is to diagnose the problems of the process;
The third is to monitor and analyze the process;
The fourth is to optimize business processes.

So, the first thing we need to do is restore the process. So who defines the process? On reflection, we came to the conclusion that we should delegate to the business users. In the past, business processes were written in the code by developers. For business personnel, credit card approval only knows whether the result data is passed or not. However, in fact, there may be some dependence on the approval steps, such as step A, step B and step C.

If the definition process is delegated to business users, business users may not know the internal process of the system at first and may not know that only after the approval of A, B and C is completed, so we need to provide A tool for business users to try. Business personnel first determine a process parameter according to intuition, and the process operation system runs according to the process determined by business personnel according to real data, which is restored to a real process instance to verify whether the process conforms to the business scene. If yes, the parameters of the business process will be used to go online. If not, business users can modify and iterate for the scene in time to improve little by little, and constantly improve the business process in the iteration.

Then, with these running processes, you can build some process applications on top of the process, such as indicators, monitoring and warning, etc. Taking credit card application business as an example, indicators such as the approval rate of credit card application and the activation rate after issuance,

After that, we will do a series of process monitoring, operation and maintenance.

Finally, with these indicators, we can use these data to guide the enabling business, improve the efficiency of operation, and improve service satisfaction.

We hope to transform the products and scenarios polished by CCB into products and promote them to more peers and industries.

2.3 Technical Challenges

To achieve this goal, we will also face some business challenges:

Service data comes from multiple systems. We in CCB are also doing data lake, but the data in the data lake is just a simple pile of some data. If a global view cannot be formed, it is easy to lead to data islands.
High business flexibility. The hope is to provide a mechanism for the business to configure the process autonomously and improve the process iteratively.
The real-time requirement of data is high. The requirement is to be able to process the process in real time after the business takes place.
Data comes from multiple streams. Transverse data from multiple systems, such as auditing, hairpin, activation, most of these systems are separate deployment before and after the end, both front end point, also has after the request and response, we need to put the system data are collected, buried the corresponding points and request response in real time linked together, to get a business operation.
Business data is huge. Tens of billions of data a day, 7×24 hours non-stop arrival.

In order to solve the above pain points, we have taken a series of measures:

Using message queues, Kafka isolates production business logging and data processing systems to minimize invasive changes to applications.
We define parameters, configure and manage them through sites and processes.
We use Flink to do real-time processing and scale horizontally.
We use stream batch to save resources.

Since all of our streaming computing applications are running on the big data cloud platform that is under construction, let’s first introduce the big data cloud platform.

The figure above shows the big data cloud platform. The data processing structure is as follows: data is connected from network card, buried point, log and DATABASE CDC, and then real-time data is processed by Flink. Finally, the processed results are processed and stored in storage systems such as HBase, GP and Oracle.

Iii. Program evolution and business effect

In CCB, data usually comes from three channels, namely customer channel, employee channel and outreach channel.

Channel trading is mainly launched mobile banking APP, employees channels of trade mainly refers to the CCB bank tellers operation of distribution in various outlets, such as going to a bank to deal with a deposits, teller will at staff channel launched a deposit business, communications channel is refers to the external interface of the system call CCB deal.

3.1 Process Analysis Scenario

Each service action initiated by a channel corresponds to three log packets of the data processing system, namely request, response and buried point, which have globally unique tracking numbers. There is a difficulty in extracting unique identifiers for the three continuous data streams, and connecting the three data joins according to the unique identifiers to form a complete business action. In addition, there are issues of message first and last, intermediate state storage, and delayed message arrival.

In order to solve these problems, the solution also underwent three evolutions.

The first scheme uses sliding Windows, but efficiency problems soon appear;
Therefore, the second solution is to adopt Flink’s interval Join, but the program OOM runs stably.
We then used the third scenario, implementing a keyedProcessFunction ourselves to manually manage the central state, addressing efficiency and stability issues.

Before we share the details, a bit of background. The three pieces of data corresponding to each business action will arrive within 5 seconds 80% of the time, but due to network jitter or acquisition delay, we need to tolerate a delay of about an hour, and a global tracking number will only correspond to one request, one response and one buried point. That is, three messages corresponding to a business action will Join successfully only once.

3.1.1 Sliding Windows (Version 1.0)

To meet this requirement, we quickly rolled out version 1.0, using a sliding window that, when the request response arrived, would split it up, extract the unique business id, and then keyBy again. Because there is a before-and-after problem, either the request comes first or the response comes first. So we use a 10-second sliding window, a five-second sliding window, and if the request comes in, and the response can arrive in less than five seconds, it can be connected in the window, and the business operation can be directly output; If it does not arrive within 5 seconds, the state must be extracted and stored in Redis for waiting. When the next response comes, it will first go to Redis to check whether there is a request according to the business identifier, and if there is, it will take out the business operation and business processing.

That is to say, connect the request and response first, and then connect the connected request and response to the buried point again, which is equivalent to two real-time join, and store the unconnected messages in Redis as state storage.

But this leads to some disadvantages:

The first is low throughput. As more and more messages are accessed, Flink sets more and more parallelism and uses more and more Redis connection number requests. Restricted by Redis throughput and connection number limits, the overall throughput will be limited after reaching a threshold.
Second, The operation and maintenance of Redis is under great pressure. When the amount of data is large, more and more data is not connected, and Redis will soon be full. Some manual cleaning is required to ensure stability;
Third, you need to manually write some extra code in Flink to interact with Redis.
Fourth, the status backlog of Redis becomes larger, which will cause the parameters or data in Redis to expire or be crowded out.

So we evolved a second version, the Interval Join version.

3.1.2 Inerval Join Version (version 2.0)

Interval Join is a built-in feature of the Flink framework. RocksDB is used for state storage, which is equivalent to replacing Redis with RocksDB.

The original intention of using this scheme is to reduce the operation and maintenance pressure on the one hand, on the other hand, with the increase of data volume, it can be easily made horizontal expansion.

The first optimization is to do some filtering according to the configuration after the data arrives, filtering out the data that is not needed in advance, so that the amount of data to be processed decreases a lot. Second, interval Join is used to join the request response once, and then join the data on the join with the buried point again. The logic in this scenario remains the same as in the previous 1.0 scenario. At the same time, in order to tolerate a data delay of about an hour, we set a 30-minute upper and lower limit.

However, after running it for a few days, we found that it often appeared in OOM, because of the use of Flink on K8s, it is complicated to analyze. Later, by reading the source code, we found that Flink’s interval Join would keep all the data within its upper and lower time ranges in the state, and would not delete the data until it expired, which also caused some new problems.

First of all, checkpoint is very large. We found by reading the source code of Flink’s Interval Join implementation, Interval Join saves all the state up and down the Rocks DB state back end for 30 minutes because it saves all the data for 30 minutes to handle one-to-many and many-to-many joins.

Second, the operation is unstable, it uses RocksDB as the state store, RocksDB itself is written in c++, Flink uses Java to call it, easily resulting in oom. And due to some constraints, RocksDB can only be configured with enough space to prevent it from OOM through configuration parameters. For apps in our industry, it is absolutely unacceptable for OOM to cause real-time business interruption.

Secondly, after analyzing the scenario of in-line Join, we found that: in the case of in-line Join, request, response and buried point must be a one-to-one relationship, rather than a one-to-many relationship like database. Given this background, we considered that during this one-hour period, a lot of data was not needed to be stored in the state back end, so we wanted to do our own state management and delete unnecessary data from the state back end. A third version evolved.

3.1.3 Manually Managing Status (Version 3.0)

Since the join will only happen once, in 3.0 we implemented a keyedProcessFunction code to manage this state.

After the data arrives, it is filtered, regardless of whether it is a request, response, or buried point. It is unified and then grouped into key by groups based on the extracted unique identifiers. After grouping, messages with the same unique identifiers are placed in the same slot. After each message comes, it will check whether the response and buried point have arrived and whether it meets the conditions of JOIN. If the conditions are met, it will output it and know the data in the state back end. If the output conditions are not met, the message will be stored in the state back end of RocksDB to wait. This allows you to manage state manually, reducing state storage by 90 percent and delivering significant benefits.

Using RocksDB as the state back end first, throughput is much better than version 1.0.

Second, it reduces the operational difficulty of development. Thirdly, the real-time processing capacity is improved. When the amount of data increases further, more nodes can be added for horizontal expansion. In addition, Flink comes with many join schemes and provides good interfaces for us to realize the logic in them more conveniently.

3.2 Process Indicator Scenario

With the basic data of the process, we made some index calculation based on the basic data of the process, and carried out two scheme iterations for the real-time process index calculation.

3.2.1 Real-time Indicators Version 1.0

Real-time metrics in version 1.0 used both streaming and offline computing to handle real-time metrics simultaneously, but we worked on quasi-real-time metrics at the minute level, limited by our proficiency with the technology stack and tools. Firstly, the data source is Kafka, which is pre-aggregated by Flink and then sink into Kafka. Then Spark periodically calls in and writes the data from Kafka to the GP library. We take GP library as the core, use SQL to calculate the index, write the index calculation results back to Oracle, and finally the application to consume, this is the 1.0 version we use.

3.2.2 Real-Time Indicator version 2.0

As we became more familiar with Flink and the tools, the team thought about how to achieve second-level realism. In version 2.0, you can connect data directly from Kafka and use Flink to calculate real-time metrics. You can write data directly to Oracle with an end-to-end second delay. It is a real indicator.

In banks, channels, products and institutions are three very important dimensions. We have made statistics on the distribution of channels, products and institutions for the business processes of online production. For example, we need to make statistics on the proportion of business processes handled by online and offline employees in each branch. For example, from the beginning to the end of the process, whether there are materials in the middle of the process is not fully prepared and there are backlogs, and the average processing time of each link, etc.

Therefore, in general, Flink does bring relatively large benefits to the project.

First, Flink SQL makes the process of processing easier. Since Flink1.11, SQL features have been gradually improved, making it more convenient for developers to use;
Second, it provides end-to-end, true second-level real-time;
Third, Flink can reduce the interaction of data components, shorten the entire data link, and improve the stability of data.

3.3 Service Effect

In the middle of the figure is a business process of cash reservation and door-to-door collection on CCB’s mobile APP.

First, cash entry, then cash counting, cash transfer, and finally business acceptance completed. The process and indicators analyzed by us can be viewed not only in the APP of the mobile phone, but also in the employee channel. For example, each green dot on the left represents a site, and each site can be strung together to form a complete process after completion.

So for business people, the first value to be gained is process remodeling. From index access to index visualization to data mining, indexes are finally obtained according to the process to optimize the process, forming a complete business closed loop.

With this basic data in hand, we can intervene against the risks of business processes. For example, if a customer wants to handle a large amount of cash withdrawal, the system will inform the branch manager in real time to retain and intervene the customer.

Second, process analysis brings about optimal allocation of resources. The process-based application can monitor the use of business resources. For example, if the number of applications for a certain process suddenly increases, we will consider whether the processing time is too long due to lack of staff. The system will timely monitor the application and allocate more staff. In this way, resource allocation can optimize resource utilization efficiency and improve service satisfaction. Similar process scenarios have also been promoted to many business lines within the industry and have been widely praised.

The processing of real-time data by Flink provides strong data support for THE digital transformation of CCB.

Iv. Future prospects

At present, the operation of this process is only implemented within CCB. In the future, we hope to productize and platform the methodology of this intelligent operation process and promote it to more industries, so that more industries can get the practice of financial-level process operation.