Component system design of Ant Financial facing hundred-million-level concurrent scenario

Author: Lv Dan (Ning Zhi), joined Alipay in 2011, responsible for alipay Wap, AliPass card voucher, SYNC data synchronization and other projects successively, and participated in the Double 11, Double 12, Spring Festival red envelope promotion activities, and has certain project practical experience and accumulation in client basic services. Currently, I am responsible for the system optimization and architecture design of mPaaS server components of Ant Financial mobile development platform.

Activity Recommendation: How does Ant Financial build an intelligent engine for user behavior prediction and summary analysis, targeted delivery, ABTest and so on? How to improve the construction of online operation and maintenance system, monitor App anomalies in real time and form online quality assurance closed-loop through dynamic hot repair by means of automatic log collection and analysis?

Application: immediately tech.antfin.com/activities/…

On May 6, InfoQ hosted the QCon 2019 Global Software Development Conference in Beijing. Lyu Dan (Ning Zhi), a technical expert of Ant Financial, shared the component System Design of Ant Financial in the face of Hundred-million-level Concurrent Scene at the conference. We arranged the speech as follows:

Today, I want to share with you about the mobile base component system, the content can be roughly divided into four block, the first block is the standard research and development of mobile service system, the basis of the second block is the core of support million level concurrent component evolution process, the structure of “mobile access of the third piece is a double tenth, 12-12, New Year red envelopes of the big deal with method that promote activities, The last piece is the basic service products that have been exported.

0. Mobile R&D basic service system

First introduce the evolution process of Alipay client. Before, the main functions of Alipay client were transfer, order payment, transaction inquiry and so on. It was more like a tool APP, which would only be taken out when payment was needed and put back after use. In 2013, Ant Financial added many services after All In Wireless, such as Yu ‘ebao, card coupon, Explore discovery, etc. Basically, all functions of the Alipay Website were transferred to the client, and Alipay gradually evolved into a platform-level client. Later, with the rapid development of mobile Internet, the company’s internal hatched more APP, other industries within the circle of the mobile Internet, rolled out a lot of business in order to improve users, user viscosity, also began to a lot of business integration between APP, the APP is born because of this, the APP started toward the ecological pattern of development.

Up to now, the annual active users of Alipay client have exceeded 800 million. Under the promotion scenario, the simultaneous online volume has exceeded 300 million, the concurrent requests have exceeded 100 million, and the simultaneous online users have exceeded 1 million per second.

Behind these data must be a huge, complex and complete support system to support the operation of Alipay, mobile RESEARCH and development basic service system is an important part of it.

According to the research and development process, we divided the basic service system of mobile RESEARCH and development into four parts: APP research and development stage, which mainly includes APP framework, basic components, cloud services and research and development tools; The App testing stage mainly includes the R&D collaboration platform and the real machine testing platform, of which the r&d collaboration platform includes version management, iteration management, installation package compilation, construction and packaging capabilities, while the real machine testing is mainly to replace manual services, reduce labor consumption and improve test efficiency; App operation and maintenance phase, including intelligent release, log backtracking, emergency management and dynamic configuration; App operation stage mainly includes public opinion feedback, real-time analysis, offline computing and intelligent marketing.

1. Evolution of ant mobile access architecture

Today’s topic is to support the basic services under 100 million levels of concurrency, and mobile access is the most core and important part of the 100 million levels of concurrency. Mobile access is not a single system but a set of components, including Spanner+ connection management, API gateway, PUSH notification, and SYNC data synchronization. It is the traffic entrance of all mobile services. It maintains client status, supports unified management and control, and processes some service data.

In fact, there was no such term as mobile access at the beginning. Similar to the evolution of Alipay client, the back-end mobile access also evolved gradually and iteratively. At the beginning, each business service provided its own API or directly exposed its capabilities to the client, without unified architecture, model or control.

To solve this problem, in the all in phase, we extended an API gateway, which did centralized management and added the ability of PUSH. Because there are many apps in the company, we hope that these capabilities can be reused. Therefore, in terms of architecture, we support the isomorphism of multiple Apps. The client will provide multiple SDKS, which can be integrated at any time.

[Gateway Architecture]

The diagram above shows the architecture of a mobile API gateway. As you can see from the diagram, we define the API life cycle as the following phases: API definition, API development, API release, API configuration, API live, API operational, and API offline. The mobile gateway divides the API life cycle into three parts: r&d support stage, runtime stage and service governance stage.

There are four main capabilities in the r&d support stage, namely code-Gen, API-man, API-test and API-Mock. In order to improve the transmission efficiency of API data on the network, at present all ant API models adopt Protobuf for serialization. Therefore, in order to facilitate business development, API gateway provides a unified code generation tool based on Proto file, and in order to reduce the client file size and method number limit, We modified the official generated code, effectively reducing redundant methods and greatly reducing the client file size.

At runtime, the core functions include API traffic control, data check, user authentication and interface system routing.

The daily operation and maintenance of API is handled by the service governance system, which is mainly capable of API monitoring, API quality model evaluation based on real-time data, and some emergency management measures.

The core architectural design of API gateway is Pipeline. As we all know, the gateway looks like a simple API control and routing, but there are many nodes involved, and the functions of each node are independent from each other, and with the development of business, functional nodes will gradually increase. In some scenarios, You also need to do different node combinations. With traditional chained calls, the code execution string is very long and difficult to extend and maintain. Therefore, we refer to netty’s Pipeline design and complete our own Pipeline link. Each handler in Pipeline is independent of each other, and can be tied freely according to needs and configurations, which also provides a good architectural support for subsequent function extension.

[Code change]

From the code, we can clearly feel that the previous call process is a remote call, which needs to be aware of the path, parameters and so on. However, after unifying the whole data interaction, for the business system, this call process is more like a local call, which directly calls functions and encapsulates the model. In this way, business r & D students can pay more attention to the code and logic writing of their own business systems, without paying attention to the realization of the underlying communication, which greatly improves the efficiency of research and development.

Mobile networks are very different from wired networks. The mobile network is complicated, and the user status is also complicated. The user may be in the basement, elevator or other weak network environment, and the user has very high requirements for experience in mobile scenarios. For example, when making payment, the user needs to get the payment result immediately. Previously, we focused on the server side, not the client side. In order to solve user problems, performance problems and improve user experience, we made an upgrade to build a unified access gateway and put it on top of the basic components, and developed performance data synchronization and enhanced IP scheduling capabilities.

Unified Access Gateway

Unified Access Gateway (ACCGW), which can be understood as a pre-nginx, is a set of components developed by ants based on Nginx. Internally, we call it Spanner. It is mainly responsible for the non-business logic processing in access architecture, including SSL uninstallation, MMTP protocol parsing. Data compression and decompression, client TCP long connection maintenance, access traffic control, packet routing, and client log access. API gateways, PUSH PUSH, data synchronization, and other components are all under its umbrella.

[Network Protocol optimization]

The full name of MMTP is Ant Mobile Transmission Protocol. It is based on TLV data structure. The advantage of this data structure is that the efficiency of subcontracting and unpacking is very high, and it is based on binary, and the storage cost is relatively low. At the same time, it also meets the link reuse of multiple components of the client. Of course, MMTP with the client also has its own features, but we also added many new features, such as intelligent connection strategy. Because the network status of users in the mobile environment is not very reliable, the traditional connection mode may not be able to meet all RPC requests, so we made policy improvement. If the long connection is available, try to use the long connection. If the long connection fails or is intermittently disconnected, we try to use the short connection, which can meet the urgent RPC data sending at that time. At the same time, we will also use some concurrent connection strategies. The carrier network usually uses the connection whichever is connected first. After the connection, we will use the intelligent heartbeat strategy to capture the difference of heartbeat time for maintaining the connection between different carriers and different regions.

Build the process will often appear in concurrent client multiterminal long connection phenomenon exist at the same time, do transmission packets may be in the middle, if immediately broken, packets are lost, is likely to impact on business, so we joined the flexible break even, to ensure that may be in the process of transmission packets can be delivered safely. In addition, after multiple connections are made, the client may have a condition that the server does not perceive in time to know whether the connection is good or bad. Therefore, we added false connection detection. Packets are sent with a serial number, and when the client reports back, if the serial number is returned, it proves that the connection is available. Otherwise, we assume that the connection is in a state of suspended animation and can be broken at the appropriate time.

MTLS is ant mobile security transport protocol, based on TLS1.3. TLS1.3 hadn’t been officially released when we were doing this, but we learned about some of its features and incorporated some of them into the design. For example, the 1RTT ECDHE handshake is used. 1RTT ECDHE is based on ECC encryption suite. The biggest characteristic of ECC is that the key string is relatively small. Smaller data has great advantages in mobile, such as improving transmission efficiency and saving storage cost. Packet size is a particular concern in the mobile world during storage or transmission. For this reason, we chose the ZSTD compression algorithm. ZSTD has a very large compression ratio, and the efficiency of compression and decompression is good under this ratio. In addition, in some replayable business scenarios, we have added the 0RTT policy to send data from the client to the server in the first place. Through the above optimization, the average response efficiency of RPC was improved 5-6 times.

[SYNC] Data synchronization

SYNC Data synchronization may sound unfamiliar, but it can be interpreted as an evolutionary version of PUSH. It is based on TCP, two-way transmission. While traditional RPC solves most problems, it is flawed in some scenarios. For example, after the client is started, RPC requests are made to determine whether the server has data. In fact, 90% of the time the query interface does not change at all or the returned data client already exists, so this process is very redundant. In addition to data redundancy, requests are redundant because no changes have taken place and invocations can in principle be saved.

At first, after all in, we made some optimization on experience, such as preloading ability. When the client started, data preloading was triggered. Although the client did not enter the module, in order to improve user experience, the client sent a lot of RPC requests, resulting in a large number of redundant concurrent requests.

Another disadvantage is that the client cannot actively perceive the data changes of the server. For example, in chat scenarios, users are waiting for interaction. If RCP timed pull is used, the cost of the client and the server will be very high and the overall response time will be slow. In SYNC push mode, when the server generates data, the data can be pushed to the client based on TCP, and the client can get the data for business rendering in the first time. For example, alipay scan code payment and face-to-face payment are synchronized with the result data through SYNC service.

At the heart of SYNC is oplog, which is similar to mysql’s binlog and is a snapshot of each incremental data. SYNC generates a unique and increasing version number for each oplog, calculates the difference between the two ends by recording the current data version number of the client, and synchronizes only the difference data. Because SYNC is based on TCP, two-way active transmission can be achieved to achieve real-time, orderly, reliable and incremental data transmission. At the same time, SYNC is not based on business scenarios, but on events, such as connection establishment, login, from the background to the foreground, etc. Therefore, a single event can trigger the incremental calculation of multiple services, and the client does not need to make any other RPC requests when there is no incremental data. Thus greatly reduce the number of customer requests and redundant data transmission, not only improve the efficiency, real-time, but also indirectly reduce the system pressure.

[Mobile Dispatching Center]

The most important thing for client requests is to find the correct IP and send the request in the first place. However, traditional DNS has some problems, such as DNS hijacking, DNS resolution failure, DNS resolution efficiency of different carriers, and so on. DNS resolution requires extra RTT.

In response to these situations, we set up a mobile dispatch center, which is based on HTTPDNS and adds user partition information on top of it. What is user partition? In the face of hundreds of millions of concurrent tasks, the server must not be in one machine room, but may be in multiple machine rooms, and there are logical partitions inside the machine room. Which logical area the user belongs to is known only by the server, but not by the client itself. If a user in a zone is not in the correct zone and needs to be transferred to another zone for service processing, the traditional DNS cannot be used because the IP address list of the user cannot be resolved. HTTPDNS+ partition data model, can let the client quickly get the most accurate IP address, at the same time the client can also do a quality check and validity check for this IP address, before the request to determine the best IP address. In addition, HTTPDNS can support deployment of overseas nodes. HTTPDNS is not worse than DNS, because it has DNS to backstop, if HTTPDNS problems, then switch to DNS to do the resolution.

The above evolution process meets most of the daily needs, but Alipay has many big promotion scenes, each big promotion play is different, and the peak is concentrated in a moment. For this scenario, we incubated new patterns: one is the decentralization of the API gateway, two is the sync-pull mechanism, and three is the Sync-bucket computing pattern.

[Gateway decentralization]

One of the core issues that gateway decentralization addresses is cost. In the promotion scenario, the service volume keeps rising and the peak value may be very high. But the peak is only for a moment, the rest of the time the machine is idle, which is a huge waste of resources. And in order to ensure that the rush does not crash, the machine can not be reduced, so the pressure on the application is very large.

If it is a single point, then there will be a problem of stability. If something goes wrong with the release of the gateway, it can affect the processing of all business processes. In addition, if a single interface failure, client problems such as endless loop, will also affect the flow of other systems. In the face of the above situation, we can offset these risks by decentralizing the gateway. If only a single system has problems, there will be no other business problems due to network problems.

【 Sync-pull 】

Why is there a need for sync-pull read diffusion? Because alipay has a lot of big V merchants inside, each merchant has a lot of users, it needs to do some operation information on a regular or irregular basis. If you follow the SYNC scenario and dump one piece of data to each concern through write proliferation, it is not only a waste of storage, but also inefficient. Assuming that a merchant has 500 million followers, it will take dozens of minutes for all 500 million data to be stored. In addition, due to the large number of our merchants, everyone will scramble for this resource, which may lead to queuing. For the merchant, the activity cannot reach the client immediately, and for the server, the message may be identical, resulting in storage waste. Even with caching, the entire index needs to be stored for each user, which still requires a high TPS for the data.

In order to solve the above problems, we upgraded to read diffusion mode, abstracted it into concern relation, each merchant abstracted into Topic, and then put all data under Topic. Big V is relatively less, because the user attention and big V production data of frequency is not high, the effective data set is not much more special, so you can put the super big V data in the cache first, and then through the binary indexed addressing quickly under the user attention relationship, and through the original SYNC mechanism to push the incremental data to the client. In this way, the original hundred million level of storage becomes a single line of storage, the original tens of minutes of response time becomes seconds, efficiency and experience are greatly improved.

Early on, when we did SYNC, we wanted each business to be relatively independent, relatively isolated, and computing to operate independently. At that time, there were only a few business scenarios, but more and more services were added later. There were nearly 80 business scenarios, and each service was independently calculated. The client is event-based. After the connection is built, it needs to conduct 80 independent business calculations, which can easily reach 100 million levels under the condition of a large amount of one million per second, which puts great pressure on the database, application programs, cache, etc. At the same time, this way can not meet the sustainable development in the future.

To solve these problems, we abstract several categories based on the original computing characteristics. For example, several categories of data based on user dimension, device dimension, one-time, multi-end synchronization, and global user configuration are abstracted into several abstract models. Let’s say there are five buckets, all calculations are done in bucket fashion, and if a new business comes in, it is grouped into a bucket based on its characteristics.

In addition, traffic limiting policies must be implemented for high-priority services. In this case, the bucket can be dynamically increased or decreased, and the incoming and outgoing services can be switched at any time. When bucket went live, our computing volume dropped by more than 80%, and over the next two or three years we grew from 80 to 300, with no additional servers. Overall, the performance improvement is significant.

2. Promote coping strategies for activity scenes

The previous content has focused on how the components of mobile access can be designed to support a multi-billion scenario. Now let’s talk about how we can actually deal with the big push.

[The way to deal with the scene of promoting activities: footwork]

Through several years of experience in promotion, we have extracted several steps to deal with promotion technically: first, business students set business goals and determine business gameplay; After receiving the promotion, the technical students began to decompose the technical indicators and determine the corresponding technical solutions according to the capabilities, processes and characteristics of their respective systems. The steps to determine the technical solutions are mainly link analysis, capacity evaluation, performance optimization, flow control scheme, pre-plan strategy and determination of elastic flow rules. After determining and completing the technical response plan, the most important thing is to carry out the full-link pressure test of the production environment through the shadow user and shadow table. The pressure test period of each system can be as short as several days or as long as several months. To find problems and bottlenecks in continuous pressure testing, and to conduct pressure testing again after optimization until technical objectives are achieved; After the pressure measurement index is completed in the whole link, several rounds of activity drills are carried out to simulate real business scenarios and verify the accuracy of technical solutions. Since then, according to actual needs, choose the time to enter the stage of great promotion. At this stage, the main work of the R & D students is to cooperate with the operation and maintenance to implement the plan, observe the changes of various indicators during the promotion period, and confirm the need for emergency according to the monitoring. Of course, contingency plans also need to be validated in previous drills. After the end of the promotion activity, it is necessary to roll back and verify the pre-plan & emergency strategy, so that the promotion activity is really over. At the same time, and more importantly, we need to review the annual major promotion so as to identify deficiencies and improve them in subsequent activities.

[Flow control is the solution for promoting activity scenes]

Flow control is the most critical factor in the implementation process. For technical students, keeping the system alive in the activities is the most powerful support for the great Progress, and flow control is the most powerful barrier of the system. Due to the different roles that systems play, the urgency of services, and the size of clusters, features such as stream logging, compression thresholds, message ordering, and so on are sacrificed to make room for the performance of major links.

Traffic control is also divided into multiple levels. At the top layer, LVS controls billions of levels on VIP; at the access gateway layer, hundreds of millions of levels are controlled according to the number of connections and packets; and at the API gateway layer, tens of millions of levels are controlled. At these levels, simple counts are generally sufficient. In the business layer, especially in the medium and low traffic layer, token buckets and distributed traffic limiting are generally adopted. Then, on the API gateway, you can also do some custom scripts that mock back the results to fend off some requests for the business system.

[Automatic real machine test]

In addition to the core link, we also need some logistics services. For example, in the testing process, automation, especially real computer simulation test to offset part of the human labor. Thousands of mobile phones are deployed in our machine room, and some automatic operation and maintenance tests are usually carried out, including installation and unloading of installation packages, performance loss, and functional tests. In addition to automated testing, it also plays the role of automatic approval and service inspection to detect small programs and find problems in time, respectively. More than 60% of repetitive physical labor consumption can be saved through the automatic test platform.

[Intelligent release of client — Make sure the client is safe]

If we want to ensure that the client is safe, then the core is the grayscale process, after the grayscale process is finished, we can release to production environment.

Intelligent release supports various client release packages, including installation packages, offline packages, and small program packages. Over the years, we have also precipitated some templates, such as grayscale users, grayscale coverage, and so on. Gray scale, we can choose a certain template, according to the established logic, the use of automatic process instead of manual processing.

[Public opinion analysis — Get user feedback in time]

After the release of the client, the business students must be very concerned about the voice of users and market reaction, while the technical students hope to collect the real feedback from users as soon as possible. Public opinion analysis system is used to meet these needs, public opinion system can be divided into four parts: data collection, the main collection channels for major media center, application market comments, meeting feedback function and customer satisfaction center data; Data content can include various hot topics, hot events and major production issues; Data storage is mainly supported by four blocks at present: metadata can generally use relational database, document data uses MongoDB, crawler collected entries are transmitted through MQ, and all data will eventually fall into ES for retrieval and basic analysis; Data calculation is more about analyzing data in ES through file algorithm, and finally producing various trends and ranking of various events and topics. Meanwhile, relevant person in charge can be informed in real time according to the feedback of each user.

Here we are talking about mobile analytics mainly based on the client log buried point data analysis capabilities. The client needs to have standard BURIAL point SDK to collect various framework and container burial points of Native, H5 and small programs, and also needs to support customized business burial points. At the same time, in order to promote the scene in a big can effectively improve the server performance, buried point of writing and the report also needs to have some measures to carry out the dynamic control, buried in the point of the client, after the completion of quote service at the right time is the gateway, mobile log log (mobile gateway has gradually been incorporated into mobile access in). After the client logs are reported to the server, they can be output to the server log file by the log gateway or delivered to the message component for consumption computing by other platforms, including real-time computing platforms such as Jstorm, Kepler and Flink. It can also be sent to offline big data computing platforms such as Spark and ODPS for further analysis. As basic components, mobile analysis in addition to logging acquisition and synchronization, also conducted some basic framework output log data analysis, behavioral analysis (like live, new, flow exists), page analysis (stay length, participation), flash back, caton analysis, and provides the ability to log back in and log pull for research students for screening and analysis. Of course, this data can be used for all kinds of business analysis, and the form of analysis depends entirely on how the business wants to use it.

3. Basic service products exported abroad

[Technical component product service output: Mature one, open one]

Our basic capacity through the efforts of the past few years, also precipitated a lot of technical products can be exported. These technical products cover every stage from APP development, testing, operation and maintenance to operation, including client framework, client basic components, cloud basic services (such as API gateway, SYNC data synchronization and PUSH notification), open tools and plug-ins. There are r&d collaboration platforms for iterative management, compilation, construction and packaging, real machine testing platforms, intelligent release, log management and emergency management required in APP operation and maintenance stage, as well as various data analysis and marketing products for APP operation. These capabilities have been exported to several partner apps of Ant International (PayTM, Malaysia, Indonesia, Philippines, etc.), hundreds of enterprise-class apps in the public cloud, and dozens of financial apps in the private cloud.

Our aim is to mature one, open one, to build the ecology of mobile Internet with partners.

[One-stop R&D platform — mPaaS]

In order to help partners build their own apps quickly and effectively, we also launched a one-stop mobile RESEARCH and development platform — mPaaS. MPaaS includes all the basic capabilities mentioned above. At the same time, it supports both public and private clouds. MPaaS is not only the output of technology, but also the output of production experience and operation philosophy.

Recommendation | activity

How does Ant Financial build an intelligent engine for user behavior prediction and aggregation analysis, targeted placement, ABTest and so on? How to improve the construction of online operation and maintenance system, monitor App anomalies in real time and form online quality assurance closed-loop through dynamic hot repair by means of automatic log collection and analysis?

May 18th mPaaS salon CodeDay#2

Past reading

The opening | ant gold clothes mPaaS core components of a service system overview”

Summary of Ant Financial mPaaS Server Core Component System: Mobile API Gateway MGS

Core Components of Ant Financial mPaaS Server: Analysis of Mobile End-to-end Network Access Architecture under Hundred-million-level Concurrency

Core Components of mPaaS Server: Architecture and Process Design of MPS for Message Push

MPaaS Core Components: How does Alipay build public Opinion Analysis System for Mobile Terminal Products?

“MPaaS Server Core Components: Mobile Analysis Service MAS Architecture Analysis”

Follow our official account for first-hand mPaaS technology practices

Nail Group: Search group number “23124039” by nail group

Looking forward to your joining us