According to “mPaaS Server Core Component Architecture Overview: Mobile API Gateway MGS”, we have preliminarily understood the specific architecture design and brief introduction of mobile API Gateway MGS among many components of mPaaS server.

This paper is based on the “Inside Ant Financial: Double tenth behind the ant gold suits technical support “event share content category” million of ants under the concurrent end-to-end network access architecture, network access architecture is emphatically discussed in the ant gold system to evolve, and how to deal with “New Year red envelopes” million level concurrent challenges, as well as the corresponding technical architecture practice and optimization idea how in mPaaS precipitation.

1. Introduction

Alipay mobile terminal architecture has completed the iterations and gradual improvement of tool-type App, platform-type App and super App.

This sharing will focus on the specific evolution of Alipay’s mobile network access architecture, as well as the specific solutions to the hundred-million-level concurrency scenario of projects such as Chinese New Year red envelopes. In addition, we will extend the discussion on the external commercial application and output of Ant Financial’s mobile network technology.

2. Evolution of Ant Financial’s mobile network access architecture

Alipay Wireless team was established in 2008, when the overall structure of Alipay APP could be simply called single application architecture. A single application consists of two parts, the client APP and the server, which communicate through HTTPS.

Due to the gradual development of wireless services, many businesses need to move from PC to wireless, and more and more development needs to be invested in wireless. However, the current architecture cannot support the parallel research and development of multiple businesses and multiple teams. Each business function has to pull one branch, N businesses have to pull N branches at the same time, merging code is also painful, the whole architecture becomes a big bottleneck.

In 2013, we upgraded the App architecture by introducing the API gateway architecture: the back-end service is abstracted into one interface to provide services externally, which can be disintegrated into a variety of services. The development and release of each system has nothing to do with other systems, and multi-terminal application access is supported, such as Word-of-mouth App and Alipay main App.

The most important thing is that we introduced the mobile RPC research and development mode. There is an INTERMEDIATE DSL RPC definition, which can generate multi-terminal code. The communication details in the middle are all responsible by the RPC framework, and the client only needs to care about the business.

The API gateway architecture provides a complete API service life cycle, which can be defined from API development to launch, configuration, service launch, service operation, and finally offline. We did a lot of tools during the development support period, such as code generation, API testing tools, etc. For the operation of the service after it goes online, we have a complete monitoring system, including scoring each API, such as API response time, data transmission size, response time, etc. For example, when the error rate exceeds a legal value, we will send an email warning. We also do a lot of client and server diagnostics, and provide full platform application support.

In addition, we introduced wireless RPC mechanisms.

During the research and development, the server opened an interface to automatically pull the service and access to the gateway background. Business students can generate RPC codes for each client and send them to the client for integration. The client sends to the gateway by RPC code, which forwards to the service server. The whole process is very simple and the overall r&d efficiency is greatly improved.

In 2015, Alipay began to try to do social networking. Therefore, the design optimization of platform architecture is imminent, and the new business scenarios also pose greater challenges and requirements to the stability of App, so the third-generation architecture of mobile access comes into being.

  • Firstly, we optimize the network protocol, change the communication mechanism between client and server into a long link, and define the long connection protocol MMTP.
  • Second, the SYNC mechanism is introduced. The server can actively push synchronous data to the client. Third, the introduction of mobile scheduling, which has a variety of personalized scheduling, such as computer room disaster recovery, whitelist scheduling, etc.

Next, take a look at network protocol optimization.

The lowest layer of our network transmission protocol is SSL/TLS, ant developed MTLS based on TLS1.3, the upper layer is the session layer, which was based on HTTP at the beginning and now is based on self-developed communication protocol MMTP, and the uppermost layer is RPC, SYNC and PUSH application layer protocols.

RPC solves the “request-response” communication mode; SYNC is responsible for the communication mode of “the server pushes data directly to the client”. PUSH is responsible for “pushing traditional PUSH box notifications.”

In addition, we have redefined HTTP2 to introduce H2+ private frame protocol to support custom two-way communication. HTTP2 is now basically the next generation communication protocol, and the mainstream browsers already support it. At the same time, we also introduced the mobile side, because it has many mobile network friendly features such as multiplexing, hPACK high compression algorithm.

Let’s talk about the SYNC mechanism.

SYNC is essentially a synchronization protocol based on SyncKey. To illustrate what SYNC is, let’s take the “bill page display” example: Traditionally, when a client wants to pull all of the person’s bills, it sends an RPC request to the server, and the server pulls all the data back at once, consuming traffic. Our SYNC mechanism is to synchronize the differential data, which achieves the effect of saving traffic. The communication efficiency is more efficient with the small amount of data, and the client side has a higher success rate of getting the data from the server side.

In addition, with SYNC, the client does not need to be online in real time. If the user is not online, SYNC Server stores the differential data in the database. The next time the client connects to the server, the differential data is synchronized to the user. Within Alipay, we have applied SYNC mechanism in chat, configuration synchronization, data push and other scenes.

Regarding the mobile scheduling design, the underlying mobile scheduling is actually an HTTPDNS, not the traditional LocalDNS.

The traditional DNS has the problem of DNS hijacking, and the uneven DNS quality of carriers affects the quality of requests and responses. In addition, it does not support complex customized scheduling requirements such as LDC multi-center scheduling. Therefore, we made our own mobile scheduling AMDC, supporting disaster recovery, policy, channel optimization, and LDC whitelist scheduling.

Regarding the evolution of the fourth-generation Alipay mobile architecture, we have mainly done two things: first, unify the network library; Second, gateway decentralization.

On the one hand, client platforms need to cover iOS and Android, as well as IOT RTOS and more in the future. However, for each platform we support, we need to develop a new set of network libraries; On the other hand, our client network library has rich and complex policies, and we often find that the policy implementation is different on each platform, and these differences can cause many unexpected problems.

Based on the above two points, we consider using C language to make a unified network library, which can be run on different platforms and all client network policies and scheduling are unified. In this way, the r&d cost is greatly reduced. Each requirement only needs the input of one r&d student, and the unified network library can be upgraded on different platforms.

Part of a service we did the gateway upgrade decentralized architecture, centralized gateway has two problems: first, the problem of capacity planning, now the whole pay treasure gateway platform nearly all interfaces, every time before doing activities need to evaluate the request quantity of the interface, but their peak request quantity is hard to estimate, every time is a ballpark capacity; In addition, the cost of gateway server is getting higher and higher, and each activity has a large volume of business, which requires massive expansion each time. Secondly, the stability problem, API gateway is closer to the business, the change is relatively frequent, sometimes because of a business change problems, the whole gateway cluster will be suspended, affecting all businesses, business level isolation is not possible. Therefore, we have done gateway decentralization, eliminated the gateway on the “formal”, advanced the API routing capability on the gateway to the top access gateway, extracted the core functions of the gateway (such as check, session, flow limiting, etc.) into a Jar and integrated it into the business system.

This has two advantages:

  • First, performance improvements: remote calls to gateway invocation services become local JVM calls;
  • Second, stability improvement: each service integrates a stable version of the gateway Jar. When a certain service system does the gateway Jar upgrade, other service systems will not be disturbed.

But the gateway decentralized drawback is obvious, such as fragmentation, the gateway for the system integration of each Jar version is different, such as found gateway Jar with a security hole need to upgrade, promote various business system upgrade Jar is a painful process, business systems need to experience to integrate the new Jar, Testing regression, online distribution, and other complex processes.

In addition, there are Jar conflicts and heterogeneous systems are not easy to integrate. The emergence of ServiceMesh brings us a new idea. We put the gateway logic into the network agent in ServiceMesh and deploy it into the business system as a Sidecar in the form of an independent process, which perfectly supports lossless smooth upgrade and supports heterogeneous systems. It solves the decentralization integration problem of Alipay’s internal Nodejs and C language system.

3. How to deal with Chinese New Year red envelopes and concurrent challenges

From the Spring Festival of 2015, Alipay will do the red envelope activities. In 2016, alipay cooperated with Chunwan, and the red envelope of Shuyi Shuo reached 17.7 billion per minute at its peak, with nearly 300 million requests per second — how did we cope with such concurrent challenges?

| way to deal with

The process of large-scale activities of Alipay is as follows: First, the product manager determines the business gameplay several months in advance, and the technical students start to evaluate the technology after they get the business gameplay. After the core indicators such as peak online users and core business requests are evaluated, the technical solution will be evaluated.

The technical scheme depends on the analysis of the core link, and then the capacity assessment of all the systems. After the capacity assessment, the flow limiting scheme is made, and finally the optimization of some systems or nodes in the whole link is determined.

Finally, the key is whether the non-core business, non-core functions to do dependency reduction. Pressure test will be conducted after the technical solution is produced, and activity drill will be conducted after the pressure test reaches the standard. During the drill, some problems will be found and repaired in time. The follow-up is to prepare for actual combat, if there is a problem will do emergency treatment. After the activity, we will roll back the previous downgrade strategy, machine room eject and other operations.

How does our network access layer ensure the promotion of activities? The following sections focus on traffic limiting and performance optimization at the access layer.

Current limiting | access layer

We are faced with hundreds of millions of requests, the back-end business is certainly not able to sustain, the entrance layer must protect the back-end system through the means of limiting the flow.

The core idea is to do a damage service, to ensure that the core business in the experience acceptable range to do downgrade non-core functions and business. First, we lower the compression threshold to reduce the consumption of the performance layer. In addition, we will downgrade all non-core and non-important interfaces, because they will not affect the client experience if they are restricted.

We have implemented multi-layer traffic limiting mechanisms, including LVS traffic limiting, access traffic limiting, API gateway traffic limiting, and service traffic limiting:

LVS: A single VIP LVS cluster usually has four machines, and one LVS cluster cannot support it. Therefore, we assign multiple VIPs to each IDC, so that multiple SETS of LVS clusters share the traffic and improve anti-ddos attack capability.

Access layer: provides TCP traffic limiting and core RPC traffic limiting capabilities. In addition, we made a hierarchical traffic limiting algorithm in the API gateway layer, and made a strategy for interfaces with different requests. High QPS traffic limiting uses a simple cardinality algorithm, and if the value exceeds this value, it will be directly rejected. For medium QPS, token bucket algorithm is done to accept a certain burst of traffic. Distribute current limiting for low QPS to ensure the accuracy of current limiting.

| TLS performance optimization

At the gateway access level, such a massive number of requests must be optimized to maximize performance. We did a lot of performance optimization to reduce performance consumption.

First, let’s share the TLS optimizations. TLS is an order of magnitude difference in performance (HTTP vs. HTTPS). Those who know encryption algorithms know that the most expensive performance in TLS is RSA encryption and decryption in TLS handshake phase. In order to optimize the performance cost of RSA encryption and decryption on the server, our optimization strategy a few years ago was hardware acceleration, which assigned the OPERATION of RSA encryption and decryption to a separate hardware acceleration card. With the development of TLS, RSA in TLS is basically abandoned and replaced by the latest ECDSA. The algorithm and cost of ECDSA at the bottom of the layer consume much less performance than RSA, with a difference of 5-6 times. In addition, Session Ticket mechanism was used to reduce the TLS handshake from 2RTT to 1RTT, which greatly improved the performance.

| compression algorithm optimization

The most commonly used compression algorithm is GZIP, and the two key indicators of compression are compression ratio and compression/decompression speed. We tried open source algorithms like Gzip, LZ4, Brotli, ZSTD, and found that Facebook’s compression algorithm, ZSTD, beat both metrics. However, ZSTD has high requirements for dictionaries, so we can get dictionaries suitable for us by cleaning massive online data, which greatly improves the compression rate and compression performance.

4. Commercial application and output of Ant Financial’s mobile network technology

| mPaaS one-stop mobile development platform

The commercialization of Ant mobile network technology relies on mPaaS, the mobile development platform of Ant Financial.

MPaaS is derived from alipay App’s mobile technology thinking and practice in the past 10 years. It provides a cloud-to-end one-stop platform solution for mobile development, testing, operation and operation and maintenance, which can effectively reduce the technical threshold, reduce research and development costs, improve development efficiency and assist ecological partners to quickly build stable and high-quality mobile apps. Mobile network services in mPaaS provide MGS gateway service, MSS data synchronization service, MPS push service, MDC scheduling service and other rich network solutions.

| integrated ant gold technical ability

MGS (Gateway service), MPS (push service) and MSS (synchronization service) on the server side are our core services, which basically cover three modes of request response, push and incremental update, and can meet most business application scenarios. The open version of gateway service supports HTTP, Dubbo, ZDAS, SOFA-RPC and other protocols. It also supports plug-in functions to enhance gateway functions. MSS service mechanism is incremental update mode, and can do sequential push, such as chat, chat messages must arrive one by one, can not be out of order, and can also achieve second level touch. In China, we will build our own PUSH channel for MPS service. In addition, when the self-built channel is unavailable, we will try to PUSH through the PUSH channel of Xiaomi, Huawei and other manufacturers to ensure high availability and high PUSH rate.

5. Conclusion

Through this section, we believe that you have a preliminary understanding of how ants build network access architecture to cope with hundreds of millions of levels of concurrency and mPaaS mobile API gateway service MGS.

For details about gateway functions, see t.cn/EUL8D6o, the official document of the mPaaS Mobile Gateway

We will talk more about the design and optimization of other service components of mPaaS.

Past reading

Analysis of Alipay App Construction Optimization: Optimizing Android Startup Performance through Package Rearrangement

“Alipay App construction optimization analysis: Android Package size extreme compression”

Brief Analysis of Alipay Small Program Framework and How to Deeply integrate in mPaaS

The opening | ant gold clothes mPaaS core components of a service system overview”

Summary of Ant Financial mPaaS Server Core Component System: Mobile API Gateway MGS

Follow our official account for first-hand mPaaS technology practices