This article is based on teacher Li Qingxin’s speech at the “2021 Vivo Developer Conference”. Public id reply ** [2021VDC] ** To obtain information related to the topics of the Internetworking technology sub-session.

I. Introduction of Vivo push platform

1.1 Understand push platform from product and technology perspective

What does a push platform do?

Some friends may know about it, some may be the first time to contact. No matter what kind of situation you are, we hope that through today’s sharing, you can have a new understanding of us. Next, I will introduce vivo push platform to you from two different perspectives of product and technology.

First of all, from the product perspective, Vivo push platform through in-depth integration with the system, establish stable, reliable, safe and controllable, support 100W push speed per second, hundreds of millions of users at the same time online message push service, help developers in different industries to tap more operational value. The core capability of push platform is to provide users with real-time and bidirectional content and service transmission through smart devices and mobile phones by using long-connection technology.

Therefore, if you are an operator, you can consider using our push platform to operate your APP on vivo mobile phone system to improve the activity and retention of your APP. What is the nature of a push platform?

From a technical point of view, we are a platform for sending messages to users over TCP long connections. Therefore, the essence of push platform is to send messages to users’ devices through network channels.

We have received daily express notice! When the Courier puts the express in the express cabinet, the express background will automatically push a message to inform you of the express. I’m sure that if you’re an operations person, you’ll also like this efficient way of automatically delivering messages. If you are interested, you can learn more about us through the vivo open platform entrance and choose Message Push after sharing.

1.2 Content, service, and device interconnection

In this era of the Internet of everything, our platform also has the ability to connect more. We connect content, services and users together through long connections, distribute content to users, and provide real-time and two-way communication capabilities for terminal devices.

There’s the concept of a long link, so what is a long link? A long connection is a network connection that the client and server maintain and can communicate with each other over a relatively long period of time (for example, a TCP long connection).

Why do we use long connections instead of short connections as the underlying network communication of the platform?

Let’s take A look at the scenario of message delivery under short connection: Polling is used in short connection, that is, the client periodically asks the background whether there is A message from device A. When there is A message from device A, the background returns the corresponding message. In many cases, it may be useless and waste traffic. When A message needs to be sent to device A in the background, the message cannot be delivered because device A does not fetch it. However, with the long connection, when there is A message from device A, the background directly sends it to device A without having to wait for device A to pull it. Therefore, the long connection makes data interaction more natural and efficient. In addition, our platform has the following technical advantages:

  1. More than 100 million level of equipment online at the same time;

  2. Support push speed of millions per second;

  3. Support more than 10 billion message throughput per day;

  4. Real-time push effect analysis;

  5. Audit full push messages in real time.

These capabilities of our push platform can guarantee the timeliness of messages. These capabilities of our platform have been constantly evolving. Next, I will share with you the changes in the architecture of Vivo push platform in recent years.

Ii. Evolution of Vivo push platform architecture

2.1 Hug Service

Architecture in THE IT field is dynamic and can change at different stages. The driving force for architecture evolution is mainly business requirements. Let’s review the business history of our platform.

Since the establishment of the project in 2015, with the growth of business volume, we have continuously added to the system, enriching the capacity of the whole system to meet the needs of different business scenarios. For example, support content full audit, SUPPORT IM, support IoT, support WebSocket communication, etc.

As can be seen from the figure, our business volume grows by billions of dollars every year, which brings challenges to the system. Problems existing in the original system architecture also gradually surface, such as delays and performance bottlenecks. The architecture serves the business. Before 2018, all services of our platform were placed on the cloud, but other internal services we relied on were deployed in our own computer room.

2.2 Dare to change

With the growth of business volume and data transmission in the self-built machine room, there has been a delay problem, which is gradually worsening, which is not conducive to the expansion of our platform functions. So in the second half of 2018, we made a change to the deployment architecture: we moved all the core logic modules to our own room. After the architecture optimization, the data latency problem was completely solved. It also lays the foundation for further evolution of the architecture. As you can see from the above figure, our access gateway is also optimized for three-place deployment.

Why tripartite deployments instead of more regional deployments? Mainly based on the following three considerations:

  • The first is based on user distribution and cost consideration;

  • The second is to provide users with nearby access;

  • The third is to enable access gateway to have a certain disaster tolerance capacity.

You can imagine, if there is no deployment, access gateway room failure, then our platform will be paralyzed.

With the further expansion of the business scale of our platform and the daily throughput reaching 1 billion, users have higher and higher requirements for timeliness and concurrency. However, the system architecture of our logical service in 2018 has been unable to meet the demand for high concurrency or requires higher server costs to meet the demand for high concurrency. Therefore, from the perspective of platform function and cost optimization, we reconstructed the system in 2019 to provide users with richer product functions and a more stable and high-performance platform.

2.3 Enabling services with the Long Connection Capability

As the company’s large-scale long connection service platform, we have accumulated very rich long connection experience. We’ve also been thinking about how to empower more businesses with long connection capabilities. Our platform server between each module through RPC calls, this is a very efficient development mode, not every developer to care about the underlying network layer data packets.

It would be a great development experience if the client could also call the background through RPC. In the future, we will provide VRPC communication framework to solve the communication and development efficiency problems between the client and the background, provide consistent development experience for the client and the background, so that more developers will no longer care about network communication problems and concentrate on developing business logic.

Three, system stability, high performance, safety

As a push platform with a throughput of more than 10 billion, its stability, high performance and security are very important. Then, I would like to share with you our practical experience in system stability, high performance and security.

As can be seen from the domain model in the figure above, our push platform takes communication service as its core capability. Based on the core capability, we also provide big data service and operation system to provide different functions and services through different interfaces. The stability and performance of the push platform with communication service as its core will affect the timeliness of messages. Message timeliness refers to the time it takes for a message to be received from a device initiated by the business side. How do you measure message timeliness?

3.1 Monitoring and quality measurement

The traditional message timeliness measurement method is shown in the figure on the left. When the sender and the receiver are on two devices, time T1 is taken when sending and time T2 is taken when receiving the message, and the two times are subtracted to get the message time. But this method is not rigorous, why? Because the time baseline of the two devices is likely to be inconsistent. The solution we used, as shown on the right, was to put the sender and receiver on the same device, which solved the time baseline problem. Based on this scheme, we set up a dial measurement system to proactively monitor the time distribution of message delivery.

3.2 High-performance and stable long-connection gateway

In the past 10 years, when discussing the long connection performance of a single machine, we faced the problem of ten thousand connections of a single machine. As a platform with hundreds of millions of devices online at the same time, we had to face the problem of one million connections of a single machine.

As a long-link gateway, it is responsible for maintaining TCP connections with the device and forwarding data packets. For long-connected gateways, we want to make them as lightweight as possible, so we refactor and optimize from architecture design, coding, operating system configuration, and hardware features down through the entire hierarchy.

  • Adjust the maximum number of file handles in the system and the maximum number of file handles in a single process.

  • Adjust system NIC soft interrupt load balancing or enable nic multi-queue, RPS/RFS.

  • Adjust TCP parameters such as Keepalive (need to adjust according to the host session time), disable timewait recycles;

  • The AES-NI instruction is used to speed up data encryption and decryption.

After our optimization, the online 8C32GB server can stably support 1.7 million long connections.

Another difficulty is that the connection is alive. An end-to-end TCP connection passes through layers of routers and gateways. However, the resources of each hardware are limited, so it is impossible to save all THE TCP connection states for a long time. Therefore, to avoid disconnection caused by TCP resources reclaimed by intermediate routers, we need to send heartbeat requests periodically to keep the connection active.

How high should the heartbeat be? Sending too fast will cause power consumption and traffic problems, while sending too slow will have no effect. Therefore, in order to reduce unnecessary heartbeat and improve connection stability, we adopt intelligent heartbeat and adopt different frequencies for different network environments.

Class 330 million device load balancer

More than 100 million devices on our platform are online at the same time. When each device is connected to the long connection gateway, load balancing is carried out through the traffic scheduling system. When a client requests to obtain an IP address, the traffic scheduling system delivers multiple NEARBY access gateway IP addresses.

So how does the scheduling system ensure that the DELIVERED IP is available? We use four strategies: proximity, public network detection, machine load, and interface success rate. Using these strategies? You can think about these two questions:

  • If the internal network is normal, can the public network be connected?

  • A server with fewer connections must be available?

The answer is no, because the long connection gateway and traffic scheduling system is the heart keep alive through the network, so see long connection gateway on traffic dispatching system is normal, but it is likely a long connection gateway public network connections are exceptions such as there is no open public access, etc., so we need to combine a variety of strategies, to assess the availability of nodes, Ensure the load balance of the system and ensure system stability.

3.4 How to Meet High Concurrency Requirements

There is a scenario where a news is sent to hundreds of millions of users at a push rate of 1,000 per second. Some users may receive the news several days later, which greatly affects the user experience. Therefore, high concurrency is very important for the timeliness of messages.

Do you think TiDB will be a performance bottleneck for push based on the push process? In fact, no, they may be regarded as central storage initially, because we use distributed cache to cache the data stored in the central to each business node according to certain policies, so as to make full use of server resources and improve system performance and throughput. Our online distributed cache hit rate of 99.9% is centered storage blocking most requests. Even if TiDB fails for a short time, we are not affected.

3.5 How can I Ensure System Stability

As a push platform, the flow of our platform is mainly divided into external calls and internal calls between upstream and downstream. Their large fluctuations will affect the stability of the system, so we need to limit the current and control the speed to ensure the stable operation of the system.

3.5.1 Push gateway Traffic limiting

As a traffic entrance, the stability of push gateway is very important. To ensure the stable operation of push gateway, we should first solve the problem of traffic balance, namely, avoid the problem of traffic skew. After the flow tilts, it is likely to cause avalanches.

We use the polling mechanism to load balance traffic to avoid the problem of traffic skew. However, there is a prerequisite, that is, all push gateway nodes and server configuration should be consistent, otherwise it is likely to cause overload problem due to insufficient processing capacity. Second, we need to control the amount of concurrency flowing into our system, so as to prevent the traffic peak from penetrating the push gateway and causing the backend service overload. We use the token bucket algorithm to control the delivery speed of each push gateway, so as to protect downstream nodes.

What is the appropriate number of tokens? If the setting is low, the downstream node resources cannot be fully utilized; If the setting is too high, the downstream nodes may fail. We can adopt the strategy of active + passive dynamic adjustment:

1) When the traffic exceeds the processing capacity of the downstream cluster, notify the upstream to speed limit;

2) When the call to the downstream interface times out, current limiting is performed when the call reaches a certain proportion.

3.5.2 Internal Rate Limiting: Sends label push smoothly

Since the push gateway has limited traffic, why do internal nodes need to limit traffic? This is determined by the business characteristics of our platform, which supports full and label push. We need to avoid the situation that modules with better performance will exhaust the resources of downstream nodes. Our label push module (to provide full, label push) is a high performance service, in order to avoid its impact on the downstream. We realized the smooth push function based on Redis and token bucket algorithm, and controlled the push speed of each label task to protect downstream nodes.

In addition, our platform supports applications to create multiple label push, and their push speed will be superimposed, so it is not enough to control the smooth push of a single label task. The application granularity must be limited in the delivery module to avoid pressure on the service background caused by too fast push.

3.5.3 Internal Rate Limiting: Limit the rate of sending messages

Therefore, in order to achieve application-level rate limiting, we use Redis to implement distributed leaky bucket traffic limiting. The specific scheme is shown in the figure above. Why do we use clientId (device unique identifier) instead of using application ID to do consistent hash? Mainly for load balancing because clientId compared to application ID, since the implementation of this function, business side no longer need to worry about pushing too fast, causing their own server pressure problem.

So will the speed limit message be lost? Of course not, we will store these messages in the local cache and smash the store to Redis, mainly to avoid storage hot spots later.

3.5.4 Fuse downgrade

Push platform, some emergencies, hot news will bring a large burst of traffic to the system. How do we deal with sudden traffic?

As shown in the figure on the left, in order to avoid the impact of sudden traffic on the system, a large number of machines are deployed redundantly in the traditional architecture, resulting in high cost and serious resource waste. In the face of sudden traffic, the capacity cannot be expanded in a timely manner, reducing the success rate of push. So how did we do that? We adopted the solution of adding buffer channels and using message queues and containers, which had little system change. When there is no burst of traffic to deploy with a small number of machines, when there is a burst of traffic we do not need manual intervention, it will automatically expand and shrink according to the system load.

3.6 Automatic test system based on Karate

In daily development, in order to quickly develop requirements, we often ignore the interface boundary test, which will cause great quality risk to online services. In addition, I don’t know if you have noticed that different media such as Word, Excel and Xmind used by different roles in the team will lead to different degrees of loss of communication information. Therefore, in order to improve the above problems, we developed an automated test platform to improve the test efficiency and interface use case coverage. We adopted a unified language to reduce the loss of communication information between different roles in the team. In addition, test cases can be centrally managed to facilitate iterative maintenance.

3.7 Content Security

As a push platform, we need to ensure content security, and we provide the ability to audit content. We adopt automatic audit as the main mechanism and manual audit as the auxiliary mechanism to improve audit efficiency. Meanwhile, we carry out content audit based on impact level and application classification strategy to escort content security. It can be seen from the figure that the business request is forwarded to the content audit system through the access gateway for the content audit of the first-layer local rules. If the local rules are not matched, we listen system is called for the content anti-garbage audit.

Iv. Future planning of the platform

In the past few years, we have mainly introduced the architecture evolution of our push platform and the practice of system stability, high performance, security and other aspects in the evolution process. Next, we will introduce our key work in the future.

In order to provide users with a more user-friendly, stable and secure push platform, we will continue to invest in the construction of the following four aspects in the future:

  • Firstly, the data consistency of the whole system is realized on the basis of single module data consistency.

  • Second, although our platform has a certain capability of disaster recovery degradation, it is not enough. We will continue to improve the fusing degradation capability of all systems.

  • Third, we will continue to optimize the usability of the platform to provide more convenient platform services for the majority of users.

  • Fourth, as a push platform, the ability to identify abnormal traffic is also important, which can avoid abnormal traffic affecting the stability of the system. Therefore, we will also build the ability to identify abnormal traffic in the future.

We also hope that as we continue to improve our platform capabilities, we can provide better services for you in the future.

Author: Vivo Internet Server Team -Li Qingxin