Author: The technique of idle fish — You Yu

I. Background:

As an important transaction consulting tool for Xianyu users, IM message has two core objectives. The first is to ensure that users’ messages are not lost, and the second is to ensure that users’ messages are delivered to the receiver in a timely manner. IM message according to whether the message receiver equipment online, is divided into offline and online delivery, according to current idle fish every day more than half of the IM message is go online channels, and online message arrival rate, timeliness is directly affect the user experience, this article will focus on analysis of the stability of the optimization of online channel, ensure that the user message arrived in a timely manner.

Second, what are the problems we face

1. The intra-end long connection is down

In IM scenario, the user and the cloud communication is frequent, and in order to realize the user’s messages arrive tend to adopt the way of push message under the clouds touch up to the user, so users online devices and the cloud will maintain a TCP connection channel, the server can be more lightweight interactions, modern IM IM downside news is through the long, even issued by Xianyu message uses ACCS long connection, WHICH is a full-duplex, low-latency, high-security channel service provided by Taobao Wireless. However, due to the uncertainty of the network status of the user device, there may be a variety of network anomalies leading to the interruption of the long connection channel. Once the long connection is interrupted unexpectedly, the user cannot receive online messages in time. Therefore, we need to sense the interruption of the long connection in time and try to reconnect.

2. The push down message is not received

Perceiving long connection interruption and reconnection can only ensure the validity of the long connection in most of the time. However, when the long connection is invalid or unstable, the client may not receive the message pushed down at all. Simply put, the reconnection mechanism alone cannot ensure that the downstream message must be received. 1) the server sends down message length even open, the message in the way of transmission channel, the client can’t received the online status of delayed 2) device, the service side downlink message think online equipment, equipment has been offline, actually could not received the 3) the client receives the downlink message, failure to end on the subsequent processing, such as library failure, The message was not presented to the user successfully

According to data buried point statistics, the success rate of ACCS downlink is about 97%

Anxious students are going to ask, lost 3% of the message? No, the 3% of messages are not lost, but they are not guaranteed to reach users in time. Our message synchronization model is a combination of push and pull mode. When the user pulls the message, all the messages of the current site of the device and the latest site of the server will be pulled. The message of ACCS downlink failure will be obtained through the active pull mode. 1) The user cold starts the APP and synchronizes messages actively; 2) the user pulls down and refreshes actively; 3) The app background switches to the foreground; 4) After receiving a push message, the client finds that there is a gap between the site of the new message and the latest local site, triggering synchronization

It can be seen that the trigger of the above active synchronization messages largely depends on user behavior or whether new messages are received, so it is difficult to ensure timely arrival of messages. If the IM software is frequently opened by users, this will not be a big problem. However, the activity of Xianyu App is low, and sometimes it even relies on IM messages to pull the app alive. In addition, a delayed message may cause users to miss a transaction, and Xianyu message does not allow such a delay. Based on the above analysis, we first describe a data indicator to reflect the current situation. According to the above description, ACCS messages are not all pushed down, but may be actively pulled down. If it is pushed, it must arrive in time, if it is pulled, it is limited by user behavior. Pull this part of the message, we define as the ACCS message compensation arrival, and then calculate the ACCS message compensation arrival time, the message range is limited to the server ACCS successfully downlink but the client through the active pull synchronous message, the previous version of this data in about 60 minutes. It should be noted that this data does not mean the time it takes for the message to reach the user, because if the message is transferred from online to offline, the time it takes to pull the message depends on the user’s behavior (when the user opens the app), but it can also roughly reflect the arrival delay of the online message.

In the following article, we will elaborate on how to optimize online channel stability from two aspects: reconnection of long connections and retransmission of missed messages.

Three, long connection reconnection

1 Why is the Long Connection Interrupted?

The possible causes are as follows: 1) The user device disconnects from the network 2) network switchover occurs on the device 3) The device is in a weak network environment and the network is unstable 4) The device is normal and the TCP connection is interrupted by the carrier due to NAT timeout

If the network status changes due to user operations, there will be a network status change event notification. In this case, you can monitor the event and actively try to reconnect, but in reality, most of the cases are “unexpected”. So how do you effectively sense anomalies?

2 Heartbeat Detection

Like most detection scenarios, the most effective detection means is heartbeat detection. The client can sense the connection interruption by sending heartbeat packets periodically. From the perspective of timeliness, the shorter the heartbeat interval, the better, while frequent heartbeat detection is bound to bring about the loss of user traffic and electricity. So our goal is how to detect as little heartbeat as possible and as quickly as possible to detect the unexpected situation of long connection interruption.

State machine + Message heartbeat queue:

On heartbeat protocol design, attention should be paid to the core aim of heartbeat packets is testing long channels unblocked, client active uplink heartbeat package and to receive the service side back pack, think long channel health, so the uplink messages, and back to package packet should be as small as possible, in general, by agreement head logo heartbeat packets and response

3. Heartbeat policy

The heartbeat strategy is the core mechanism to achieve the above goals, but the detailed design of the heartbeat strategy can even be written a separate article, this article only briefly lists several heartbeat strategies, interested students can read the article recommended at the end of the article to continue in-depth research.

  • Short heartbeat detection Initial status After receiving ack packets for three consecutive times, it is considered to be in stable state
  • Regular fixed duration heartbeat (adjustable frequency Mid+,Mid-, Long according to different APP states)
  • Adaptive heartbeat Indicates the heartbeat interval that automatically ADAPTS to changes in the device network status
  • Redundant heartbeat, app background cut foreground, active heartbeat once

3. Message ACK and retransmission

To solve the above problems, the message ACK and retransmission mechanism are introduced. The overall idea is that the client sends an ACK to the server after receiving the ACCS message and processing it successfully. The server sends the ACCS message to the retry queue, updates the arrival status of the message after receiving the ACK, and terminates the retry.

Overall design flow chart:The difficulty of the scheme is the implementation design of the retry processor. Next, we will focus on the detailed design of this part

1. Retry queue storage design

We use ali Cloud TimeLine model to store the arrival status of downstream messages. Ali Cloud Table storage is a multi-model structured data storage developed by Ali Cloud, providing massive structured data storage and fast query and analysis services. The distributed storage of table storage and powerful indexing engine can support petabyte storage, ten million TPS and millisecond delay service capabilities. Timeline model is a data model designed for message data scenarios. It can meet the special requirements of message sequence preservation, massive message storage and real-time synchronization in message data scenarios, and is widely used in IM and Feed flow scenarios.

We define a TimeLine for each user device. The timeline-id is defined as userId_deviceId, and sequenceId is defined as message site. The storage structure is as follows:

Each successful downstream message through ACCS is inserted into the TimeLine of the receiving user device. After receiving ack, the message arrival status is updated according to the message ID. At the same time, as the retry action only occurs within a short period of time after the downstream message, a relatively short global expiration time is required to avoid data inflation.

2. Delay retry design

1) Every time a message is delivered through ACCS, insert it into the Timeline first and the initial state is not reached, and then produce a delay message with N seconds delay. 2) After consuming the delay message each time, read the arrival state of the message in TableStore. If the delay is reached, the delay will be terminated, otherwise continue. Forward if the device is not online, offline channel and an end to try again, if the device online, then push heavy has not arrived, and delay again N SEC 4) each message retry consumption life cycle with the same delay, again up consumption M times, more than the number is no longer a retry and log buried point, follow-up can monitor the situation and optimized based on the data

3. Delay retransmission policy

The delay retransmission strategy refers to how to select the appropriate delay time to maximize the retransmission efficiency in the retransmission process. The network environment of different users varies greatly at different times and places, and the time required for the network to recover to the stable state also varies. Therefore, an appropriate delay strategy should be selected to ensure the retransmission efficiency. The optimal delay strategy aims to successfully deliver messages with the least retransmission times in the shortest time.

3.1 Fixed delay time

In order to find the optimal delay strategy, the answer must be obtained from the data analysis. The wild imagination is often far from the reality. We first analyze a wave of data with a fixed delay time (10s) and a maximum of 6 retries

We can see from this set of data that about 85% of the messages can be successfully resent within 40s, and 12% of the messages still do not receive ACK after reaching the maximum retry times. After four retries, only 2.03% of the messages are successfully resent for the fifth time, and only 0.92% for the sixth time. The benefits of resending have become very low. After 6 times, there are still some messages that have not received ACK. If the fixed delay time strategy is used for these messages, the cost performance is very low, and the frequent retransmission wastes system resources. We continue to improve the strategy.

3.2 Fixed delay + increasing fixed step size

Considering cannot recover part of the user’s network for short periods of time, frequent short interval retransmission of little value, we use 4 times fixed short interval delay after N seconds, after each time delay time is the last time delay time increasing fixed step length M s strategy,, user equipment off-line until it receives an ack or has reached the maximum delay time MAX (N). This strategy can solve the problem of the fixed delay retransmission strategy to a certain extent, but if the user cannot recover the network in a short time, each retransmission must be increased again, which is not an optimal solution.

3.3 Adaptive delay

Design flow chart:As shown above, we eventually spawned adaptive delay strategy, adaptive delay refers to the network status, according to the user to automatically adjust the delay time, expecting to achieve the highest retransmission efficiency, new message through the first four fixed N seconds short delay to detect the network status of the equipment, once the network recovery, we will empty equipment of the value of N, The device N value is the minimum time required for the current device network to respond to ack based on the previous retransmission experience. By default, the value is empty, indicating that the user device network is normal. If no ACK is received after four times of retransmission, we try to read the N value of the device. If it is null, the initial value is taken, and each delay increases by a fixed step M in the future. After retransmission, the N value of the current device is updated until the message receives ack or reaches the maximum delay time MAX(N).

4. Compatibility with older versions

Need to be aware of is the old version of the app is not back an ack, if send to old version equipment news joined retry queue, that such news will retry until maximum times will terminate, endless consumption of resources, so we design in accs long, even after the client actively upward a device information, containing a version of the app, The server stores the message for a certain period of time. Before adding the message to the retry queue, the server verifies the app version of the receiver and adds the message to the retry queue.

Iv. Program effect

News after reconnection retransmission scheme launched, we defined above indicators accs compensation arrival time Greatly reduced to 15 minutes from 60 minutes, a drop of about 75%, which confirms our technical analysis, at the same time, public opinion feedback of user information delay have fallen sharply, visible message retransmission mechanism to ensure users arrive fruitful.

5. Future prospects

The stability optimization of message online channel has come to an end. In the future, we will continue to optimize the usage experience of Idle fish message, including the improvement of basic functions and the improvement of basic experience. In terms of basic functions, we have supported message recall and draft functions in the recent version, and will gradually support sending location, session grouping, remarks, message search and other functions in the future. In terms of basic experience, we have optimized and upgraded the UI style of Xianyu message, and optimized the CPU and memory usage of app message TAB page. In the future, we will continue to optimize the usage experience of message from the aspects of flow, power and performance.

Message System Architecture in Modern IM Systems – Architecture In Modern IM Systems – Model Optimization practice of high concurrency IM system architecture