Author: Idle Fish Technology — Jing Song

1. The background

At the beginning of 2020, I took over the news of Xianyu. At that time, there were various problems in the news, and the online public opinions were also continuous: “The news of Xianyu is often lost”, “the profile pictures of message users are confused”, and “the order status is wrong” (I believe you are still making fun of the news of Xianyu when you read the article now). Therefore, the stability of idle fish is an urgent problem to be solved. We investigated some solutions of the group, such as Dingding IMPass. The cost and risk of direct migration are high, including the need for server data to be double-written and compatibility between old and new versions.

Based on the existing message architecture and system of Idle fish, how to ensure its stability? Where should governance begin? What is the stability of idle fish now? How do you measure stability? Hope this article, can let everyone see a different idle fish news.

2. Industry plan

The message delivery link can be roughly divided into three steps: the sender sends, the server receives and drops the message, and the server informs the receiver. Especially the mobile network environment is more complex, you may send a message, the network suddenly broken; A message may be in the process of being sent, but the network suddenly recovers and needs to be resent.

In such a complex network environment, how to deliver messages stably and reliably? For the sender, it does not know whether the message has been delivered or not. In order to ensure the delivery, it needs to add a response mechanism, similar to the following response logic:

  1. The sender sends a message “Hello” and enters the wait state.
  2. The receiver receives the message “Hello” and confirms to the sender that I have received the message.
  3. Once the sender receives the confirmation, the process is considered complete, otherwise it will retry.

The above process seems simple, but the key is that there is a server-side forwarding process in the middle. The question is who sends the acknowledgement back and when. There is a must-reach model found on the Internet, as shown in the figure below:

[Sending process]

  • AtoIM-serverSends a message request packet, i.emsg:R1
  • IM-serverAfter successful processing, replyAA message response package, i.emsg:A1
  • If at this timeBOnline,IM-serverThe initiative toBSends a message notification packet, i.emsg:N1(If, of courseBIf not online, the message is stored offline.)

[Receiving process]

  • BtoIM-serverSend an ACK request packet, i.eack:R2
  • IM-serverAfter successful processing, replyBAn ACK response package, i.eack:A2
  • theIM-serverThe initiative toASend an ACK notification packet, i.eack:N2

A trusted message delivery system is guaranteed by six packets. This delivery model determines the certainty of message delivery. Any error in the intermediate link can be determined based on the request-ACK mechanism to determine whether there is an error and retry. Take a look at chapter 4.2, which also refers to the above model. The logic sent by the client is directly based on HTTP, so there is no need to do retry for the time being, mainly when the server pushes to the client, retry logic is added.

3. The problem of idle fish messages

Just took over idle fish message, there is no stable related data, so the first step is to do a systematic investigation of idle fish message, first of all, to do the whole link buried point of the message.

Based on the whole message link, we sort out several key indicators: sending success rate, message arrival rate, client drop rate. The statistics of the whole data are based on buried points. In the process of burying points, a big problem was found: Idle fish messages did not have a globally unique ID, so the life cycle of the message could not be uniquely determined in the process of full-link burying points.

3.1 Message uniqueness problem

Previously the idle fish message was uniquely identified by three variables

  • SessionID: ID of the current session
  • SeqID: indicates the number of the message sent locally by the user. The server does not care about this data and transparent transmission is complete
  • Version: This is important. It is the sequence number of the message in the current session. The server prevails, but the client may also generate a false Version

Pictured above example, when A and B at the same time send A message will be generated locally as key information, when to send A message (yellow) to the server first, because there is no other version in front of the news, so the original data will be returned to A, the client receives A message, do merge with local news again, will only retain A message. At the same time, the server also sends this message to B. Because B also has a local message with version=1, the message from the server is filtered out, causing message loss.

After B sends a message to the server, the server increments the version of B’s message to 2 because there is already a message with version=1. This message is sent to A and can be merged with the local message. However, when this message is returned to B and merged with the local message, two identical messages will appear and message duplication will occur. This is also the main reason why message loss and message duplication always occur before idle fish.

3.2 Message push logic Problems

Before xianyu message push logic also has a big problem, the sender uses HTTP request, send message content, basically no problem, the problem is when the server to push to the other end. As shown in the picture below,

When the server pushes a message to the client, it determines whether the client is online. If the client is online, the server pushes the message offline. This is very simple and crude. If the status of the long connection is unstable, the real status of the client is inconsistent with the storage status of the server. As a result, messages are not pushed to the server.

3.3 Client Logic Problems

In addition to the above relationship with the server, there is a kind of problem is the design of the client itself, can be summarized as the following situations:

  1. Multithreading problem

The layout of the feedback message list page will be distorted, and the rendering interface will start before the local data is fully initialized

  1. The count of unread and small red dot is inaccurate

The local display data is inconsistent with the database storage.

  1. Message merge problem

The local merge message is segmented, which cannot guarantee the continuity and uniqueness of the message.

For several cases like the above, we first combed and reconstructed the client code, and the architecture is shown in the figure below:

4. Our solution – Engine upgrade

The first step in governance is to solve the problem of uniqueness of idle fish messages. We also investigated the solution of nail. Nail is the unique ID of the global maintenance message of the server. Considering the historical burden of idle fish message, we adopted UUID as the unique ID of the message, so that the message link burying point and de-duplication can be greatly improved.

4.1 Message uniqueness

On newer versions of the APP, the client will generate a UUID, and the server will add information if the old version cannot.

Message ID of the similar a1a3ffa118834033ac7a8b8353b7c6d9, after the client receives the message, will first according to the MessageID and heavy, and then based on Timestamp ordering is ok, although it may not be the same client, but the probability of repeated or smaller.

- (void)combileMessages:(NSArray<PMessage*>*)messages {
    ...
    
    // 1. Perform deduplication according to MessageId
    NSMutableDictionary *messageMaps = [self containerMessageMap];
    for (PMessage *message in msgs) {
        [messageMaps setObject:message forKey:message.messageId];
    }
    
    // 2. Sort messages after merging
    NSMutableArray *tempMsgs = [NSMutableArray array];
    [tempMsgs addObjectsFromArray:messageMaps.allValues];
    [tempMsgs sortUsingComparator:^NSComparisonResult(PMessage * _Nonnull obj1, PMessage * _Nonnull obj2) {
        // Sort the message by its timestamp
        returnobj1.timestamp > obj2.timestamp; }]; . }Copy the code

4.2 Resending and reconnecting

Based on the retransmission and reconnection model in #2, Xianyu improves the retransmission logic on the server side and the reconnection logic on the client side.

  1. The client periodically checks whether the ACCS long connection is connected
  2. The server detects whether the device is online. If the device is online, the server pushes a message and waits for a timeout
  3. When the client receives the message, it returns an Ack

We have already published an article called “Bybye to Message Delay: Solutions for The Timely arrival of Idle fish messages (details)”, which explains the problems caused by network instability for idle fish messages. We will not go into details here.

4.3 Data Synchronization

Retransmission and reconnection is the problem of the basic network layer. Next, we will look at the problem of the business layer. Many complex situations are solved by adding compatible codes in the business layer. Before perfecting the logic of data synchronization, we also investigated a set of data synchronization schemes of Dingding, which are mainly guaranteed by the server side with a stable long connection guarantee behind them. The general process is as follows:

The idle fish server does not have this capability for the time being. See the server storage model in 4.5 for details. Therefore, the idle fish can only control the data synchronization logic from the client. The data synchronization methods include: pull session, pull message, push message, etc. Because of the complexity of the scene involved, there was a scene before that push would trigger incremental synchronization. If push was too much, multiple network requests would be triggered simultaneously. In order to solve this problem, we also made relevant push and pull queue isolation.

The client-controlled strategy is to add the pushed message to the cache queue if it is being pulled, and then merge the pulled result with the local cache logic, thus avoiding the problem of multiple network requests. My colleague has written a paper about the logic of push-pull flow control, “How to effectively shorten the idle fish message processing time”, which can not be further described here.

4.4 Client Model

The data organization form of the client is mainly divided into two types: session and message. Session is divided into virtual node, session node and folder node.

On the client side, a tree like the one shown above is built. This tree mainly stores information about the session display, such as unread, red dots, and the latest message summary. Updates of the child node are automatically updated to the parent node, and the process of building the tree is also read and unread updates. Of more complex scenarios is idle YuQing newspaper, this is actually a folder node, it contains many child session, this will determine his message sorting, red point count and the update logic will become more complex, the service side informed the customer terminal session list, and then the client to splice these data model.

4.5 Server Storage Model

The client request logic is outlined in 4.3, where history messages are divided into incremental and full domain synchronization. This domain is actually a layer of concept on the server side. In essence, it is a layer of cache for user messages, which are temporarily stored in the cache to speed up message reading. However, this design also has a defect, that is, the domain ring is long, the maximum storage 256, when the user’s message number is more than 256, can only be read from the database.

As for the storage mode of the server side, we have also investigated the nail scheme, which is write diffusion. The advantage is that the message can be well customized for each user. For example, the nail logic, but the disadvantage is that the storage capacity is very large. This solution of idle fish should be a solution between read diffusion and write diffusion. This design method not only makes the client logic complex, but also slows the data reading speed of the server side. The following part can also be optimized.

5. Our solution – Quality control

In doing the whole link transformation of the client and the server, we also made the logic of monitoring and checking the behavior on the message line.

5.1 Link Troubleshooting

Full-link investigation is based on the real-time behavior logs of users. The buried point of the client cleans the data into SLS through Flink, the real-time processing engine of the group. The user’s behavior includes the message processing of the message engine, the behavior of users clicking/visiting the page, and the network request of users. Connect server test there will be some long push and retry logging, also will be clean to SLS, thus formed from the server to the client the screening of all link, details please refer to the quality of the news platform screen | all link a series of articles “.

5.2 Reconciliation system

Of course, in order to verify the accuracy of the message, we also made a reconciliation system.

When a user leaves a session, the system collects a certain number of session messages, generates an MD5 check code, and reports the messages to the server. After receiving the verification code, the server determines whether the message is correct. After sampling data verification, the accuracy of the message is basically 99.99%.

6 Core data indicators

We encountered some problems in the statistics of key indicators of messages. Before, we used user buried points to make statistics, and found that there would be 3%~5% data difference. Therefore, we later used the sampled real-time reported data to calculate data indicators.

Message arrival rate = The actual number of messages received by the client/the number of messages received by the client

The actual messages received by the client are stored in the database. This indicator does not differ between offline and online. When the user updates the device last on the day, theoretically, all messages delivered on the day and before this time should be received.The arrival rate of the latest version has basically reached 99.9%. From the point of view of public opinion, the feedback of lost messages is indeed much less.

Plan for the future

On the whole, after one year of governance, the news of Idle fish is gradually getting better, but there are still some aspects to be optimized:

  • Now the security of the message is insufficient, easy to be used by the black production, with the help of the message to send some illegal content.
  • Message scalability is weak, add some cards or capabilities will be issued, lack of dynamic and extensible capabilities.
  • Now the underlying protocol is more difficult to expand, the follow-up or to standardize the protocol.
  • From a business perspective, messages should be a horizontally supported tool or platform product that can be planned for quick two-party and three-party connections.

In 2021, we will continue to pay attention to users’ public opinions related to the message, hoping that Xianyu news can help users better complete second-hand transactions.

[Reference]

  1. www.52im.net/thread-464-…