Part of the content and pictures of “Chat about INSTANtaneity and reliability of IM systems” are quoted in this article, thanks to the original author.

1, the introduction

Introduction to Zero-base IM Development (II) : What is real-time IM System? Talking about the “foothold” of IM system — “real-time” this technical characteristics, this paper mainly explains the topic of “reliability” in IM system, the content tries to only talk about the principle without in-depth expansion, avoid deep technical discussion, to ensure easy to understand.

This series of articles is intended for zero-IM-based developers or product managers. The goal is to tell you “What is an IM system?” , try not to delve into specific technical implementation, to ensure easy to understand, young and old.

If you want to learn IM technology from the technical dimension system and start your own IM development (i.e. solve the question “How do IM systems do? Start with this article: Just one For beginners: Developing mobile IM from Scratch.

Learning communication: open source IM framework source github.com/JackJiang20…

(This article is simultaneously published at: www.52im.net/thread-3182…)

2. Series of articles

Introduction to Zero-base IM Development (I) : What is IM System?

Introduction to Zero-base IM Development (II) : What is real-time IM System?

Introduction to Zero-Base IM Development (III) : What is IM System Reliability? (* Article)

Introduction to Zero-base IM Development (4) : What is Message Timing Consistency in IM Systems?

Introduction to Basic IM Development (5) : What is IM System Security? (To be released later)

Introduction to Zero-base IM Development (6) : What is the Heartbeat mechanism of IM Systems? (To be released later)

Introduction to Zero-base IM Development (7) : How to Understand and implement IM System message unread? (to be released later)

Introduction to Zero-base IM Development (8) : How to Understand and implement multi-terminal message roaming in IM Systems (to be released later)

3. Overview of the text

Generally speaking, the message “reliability” of IM system usually refers to the reliability of chat message delivery (to be precise, this “message” is in the broad sense, because there are all kinds of instructions invisible to the user, for popular purposes, collectively referred to as “message”).

In terms of user behavior, message “reliability” should fall into two categories:

  • 1) Reliability of online messages: that is, when sending messages, the receiver is currently online;
  • 2) Reliability of offline messages: When sending messages, the receiver is in offline state.

In technical terms, message reliability has two meanings:

  • 1) Don’t lose the message: this is very straightforward, the message should not be like into a black hole, a face can not be confused;
  • 2) Messages are not heavy: This is the opposite of lost messages, and repeated messages are not tolerated.

For the feature of “message not lost”, when refined, it contains two meanings:

  • 1) It has been explicitly received by the other party;
  • 2) It has clearly not been received by the other party.

Yes, for 1) remeaning is easy to understand, 2) remeaning means: when the other party failed to receive, your IM system must also sense, otherwise, it also belongs to the category of “lost”.

In short, a well-formed IM system must include both message “reliability” logic in order to be usable.

Message reliability (no loss, no repetition) is undoubtedly an important index of IM system, but also one of the difficulties in IM system implementation. The following text of this article will discuss the reliability of online message and offline message.

4. Typical online message sending and receiving process

Here’s a typical IM message flow:

! [](https://img2020.cnblogs.com/blog/1834368/202010/1834368-20201029140303292-1266975469.png)

Yes, this is a typical server-side transformation OF IM architecture.

** The so-called “transformation OF IM architecture in server” refers to: ** A message sent from client A needs to be forwarded by IM server, and then pushed to client B by IM server. This mode is also the most common MESSAGE distribution architecture of IM system at present.

You might say, can’t IM be P2P? Yes, at present, the mainstream IM is basically the server transfer this way, P2P mode in IM system rarely used.

There are two obvious drawbacks:

  • 1) In P2P mode, IM operators are easy to be ignored by users (unable to monitor user behavior, users are afraid of pornography?) ;
  • 2) In P2P mode, the business form of group chat is difficult to achieve (I want to send messages to thousands of people, it is impossible for me to distribute 1000 times by myself).

* * A bit of A wandering, we get back to business: * * in the picture above, the client sends A message to A server, the server transfer message to client B, assuming that the two data link communication protocol is used in the TCP, do you think under the TCP so-called reliable transport protocol blessing, really can ensure the reliability of the IM chat messages?

The answer is no. Let’s move on to the next video.

5. TCP does not guarantee “reliability” of online messages

Further, in a typical server transition IM architecture, even the use of “reliable transport protocol” TCP does not guarantee the reliability of chat messages. Why do you say so?

To answer this question, there are many articles on the Internet, using the perspective of the server side, such as the operating system crash during message sending, network interruption, storage failure and so on. In short, it is abstract and not easy to understand.

This time we will understand from the client’s point of view why IM chat messages are not reliable even when using the reliable transport protocol TCP.

** Specifically: * * how to ensure the reliability of the IM message is a relatively complex topic, from the client to send data to the server, and then from the server to the target client, finally display in the UI, it involves a lot of links, it only take one of the rings “the receiver how to ensure that a message is not lost” to explore, under the rough talk I come into contact with two kinds of design thinking.

** Speaking of reliable delivery: ** The first thing that comes to mind is TCP reliability. The reliable delivery of data is a universal problem, whether the network binary stream data, or the upper business data, there is a reliability guarantee problem, TCP as a network infrastructure protocol, the reliability of its reliability design is beyond doubt, we start from the reliability of TCP.

At the TCP layer: For all data sent by Sender, each byte has a Sequence Number, and each byte will be returned an Ack Number by the receiver after it arrives at the receiver. The relation between the two is Ack = Seq + 1. In simple terms, if the Sender sends a packet of 100 bytes with Seq = 1, then the receiver returns a packet of Ack = 101. If the Sender received this Ack packet, the data was actually received by the receiver. Otherwise the Sender will have some sort of strategy to republish the package above.

** The first question is: ** Since TCP itself is reliable, why does the Receiver still lose messages?

See the figure below for a glance:

! [](https://img2020.cnblogs.com/blog/1834368/202010/1834368-20201029140348192-273482632.png)

(The figure above is from “Message reliability and Delivery Mechanism of MOBILE IM from client’s Perspective”)

** Summary of the above diagram: ** Network layer reliability is not the same as business layer reliability.

After data reliably arrives at the network layer, it needs to be transferred one layer at a time. The possible processing is as follows:

  • 1) Security verification;
  • 2) binary;
  • 3) Model creation;
  • 4) Write db;
  • 5) Cache;
  • 6) UI display;
  • 7) And some boundary problems, such as network outage, users suddenly log out, disk full, memory overflow, APP crash, sudden shutdown, etc.

The more features a project has, the more likely it is that processing up the network layer will go wrong.

** Take the simplest scenario as an example: ** After the message reliably arrives at the network layer, IM APP crashes before writing db (it is not rare that the APP may crash). Although the data reliably arrives at the network layer, it is not saved in DB. The next time the user opens the APP, the message will be lost naturally. This means that the message is lost forever to the receiver, and therefore there is no “reliability”.

To understand the possibilities and solutions of IM from the client’s point of view, you can read in detail: Mobile IM Message Reliability and Delivery Mechanism from the Client’s point of view. This section refers to the article “4. A section of text.

6. Add “reliability” for online messages

So how do you add reliability assurance at the application layer?

** There is an existing mechanism that we can learn from: **TCP timeout, retransmission, confirmation mechanism.

To be specific:

  • 1) Construct an ACK message in the application layer. When the receiver finishes processing the message correctly, it sends an ACK to the sender;
  • 2) If the sender does not receive an ACK within the timeout period, the message is considered to have failed to be sent and needs to be retransmitted or processed.

The process of sending and receiving messages with the confirmation mechanism added is as follows:

! [](https://img2020.cnblogs.com/blog/1834368/202010/1834368-20201029140457292-750952359.png)

We can divide the whole process into two stages.

** Stage 1: clientA -> Server **
  • 1-1: clientA sends a message to the server (msG-REQ);
  • 1-2: Server receives message and replies ACK(msG-ACK) to clientA;
  • 1-3: Once clientA receives an ACK, the message is considered to have been successfully delivered and the first phase is over.

No matter msG-a or ACK -A is lost, clientA cannot receive an ACK within the timeout period. In this case, it can prompt the user to send A failed message and manually resend the message.

** Stage 2: Server -> clientB**
  • 2-1: The server sends a message to clientB (notify-req).
  • 2-2: clientB receives the message and replies ACK(notify-ack) to the server.
  • 2-3: After receiving the ACK, the server marks the message as sent. The second phase ends.

If msG-B or ACK-b is lost, the server cannot receive an ACK within the timeout period. In this case, the server needs to resend msG-B until clientB returns an ACK.

For an in-depth discussion of IM chat message reliability assurance, you can read IM Message Delivery Guarantee Mechanism implementation (I) : Ensuring reliable Delivery of Online Real-time messages. This article will discuss this topic in depth.

7. Typical offline message sending and receiving process

Leaving the “reliability” of online messaging behind, it’s time to consider offline messaging.

7.1 Unreliable Sending and Receiving Offline Messages

Here is a typical IM offline message flow diagram:

! [](https://img2020.cnblogs.com/blog/1834368/202010/1834368-20201029140534659-669937918.png)

This is similar to the online messaging process, as shown in the figure above.

The offline message sending and receiving process can also be divided into two phases:

** Stage 1: clientA -> Server **
  • 1-1: clientA sends a message to the server (msG-REQ);
  • 1-2: The server finds clientB offline and stores the message to offline-db.
** Stage 2: Server -> clientB**
  • 2-1: Pull-req from the server after clientB goes online.
  • 2-2: The server retrieves offline messages from offline-DB and pushes them to clientB(pull-res) and deletes them from offline-DB.

** Obviously: ** There is also the possibility of message loss during offline message sending and receiving.

** For example: ** If pull-res is not successfully delivered to clientB and offline-DB is deleted, this part of the offline message is completely lost.

7.2 Reliability of Offline Messages

Similar to the online message sending and receiving process, we also need to add a reliability guarantee mechanism at the application layer.

The following figure shows the offline message sending and receiving process with added reliability guarantee:

! [](https://img2020.cnblogs.com/blog/1834368/202010/1834368-20201029140606209-711649734.png)

Compared with the initial offline message sending and receiving process, steps 1-3, 2-4 and 2-5 are added in the figure above:

  • 1-3: After the server stores the message to offline-DB, it sends an ACK(msG-ACK) to clientA, which considers the message delivered successfully after receiving the ACK.
  • 2-4: clientB replies an ACK(res-ACK) to the server after receiving the pushed offline message.
  • 2-5: The server can delete the offline message from offline-DB only after receiving the RES-ACK and confirming that the offline message has been successfully received by clientB.

Of course, the above guarantee mechanism, there is still room for performance optimization.

** When the volume of offline messages is large: ** If you reply ACK for every message, it will undoubtedly increase the number of times the client communicates with the server. In this case, we usually use batch ACK to reply to multiple messages with only one ACK. In one subsequent IM implementation, all offline messages are grouped by session, and each group replies an ACK. If an ACK is lost, only all offline messages of the session need to be retransmitted.

Detailed discussion on the reliability of the offline message safeguard mechanism, can be read: “IM message delivery guarantee mechanism to implement (2) : ensure the reliable delivery of the offline message”, “IM development dry share: how to implement a large number of elegant offline reliable delivery of messages, these two articles can give you a more specific answer.

8. The problem of repeated chat messages

In the previous section, we did eliminate the possibility of message loss by adding retransmission and confirmation mechanisms to the application layer.

But because of the retry mechanism, we encounter a new problem: the same message can be sent twice.

** The simplest example is: ** Suppose a client successfully receives a message pushed by the server, but its subsequent ACK is missing. Then the server will push the message again after a timeout. If the business layer does not process the duplicate message, the user will see two identical messages.

The method of message deduplication is actually very simple. Generally, the message is filtered according to its unique identifier (ID).

The process may differ on the server and client side:

  • 1) Client: We can construct a map to maintain the ID of the received message, and discard the message with a duplicate ID.
  • 2) Server: when receiving the message, it queries the database according to the ID. If the message already exists in the database, it will not be processed, but it still needs to reply the Ack to the client (because the message is likely to come from the user’s manual retransmission).

In one-to-one chat, the logic is not complicated, but in group chat mode, it will be complicated. For a detailed discussion of group chat message retention and de-duplication, you can read in depth: How to Ensure that IM group chat messages are not lost or reduplicated? .

9. Summary of this paper

Ensuring message reliability is an important part of IM system design. Whether messages are not lost or heavy has a great impact on user experience.

TCP, the so-called reliable transport protocol, cannot guarantee the reliability of messages at the application layer.

The reliability of IM messages is guaranteed by ACK response and retransmission mechanisms at the application layer. However, we need to deal with the message duplication problem, and the simplest way is to idempotent deduplication through message ID.

This is the theoretical basis for IM system message reliability. If you have any questions, please note at the end of this article.

10. Reference materials

[1] Realization of IM message delivery guarantee mechanism (I) : To ensure reliable delivery of online real-time messages

[2] Implementation of IM message delivery guarantee mechanism (II) : To ensure the reliable delivery of offline messages

[3] IM Development Dry Goods sharing: How to gracefully achieve a large number of offline messages reliable delivery

[4] The message reliability and delivery mechanism of mobile IM are discussed from the perspective of client

[5] Talk about the immediacy and reliability of IM systems

[6] Study Note 4 — How does IM system ensure message reliability

[7] IM group chat messages are so complex, how to ensure that not lost and not heavy?

(This article is simultaneously published at: www.52im.net/thread-3182…)