This article content and writing ideas are based on Deng Yunze “large-scale concurrent IM service architecture design”, “IM weak network scene optimization” two outline, thank Deng Yunze selfless sharing.

1, the introduction

Following the previous article “a set of 100 million users OF THE IM architecture technology essentials (Part one) : overall architecture, service separation, etc.”, this paper mainly focuses on the 100 million users of the IM architecture of some more detailed but very important hot issues, such as: message reliability, message ordering, data security, mobile terminal weak network issues.

In fact, each topic of these hot IM issues can be written separately, but limited by the length of the article, this paper will not discuss each issue in detail and in-depth, mainly to guide readers to understand the key to the problem in the way of launching a brick, and provide links to specific research articles for the problem, to facilitate selective in-depth study. I hope this article has been of some benefit to your IM development.

2. Series of articles

In order to better present the content, this paper is divided into two parts.

This article is the second of two articles:

A set of IM Architecture Technology for 100 million Users (Part I) : Overall Architecture, Service Separation, etc.

“A set of 100 million users OF IM Architecture technology dry goods (Part II) : Reliability, Order, weak network optimization” (this article)

This article will focus on some of the more detailed but important hot issues of the 100-million-user IM architecture.

3. Message reliability problem

The reliability of messages is a typical technical indicator of AN IM system. For users, whether messages can be delivered reliably (without losing messages) is the trust premise of using this IM system.

In other words, if this IM system can not guarantee not to lose messages, it is equivalent to every message sent has the probability of being lost, for users, will not “trust” to use it, that is, “distrust” this IM.

From the perspective of product managers, with such technical barriers, no matter how hard the promotion, the end users will soon lose. So an IM suite that doesn’t guarantee message reliability is a serious problem.

**PS: ** If you don’t already have an intuitive picture of the problem of IM message reliability, check out The Introduction to Zero-based IM Development iii: What is IM System reliability? This article can be easily understood.

As shown in the figure above, message reliability is mainly guaranteed by two logics:

  • 1) Uplink message reliability;
  • 2) Reliability of downstream messages.

1) The reliability of uplink messages can be handled as follows:

The user sends a message (let’s say the protocol is PIMSendReq), the user gives the message a local ID, and then waits for the server to complete and sends the sender a PIMSendAck (the same local ID) to tell the user that the message was sent successfully.

If you wait for a period of time and do not receive the ACK, the user did not send the ACK successfully, and the client SDK needs to retry.

2) The reliability of downlink messages can be handled as follows:

The service receives A message from user A and pushes the message to B, C, and D. Assuming THAT B is temporarily offline, the online push is likely to fail.

** So the core of ensuring downstream reliability is: ** Cache the push request before doing it.

This cache is guaranteed by the storage system. The MsgWriter maintains a (offline message list). A message from the user is written to the offline message list of B, C, and D at the same time.

To solve the problem of message reliability, the specific solution can be considered from another dimension: real-time message reliability and offline message reliability.

You can read more about these two articles:

IM Message Delivery Guarantee Mechanism (I) : Ensure the reliable delivery of online Real-time Messages

Implementation of IM Message Delivery Guarantee Mechanism (2) : Ensure reliable Delivery of Offline Messages

When it comes to the reliability of offline messages, there is a big difference between single chat and group chat. For the reliable delivery of offline messages in group chat, read more about IM Development Dry Goods Sharing: How to Gracefully Deliver A Large number of Offline Messages reliably.

4, the message orderliness problem

The problem of message ordering is another technical “hard nut” in distributed IM systems.

Because of the distributed system, the client and server clocks may be out of sync. If you simply rely on one party’s clock, a large number of messages will be out of order.

For example, if only the client clock is used, A is 30 minutes later than B. So A messes B, and B messes A back.

The order of sending is:

Client A: “XXX”

Client B: “YYY”

The order of receivers will become:

Client B: “YYY”

Client A: “XXX”

Because A’s time is 30 minutes later, all A’s messages will be later in the queue.

A similar problem can arise if you rely only on the server’s clock, as the two servers may also have inconsistent times. Although client A and client B have the same clock, A’s messages are processed by server S1 and B’s messages are processed by server S2, which results in the same out-of-order messages.

To solve this problem, the idea is that I can do a series of operations like this.

1) Server time alignment:

This part is the pot of back-end operation and maintenance, by the system administrator to try to protect, there is no other way.

2) The client adjusts and aligns the server time by time:

For example, after the client logs in, the difference between the client time and the server time is calculated, and the difference is taken into account when sending messages.

In my IM architecture, this aligns the time up to 100ms, but anything less is difficult because RTT is unstable between the client and server (there is a risk of uncontrollable delays).

3) Message with both local time and server time:

The specific processing can be like this: when sorting, for the same person’s message, according to the message local time to arrange; Messages for different people are arranged according to server time, this is the interpolation sorting algorithm.

**PS: ** If you want to understand it more generally, you can read The Introduction to Zero-based IM Development (iv) : What is Message Timing Consistency for IM Systems? .

** In addition: ** From the perspective of technical practice feasibility, “A Low-cost method to Ensure IM Message timing”, “How to ensure IM real-time message” timing “and” consistency “? The ideas in these two articles can be used for reference.

In fact, the sorting problem of messages can also be dealt with from the perspective of message ID (that is, through the algorithm to make the sequence of message IDS, so that the purpose of sorting messages can be achieved according to the message ID).

Algorithm the news about the order ID, this is two very: “IM message ID technology project (a) : WeChat mass IM chat message sequence number generated practice (principle) algorithm, the technology project (3) : IM message ID decryption melting cloud IM products chat message ID generation strategy”, I will not talk nonsense.

5. Message read synchronization problem

The read/unread function of the message is shown in the figure below:

Above is the read and unread message in the nail. This is very useful in enterprise IM scenarios (because leaders love it, you know).

The read/unread function, for one-to-one chat messages, is easy to understand: it adds a corresponding message (sent back when the user reads the message).

But for group chat, how many people have read this message, how many people have not read, want to achieve this effect, it is really a bit of trouble. For the group chat read unread function implementation logic, here will not expand, interested can read this article “IM group chat message read receipt function how to achieve?” .

Going back to the topic of this section, “read synchronization,” this shows that it’s a step more difficult, because read/unread receipts are not just for “accounts,” but are now broken down into cases where “the same account is logged in from different ends,” which is a bit more complicated for the synchronization logic of read receipts.

Here are some ideas based on my experience with IM architecture.

Specifically: * * * * log on to the same account user may have multiple devices (such as: Web PC and mobile terminal landing) at the same time, this case has read unread function, you need to read the synchronous, otherwise the device 1 read the message, the equipment 2 see remained unread messages, from the point of view of product, this will affect the user experience.

For my IM architecture, read synchronization relies mainly on two logics:

  • 1) Maintain the synchronization status, maintain a timestamp for each Session of the user, and save the last read message time;
  • 2) If a user opens a Session and multiple devices are online, send a PIMSyncRead message to inform other devices.

6, data security issues

6.1 basis

The data security in IM system architecture is more complicated than that in general system. From the communication point of view, it involves the security of socket long connection communication and the security of HTTP short connection. With the popularity of IM on mobile terminals, there are trade-offs in security, performance, data traffic, and user experience. Therefore, there are many challenges to achieve a sound IM security architecture.

IM system architecture, the so-called data security, mainly communication security and content security.

6.2 Communication Security

The so-called communication security, it is necessary to understand the IM communication service composition.

Currently, a typical IM system consists of two types of communication services:

  • 1) Socket long connection service: technically, that is, most people are familiar with the network communication this piece, and then a bit more detailed that is TCP, UDP protocol this piece;
  • 2) HTTP short connection services: the most commonly used HTTP REST interfaces.

For more information on how to improve the security of long connections, read “Easy to Understand: Understanding the Principles of Instant Messaging Security.” In addition, the wechat team shared “wechat next-generation communication security solution: BASED on TLS1.3 MMTLS detailed explanation”, is also very reference significance.

For scenarios with higher communication security, see Instant Messaging Security (part 2) : Discussing the Application of Combined Encryption Algorithms in IM. The article has a good idea about how to use combined encryption algorithms.

As for short connection security, you’re familiar with that, and turning on HTTPS will do the job in most cases. If you don’t know much about HTTPS, you can start from this article: “read HTTPS security principle, digital certificate, single authentication, two authentication”, “Instant messaging security article (7) : If this is to understand HTTPS, one is enough”.

6.3 Content Security

This may not be easy to understand, since communication security is implemented above, why bother with “content security”?

Let’s take a look at the so-called three functions of cryptography: Encryption, Authentication and Identification.

In detail, it is:

Encryption: Prevent bad guys from getting your data.

Authentication: prevent bad guys from modifying your data and you don’t know it.

Authentication: prevent bad guys from impersonating your identity.

In the previous section, if a malicious attacker circumvents or breaks through authentication and authentication in communication, the encryption that depends on authentication and authentication may be cracked.

To address these issues, we need to encrypt the content more securely and independently, and this is called “end-to-end encryption” (E2E).

For example, the supposedly uncrackable IM, Telegram, actually uses end-to-end encryption.

Instead of going into the depth of end-to-end encryption here, here are two interesting articles to read in depth:

“A Sharp Tool for Secure Mobile Communication — End-to-end Encryption (E2EE) Technology Details”

How End-to-end Encryption (E2EE) works in Real-time Audio and Video Chat

7. Avalanche effect

In a distributed IM architecture, there is an avalanche effect.

We know that in a distributed IM architecture, users are assigned to different servers each time they log in based on a load balancing algorithm for high availability. So here’s the problem.

** For example: ** Suppose there are 5 machine rooms, in which room A is faulty, causing the users of this machine room to go to room B. Machine room B was overwhelmed and crashed. Users of A+B went to machine room C, and the chain reaction caused all services to fail.

Preventing the avalanche effect requires some coordinated solutions in both server architecture and client link strategy. Servers need limited streaming capacity as a basis, mainly limiting the total number of service users and the number of users connected for a short time.

At the client level, a policy should be in place to prevent a large number of users from connecting to a server at the same time when a service is discovered to be disconnected.

There are usually two options:

  • 1) Retreat: Set a random interval between reconnections;
  • 2) LBS: Apply for a new server IP for reconnection with the server, and then reduce the number of users allocated to the same server in a short time by LBS service.

The two schemes do not conflict and can be done at the same time.

8. Weak network problem

8.1 Causes of weak Network Faults

Given the popularity of IM on mobile today, weak networking is a common problem. Elevator, train, driving, subway and other scenes, will encounter obvious weak network problems.

So why the weak network problem?

To answer this question, we need to find the answer from the principle of wireless communication.

Because the quality of wireless communication is subject to many factors, such as: wireless signal strength changes quickly, signal interference, communication base stations are not evenly distributed, too fast moving speed and so on. It would take three days and three nights to make this clear.

Interested readers, be sure to read the following articles carefully, similar interdisciplinary genealogical articles are rare:

Introduction to Zero-based Communication technology for IM Developers (xi) : Why WiFi Signal Is bad?

Introduction to Zero-based Communication technology for IM Developers (12) : Netload? Network drop?

Introduction to Zero-based Communication technology for IM Developers: Why Mobile Signal Is bad?

Introduction to Zero-based Communications technology for IM Developers: How Hard is it to Get Wireless Internet access on High-speed Trains?

Weak network problem is a required course for mobile APP. The following summaries are also worth learning from:

A Must-read for Mobile IM Developers (part 1) : Easy to Understand how the Mobile Web is weak and Slow

A Must Read for Mobile IM Developers (part 2) : A Summary of the most Fully mobile Weak Network Optimization Approach

Summary of Optimization methods for Modern Mobile Network Short Connections: Request Speed, Weak Network Adaptation, and Security

Baidu APP Mobile Network Depth Optimization Practice Sharing (III) : Mobile Weak Network Optimization

8.2 IM Handling weak Network Problems

For IM, the weak network problem is not very complicated, the core is to do a good job of message resending, sorting, and receiving retry.

In order to solve the IM problem caused by the weak network, the following measures can usually be improved:

  • 1) The message is automatically resent;
  • 2) Offline message reception;
  • 3) Order of resending messages;
  • 4) Offline instruction processing.

Each of them will be discussed below.

8.3 Automatically Resending Messages

When I take the subway, I often encounter that the network is disconnected and the message fails to be sent after the train starts to move.

At this point, the product has two forms of expression:

  • A. Directly tell the user that sending fails;
  • B. Keep the sending status and automatically retry 3-5 times (3 minutes) before informing the user that the sending failed.

** Obviously: ** It is much better to tell the user the sending failure after the automatic retry fails. Especially in the case of intermittent network disconnection, the success rate of retry is very high, and it is likely that the user will not even be aware of the failure.

** Technically: ** The IMSDK client monitors the status of each message. Sending a message is not a simple call to the network to send a request, but a state machine that manages several states: initial state, sending, sending failure, sending timeout. For failure and timeout states, enable the retry mechanism.

There is also a discussion thread on retry mechanism design here: How to design a “failure retry” mechanism for a fully Self-developed IM? .

“IM message delivery guarantee mechanism (1) : ensure the reliable delivery of online real-time messages” in this article about the message timeout and retransmission mechanism implementation ideas, can also refer to.

8.4 Receiving Offline Messages

There is no “online” status in modern IM, and there is no need to give users this information. But from the technical level, the user dropped the line or to the correct perception.

There are several ways of perceiving:

  • A. Signaling long connection status: If no heartbeat feedback from the server is received for a long time, the connection is disconnected.
  • B. Number of failed network requests: If multiple network requests fail, it means “may be” dropped;
  • C. Device network status detection: it is good to directly detect the nic status. Generally, Android/iOS/Windows/Mac have corresponding system APIS.

After correctly detecting the network status, it is found that the network is switched from “disconnection to recovery”. To proactively pull the messages in the offline phase, the weak network status can be achieved without losing messages (pull from the offline message list of the server).

The determination of the network status mentioned in the above text involves the network connection check and survival mechanism in IM, which is a headache in IM.

Accidentally, and stepped into the IM network to keep alive this pit, I will not expand here, interested must read the following article:

Why does A TCP Based mobile IM still need a heartbeat Survival mechanism?

Understanding the Network Heartbeat Packet Mechanism in instant messaging applications: Functions, Principles, implementation ideas, etc.

“Wechat team original share: Android version of wechat background live actual combat share (network live)”

“Mobile TERMINAL IM Practice: Realizing the Intelligent Heartbeat Mechanism of Android wechat”

Mobile IM Practice: Analysis of The Heartbeat Strategies of WhatsApp, Line and wechat

8.5 Sorting of Resending Messages

Another pitfall in weak-net logic is message ordering.

If A, B, and C send messages successfully, but B encounters intermittent network disconnection while sending messages, B triggers automatic retry.

So should the order of the receiver be A, B, C or A, C, B? I have looked at different IM products, and the processing logic is different. This is something you can play with.

The solution is to rely on the differential sorting mentioned in the previous service Architecture article, where messages sent by the same person are sorted according to the local time attached to the message. Messages from different people, sorted by server time.

Specifically, I will not have to go back to the fourth section of this article, “4, the problem of order of information”.

8.6 Offline instruction processing

When some commands are operated, the network may be faulty. After the network is restored, the commands must be automatically synchronized to the server.

For example, try setting your phone to airplane mode and deleting a contact in wechat to see if it works. Then reopen the network and see if the data is synchronized to the server.

The same logic applies to read synchronization. The offline information must be synchronized with the server.

8.7 Summary

IM weak network processing, in fact, is relatively simple, basically automatic retry + message state can solve most of the problems.

Some details are not complicated, the main reason is that the IM message volume is relatively small, after the network recovery can be quickly restored operation.

The logic of video conferencing in weak networks is much more complicated. Especially in the weak network environment with high packet loss, we should try our best to ensure the fluency of audio and video.

9. Summary of this paper

“A set of billions of users of THE IM architecture technology dry goods” this article of the next two so kan finished, the first involved in the IM architecture problem is good, the next accidentally brought out of the IM in a variety of hot issues “pit”, IM development is a long story…

I suggest that if you want to systematically learn mobile IM development, you should read my article on “Getting started and giving up” IM development (ha ha ha), which is “One Entry for beginners is enough: Developing Mobile IM from Zero”. I will not elaborate on the details, otherwise this space will not stop the car… (This article was published simultaneously at www.52im.net/thread-3445…)

10. References

[1] Design of large-scale concurrent IM service architecture

[2] IM weak network scenario optimization

[3] Introduction to Zero-based IM Development (PART 3) : What is the reliability of IM systems?

[4] IM Message delivery assurance Mechanism (1) : Ensure the reliable delivery of online real-time messages

[5] IM Development Dry goods sharing: How to gracefully deliver a large number of offline messages reliably

[6] [C] / / Proceedings of the National Academy of Sciences

[7] A new generation of wechat communication security Solution: Detailed explanation of MMTLS based on TLS1.3