preface

This article will not provide a generic IM solution, nor will it judge whether an architecture is good or bad. Rather, it will discuss common challenges in IM system design and industry solutions. Because there is no universal solution, different solutions have their advantages and disadvantages, only the system that best meets the needs of the business is a good system. Moreover, with limited human, material and time resources, there are often many trade-offs to be made. In this case, a system that can be rapidly iterated and easily expanded is a good system.

Core CONCEPTS of IM

User: User of the system

Message: refers to the content of communication between users. Generally, IM messages fall into the following categories: text messages, emoticons, pictures, videos, and files

Conversation: Usually refers to the relationship between two users established as a result of a chat

Group: Usually refers to a group of users connected by chatting

Terminal: the terminal on which the user uses the IM system. There are usually Android, iOS, Web and so on

Unread: Indicates the number of messages that the user has not read

User status: Indicates the status of the user, such as online, offline, or suspended

Relationship chain: refers to the relationship between users, usually one-way friend relationship, two-way friend relationship, following relationship and so on. It is important to note the difference with a session. A session is created only when a user initiates a chat, but a relationship does not need a chat to be established. For the storage of relational chains, you can use a graph database (Neo4j, etc.) that naturally expresses real-world relationships and is easy to model

Single chat: Chat one-on-one

Group chat: A group chat

Customer service: In the field of e-commerce, users usually need to provide pre-sales consulting, after-sales consulting and other services. At this point, it is necessary to introduce customer service to handle user inquiries

Message diversion: In the field of e-commerce, a store usually has multiple customer service, so it is message diversion to decide which customer service to handle the user’s inquiries. Generally, the message diversion will determine which customer service provider the message will be sent to according to a series of rules, such as whether the customer service is online (if the customer service is not online, it needs to be re-routed to another customer service), whether the message is pre-sale or after-sales consultation, and how busy the current customer service is, etc

Mailbox: In this article, we refer to a Timeline and a queue for sending and receiving messages

Read diffusion vs write diffusion

Read the diffusion

Let’s look at reading diffusion first. As shown in the figure above, A has A mailbox (Timeline for some blog posts) for each person and group he chats with. When viewing the chat information, A needs to read all the mailboxes with new messages. Read proliferation here is different from the Feeds system, where everyone has a write mailbox and only needs to write once to his or her own mailbox, while reads need to be read from everyone else’s mailbox. But read diffusion in IM systems is usually one mailbox for every two related people, or one mailbox for each group.

Advantages of read diffusion:

  • The write operation (sending a message) is very light, whether it is a single chat or a group chat, you only need to write to the corresponding mailbox once
  • Each mailbox is naturally the chat records of two people, and it is convenient to check the chat records and search the chat records

Disadvantages of reading diffusion:

  • Read operations (read messages) are heavy

Write a diffusion

Now let’s look at write diffusion.

In write proliferation, each person only reads messages from his or her own mailbox, but when writing (sending) messages, the treatment for single chat and group chat is as follows:

  • Single chat: Write one message to both your email and the other’s email, and another if you need to check the chat history of both of you (of course, if you can trace all the chat history of both of you from your email, this would be inefficient).
  • Group chat: You need to write a message to all group members’ email addresses, and another if you need to check the chat history of the group. As you can see, write diffusion greatly amplifies write operations for group chats.

Write diffusion advantages:

  • The read operation is very lightweight
  • It is very convenient to do multi-terminal synchronization of messages

Write diffusion disadvantages:

  • Write operations are heavy, especially for group chats

Note that in the Feeds system:

  • Write diffusion is also called Push, fan-out, or write-fanout
  • Read diffusion is also called Pull, fan-in, or read-fanout

Unique ID design

In general, ID designs fall into the following categories:

  • UUID
  • Snowflake based ID generation method
  • Based on the generation method of application DB step size
  • Based on Redis or DB auto-increment ID generation mode
  • Special rules generate unique ids

For details on the implementation and pros and cons, see the previous post on distributed unique ID resolution

Unique IDS are required in IM systems:

  • The session ID
  • Message ID

Message ID

Let’s look at three considerations when designing message ids.

Is it ok if the message ID is not incremented

Let’s see what happens without incrementing:

  • Using strings wastes storage space, and the storage engine cannot be used to store adjacent messages together, which reduces the performance of writing and reading messages
  • If numbers are used, but numbers are random, adjacent messages cannot be stored together by using the characteristics of the storage engine, which will increase random I/O and reduce performance. And random ids don’t guarantee uniqueness

Therefore, the message ID is best incremented.

Global increment vs user level increment vs session level increment

Global increment: Indicates that the message ID increases over time across the IM system. Snowflake is generally used for global increments (of course, Snowflake is just an incrementation of the worker level). At this point, if your system is read spread, in order to prevent message loss, each message can only carry the ID of a message, the front end according to the last message to determine whether there is a lost message, if there is a lost message need to pull again.

Increasing user level: Indicates that the message ID is guaranteed to be increasing only for a single user. The message ID does not affect different users and may be repeated. Typical example: wechat. If a write diffusion system is used, the mailbox timeline ID and the message ID need to be designed separately. The mailbox timeline ID increases at the user level, and the message ID increases globally. If the system is read diffusion, the need to use user level increment is not very great.

Session level ascending: Indicates that the message ID is guaranteed to be ascending only in a single session, which does not affect different sessions and may be repeated. Typical representative: QQ.

Continuous increasing vs monotonically increasing

Incrementing continuously means that ID presses 1,2,3… N generation; Monotonically increasing means that the ID generated later is larger than the ID generated earlier. It does not need to be continuous.

As far as I know, THE MESSAGE ID of QQ is the continuous increment used at the session level. The advantage of this is that if the message is lost, the next message will find that the ID is not continuous, so as to avoid the loss of message. At this point, one might be thinking, can’t I just use a timed pull to see if any messages are missing? Of course not, because the message ID is only continuously increasing at the session level and if one person has thousands of sessions, how many times would that have to be pulled, the server would not be able to resist.

For read diffusion, sequential increment of message ids is a good approach. If monotonic increment is used, the current message must have the ID of the previous message (that is, chat messages form a linked list), so that the message can be determined if it is lost.

To sum up:

  • Write diffusion: The mailbox timeline ID is incremented by user level, and the message ID is incremented globally
  • Read diffusion: Message ids can be incremented at the session level and preferably continuously

The session ID

Let’s look at some of the issues that need to be addressed when designing session ids:

There is a simple way to generate session ids (special rules generate unique ids) : concatenate from_user_id with to_user_id:

  1. If both from_user_id and to_user_id are 32-bit integers, it is easy to concatenate a 64-bit session ID with bitoperations, i.e. : Conversation_id = from_user_id < < 32 ∣ {the from \ _user \ _id} < < 32 | from_user_id < < 32 ∣ {to_user_id} (before joining together smaller user ID is the need to ensure that value From_user_id, so that any two users who initiate a session can easily know the session ID.
  2. If both from_user_id and to_user_id are 64-bit integers, they can only be concatenated into a single string. Concatenating to a string is a bad way to waste storage space and performance.

The former company used the first method above, but the first method has a flaw: as the business expands globally, the 32 bit user ID will need to be drastically changed if it is not enough to expand to 64 bit. 32-bit integer ids seem to hold 2.1 billion users, but in order to prevent people from knowing the real user data, we usually use non-consecutive ids, and then the 32-bit user ID is completely inadequate. Therefore, the design is entirely dependent on the user ID, which is not a desirable design approach.

Therefore, session ids can be designed using global increments with a mapping table that holds the relationship between FROM_user_ID, to_user_ID, and conversation_ID.

Push vs pull vs push and pull

In IM systems, there are three possible ways to get new messages:

  • Push mode: When there is new message, the server actively pushes it to all terminals (iOS, Android, PC, etc.)
  • Pull mode: The front end initiates the request for pulling messages. To ensure the real-time performance of messages, push mode is generally adopted. Pull mode is generally used to obtain historical messages
  • Push-pull mode: When there is a new message, the server pushes a notification of a new message to the front end. After receiving the notification, the front end pulls the message from the server

The simplified figure of push mode is as follows:

As shown in the figure above, messages sent by users are normally pushed to all ends of the receiver after being stored by the server. But is may be lost, the most common case is the user may be fake online (means if push service based on the long connection, while long connection may have been disconnected, namely the user has dropped, but usually after a cardiac cycle server is needed to perceive that the server will be wrongly assume that users will also online; Pseudo online is a concept I want to, did not think of the right word to explain). Therefore, it is possible to lose messages if you use the push mode alone.

The simplified figure of push-pull combination mode is as follows:

You can use push and pull mode to solve the problem that push mode may lose messages. When a user sends a new message, the server pushes a notification, and then the front end requests the latest message list. In order to prevent the loss of messages, the server can actively request once every period of time. As can be seen, it is best to use the push-pull combination mode to use write diffusion, because write diffusion only needs to pull a timeline of personal mailbox, while read diffusion has N timeline (one for each mailbox), if also timed pull, the performance will be poor.

Industry Solutions

Now that you’ve seen common IM system design issues, let’s take a look at how the industry is designing IM systems. Studying the mainstream solutions in the industry helps us to understand the design of IM systems. The following research is based on the public information on the Internet, may not be correct, we just for reference.

WeChat

Although many of the basic frameworks of wechat are self-developed, this does not prevent us from understanding the architecture design of wechat. As can be seen from the article “from 0 to 1: The evolution of wechat background System” published by wechat, wechat mainly adopts the combination of writing diffusion + push and pull. Since group chats also use write diffusion, which consumes resources, there is a cap on the number of people in wechat groups (currently 500). So this is also an obvious disadvantage of writing diffusion, if you need tens of thousands of people is more difficult.

It can also be seen from the article that wechat adopts multi-data center architecture:

Each data center of wechat is autonomous, and each data center has a full amount of data. Data is synchronized between data centers through self-developed message queues. To ensure data consistency, each user belongs to only one DATA center and can only read and write data in the data center to which the user belongs. If the user is connected to another data center, the user is automatically connected to the data center. If you want to access other users’ data, you only need to access your own data center. Meanwhile, wechat uses the three-park Dr Architecture and Paxos to ensure data consistency.

As can be seen from the article “Trillion Level Call System: Architecture Design and Evolution of wechat Serial Number Generator” published by wechat, wechat ID design adopts the generation method based on application DB step length + user level increment. As shown below:

The sequence number generator of wechat generates the routing table from the arbitration service (the routing table holds the full mapping of the UID number segment to AllocSvr), and the routing table is synchronized to AllocSvr and Client. If the AllocSvr fails, the quorum service reschedules the UID number segment to another AllocSvr.

nailing

Dingding public information is not much, from the “Ali Dingding technology sharing: enterprise-level IM king – Dingding back-end architecture of the outstanding place” this article we can only know that Dingding at the beginning of the use of write diffusion model, in order to support the crowd, and then seems to optimize into read diffusion.

But when it comes to Ali’s IM system, we have to mention ali’s Tablestore. Under normal circumstances, IM systems will have an auto-increment ID generation system, but Tablestore creatively introduced the primary key column auto-increment, that is, the generation of ID is integrated into the DB layer, support user level increment (traditional MySQL DB can only support table-level auto-increment, that is, global auto-increment). For details, please refer to how to Optimize high-concurrency IM System Architecture.

Twitter

What? Isn’t Twitter a Feeds system? Isn’t this article about IM? Yes, Twitter is a Feeds system, but Feeds system and IM system actually have a lot of common design, the study of Feeds system helps us to design IM system for reference. Plus, there’s no harm in researching the Feeds system to expand your technical horizons.

Twitter’s increment ID design is probably well known, known as Snowflake, so ids are globally increasing.

As you can see from this video sharing How We Learned to Stop Worrying and Love Fan-in at Twitter, Twitter started out using a proliferation model, The Fanout Service writes to the Timelines Cache (using Redis), and the Timeline Service reads the Timeline data, which is then returned to the user by the API Services.

However, since write diffusion is too costly for big V to write, Twitter uses the combination of write diffusion and read diffusion later. As shown below:

For users with few followers, if they still use the write diffusion model to send Twitter, the Timeline Mixer service will integrate the user’s Timeline, big V’s write Timeline and system recommendation, and finally the API Services will return it to the user.

58 home

58.com has implemented a universal real-time messaging platform:

Msg-server stores the mapping between applications and MQ topics. According to this configuration, MSG-Server pushes messages to different MQ queues, which can be consumed by specific applications. Therefore, a configuration change is all you need to add an application.

In order to ensure the reliability of message delivery, the home server also introduces an acknowledgement mechanism: the message platform receives the message from the database first, and the receiver receives the message from the application layer ACK and then deletes it. The best way to use the confirmation mechanism is to use single sign-on (SSO). If multiple ends can log in at the same time, it is more troublesome because all ends need to confirm that they have received the message before deleting.

As you’ve probably seen, designing an IM system can be challenging. Let’s move on to the considerations of designing an IM system.

IM needs to be solved

How to ensure the real-time of messages

In the selection of communication protocol, we mainly have the following choices:

  1. Use TCP Socket communication, their own design protocol: 58 home and so on
  2. Use UDP Socket communication: QQ and so on
  3. Using HTTP long round robin: wechat web version and so on

Either way, we can achieve real-time notification of messages. But what may affect the timeliness of our messages is the way we process them. For example: if we push using MQ to process and push a message of ten thousand people, push a person needs 2ms, then push ten thousand people need 20 seconds, then the message behind will be blocked for 20 seconds. If we need to complete the push within 10ms, the concurrency of our push should be: Number of people: 10000 / (total push duration: 10 / single push duration: 2) = 2000

Therefore, we must evaluate the throughput of our system when choosing a specific implementation scheme, and every link of the system must be evaluated by pressure measurement. Only when the throughput of each link is evaluated well, can the real-time performance of message push be guaranteed.

How to ensure message timing

Messages may be out of order in the following cases:

  • Sending messages can be out of order if you use HTTP instead of a long connection. Because the backend is usually clustered, HTTP requests may be sent to different servers. Due to network latency or different server processing speeds, messages sent later may be completed first, resulting in message disorder. Solution:
  • The front end processes the messages in turn, sending one message before sending the next. This mode reduces user experience and is not recommended.
  • Take a front-generated order ID and let the receiver sort by that ID. This is a bit more cumbersome front-end processing, and a message may be inserted in the middle of the recipient’s history message list during the chat, which may be strange, and the user may miss the message. This can be resolved by rearranging the message when the user switches the window, with the receiver appending the message to the end each time it receives it.
  • In order to optimize the experience, some IM systems may adopt asynchronous send confirmation mechanism (for example: QQ). That is, the message reaches the server and the server sends it to MQ. If the sending fails due to permissions or other issues, the back end pushes a notification down. In this case MQ will choose the appropriate Sharding strategy:
  • Sharding by to_user_id: If multi-end synchronization is required using this policy, the synchronization of multiple ends on the sender may be out of order, because the processing speed of different queues may be different. For example, the sender sends M1 first and then M2, but the server may process M2 first and then M1, and the other end will receive M2 first and then M1, and the other end’s session list will be messed up.
  • Sharding by conversation_ID: If you use conversation_ID, multi-endpoint synchronization will be out of order.
  • Sharding by from_user_id: This policy is a good choice in this case
  • It is often possible to push MQ before pushing to optimize performance, in which case to_user_id is a good choice.

How to do with the user online status

Many IM systems need to show the user’s status: online, busy, etc. Redis or distributed consistency hashing can be used to store users’ online status.

  1. Redis stores user online status

Looking at the picture above, some people may wonder why Redis needs to be updated every heartbeat. If I’m using a TCP long connection, don’t I have to update every heartbeat? Indeed, normally the server only needs to update Redis when creating or disconnecting a connection. However, because the server may fail, or the network between the server and Redis may fail, the event-based updates may fail, resulting in incorrect user status. Therefore, it is best to update your online status by heartbeat if you want to be accurate.

Since Redis is a stand-alone storage device, we can use Redis Cluster or Codis to improve reliability and performance.

  1. Online status of distributed consistent hash storage users

When using distributed consistency hashing, migrate user Status before expanding or scaling down the Status Server Cluster. Otherwise, user Status may be inconsistent during the initial operation. In addition, you need to use virtual nodes to avoid data skew.

How to do multi-terminal synchronization

Read the diffusion

As mentioned above, for read diffusion, message synchronization is mainly in push mode. The message ID of a single session increases in sequence. If the front end finds that the message ID is not continuous, it requests the back end to get the message again. But it is still possible to lose the session last message, in order to increase the reliability of information, can be in session history list session with a last message ID, front end when received a new message will pull the latest session list first, and then determine whether the session last message, if not, the message may be lost, The front end needs to pull the message list of the session again; If the last message ID of the session is the same as the last message ID in the message list, the front end no longer processes it. The performance bottleneck of this approach will be in pulling the historical session list, because every new message needs to pull the back end once, if the magnitude of wechat, the message alone may have 200,000 QPS, if the historical session list is placed in MySQL and other traditional DB, it will not be able to resist. Therefore, it is best to store the list of historical sessions in a Redis cluster that has AOF enabled (you may lose data with RDB). I can only lament that performance and simplicity can’t be both.

For write diffusion, multi-terminal synchronization is simpler. The front end only needs to record the last synchronization site, bring the synchronization site during synchronization, and then the server will return all the data behind the site to the front end, and the front end can update the synchronization site.

How to deal with no readings

In IM systems, unread handling is very important. Non-reading is generally divided into session non-reading and total non-reading. If not handled properly, session non-reading and total non-reading may be inconsistent, which seriously reduces user experience.

Read the diffusion

For read diffusion, we can have both session unread and total unread back end, but the back end needs to ensure the atomicity and consistency of the two unread updates, which can be achieved in the following two ways:

  1. With Redis’s Multi transaction feature, transaction update failures can be retried. Note that transactions are not supported if you use a Codis cluster.
  2. Use Lua to embed scripts. Using this approach requires that both session unread and total unread are on the same Redis node (you can use Hashtag for Codis). This approach leads to decentralized implementation logic and increased maintenance costs.

Write a diffusion

For write diffusion, the server usually downplays the concept of sessions, meaning that the server does not store a list of historical sessions. Unread calculation can be responsible by the front end, mark read and mark unread can only record an event to the mailbox, each end by replaying the event in the form of processing session unread. Using this method may cause inconsistent readings at each end, at least wechat will have this problem.

If write diffusion also stores the unread through the historical session list, then the user timeline service is tightly coupled to the session service. In this case, atomicity and consistency are required, then distributed transactions can only be used, which will greatly reduce the performance of the system.

How do I store history messages

Read the diffusion

For read diffusion, only Sharding is required to store a copy according to the session ID.

Write a diffusion

For write diffusion, you need to store two message lists: one for user Timeline and one for session Timeline. The message list based on the user Timeline can be Sharding by the user ID, and the message list based on the session ID can be Sharding by the session ID.

Data is separated from heat and cold

For IM, the storage of historical messages has a strong time series nature, and the longer the time, the less likely the message will be accessed, and the less valuable it will be.

If we need to store history messages for several years or even forever (which is common in e-commerce IM), then the separation of hot and cold history messages is very necessary. Hot and Cold data separation is usually a Hot-Warm-Cold (HWC) architecture. The message just sent can be put into Hot storage system (Redis can be used) and Warm storage system, and then the Store Scheduler migrates the Cold data to Cold storage system periodically according to certain rules. To obtain messages, you need to access the Hot, Warm, and Cold storage systems in sequence. The Store Service consolidates data and returns it to the IM Service.

What does the access layer do

Load balancing at the access layer is implemented using the following methods:

  1. Hardware load balancer: for example, F5, A10, etc. Hardware load balancing has strong performance and high stability, but it is very expensive and not recommended by rich companies.
  2. Using DNS to implement load balancing: The DNS is simple to implement load balancing. However, the DNS takes a long time to implement load balancing if switchover or expansion is required. The DNS supports a limited number of IP addresses and supports simple load balancing policies.
  3. DNS + layer 4 load balancing + Layer 7 load balancing architecture: for example, DNS + DPVS + Nginx or DNS + LVS + Nginx. One might wonder why add layer 4 load balancing? This is because layer 7 load balancers are CPU intensive and often need to be scaled up or downsized. Large websites may need many layer 7 load balancers, but only a small number of Layer 4 load balancers. Therefore, the architecture is useful for large applications with short connections such as HTTP. Of course, if the traffic is small, just use DNS + 7 layer load balancing. But for long connections, adding layer 7 load balancing to Nginx is not so good. Because Nginx often needs to change the configuration and reload configuration, the TCP connection will break during reload, resulting in a large number of dropped calls.
  4. DNS + Layer-4 load balancer: The layer-4 load balancer is stable and rarely changed. It is suitable for long connections.

For the long connection access layer, if we need a more flexible load balancing strategy or need to do gray scale, we can introduce a scheduling service, as shown in the following figure:

Access Schedule Service can implement assigning Access Services based on various policies, for example:

  • According to the grayscale strategy to allocate
  • Distribute according to the nearest principle
  • Allocate according to the minimum number of connections

Architectural thoughts

Finally, let’s share some architectural tips for making large applications:

  1. Gray! Gray! Gray!
  2. Monitor! Monitor! Monitor!
  3. The alarm! The alarm! The alarm!
  4. Cache! Cache! Cache!
  5. Current limit! Fuse! Downgrade!!!
  6. Low coupling, high cohesion!
  7. Avoid single points and embrace statelessness!
  8. Evaluation! Evaluation! Evaluation!
  9. Pressure test! Pressure test! Pressure test!

Original link: xie.infoq.cn/article/19e…

InfoQ By Chank