This article is originally shared by rongyun technical team, with revisions and changes.

1, the introduction

In the scene of live video, the interaction of bullet screen, chat with anchors, various business commands and so on constitute the interaction mode between ordinary users and anchors.

From a technical point of view, these real-time interaction means, the underlying logic is real-time chat message or command distribution, the technical architecture is analogous to IM applications, that is the IM chat room function.

The previous article in this series, “Practice in Live Chat Message Distribution techniques for Millions of People online,” focused on message distribution and discard strategies. This paper will mainly share practical experience on technical difficulties in the architecture design of massive chat messages in live broadcast rooms from the perspectives of high availability, elastic scaling, user management, message distribution, client optimization and so on.

2. Series of articles

This is the seventh in a series of articles:

  1. Chat Technology of Live Broadcast System (I) : Practical Road of Real-time Push Technology of Millions of Online Mepai Live Live Barrage System
  2. Chat Technology of Live Broadcast System (II) : Technical Practice of Alibaba E-commerce IM Messaging Platform in group Chat and Live Broadcast
  3. Live Chat Technology (III) : Evolution of Message Architecture for single room of 15 million Online Live chat rooms in wechat
  4. Live Broadcast System Chat Technology (IV) : Evolution practice of Massive User Real-time Message System Architecture of Baidu Live
  5. Live Streaming System Chat Technology (5) : Cross-process Rendering and Stream Pushing Practice of wechat Mini Game Live streaming on Android Terminal
  6. Live Broadcast System Chat Technology (VI) : Practice of Real-time Chat Message Distribution technology in Live Broadcast Room with One Million People online
  7. “Chat Technology in Live Broadcast System (VII) : Difficulties and Practice in The Design of Massive Chat Messages in Live Broadcast” (* Article)

3. Main functions and technical features of direct broadcast room

Today’s video live broadcast room is not only a technical problem of video streaming media, but also includes tasks such as multi-type message sending and management and user management that can be perceived by users. At present, everything can be live broadcast, super-large live broadcast scenes are common, and even there are scenes with unlimited number of people. In the face of the concurrent challenge of such a massive amount of real-time information and instructions, the technical difficulties brought by unconventional means can be solved.

Let’s first summarize the main functional features and technical features of today’s typical video live broadcast rooms compared with traditional live broadcast rooms.

Rich message types and advanced features:

  • 1) Can send text, voice, pictures and other traditional chat functions;

  • 2) Message types that can realize non-traditional chat functions such as “like” and “gift”;

  • 3) It can manage content security, including setting sensitive words, anti-garbage disposal of chat content, etc.

Chat management function:

  • 1) User management: including creating, joining, destroying, banning, querying, blocking (kicking people), etc.;

  • 2) Whitelisted users: The whitelisted users will not be automatically kicked out if they are in the protected state and have the highest priority in sending messages.

  • 3) Message management: including message priority, message distribution control, etc.;

  • 4) Real-time statistics and message routing.

Maximum number of people and behavior characteristics:

  • 1) There is no upper limit for the number of viewers: For some large live broadcast scenes, such as the Spring Festival Gala and the National Day Military parade, the cumulative number of viewers in the live broadcast room is often tens of millions, and the number of simultaneous viewers is also up to millions;

  • 2) User behavior: Users go in and out of the live broadcast room very frequently, and the number of people going in and out of the hot live broadcast room may be tens of thousands in a second, which poses a great challenge to the service’s ability to support users’ online and offline as well as user management.

Massive message concurrency:

  • 1) Large amount of concurrent messages: there is no obvious upper limit for the number of people in live chat rooms, which brings the problem of massive concurrent messages (for a chat room with millions of people, the uplink of messages is huge, and the amount of messages distributed is exponentially increased);

  • 2) High real-time performance of messages: if the server only performs peak elimination of messages, the accumulation of peak messages will increase the overall message delay.

As for point 2 above, the cumulative effect of delay will lead to the deviation between the message and the live video stream on the time line, thus affecting the real-time interaction of users watching the live broadcast. Therefore, the rapid distribution of massive messages is very important.

4, direct broadcast chat room architecture design

An HA system needs to support automatic service failover, precise service fuse degradation, service governance, service traffic limiting, service rollback, and automatic service capacity expansion or reduction.

The direct broadcast chat room system architecture is as follows:

As shown in the figure above, the system architecture is mainly divided into three layers:

  • 1) Connection layer: mainly manages the long link between the service and the client;

  • 2) Storage layer: the current use is Redis, as a secondary cache, mainly store chat room information (such as personnel list, blacklist and whitelist, ban list, etc., when the service is updated or restarted, the backup information can be loaded from Redis chat room);

  • 3) Business layer: this is the core of the whole chat room. In order to realize cross-room DISASTER recovery, the service is deployed in multiple available areas, and is divided into chat room service and message service according to capabilities and responsibilities.

Specific responsibilities of chatroom service and messaging service:

  • 1) Chat room service: mainly responsible for handling management requests, such as the entry and exit of chat room personnel, blocking/banning, uplink message processing and review, etc.;

  • 2) Message service: mainly cache the user information and message queue information to be processed by this node, and distribute messages in chat rooms.

In the scenario of massive users and high concurrency, message distribution capability will determine the performance of the system. In a live chat room with millions of users, for example, one uplink message corresponds to a million times distribution. In this case, the distribution of massive messages, relying on a single server is not possible.

Our optimization idea is to split the staff of a chat room into different messaging services, and then spread the message to the messaging service after the chat room service receives the message, and then the message service distributes it to the users.

Take millions of online direct broadcast chat rooms as an example: assuming that there are 200 chat room message services in total, each message service manages about 5000 people on average, and each message service only needs to distribute messages to the users who fall on this server.

Service drop point selection logic:

  • 1) In the chat room service: the uplink signaling of the chat room uses the consistent hash algorithm to select nodes according to the chat room ID;

  • 2) In the message service: the consistency hash algorithm is used to determine which message service the user falls in according to the user ID.

Consistent hashes choose relatively fixed points, which can converge the chat room behavior on a single node, greatly improving the cache hit ratio of the service.

Chat room personnel in and out, black/white list setting and message sending judgment processing can directly access the memory, without accessing the third party cache each time, thus improving the response speed and distribution speed of the chat room.

Finally, Zookeeper is mainly used for service discovery in the architecture. All service instances are registered with Zookeeper.

5, direct broadcast chat room expansion capacity

5.1 an overview of the

As the form of live streaming becomes more and more accepted, there are more and more cases where the number of people in the chat room is increasing and the pressure on the server is gradually increasing. Therefore, it is very important to smoothly expand/shrink capacity in the process of gradually increasing/decreasing service pressure.

In terms of automatic expansion and shrinkage of services, the solutions provided by the industry are generally the same: namely, to know the bottleneck point of a single server through pressure test → to judge whether expansion or shrinkage is needed through monitoring business data → to trigger the set conditions and alarm and automatically expand and shrink the capacity.

In view of the strong business of direct broadcast chat room, the whole chat room business should not be affected in the expansion and reduction of capacity.

5.2 Chat room Service Capacity Expansion

When the chat room service was scaling down, we used Redis to load member lists, banned/blacklist lists, etc.

Note: Before self-destructing a chatroom, check whether the current chatroom belongs to the node. If not, skip the destruction logic to avoid data loss in Redis due to the destruction logic.

The details of the chat room service expansion and shrinkage scheme are as follows:

5.3 Capacity Expansion of the Message Service

When the message service is expanding, most members need to be routed to the new message service node according to the principle of consistent hashing. This process will upset the current personnel balance and make an overall personnel transfer.

1) During expansion: We gradually transfer personnel according to the active level of the chat room.

2) When there is a message: [The message service will traverse all users cached on this node to pull the message notification, and judge whether this user belongs to this node during this process (if not, this user will be synchronized to the node that belongs to it).

3) When pulling messages: When a user pulls messages, if there is no such user in the local cache list, the message service will send a request to the chat room service to confirm whether the user is in the chat room (if yes, it will join the message service synchronously, if not, it will directly discard).

4) When scaling down: The message service will obtain all members from the public Redis, filter out the users of this node according to the drop point calculation and put them into the user management list.

6. Online and management of massive users

Chat room service: manage all people’s entry and exit, list changes of people are also asynchronously stored in Redis.

Message service: the maintenance of their own chat room personnel, users in the initiative to join and exit the room, according to the consistency of the hash to calculate the point of synchronization to the corresponding message service.

After the chat room gets the message: the chat room service broadcasts to all the chat room messaging services, which then pull the message notification. The message service will detect the user’s message pull situation. In the case of active chat room, if the user does not pull within 30 seconds or the accumulated 30 messages are not pulled, the message service will judge that the current user is offline, and then kick out the user, and synchronize the chat room service to do offline processing for the member.

7. Distribution strategy of massive chat messages

The message distribution and pull scheme of direct broadcast chat room service is as follows:

7.1 Pulling notification Messages

In the figure above: user A sends A message in the chat room, which is first processed by the chat room service. The chat room service synchronizes the message to each message service node, and the message service delivers notification pull to all members of the node cache (the server in the figure sends notification to user B and user Z).

During message distribution, server does a notification merge.

The detailed process of notification pulling is as follows:

  • 1) The client joins the chat successfully and adds all members to the queue to be notified (if it already exists, the notification message time will be updated);

  • 2) The thread is delivered and the queue to be notified is obtained by rotation training;

  • 3) Send notification pull to the user in the queue.

This process ensures that the sending thread sends only one notification pull to the same user in one round (that is, multiple messages are merged into one notification pull), effectively improving server performance and reducing network consumption between the client and server.

7.2 Pulling of messages

The user’s message pulling process is shown as follows:

As shown in the figure above, user B receives the notification and sends a pull message request to the server, which is ultimately processed by message node 1, which returns a list of messages from the message queue based on the timestamp of the last message passed by the client (see figure below).

Example client pull message:

The maximum local time of the client is 1585224100000. Two messages larger than this number can be pulled from the server.

7.3 Message Speed control

When a server deals with massive messages, it needs to control the speed of messages.

This is because: in a live chat room, a large number of users at the same time to send a large number of messages, generally basically the same content. If all messages are partially sent to the client, the client may encounter problems such as delay and message delay, which seriously affects user experience.

So the server does speed limiting on both the upper and lower lines of the message.

Principle of message speed control:

Specific rate limiting policies are as follows:

  • 1) Server uplink Rate limiting (discarking) policy: The default rate limiting for messages in a single chat room is 200 messages per second, which can be adjusted according to service requirements. Messages sent after reaching the speed limit will be discarded in the chat room service and will not be synchronized to each message service node.

  • 2) Server downlink rate limiting (discarding) policy: The server controls the downlink rate limiting mainly according to the length of the message ring queue. After reaching the maximum value, the oldest messages are discarded.

After each notification pull, the server marks the user as being pulled. After the user actually pulls the message, the user removes the flag.

If the user has a pull in flag when a new message is generated:

  • 1) If the marking time is less than 2 seconds away from the setting, no notification will be issued (reduce the client pressure and discard the notification without dropping the message);

  • 2) The notification will continue after 2 seconds (if the notification is not pulled for several consecutive times, the user kick out policy will be triggered, which will not be described here).

Therefore, whether the message is discarded depends on the client pull speed (affected by client performance and network). If the client pulls the message in time, no message is discarded.

8. Prioritize messages in live chat rooms

The core of message speed control is the choice of messages, which requires priority division of messages.

The partitioning logic is roughly as follows:

  • 1) Whitelist message: this type of message is the most important and has the highest level. General system notification or management information is set to whitelist message.

  • 2) High priority message: Next to the whitelist message, the message without special Settings is high priority.

  • 3) Low-priority messages: messages of the lowest priority, which are mostly text messages.

Specific how to divide, should be able to open a convenient interface for setting.

The server implements different rate limiting policies for the three types of messages. In the case of high concurrency, messages with a lower priority have the highest probability of being discarded.

The server stores the messages in three buckets: The client pulls the messages in the order of whitelisted messages > high-priority messages > low-priority messages.

9. Client optimization for receiving and rendering large numbers of messages

9.1 Message Receiving Optimization

In terms of message synchronization mechanism, if every message received by the chat room is directly sent to the client, it will undoubtedly bring great performance challenges to the client. Especially in concurrent scenarios with thousands or tens of thousands of messages per second, continuous message processing can consume the limited resources of the client and affect other aspects of user interaction.

Considering the above problems, a separate notification pull mechanism is designed for the chat room. After a series of control by the server, the client is notified of the pull.

Specifically divided into the following steps:

  • 1) The client successfully joins the chat room;

  • 2) The server issues a notification pull letter order;

  • 3) The client pulls the message from the server according to the maximum timestamp of the message stored locally.

Note here: the first time you join a live chat room, there is no local valid timestamp, at this time 0 will be passed to the service to pull the latest 50 message repository. The maximum timestamp of the message stored in the database will be passed when the next pull is performed for differential pull.

After the client pulls the message, it performs reordering and then dumps the reordered data to the service layer to avoid repeated display.

In addition: The instant message in the chat room is strong. After the live broadcast or the user exits the chat room, most of the messages pulled before do not need to be checked again. Therefore, when the user exits the chat room, all the messages in the database of the chat room will be cleared to save storage space.

9.2 Rendering optimization of messages

In terms of message rendering, the client also uses a series of optimizations to ensure that it performs well even when a large number of messages are flooded in live chat rooms.

Taking The Android terminal as an example, the specific measures are as follows:

  • 1) Adopt MVVM mechanism: strictly distinguish business processing and UI refresh. Each message is notified to refresh the page after the child thread of the ViewModel has processed all the business and prepared the data needed for the page refresh.

  • 2) Reduce the load of the main thread: accurately use LiveData setValue() and postValue() methods: Events already in the main thread are notified to View refresh by setValue(), so as to avoid excessive postValue() overload of the main thread;

  • 3) Reduce unnecessary refresh: for example, when the message list is sliding, there is no need to refresh the new message received, but only prompt;

  • 4) Identification of data update: Identify whether the data has been updated through Google’s data comparison tool DiffUtil, and update only the changed part of the data;

  • 5) Control the global refresh times: try to update UI through local refresh.

By the above mechanism: From the pressure test, on a mid-range phone, the message list still performed smoothly with no lag at 400 messages per second in a live chat room.

10. Optimize custom attributes outside of traditional chat messages

10.1 an overview of the

In the direct broadcast chat room scene, in addition to the traditional chat message sending and receiving, the business layer often needs to have some of its own business attributes, such as the main broadcast mic information and role management in the voice live chat room scene, as well as recording the user’s role and card game status in the Wolf killing and other card game scenes.

Compared with traditional chat messages, customized attributes have mandatory and time-sensitive requirements. For example, information such as positions and roles must be synchronized to all members in a chat room in real time. Then, clients can refresh local services based on customized attributes.

10.2 Storage of Custom Attributes

Custom attributes are passed and stored as keys and values. There are two main operations for custom attributes: setting and deleting.

The server stores custom attributes in two parts:

  • 1) A full set of custom attributes;

  • 2) Custom attribute set change record.

The custom attribute storage structure is shown below:

For these two pieces of data, two query interfaces should be provided, namely, full data query and incremental data query. The combined application of these two interfaces can greatly improve the capability of chatroom service attribute query response and custom distribution.

10.3 User-defined Attribute Pull

Full data in memory, used primarily by members that have never pulled custom attributes. Members who have just entered the chat room can directly pull the full amount of custom attribute data and display it.

For a member whose full data has been pulled, if the full data is pulled every time, the client needs to compare the full custom attributes on the client side with the full custom attributes on the server side to obtain the modified content. No matter which end of the comparison behavior is placed, certain calculation pressure will be increased.

Therefore, in order to synchronize incremental data, it is necessary to build a collection of property change records. This way: Most members can get incremental data when they receive changes to custom attributes to pull.

The property change record uses an ordered map set: the key is the change timestamp, the value contains the change type and the content of the custom property, and the ordered map provides all the custom property actions during this period.

The distribution logic of custom attributes is the same as that of messages: they are notification pulls. That is, when the client receives notification of a custom attribute change pull, it pulls with its own local maximum custom attribute timestamp. For example, if the client passes a timestamp of 4, two records with a timestamp of 5 and a timestamp of 6 will be pulled. The client pulls the increment and plays it back locally, then modifies and renders its own custom properties locally.

11. Reference materials

[1] Should I use “push” or “pull” to synchronize online status in IM chat and group chat?

[2] IM group chat messages are so complex, how to ensure that not lost and not heavy?

[3] How to ensure the efficiency and real-time performance of mass group message push in MOBILE TERMINAL IM?

[4] Discussion on synchronization and storage of chat messages in modern IM system

[5] Discussion on group messaging disorder in IM instant messaging

[6] How to implement the read receipt function of IM group chat messages?

[7] Should I save one IM group chat message or multiple IM group chat messages?

[8] A set of highly available, scalable and concurrent IM group chat and single chat architecture design practice

[9] IM group chat mechanism, other than loop to send messages? How to optimize?

[10] NetEase Yunxin Technology Sharing: Practice summary of 10,000 people chat in IM

[11] Ali Dingding Technology sharing: King of enterprise-level IM — Dingding’s superiority in back-end architecture

[12] A study on storage space implementation of read and unread IM group chat messages

[13] Revealing the IM architecture design of enterprise wechat: message model, ten thousand people, read receipt, message withdrawal, etc

[14] Rong Yun IM technology sharing: Thinking and practice of Message delivery scheme of Wan Crowd Chat

(This article has been simultaneously published at: www.52im.net/thread-3835…)