The background,

Instant messaging (IM) system is an important part of live broadcasting system. A stable, fault-tolerant, flexible message module that supports high concurrency is an important factor affecting the user experience of live broadcasting system. IM long connection service plays an important role in live broadcasting system.

Aiming at the live show, this article briefly describes the message model and explains the architecture of our message model. In addition, we have upgraded and adjusted the evolving message model architecture by dealing with different online business problems over the past year. This article is compiled and shared with you.

Live in the most mainstream business, push-pull flow is the most basic technical point live broadcast business, information technology is implemented to watch live all of the users and the host key technologies to realize the interactive point, via live IM system module, we can complete screen interactive, color barrage, so a gift radio, DMS, PK core shows broadcast function development.” As the information bridge of “communication” between users and users, and between users and anchors, how to ensure the “information bridge” to remain stable and reliable under high concurrency scenarios is an important topic in the evolution process of live broadcasting system.

2. Live broadcast message service

In the live broadcast business, there are several core concepts about the message model. Let’s briefly introduce them first, so that we can have an overall understanding of the message model related to live broadcast.

2.1 Anchors and users

Anchor and audience, for IM system, are all ordinary users, and will have a unique user identity, which is also an important identity for IM distribution to peer-to-peer messages.

2.2 the room no.

An anchor corresponds to a room number (Roomid). Before broadcast, the anchor will bind the unique room number after identity information verification, which is an important identification for IM system to distribute messages in the broadcast room.

2.3 Message type partitioning

According to the characteristics of live broadcast service, IM messages can be divided in many ways, such as by the dimensions of the receiver, by the type of broadcast room message service, by the priority of the message, and by the storage mode.

In general, we have the following types of messages according to the recipient dimension:

Point-to-point messaging (unicast messaging)
Broadcast Room Messages (Group Broadcast Messages)
Broadcast messages

Depending on the specific business scenario, there are several types of messages:

Gift message
The male screen message
PK message
Business notification class messages

It is very necessary that the message can be distributed to the corresponding group or a single user terminal in real time and accurately. Of course, a better IM messaging model can also empower businesses with new capabilities, such as the ability to:

Count the number of people online in real time in each studio
Capture user events in and out of the studio
Statistics each user real-time access to the broadcast room time

2.4 Message priority

It is very important that live messages are prioritized. Unlike WeChat, QQ and other IM chat products, live messages are prioritized.

WeChat chat messages, such as product, whether private or group chat, everyone sends the message priority is basically the same, who does not exist the message priority, whose message priority is low, all need to be accurate in real-time distributed to various business terminal, but live because of the different business scenarios, the priority of the message distribution is not the same.

For example, if a studio rendering only 15 ~ 20 messages per second, if a hot air produced by a second message volume is more than 20 or more, if you don’t do the message priority control, real-time distributed message directly, so as a result of the mammal is the male screen client rendering caton, gift bounced rendering too fast, the user viewing experience has fallen dramatically, So we need to give different message priorities for different business types of messages.

Gift message is greater than the male screen, for example, the same type of business news, big gift message priority and larger than a small gift, high-grade screens of the user message priority is higher than the lower level users or of the anonymous user screen message, doing business news distribution, need according to the actual message priority, selective distribution for message accurately.

3. Message technology

3.1 Message architecture model

3.2 Short polling vs. long linking

3.2.1 short polling

3.2.1.1 Short polling business model

First of all, we will briefly describe the process of short polling time and the basic design idea:

The client polls the server interface every 2s with the Roomid and Timestamp parameters, and the Timestamp passes 0 or null for the first time.
The server queries the message events generated by the room after the timestamp based on the Roomid and Timestamp, and returns a limited number of messages such as (for example, 10-15 messages are returned, of course, the number of messages generated after this timestamp is much greater than 15 messages). However, due to the limited rendering capacity of the client and excessive message display, the user experience will be affected, so the number of messages returned is limited), and at the same time, the timestamp of the last message in these messages is returned, which is used as the benchmark request timestamp of the client’s next request server.
This repeated, so you can every 2S in accordance with the requirements of each terminal, update the latest news of each broadcast room

The overall main idea is shown in the figure above, but the specific time can be refined, and detailed explanations and explanations will be made later.

3.2.1.2 Short-polling storage model

Short-polling message storage is somewhat different from the normal long-connection message storage, and there is no problem of message diffusion. The message storage we need to do needs to achieve the following business objectives:

Message insertion time complexity is relatively low;
The complexity of the message query is relatively low;
The storage structure of the message should be relatively small and should not take up too much memory or disk space;
Historical messages can be stored persistently on disk according to business needs;

In combination with the technical requirements of the four points mentioned above, after discussion among team members, we decided to use the SortedSet data structure of Redis for storage. Specific implementation idea: according to the product business types of broadcast room, business messages are divided into the following four types: gift, public screen, PK and notice.

One studio message is stored using four Redis SortedSet data structures, The SortedSet keys are “live::roomId::gift”,”live::roomId::chat”,”live::roomId::notify”,”live::roomId::pk”, and the score is the time stamp of the actual message. Value is the serialized JSON string, as shown in the figure below:

When the client polls, the logic of the server query is as follows:

Many students will ask, why not apply the data structure of Redis List? The following figure will give a detailed explanation:

Finally, we will compare the correlation analysis of the time complexity of Redis SortedSet and Redis List data structures when live message storage is conducted.

Above, we use Redis SortedSet data structure for message storage some simple design thinking, we will also mention the coding of end polling, need to pay attention to the points.

3.2.1.3 Time control of short polling

Time control of short polling is extremely important. We need to find a good balance between the QoE experience of the live audience and the pressure on the server.

If the polling interval is too long, the user experience will deteriorate a lot, and the live viewing experience will deteriorate, and it will feel like “one meal after another.” The frequency of short polling is too high, which will lead to too much pressure on the server, and there will be many “empty polling”. The so-called “empty polling” is invalid polling, that is, after the valid polling returns a valid message in the last second, the invalid polling will occur if no new message is generated in the broadcast room during the interval.

At present, the daily polling times of Vivo’s live broadcast are 1 + billion. During the peak time of watching live broadcast at night, the CPU load of servers and Redis will rise. The thread pool of Dubbo’s service provider is always at a high water mark, which needs to be pressured according to the real-time load of the machine and Redis. The horizontal capacity expansion of the server and the node expansion of Redis Cluster can even load some broadcast rooms with ultra-high heat value to the designated Redis Cluster Cluster, so as to achieve physical isolation and enjoy “VIP” service, so as to ensure that the messages of each broadcast room do not affect each other.

For live broadcast rooms with different number of people, the polling time can also be configured. For example, for live broadcast rooms with fewer people and less than 100 people, a relatively high frequency polling frequency can be set, such as about 1.5s; for those with more than 300 people, about 2s for those with less than 1000 people, and about 2.5s for live broadcast rooms with 10,000 people. These configurations can be issued in real time through the configuration center, and the client can update the polling time in real time. The adjusted frequency can be based on the effect of the actual broadcast room user experience and the load of the server to find the relative best value of the polling interval.

3.2.1.4 Attention points of short polling

1) The server needs to verify the timestamp passed by the client: This is very important, just think, if the audience when watching live, will live out of the background, the client polling process pause, when user resume live viewing screen process, the client passed the time will be very old even expiration time, this time there was a slow, will cause the server query Redis If a large number of server slow checks occur, the connection between the server and Redis cannot be released quickly, and the performance of the entire server will also be slowed down. A large number of polling interface timeouts will occur in an instant, and the quality of service and QoE will decrease a lot.

2) The client needs to verify duplicate messages: In extreme cases, the client may receive repeated messages, which may be caused by the following reasons. At one time, the client sends a request of Roomid =888888& Timestamp = T1. Because of network instability or server GC, the request is slow to process and takes more than 2s. The client sends a request of Roomid =888888& Timestamp = T1 again. When the server returns the same data, the client will render the same message repeatedly for display, which will also affect the user experience. Therefore, it is necessary for each client to verify the repeated message.

3) The problem that massive data cannot be returned to rendering in real time: Imagine, if a great studio and heat have thousands or tens of thousands of messages every second, according to the storage and query of the above ideas is flawed, because we every time because of the limitation of various factors, every time returns only 10 ~ 20 messages, so we need a long time to send this heat a lot of data from a second return entirely, This causes the latest message to not be returned in priority as quickly as possible, so the message returned by polling can also be selectively discarded according to the message priority.

Client polling services server query studio is obvious, the benefits of the news of the message distribution is highly real-time and accurate, it is hard to discern tremble for Internet message cannot reach the scene, but the disadvantages are also very obvious, the server in the business of peak load pressure is very big, if the air all messages are distributed by polling, for a long time in the past, Servers are difficult to achieve linear growth through horizontal capacity expansion.

3.2.2 long connection

3.2.2.1 Architectural model for long connections

In terms of process, as shown in the figure above, the overall process of live long connection is as follows:

The mobile client first requests the long connection server through HTTP, and obtains the IP address of the TCP long connection. The long connection server returns the list of the optimal IP that can be connected according to routing and load policy.

According to the IP list returned by the long connection server, the mobile phone client requests the connection of the long connection client, and the long connection server receives the connection request and then establishes the connection.

The mobile phone client sends authentication information to authenticate communication information and confirm identity information. Finally, the long connection is established. The long connection server needs to manage the connection, monitor the heartbeat, and reconnect the disconnection.

The basic architecture of the long-connection server cluster is shown in the figure below. The service is divided according to the region, and terminal machines in different regions are connected on demand.

3.2.2.2 Long connection establishment and management

In order to make messages reach users instantly, efficiently and safely, the live broadcast client and the IM system establish an encrypted full-duplex data channel, which is used for both sending and receiving messages. When a large number of users are online, a large amount of memory and CPU resources are needed to maintain these connections and maintain the session.

The IM access layer tries to keep the function simple, and the business logic sinks to the later logic service for processing. In order to prevent the restart process from causing a large number of external network devices to re-connect when publishing, which will affect the user experience. The access layer provides a hot update release scheme: the basic logic such as connection maintenance and account management that is not often changed is put into the master program, and the business logic is embedded into the program in the way of SO plug-in. When modifying the business logic, it only needs to reload the plug-in once to ensure that the long connection with the device will not be affected.

3.2.2.3 Long connection maintenance

After a long connection is established, if the intermediate network is disconnected, neither the server nor the client will be aware of it, resulting in a false online situation. Therefore, one of the key problems in maintaining the “long connection” is to be able to make the “long connection” so that when the intermediate link fails, both ends of the connection can be notified quickly, and then by reconnecting to establish a new available connection, so that our long connection is always in a highly available state. IM enables keeplive on the server side and intelligent heartbeat on the client side.

Using Keeplive to protect the detection function, can detect the client collapse, the middle network end open and the intermediate device due to timeout to delete the connection related to the connection table and other unexpected cases, so as to ensure that when the accident occurs, the server can release the half-open TCP connection.
When the client starts the intelligent heartbeat, it not only notifies the server of the client’s survival status under the condition of consuming very little power and network traffic, but also periodically refreshes the IP mapping table of the internal and external NAT networks. It can also automatically reconnect the long connection when the network changes.

3.2.3 IM message distribution in the broadcast room

Overall flow chart of IM long connection message distribution

When integrating the three modules of client, IM long connection server module and live broadcasting service server module, the overall message distribution logic follows the following basic principles:

All the messages are called by the live broadcast service server to the interface of the IM long connection server, and the messages that need to be distributed are distributed to the various business broadcast rooms.
The business server responds to the events generated in the broadcast room according to the corresponding business types, such as deducting virtual currency from gifts, sending public screens for text health check, etc.
Client accept live business server signal control, a message is through short connection channel distribution or HTTP long polling distribution, are all controlled by live business server, client shielding the underlying message for details, the client top accept unified message data format, message processing to carry on the corresponding business type.

3.2.3.1 Studio member management and message distribution

Members of the studio is the most important basic metadata studio, a single set of users is actually uncapped, and live shows large number of mammal (greater than 30 w online at the same time), in the hundreds, small live tens of thousands of such distribution, how to manage the members of the studio is one of core functions in a studio system architecture, the common way has the following two kinds:

1. Assign fixed shards to the broadcast room. There is a mapping relationship between users and specific shards, and the storage of users in each shard is relatively random.

The algorithm of fixed sharding is simple to implement. However, for a broadcast room with fewer users, it may have a small number of users carried by sharding, while for a broadcast room with large users, it may have a large number of users carried by sharding. Fixed sharding has the characteristics of poor natural scalability.

2. Dynamic sharding, which specifies the number of sharding users. When the number of users exceeds the threshold, a new sharding is added, and the number of sharding can change with the increase of the number of users.

Dynamic sharding can automatically generate sharding according to the number of people in the broadcast room, and new films will be opened when full. The number of users in each sharding can reach the threshold as far as possible, but the number of users in the existing sharding varies with the number of users entering and leaving the broadcast room, so the maintenance complexity is relatively high.

3.2.3.2 Message distribution in the broadcast room

In the broadcast room, there are various messages such as entry and exit message, text message, gift message and public screen message. The importance of the message is different, and the corresponding priority can be set for each message.

Messages of different priority are placed in different message queues. High-priority messages are sent to the client first, and the earliest, low-priority messages are discarded when the stack exceeds the limit. In addition, the broadcast room message belongs to real-time message, and it is of little significance for users to obtain the historical message and offline message, and the message is stored and organized in the way of read diffusion. Live broadcast messages, according to the members of the studio shard inform the corresponding message service, then the message send to shard respectively corresponding to each user, in order to real-time and efficiently under studio news to users, when users have more than not receiving messages, issued by the service with the method of batch issued by sending multiple messages to the user.

3.2.3.3 Message compression for long connections

When using TCP long connection to distribute broadcast room messages, attention should also be paid to the size of message body. If the number of messages distributed at a certain moment is relatively large, or when the same message is in the scene of group broadcasting, there are more users of group broadcasting, the exit bandwidth of the machine room in the IM connection layer will become the bottleneck of message distribution. Therefore, how to effectively control the size of each message and compress the size of each message is a problem we also need to think about. At present, we optimize the relevant message body structure in two ways:

Use the PROTOBUF protocol data exchange format

Messages of the same type are sent together

After our online testing, using ProtoBuf data exchange format, on average, each message saves 43% of the byte size, which can greatly help us save the room exit bandwidth.

3.2.3.4 piece of news

The so-called block message is also a technical solution we use for reference from other live broadcasting platforms, that is, multiple messages are sent together. The live broadcasting service server does not immediately call IM long connection server cluster to directly distribute messages when a message is generated. The main idea is to take the broadcast room as the dimension and distribute the messages generated by the business system at a uniform time interval every 1s or 2s during this period.

Distribution of 10 ~ 20 messages in a second, and if the business server accumulation to the news of more than 10 ~ 20, then discarded according to the priority of the message, if the 10 ~ 20 messages are a priority, are present types of messages, for example, after have the message on to send a message block, the benefit has the following three;

Merge message, can reduce the transmission of redundant message header, multiple messages sent together, in the custom TCP transmission protocol, can share the message header, further reduce the number of message bytes size;

To prevent the occurrence of message storm, the live broadcast service server can easily control the speed of message distribution, and will not distribute messages to the live broadcast client without limit, because the client cannot handle so many messages.

Friendly user experience. Because the flow rate of the broadcast room is normal and the rhythm of the rendering is relatively uniform, it will bring a very good user live broadcasting experience and the whole live broadcasting effect will be smooth

3.3 Message discard

Whether HTTP polling short or long connection, in the presence of high heat value studio, there are news discarded, for example, in the broadcast of the game, have a more wonderful moments, reviews and screen will instantly increase the number of, at the same time, the news of the low value of the gift will instantly increase a lot, used to show support for their players good operation, Then the number of messages distributed by the server through IM long connection or HTTP short polling will be thousands or tens of thousands per second. The sudden increase of messages will lead to the following problems for the client.

The client receives a sudden increase of messages through a long connection, resulting in a sudden increase of downward bandwidth pressure, and other services may be affected (for example, SVGA of gifts cannot be downloaded and played in time);

The client can’t process and render so many gifts and public screen messages quickly, which increases the CPU pressure and affects the audio and video processing.

User experience (QoE) metrics decrease due to the backlog of messages, leading to the possibility of displaying messages that are long overdue.

Therefore, messages are necessary to be discarded for these reasons. To take a simple example, the priority of gifts must be higher than that of public screen messages, the messages of PK progress bar must be higher than that of the whole network broadcast messages, and the messages of high-value gifts are higher than those of low-value gifts.

According to these business theories, we can do the following controls in real code development:

According to the specific service characteristics, the message of each service type is divided into different levels. When the message distribution triggers flow control, the message of low priority is selectively discarded according to the message priority.

The creation time and send time fields are added to the message structure. When the actual long connection channel is called, it is necessary to judge whether the interval between the current time and the creation time of the message is too large. If it is too large, the message will be discarded directly.

Gain messages (correct), in the business development, the design of the message, as far as possible to gain design news, gain news arrived refers to A subsequent message can contain totem of arrived, for example, 9, 10, PK values of host A and host B is 20 more than 10, then 9 11 points distribution PK news value is more than 10, 22 The incremental message 2:0 cannot be distributed, and the client is expected to do the summation of PK messages (20+2:10+0). However, there are messages that are dropped due to network tremor or pre-message dropping, so the distribution of gain messages or correction messages can help the business to resume normal.

Four, write at the end

For any broadcast system, with the development of its business and the increasing popularity of the broadcast room, problems and challenges encountered by the message system will follow. Whether it is the long connection message storm or the massive HTTP short polling request, the pressure on the server will increase dramatically, which we need to solve and optimize continuously. According to the business characteristics of each period, we should continue to upgrade live messages and build evolvable message modules to ensure that the ability of message distribution can ensure the sustainable development of the business.

Vivo broadcast message module is also gradually evolution, the main impetus for evolution comes from because of the development of the business, with the diversification of business form, watch the number of users more and more, the function of the system will also gradually increase, also will encounter all sorts of performance bottlenecks, in order to solve the performance problems, will be one by one code analysis, interface performance bottleneck analysis, Then the corresponding solution or decoupling scheme is given, and the message module is no exception. I hope this article can give you some inspiration on the design of the related live broadcast message module.

Authors: Vivo Internet Technology – Lindu, Li Guolin

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Play Live Series: Message Module Evolution (3)