How to realize the interactive answering system with millions of concurrent answers under the weibo live broadcast scenario

On September 7, 2018, Chen Hao, sina Weibo system development engineer, delivered a speech titled “The Interactive system of millions of Concurrent messages under weibo Live Broadcast” at the “RTC2018 Real-time Internet Conference”. IT big said as the exclusive video partner, by the organizers and speakers review authorized release.

Read the word count: 2867 | 8 minutes to read

Watch video playback and PPT of guest speeches, please click:
t.cn/Ew1WrzR.

Abstract

This share mainly introduces the technical scheme of how to realize the interactive answering system of weibo live broadcast scene.

Introduction to Live Quiz

The traditional home page of mobile phone live broadcast is generally composed of live audio and video stream and message interaction below. The only difference lies in the introduction of an interactive way of answering questions. The password of the host can control the behavior of the client, that is, the initiating of the answer is decided by him, and the user who answers correctly will get a bonus incentive. The real-time data, such as the number of people choosing each option, will be displayed immediately after the end of each round of answer.

Technical Challenge Analysis

The core requirement of livestream answering is to ensure that the answer screen of each user can be displayed when a large number of users are online at the same time, and they can participate in the answer smoothly during the livestream, and finally share the bonus.

High concurrency

In the live answering, a large number of users must join the room a few minutes before the first question is handed out, and a large number of users will submit their answers at the same time during the 10-second countdown. During the answering process, a large number of messages will be sent. There is a clear peak in all three scenarios, and the technical challenge for us is the ability to process massive amounts of data with high concurrency.

reliability

Successful display of the answer screen is the primary condition for users to participate in the answer, each round of answer may eliminate a number of participating users, under normal circumstances, only when users answer wrong will be eliminated. But if the answer screen is not shown and eliminated, that is a technical problem, and our mandatory requirement is not to eliminate users because of technical reasons. Therefore, the technical difficulty here is the success rate of the mass topic message delivery and the successful display of the answer screen.

Real-time performance

Each round of questions only has a 10-second countdown, so make sure that the screen is displayed to the user within this time, and the statistical results are displayed in real time when the answer is finished. The technical difficulty lies in real-time delivery and statistical analysis of massive data in a very short time.

Answer Design Scheme

The technical solution we designed to solve the problem is based on the existing technical architecture of weibo live broadcast. The figure above is the architecture diagram of weibo live broadcast interaction, the core of which is short connection and long connection. The short connection uses the private Wesync protocol and supports SSL, which is the core service that supports the interaction of millions of messages. Long link is responsible for maintaining message channel, dynamic scaling and supporting user grouping.

In fact, the core of the whole scheme design is to solve the problem of signaling channel selection, that is, how to deliver the message of the question, so that users can receive it in a short time.

Plan a

The first solution that comes to mind is polling, in which the client constantly issues query requests and the server controls whether the results are returned. However, a large number of useless requests can consume bandwidth resources and put constant pressure on the server. At the same time, because it is not in the same channel as the audio and video stream, it is difficult to keep the same arrival time of the topic and the audio and video stream.

Scheme 2

The second scheme is to reuse the channel of audio and video stream, and put the topic information directly into the audio and video stream together. In this way, the time delivered by the topic is the same as that issued by the moderator’s password, so that users do not perceive the time difference, and the experience is much better. Its disadvantage is that if the network jitter or the live broadcast is interrupted, the title information will be lost.

Plan 3

The third solution is to reuse interactive channel instead of relying on live channel, so that channel is independent and no longer affected by live stream. Its drawback is that it can’t keep the time of issue and arrival of audio and video the same.

This is a simple comparison of the three options. It can be seen that the interactive channel has good access difficulty and scalability, while the live channel has poor scalability because it depends on audio and video streams.

After comprehensive consideration, we finally decided to use an independent channel to ensure that the answer signaling is not affected by the live stream signal. And since the existing interactive channel of weibo live broadcast can support tens of millions of messages, it is a better choice to reuse the interactive message channel.

The picture above shows the whole flow chart of live answering. The interaction between answering questions and commenting is supported by the short link service, which uses pub’s method for sending questions and displaying results. Then, broadcast messages are sent to users through long links.

Typical problem solving

Real-time typical problem solving

There is a time delay when the live stream is finally pushed to the client through the collecting and editing equipment, and the arrival time of the topic received by the client is inconsistent with that of the video stream. As shown in the figure above, the moderator questions at T0, users can only hear the voice at T2, and the questions arrive at T1.

The solution is to add the server time stamp in the extended field of the video stream, and add the server time stamp in the subject message body of the interactive channel. Finally, the client will display the answer message matched by the timestamp in the extended field of the live stream.

Another typical problem with real-time is real-time statistics of massive amounts of data. First of all, each round of questions requires real-time calculation of massive user status and statistical results, including whether the user answers correctly, whether the user needs to be resurrected, and whether the user has a resurrected card. Secondly, there are only two instructions for sending questions and showing answers in each round of questions, so there is no separate instruction to tell the server when to process data, and the massive data needs to be calculated quickly.

It’s definitely not practical to get all the data at once, so our solution is to divide it into parts and process it in parallel. When server send topic instructions arrived, first according to certain rules to fine-grained resolution of users, at the same time, according to the flow time delay, the answer the beginning of the end time to calculate the data processing time, and then dividing a good user fragmentation and encapsulated into task execution time shard in a delay in the queue, arriving at execution time by processor cluster pull the task.

After receiving the task, the processor will integrate the user message according to the user’s choice, status and long connection address, and merge many small messages into message body.

The message body is finally sent to the long link. After receiving the message body, the long link service splits the message body and sends it to the user according to the UID.

Reliability typical problem resolution

As mentioned above, users in the weak network environment are prone to packet loss and disconnection, resulting in failure to receive topic information, thus being eliminated. Since the user’s network environment cannot be guaranteed, we can only implement a more stable and faster automatic reconnection mechanism. At the same time, the server will unconditionally retransmit the topic message within a certain period of time, and the client will judge the weight by the message ID. In addition, the message contains the latest display time, so that the answer screen can be displayed even if the live stream is cut off.

High concurrency typical problem solving

High concurrency involves multiple problems. The first is how to submit answers in high concurrency. If there are millions of users online, the answer will be submitted within 10 seconds, and the submission requests will be concentrated in 3-6 seconds. At this time, the peak QPS is estimated to be 300,000. In addition, it is necessary to ensure that the user’s answers can be submitted in a short time and cannot be withdrawn.

The solution is also very simple, mainly to achieve logical separation and request merge. The processing of the user’s request is returned quickly, the heavy logic is delayed, and the upstream logic remains light. User requests are consolidated on the resource side and sent to a separate thread pool for bulk submission. However, when the traffic reaches the load design threshold, the system automatically retries the request randomly to compensate the time to ensure that all users can submit their answers.

The second problem is massive message delivery. For a system with tens of millions of messages delivered in real time, the network bandwidth pressure at the subscription end is also huge. In the face of bandwidth pressure, on the one hand, messages can be compressed to reduce redundancy in transmission; on the other hand, messages can be reduced to group and merge some small messages according to user options. Message push capability is to improve message throughput.

The third problem is about online guarantee. There is no gray-scale process for live answering, which means full amount. Quantitative data is needed to evaluate the carrying capacity of system services.

Therefore, we need to carry out several rounds of pressure test and continuous performance optimization before going online. During function development, testers will design test schemes and conduct single interface test during problem repair to find out interface performance bottlenecks. At the same time, the overall pressure measurement environment was built. After the performance optimization stage, the tester started the integrated pressure measurement of single machine and the full link pressure measurement.

The above is the content of this sharing, thank you!