Millions of viewers watch the live broadcast in one second. How did Kuaishou do it?

After a boom period in 2016, mobile video live broadcasting has entered the second half, and everyone’s focus has shifted from the extensive growth stage of how to build a perfect live broadcasting platform to the stage of refined operation. How to continuously optimize user experience under huge traffic, complex application scenarios and complex network conditions is a topic of great concern in the industry.

Kuaishuo has 500 million registered users, and the peak number of people in a single live broadcast room has exceeded 1.8 million. They have made a lot of exploration and practice in first-screen and fluency optimization based on big data technology for massive users. How does Kuaishou live design full-link quality control scheme, how to build big data processing Pipeline, how to solve problems such as broadcast frame skipping and first-screen lag optimization? This paper is full of dry goods and fully decrypts the technical architecture and optimization practice of Kuaishou Live Big data.

Note: This article is a compilation of kuaishou software engineer Luo Zhe’s speech at ArchSummit Shenzhen 2017. The original title is “Kuaishou optimization of Live Broadcast Experience driven by Big Data”. Interested readers can click the link v.qq.com/x/page/e053… Watch the video of Teacher Luo Zhe’s live speech.

Introduction:

Good afternoon, everyone. My name is Luo Zhe from Kuaishou. In the past year, I have been working in Kuaishou for experience optimization of live broadcast. Today, the topic to share with you is how kuaishou optimizes the quality of live broadcasting driven by big data.

Joined the company this more than a year, the company’s registered users and live every day to refresh the peak, up to now, quickly has more than 500 million registered users, the number of short video has exceeded the int32 can store digital ceiling, which is $21, day also has reached 65 million active users, but also in the trend of rapid growth.

The live streaming service of Kuaishou was launched at the beginning of 2016. Our live streaming service is very different from other live streaming platforms, that is, The live streaming service of Kuaishou is for ordinary people, not only for Internet celebrities. The content of Kuaishou’s live broadcast is mostly common life scenes, which are very diverse. Such a mode also determines that the business scenes that Kuaishou’s live broadcast needs to consider are more complex.

At present, Kuaishou’s live broadcast business is growing rapidly, with the number of viewers in a single studio peaking at more than one million. (On August 7, the maximum number of simultaneous online users in kuaishou’s single live broadcast room exceeded 1.8 million in a live broadcast by user “MC Tianyou”.) So, how do we ensure the smoothness of our live broadcast with such a large user base? I will analyze it from four aspects.

Challenges and solutions faced by Kuaishou Live

Kuaishou live features and challenges

Kuaishou live has four distinct features, which bring kuaishou opportunities, but also let us face great challenges:

First, with the continuous development of livestreaming business, users have higher and higher requirements for livestreaming experience, which requires refined crowd optimization.
Second, Kuaishou is mainly live broadcast for ordinary people, with rich and real scenes, which also brings some problems. For example, the network situation of users is very complicated.
Third, the user base is large and the traffic of live broadcast is huge. In order to maintain the stability of the business, CDN of multiple suppliers must be adopted, which also brings the complexity of management and business.
Fourth, different scenes have different requirements for live broadcast. We need to face contradictory choices such as clear or smooth, second opening of the first screen or low delay in different scenes. Such service characteristics will bring diversified experience problems, long demand coordination cycle between different CDN, and complex and changeable network environment.

Data-driven optimization methodology

In view of the complicated live experience problems online, the Kuaishou video team summarized a set of data-driven optimization methodology in the practice process, which can be summarized as follows:

The first is to identify pain points and set priorities.Problems can be divided into two categories: tolerable and intolerable, unbearable such as playback failure, green screen and black screen, etc., such problems affecting the usability of the function can be located at a high priority and dealt with quickly; What can be tolerated include lag, clarity, delay and other Settings that can be viewed but have poor user experience.
Secondly, it is necessary to formulate an optimization plan. After problems arise, a reasonable optimization plan should be customized, which may involve multiple parties. There are problems to be solved by Kuaishou and CDN service providers, and reasonable scheduling is required to ensure that problems are solved in an orderly manner.
The third step is grayscale verification or AB testing,After the problem is solved, gray scale verification is carried out by observing the data of the whole network to ensure that the scheme is really effective before full on-line.

Kuaishou live full link quality monitoring

This methodology is based on data, so what data does Kuaishou live use and how to judge whether the user’s broadcast experience is OK? The following is a brief introduction of the end-to-end processing process of Kuaishou’s live broadcasting system: video and audio signals are collected in real time, preprocessed and encoded, packaged and sent to the CDN source station. The player pulls data from the edge of THE CDN, decoded, and presented to the audience after audio and video synchronization.

We have made a very perfect quality monitoring system in the push stream end and the playback end respectively.

At the push stream end, the data source is the picture collected by the camera and the sound collected by the microphone. Different models and systems have completely different capabilities of the camera and microphone, so we first collect the key information of the camera resolution, frame rate and model.
The next step is the video pre-processing process, such as denoising, beautifying, special effects, etc., all of which consume CPU and memory resources, so the CPU and memory occupation is reported in detail in this link.
After pre-processing, video coding will be carried out. The quality of video coding affects the whole video viewing experience. For the video encoder, the objective quality of video coding and the output frame rate of the encoder are mainly reported.
The acquisition and coding process of audio data is similar to that of video;
Coding is completed data encapsulation will be carried out in agreement, and then enter the code rate adaptive module, function of rate adaptive module mainly is to adjust the output bit rate to meet the user’s network bandwidth, at the time of the user’s network variation, the adaptive module will take the initiative to discard some data in order to reduce the pressure on the network, push the caton on flow mainly occurred here, too, Therefore, the output bit rate, the number of lost frames and The Times of stutter of the adaptive module are analyzed in detail.
The data finally reaches the source station of THE CDN service provider. Whether the source node allocated by the CDN service provider is reasonable and in the same region and operator with the user directly affects the connection quality of the user. Therefore, the geographical location and operator information of the source node are also very important information for quality evaluation.

Let’s look at the pull (play) side, the whole flow of the pull side and the push side is a reverse flow.

The client first passes DNS resolution and connects to the edge node of THE CDN. Similar to the push end, it needs to collect the DNS resolution time, operator, geographical location, RTT value and other key information of the edge node.
The HTTP-FLV data obtained from the edge node of CDN will be unpacked and sent to the receiving buffer first. At this stage, the time of the first packet of the CDN service node and the end-to-end delay of sending to the receiving can be counted.
The length of the receiving buffer determines the network resistance at the pull flow end. Here, The Times and duration of the delay can be collected. The length of the buffer itself is also a point to be monitored.
When the data is output from the buffer, audio and video are decoded, and audio and video are synchronized to play. Here, the time from the start of the pull stream to the first frame is the time of the first frame.

After the user clicks on a live broadcast, these complex processes will be completed in most cases within a few hundred milliseconds. We also further decomposed the time of each link in the first frame to conduct in-depth analysis and optimization.

Live broadcast quality data processing Pipeline

After extracting the detailed quality data, the following is the back-end processing, I will be live from mass data processing Pipeline, the user experience quality & service quality data monitoring, data visualization process three angles comprehensive analysis for everybody quickly is an issue of how to found a live, and how to solve.

Live broadcast quality data processing process

The figure below is the live data processing Pipeline now used by Kuaishou. It can be clearly seen that the whole process is data collection → data cache → data classification processing → data index/display.

We look at the details of this process, data collected from the fast hand APP, and then after the simple processing of the report server, will be stored in Kafka Topic, Kafka is a very reliable data queue service, as long as Kafka’s cluster is enough, even if part of the machine down data will not lose; Next, Kafka’s data splits into two processing paths:

The first is a real-time path, where data is cleaned by Flink and written to an Elastic Search cluster, using Kibana for data visualization. This path mainly serves for real-time data analysis, report presentation and monitoring alarm. The delay is controlled in minutes, and the data is only saved for a few weeks.
The other is the traditional batch processing path. Data is periodically processed by the Hadoop cluster and injected into the Hive database. This is a typical non-real-time data processing system. The delay is in the hour level, and there is no way to achieve the real-time at the minute level or second level. This path is mainly used to process the massive data of the day or the month, and finally output some non-real-time reports. For example, data such as one day, one month, a few months of the card curve, data is fully stored for several years.

Every day, kuaishou processes live streaming-related data through this system. In the order of ten billion pieces, all data display and monitoring related to live streaming-related data depend on this whole Pipeline. It is not easy to support various business query requirements and ensure the smooth operation of the system under the minute-level requirements.

User experience quality & Service quality

After collecting data and cleaning the data into the repository, how to analyze the data? We classify data visualization monitoring into two categories:

The first one is QoE (Quality of Experience), which is defined as data related to users’ subjective feelings, such as the number of simultaneous live broadcast rooms, the number of simultaneous live broadcast online, and the jump rate of live broadcast. The fluctuation of these data may be caused by technical problems. It may also be because the broadcast content does not meet the audience’s expectations, which can comprehensively reflect the subjective feelings of users on the live broadcast experience. Moreover, these user experience indicators reflect the overall trend. If there is a technical problem, it is impossible to trace the source of the problem by relying on these indicators. For example, if we find that the number of people who watch live broadcasts online at the same time drops, it only shows that there is a problem. The second type of indicator is needed for the specific problem and the cause.
The second index: QoS (Quality of Service) data for further analysis. The data of service quality are purely objective and reflect technical indicators. For example, this diagram shows the curve of the overage rate of each CDN service provider in the dimension of time. The fluctuation of QoE index may not necessarily be caused by the fluctuation of QoS data, and the fluctuation of QoS data may not necessarily lead to the change of QoE data. For example, the lag rate may rise, but within the acceptable range, the number of concurrent online users may not have a big change.

Data visualization monitoring process

The “number of people entering the room” and “number of people exiting the room” analysis in the following figure illustrates how we combine QoE data and QoS data for monitoring and analysis.

First, let’s take a look at the QoE data. The upper left corner is the curve of the number of simultaneous online users in a certain room of Kuaishou. During the live broadcast, there was a phenomenon of “pit drop” in the number of online users. We speculate that there might be something going on here that’s causing this huge drop in viewership, and it could be something non-technical, like an anchor doing something that people don’t like, that’s causing a huge loss of viewers.

Strangely, the “number of people entering the room” curve in the upper right shows a peak in the number of people entering the room at the same time, indicating that while a large number of users are exiting the room, a large number are entering the room at the same time.

Here we can some conclusions through QoE data, many audience quit this time, it should not be due to broadcast content, but quickly broadcast service has a problem, because a large number of exit at the same time a large number of into the audience, because the audience feel back on live may be able to solve the problem, exit live not because they no longer want to watch the live broadcast.

To confirm this determination, we observe QoS data curves. Below two curves are all entered the room number of CDN node curve and exit the room, you can see in the user a lot on exit, basically all the CDN node will have a lot of exit and enter, rather than only a few nodes have the behavior, can further judgment should not be so individual flow problems of the nodes, It is highly likely that there is a problem with the anchor stream.

After we got the video and stream curve of the anchor together with CDN, we could basically conclude that the network jitter of the anchor occurred at that time, which caused a temporary delay and then recovered immediately. The delay caused a large number of viewers to quit the broadcast room.

From this example, we can see that QoE is a comprehensive measurement index, which is very intuitive. Although it cannot directly correspond to QoS quality index, we can use it to monitor the whole situation and determine whether the experience problem is caused by technology or content. If it is caused by technology, We can find the root of the problem when we look at QoS indicators in detail.

Optimization case of live broadcast system

Next, I will illustrate how to use big data to do live system tuning through two examples of broadcast frame hopping optimization and httpDNS first screen optimization.

Pull stream end start process

As mentioned above, the starting process of pull stream end is mainly connected with CDN node to pull data, data decoding and rendering. The edge nodes of CDN generally cache part of the data, so that the pull end can pull data at any time when the pull stream starts.

In order to make users play smoothly as much as possible, CDN will send more data to users as much as possible, sometimes even exceeding the buffer of pull stream at the player end. The significant problem is that if all the data are received according to the order and broadcast at the normal speed, the delay of live broadcast will increase and the interactive effect will deteriorate.

The industry’s accepted standard for interactive livestream delay is less than five seconds, after which the interactive effect deteriorates when commenting on gifts. Therefore, we need to shorten the first screen as much as possible to improve the fluency while ensuring the delay.

As shown in the figure above, the length of the receiving buffer at the pull end is generally equal to the delay. We set its length as 5s. If the data delivered by CDN is larger than the length of the receiving buffer, assuming that the exceeded part is 4 seconds, then if no processing is done, the delay of normal playback is 5 seconds plus 4 seconds equals 9 seconds. With a delay of 9 seconds, it is almost impossible to make comments or interact during the live broadcast, resulting in poor experience. So we first tried to solve the problem caused by the excess data from the client side alone.

We tried two solutions: fast forward and jump frame directly, fast forward plan is to directly will be more than part of the data of quick play in the past, video and audio are played faster, the scheme launched soon after received the user complaints, doubt live well quickly broadcast is false, real live how can appear the phenomenon of “fast forward”.

Then we modified the scheme and skipped some of the exceeding data without playing it. However, this brought new problems. Directly jumping the frame would lead to sudden changes in the user’s voice and picture, and the anchor might suddenly appear from the left side of the picture to the right side of the picture, and the experience was not very good. In short, optimizing only on the client side will not optimize the experience.

Start frame hopping optimization

The real cause of this problem is the excessive data delivered by CDN. In order to achieve optimal experience, joint optimization with CDN is necessary. At this time, Kuaishou’s multi-CDN strategy brings a new problem: each CDN has completely different strategies for releasing data at the beginning of broadcasting, and the length of data delivered at the beginning of broadcasting is different, so it is difficult to quantitatively evaluate which CDN does better.

Therefore, it becomes the first problem to be solved to formulate a unified evaluation standard. Here, Kuaishou uses “frame jump time 10 seconds before broadcast” as the standard to measure the length of data delivered by CDN, specifically referring to the total time of discarding data within the first 10 seconds of playback at the pull stream end.

After the unified evaluation standard was established, the frame skipping index of each CDN was observed through online data, and each CDN was tried to optimize its own index and approach the best one as far as possible. However, as the broadcast strategy of each CDN is very different, and the configurable parameters are also completely different, it is difficult to achieve complete data consistency among different CDN. Moreover, even the CDN with the best index cannot adjust the adjustment time of 10s before broadcast to the satisfaction of Kuaishou.

Therefore, the unification of all CDN broadcast data distribution strategy becomes the second important problem to be solved.

We designed a unified strategy for broadcasting data distribution, and let all CDN implement it according to this plan. In general, the scheme follows three principles: 1. The length of the delivered data cannot exceed the length of the receiving buffer of the fast pull stream 2. Must be issued 3 from the beginning of a GOP (Group of Pictures). Deliver as much data as you can without violating the previous two points. The figure above shows two actual cases that determine the data delivery strategy according to different GOP structures of server cache, assuming that the receiving buffer length of fast pull stream is 5 seconds:

In the first example, if the data is delivered from the first GOP, the length of the total data is 6.5s, which is larger than the length of the fast receiving buffer. Therefore, the data can only be delivered from the second GOP, and the length of the total data is 4.5s. Within the limit of the buffer length, the maximum length of the data can be delivered.
For the second example, if the data length reaches 6s from the beginning of the second GOP, the data length can only be delivered from the beginning of the last GOP, and the data length is 3s, within the limit of the length of the receiving buffer.

After a unified broadcast data distribution strategy was developed, gray scale was launched in several CDN, and data of covered nodes and uncovered nodes of each CDN were compared and observed. Then, gray scale range was gradually expanded until full online. Compared with the daily average before optimization, the frame hop time of 10s before broadcast is reduced from 1500ms to 200ms.

After the last round of CDN end optimization, the observation of the whole network’s broadcast frame jump data shows that all CDN indicators remain at the same level (the average frame jump within 10 seconds of broadcast is about 200ms). It can be basically judged that the optimization of CDN end has reached a bottleneck. Can the client further optimize to solve the last 200ms? Here Kuaishou uses a slow fast forward scheme: accelerate the playback of the extra 200ms without the user’s perception and control the size of the buffer.

As long as it will speed up the degree of control within a certain range, the user is basically unaware of, while normal length is 200 ms data, through accelerated, can play out very quickly, and then back to normal speed, thus guarantee the frame no more jumping phenomenon, the last of the AB TEST data shown started jumping frame length is zero, And the situation of card also has more obvious decrease.

After solving the problem of broadcast frame hopping, let’s review the whole process of broadcast frame hopping optimization:

Unified evaluation criteria. We used CDN services of many manufacturers. In order to fairly measure the length of data delivered by each broadcast and observe the subsequent optimization effect, Kuaishou designed a unified quantitative statistical index: the frame jump time 10 seconds before the broadcast.
Unified CDN end policy. After unifying the evaluation standard, Kuaishou’s plan is to unify the data delivery strategy of each CDN. This strategy is designed by Kuaishou and implemented by each CDN. Then, gray scale test and comparison are conducted to analyze the implementation effect of CDN through data. Through this step, the 1500ms delay is optimized to 200ms.
The client side is further optimized. After the optimization of the previous step, the length of the broadcast frame hop is not completely eliminated, and the optimization means of the CDN end are limited. At this time, try to optimize the last 200ms on the client end. We choose to implement the slow fast forward scheme on the client side, so that users can’t notice the change of the playback speed, but quickly consume the 200ms, observe the data through AB TEST, and then go online in full.

In this whole process can clearly see the whole data platform to quickly TEST monitoring and statistical dependence, whether the assessment of the quality of CDN, or CDN optimized contrast TEST, the client’s AB TEST effect validation, all need data contrast of entire network, to confirm the optimization results, proving the perfect can solve the problem of jumping frames.

HttpDNS first screen optimization

The second optimization point we want to share is the optimization of the first screen. The whole process of the first screen can be roughly divided into the six steps as shown in the following figure. Kuaishou has made a detailed analysis of the time of each step.

The traditional DNS resolution process (known in the industry as the localDNS solution) is simple. The whole process is shown in the figure below. APP initiates a request for domain name resolution to the DNS Server of the network carrier, and the DNS Server of the carrier initiates a recursive query to the GSLB system of the CDN. GSLB determines which carrier and geographical location the query comes from according to the IP address of the OPERATING DNS Server. It then returns several suitable CDN edge node IP addresses to it.

In this process, the carrier DNS Server layer is separated between APP and CDN, and CDN cannot obtain any APP information when allocating nodes. In order to solve the traditional localDNS scheduling is not accurate, easy to be hijacked and other shortcomings, in recent years, another scheme, httpDNS, emerged in the industry. The principle is that APP directly calls the httpsDNS API provided by CDN through HTTP protocol to obtain a set of appropriate edge node IP. The two schemes have their advantages and disadvantages:

1. The advantage of localDNS is that the carrier has a small number of DNS servers. For CDN GSLB, it is relatively easy to locate the information of the carrier’s DNS Server.
2. The traditional localDNS resolution method depends on the system DNS resolution mechanism, the system generally has internal cache resulting in the resolution time is not controllable, httpDNS solution directly bypass the system DNS resolution mechanism, can be before users watch the live broadcast resolved node IP, in the real broadcast can completely save DNS resolution time. Of course, localDNS can also call the underlying resolution interface to implement domain name pre-resolution.
3. HttpDNS scheme is by APP directly call CDN API, request without carrier transfer, can bypass DNS hijacking device, so can effectively prevent the carrier hijacking domain name.
4. Another advantage of APP directly calling CDN’s httpDNS API is that CDN can directly obtain APP’s exit IP, and node allocation can be more accurate and reasonable. The problem is that in the case of multi-outlet small operators, the outlet to the CDN httpDNS server is likely to be rented from China Telecom Unicom, easy to misjudge, and localDNS because of the granularity of the operator area, but it is easier to determine the operator to which the user belongs.

Both localDNS and httpDNS have their advantages and disadvantages, and both have a certain failure rate. To improve the success rate of DNS resolution, Kuaishang uses localDNS and httpDNS together.

In order to combine the advantages of localDNS and httpDNS, and consider the optimization of multiple CDN edge nodes returned by the resolution, Kuaishou designs a unique DNS resolution scheme:

1. When the application starts, the TTL expires, the network switches, and the IP address list is obtained from localDNS and httpDNS respectively
2. Measure the speed of each IP address, and select the IP address with the best speed measurement result as the pull flow node
3. When the user starts to pull traffic, connect to the node IP address that has been prepared before, saving DNS resolution time. We soon went online and conducted AB TEST of this scheme, but the result was not as expected. For users who used the new DNS policy, the number of delays increased and the viewing time decreased. It seems that node optimization did not play its due role. Through further observation of data, it is found that users using the new DNS policy pull traffic from a few nodes with the best speed measurement results, resulting in a high load on these nodes and a decline in the quality of pull traffic.

Therefore, we further optimized the IP optimization scheme. After the speed measurement, we did not directly select the one with the best result, but randomly selected the one within a certain range acceptable to the test result, so as to avoid the load imbalance of nodes caused by a large number of users gathering to a few nodes.

AB TEST was carried out on this optimization scheme, and the quality data fully met expectations:

The first screen time decreased by 30%, the lag was significantly improved, and the connection failure rate decreased by 2% due to node optimization. A strategy proposed to optimize the first screen had such a significant positive effect on multiple indicators, which was not expected at the beginning.

Challenges and Planning

The live streaming business of Kuaishou is still in rapid development, and the big data analysis platform of streaming media also presents new challenges:

From the perspective of data scale, the growth of user scale and the continuous enrichment of business types result in the exponential growth of data scale, and the demand for data processing capability is also getting higher and higher
In terms of real-time performance, live broadcast business has higher and higher requirements for real-time data. Quality monitoring and data analysis should be handled as quickly as possible
From the perspective of analysis depth, index quality monitoring and product business analysis have put forward higher requirements on the dimension and granularity of data monitoring, with more and more dimensions and finer granularity

To address these challenges, we will continue to invest in improving our core data capabilities.

On the big data platform, from the perspective of the whole live broadcast experience optimization, in order to truly achieve end-to-end controllable user experience, we need to be able to go deep into the monitoring and tuning of all links. Therefore, Kuaishou will build core technical capabilities as the basis of the next optimization:

Firstly, Kuaishou will establish its own source station to replace CDN’s source station to receive stream service. With the self-established source station, Kuaishou streaming media big data platform can realize real-time and intelligent monitoring of the whole process of audio and video transmission.
At present, the RTMP push-stream protocol widely used in the industry has limited means of optimization in the mobile weak network environment. With the self-built source station, at the push-stream end, Kuaishou will use the private push-stream protocol, and the transmission capacity of the weak network will be greatly strengthened.
The scheduling of CDN will also be more flexible, and dynamic and intelligent traffic scheduling can be carried out according to the real-time service quality data of each CDN of the operator.

The last

Kuaishou’s live streaming business is still growing at a high speed, and the rapid increase in the number of users has higher requirements on service quality. Streaming media big data platform is a basic platform for various video services, which needs to provide perfect and stable data collection, data processing, data monitoring, data analysis and other services.

In order to meet the business challenges, we will continue to expand and improve the data infrastructure, expand the relevant technical team, welcome more talented people with practical experience in big data platform, interested in streaming media big data analysis and optimization to join kuaishou video technology team. We believe that in the future, Kuaishou’s streaming media big data platform can better serve users and achieve kuaishou’s vision of recording the world.