This article is based on the case of wang hui, research and development director of qzone client at TOP100summit in 2016. Editor: Cynthia

Wang Hui: R&d Director of Tencent SNG Social Platform Department, senior engineer in charge of Tencent Qzone mobile client technology. Since 2009, I have been responsible for the research and development of Qzone technology. I have experienced the transformation from Web era to mobile client technology, and have accumulated good technology in both Web and mobile terminals.

Takeaway: the mobile Internet rapid development, in 2016, the social network to get the explosive growth of the application of video technology, short live video, video, video filters, the dynamic effect, music, karaoke, video face voice in succession, even the wheat, and other functions in the product online, how to fast online function at the same time, guarantee the stable and smooth experience, become a challenge. The main challenges are as follows: 1. In the complex network environment, how to ensure the success rate of video playback, how to ensure the smoothness of live broadcast and reduce the lag? 2. In terms of experience, how to ensure fast and smooth broadcasting experience, and how to realize instant broadcasting in seconds? 3. In terms of performance, the effects of live streaming filters, beauty and facial dynamic effects are fully open. How to ensure the performance of the anchor end? 4. In the case of massive concurrent users of hundreds of millions, how to better guarantee the quality and how to achieve the best flexible bandwidth strategy? This case focuses on the above challenges and reveals some optimization attempts of Tencent Qzone team in the challenges.

I. Case introduction

At present, Qzone is the largest SNS community in China, with 500 million pictures uploaded and 1 billion videos played daily at peak. 630 million users share their lives and retain their feelings, among which the mainstream group is young people born after 1995. Because they are young people, and young people are the main driving force of lifestyle change, they are not satisfied with the current traditional way of sharing pictures and videos. They hope to let you see me “now, immediately and immediately” through live broadcasting. This is the upgrade of content consumption, along with the upgrade of mobile hardware and the reduction of data costs, so mobile live streaming is possible. The positioning of our live broadcast products is also based on the life content of target users, plus the power of social communication, which is different from the current mainstream model of network + game live broadcast. The benefits are: the content created by users is more diversified, more close to life, more resonating and spreading among friends; The problem is that the mobile terminals we want to be compatible with are also massive, and the performance problem is what we focus on.

With that in mind, we’re going live. The goal is to build a closed-loop capability for live streaming, which means going online quickly, within a month. Functions such as launching live broadcast, watching live broadcast (one person initiating live broadcast, many people watching), watching live broadcast (playback), live broadcast interaction (comments, gifts, etc., in the live broadcast room), live broadcast precipitation (feed precipitation, sharing live broadcast, etc.) should be realized (see Figure 1). And support Android, iOS, Html5 three platforms at the same time (that is, the scheme is mature), and support space and mobile QQ space.

First of all, we faced the problem of limited project duration and insufficient accumulation of live broadcast technology. In spite of this, we are still determined to continue to communicate with major relevant technology providers, and select the technology according to our standards and the recommendations of the providers. Our standards are as follows: ● Good professionalism (low time delay for live broadcasting, support for the whole platform and good basic service construction); ● Open source; ● High support, problems can be solved at any time communication; ● Supports dynamic capacity expansion.

Finally, ILVB live broadcast solution provided by Tencent Cloud was selected according to the standard, especially the audio and video related group has accumulated many years of technology in this area, and can cooperate with the department for win-win. It is worth mentioning that our closed-loop r&d model also enables us and our partners to continuously improve product quality. First fast on-line (finished product demand, and improve monitoring), then after launch to check the monitoring data analysis (data), and then applied to the optimization of work (follow up data, special optimization), finally verifies the gray level (gray part of the user authentication optimization effect), according to the effect you decide whether or not the formal application to the products (as shown in figure 2).

Finally, we realized the online in one month, and supported BOTH QQ Zone and mobile QQ(combined version of QQ Zone). So far, we have iterated 12+ versions. The number of views has also increased from millions in May to tens of millions in August to hundreds of millions now, becoming a popular nationwide broadcast with user participation. Product data has been skyrocketing along with user demand, which brings with it all kinds of feedback, especially on performance issues, which is the focus of this article.

2. Live broadcast architecture

Before introducing the architecture of live broadcast, I think it is necessary to review the H264 coding for everyone. At present, the video coding of live broadcast in the market is basically H.264. H264 has three frame types: the full encoding frame is called I frame, the reference to the previous I frame generated frame containing only the difference part of the encoding frame is called P frame, and a reference before and after the frame encoding frame is called B frame. The compression process of H264 consists of grouping (dividing several frames of data into one GOP and one frame sequence), defining frame type, predicting frame (taking FRAME I as the base frame, predicting frame P by frame I, and predicting frame B by frame I and P), and data transmission. A simple example is used to explain the video model. If a GOP(Group of Pictures) is regarded as a train pulling goods, then the video is a freight convoy composed of N trains (FIG. 3).

Livestreaming is the flow of video data, which is a data flow process while shooting, transmitting and playing. The data is produced and loaded by the anchor, and then unloaded to the audience for broadcast consumption through the network (railway). Trains need to be dispatched, and so does video streaming. This is the video streaming protocol, which controls the orderly transmission of video to the audience. Common protocols are shown in Figure 4:

We use QAVSDK protocol developed by Tencent Cloud based on UDP.

Content related to protocol has been mentioned before. Now let’s talk about our live broadcast model, as shown in Figure 5:

The video room (video streaming) and the business room (related business logic interactions) are roughly the same structure, with the difference being the data flow (note the arrows in Figure 5). The video room data flows from the anchor to the video server through the video streaming protocol, and the video server also sends the data to the audience through the video streaming protocol, and the audience decodes and plays the data. Anchors only upload, viewers only download. Anyone in the business room needs to send relevant business requests to the server (such as comments, of course, the client will block some special logic, such as hosts cannot send gifts to themselves). A more detailed structure is shown in Figure 6:

Note: iOS mobile Q audience adopts RTMP protocol not because it does not support QAVSDK, but because mobile Q has the pressure of reducing packages, and the SDK related to QAVSDK takes up a large space.

Third, technical optimization

Next comes the focus of this paper: technical optimization. Technical optimization is divided into four aspects: second opening optimization (time-consuming optimization practice), performance optimization (performance optimization practice), lag optimization (problem analysis practice), replay optimization (cost optimization practice). Before optimization, our necessary work is to monitor statistics first. We will conduct preliminary statistics on the data points we care about, and make relevant reports and alarms to assist optimization analysis. Monitoring is divided into the following five parts: ● Success rate, success rate of launching live broadcast, success rate of watching live broadcast, error ratio list; ● Time, time to initiate live broadcast, time to enter live broadcast; ● Live broadcast quality, card frame rate, 0 frame rate; ● Problem location, step flow, 2s flow, client log; ● Real-time alarm, SMS, wechat and other methods.

Through these, we can view, analyze and locate data, as well as real-time alarms, so as to solve problems more easily.

Four, second on optimization

Almost everyone joked “Why is the live stream opening so slow, rival products are much faster than us!!” We can’t stand it ourselves. We need to watch live broadcast in seconds (it takes less than 1 second from clicking live broadcast to seeing the picture), while the average opening time of the external network is 4.27 seconds according to statistics, which still has a certain gap to open in seconds. So we sorted out the time sequence relationship between the point opening and the rendering of the first frame, and counted the time consuming of each stage. Flowchart and time consumption are generally shown in Figure 7:

Through process analysis and data analysis, it is found that two reasons for time consuming are: it takes too long to obtain the first frame data and the core logic is serial. Next we optimize for these two problems. The first frame takes too long. The core problem is speeding up the time when the first GOP reaches the audience. Our optimization plan is: let the interface machine cache the first frame data, and transform the player at the same time, parse to I frame and start playing. This speeds up the time it takes viewers to see the first frame. Core logic serial. In this part, we mainly use the following processes: ● Preloading, preparing the environment and data in advance, such as pre-pulling the live broadcast process in advance in feeds, and obtaining the IP data of the interface machine in advance; ● Delay loading, the UI, comments and other logic for delay loading, let the system resources to the first frame; ● Cache, such as cache interface machine IP data, reuse within a period of time; ● Serial to parallel, at the same time pull data, save time; ● Optimize and comb the single-step time-consuming logic to reduce the single-step time-consuming. The optimized process and time consumption are roughly shown in Figure 8. The time consumption is reduced to 680ms, and the target is achieved!

Fifth, performance optimization

As the product continues to iterate, the gameplay of livestreaming gets richer and richer, while some performance issues continue to be exposed. In particular, we later added dynamic effect stickers, filters, voice changes and other functions, and a large number of users gave feedback that live broadcasting was very difficult. According to the statistics, the short frame rate of anchors is very low, the picture is not continuous, and the subjective perception is stuck. In addition, there are a large number of low-end machines in the user’s devices. See Figure 9.

Through analysis, it is found that the main reason for the low frame rate is that the processing time of single frame is too long, while the encoding factor is relatively low. General total time = Processing effort * single frame time. So we gradually optimized both.

● The collection resolution is consistent with the processing resolution, such as the code of 960×540, because some mobile phones may not support this collection resolution, the collection is generally 1280*1024, before processing is first processing and then scaling, now first scaling and then processing, reduce the image in the filter, dynamic stickers processing time. As shown in figure 10

● Frame loss before processing frame. Although we set the frame acquisition rate for the system camera, many models did not take effect, so we discarded the extra frames through the policy to reduce the number of frames for image processing. For example, we set the frame rate of collection to 15, but the actual number is 25. The extra 10 frames will be wasted in the process of coding and discarded before, which can reduce resource consumption. ● Model classification, different models according to different hardware capabilities using different acquisition frame rate and coding frame rate, to ensure smooth; Also adjust the framerate dynamically when overheating and when returning to normal to adjust resource consumption. ● Face recognition acquisition optimization, each frame recognition is changed to two frames of face recognition, neither produce face drift can also reduce processing time. ● The acquisition process was modified to reduce the unnecessary time consumption by about 33%, as shown in Figure 11.

● Dynamic effect stickers are rendered by multiple GL threads, and the sticker rendering is placed in another OffScreenThread for rendering, which does not occupy the time of the whole beautification process. The effect is shown in Figure 12:

● Dynamic effect stickers adopt OpenGL mixing mode; ● Image processing algorithm optimization, such as ShareBuffer optimization (fast copy data between GPU and memory, excluding CPU intervention, save texture to RGBA time; The elapsed time is almost halved, and the FPS is improved by at least 2-3 frames), and the filter is optimized by LUT, as shown in Figure 13.

In addition to the above two big optimization points, we are also pushing more machines to adopt hardware coding. Firstly, the encoding is stable and the frame rate does not fluctuate. Secondly, the CPU usage will be reduced. The above are some of our general optimization points. After optimization, users’ complaints on live broadcast are greatly reduced.

6. Caton optimization

Let’s take a look at some related definitions: “Stuck users” definition: (stuck time/total time) >5% is defined as the stuck users, and the stuck rate = stuck users/total users. Anchor Caton point definition: the number of points with frame rate <5 after encoding the uplink big picture. Audience Caton point definition: points with frame rate <5 after decoding. Our goal is to get the lag rate below 50%. Of course, an uptrend will cause all users to freeze, while a downtrend trend will cause only a single viewer to freeze. Let’s look at the causes of stuttering, as shown in Figure 14:

There are probably three modules of the anchor side, network and audience side that may cause the lag problem, and the performance optimization of the anchor side has been basically solved, so let’s take a look at the network and audience side. From the statistical data found that the impact of network quality accounted for about 50%, which is obviously to optimize. So network uplink optimization we did the thing shown in Figure 15, reducing the data in a single frame and reducing the number of frames. In the case of trains, we reduced the cargo and controlled the number of trains.

The client-side downlink optimization is to hoard goods and throw them away, as shown in Figure 16.

As shown in Figure 17, we can see the optimized effect, which has obvious advantages compared with competitive products:

At the same time, the anchor rate dropped to 30 percent, and the audience rate dropped to 40 percent, so the goal was met.

7. Playback optimization

As shown in Figure 18, we first take a look at the general process of livestreaming playback:

● Server cost: in addition to pushing a private protocol stream, the server also needs to transcode the private stream into HLS and MP4 for playback

In the scheme selection, MP4 has the advantages of mature playback scheme, fast speed and good user experience. However, HLS system support is poor and user waiting time is long. Does that mean we go straight to the MP4 scheme? In fact, it is possible to use HLS and MP4 for playback, but HTML5 can only use HLS for live viewing because the data is changing. If we use MP4 for playback, it means that the server needs to transcode the private protocol streams into MP4 and HLS respectively, which is obviously uneconomical. This leads us to choose HLS. The server only needs to transcode the stream once to HLS. Since we choose HLS, we need to solve the problems existing in HLS. On Android, HLS was first supported in Android 3.0, but support for HLS gradually faded when Google officially tried to replace HLS with DASH. There was not even a mention of HLS in the official documentation. Through practice, it is found that Android native player’s support for HLS is only able to play, without any optimization at all. In this way, there will not only be redundant M3U8 file requests, but also the whole process is serial after starting playing, which greatly affects the viewing time of the first frame of the video picture (about 4.5s on average). We can solve this problem by downloading the local agent in advance. After accessing the download agent, the content of the M3U8 file can be scanned at the agent layer, and the parallel downloading of TS fragments can be triggered to cache the TS data. After such processing, although the player level is still downloaded in serial, because we have prepared the data in advance, the data will be returned to the player soon, thus achieving the effect of reducing the time of the first frame. After experimental access, the average visible time of the first frame is reduced to about 2s. The flow chart before and after optimization is shown in Figure 19:

In terms of cache strategy, HLS cache industry has not yet developed a mature solution. We have implemented automatic detection and support for the three modes in Figure 20, without any concern for the underlying cache and download logic.

Finally, the server cost saves 50% transcoding calculation and storage cost; In addition, playback loads faster.

Viii. Case Summary

Through the previous cases and related optimization analysis, summed up three general problem modes, as well as the corresponding optimization ideas. ● Speed class: clear the time sequence, statistics of each stage of time, each break; ● Performance class: through Trace, clear performance loss point, each break; Problem solving: model building, preliminary analysis, statistical reporting, problem confirmation, each to break.

It can be summarized as figure 21:

The case also reflects the following reference points: ● Fast iteration, small steps, ● Monitoring driven optimization, ● model building, abstract problem intuitive analysis, ● product positioning determines the direction of optimization, ● Mass services, small to save big

Finally, this paper ends with a topological graph of spatial live broadcast (FIG. 22).

TOP100 Global Software Case Study Summit has been held for six times, selecting excellent global software development cases and attracting 2000 attendees annually. Including product, team, architecture, operation and maintenance, big data, artificial intelligence and other technical special sessions, on-site learning of the latest RESEARCH and development practices of Google, Microsoft, Tencent, Alibaba, Baidu and other front-line Internet enterprises. Opening ceremony single object ticket application entry: www.top100summit.com/?qd=juejin