Industries such as live streaming, social networking and online education have spawned the rise and development of real-time audio and video technology (RTC). In turn, the development and application of RTC has brought tremendous growth to these industries. With the continuous penetration of RTC into application scenarios, business partners have higher requirements for scene experience, such as lower latency, smoother performance, and higher picture quality. LiveVideoStackCon 2021 Beijing, volcano engine video cloud RTC product leader Julian, to share with you how volcano engine video cloud RTC in douyin, watermelon, toutiao and other products in the scene practice, continue to pursue the ultimate.

Hi, I’m Julian from the Volcano Engine RTC team, and it’s a pleasure to talk to you today. Today, I would like to share with you how the RTC behind Douyin pursues the ultimate.

1. Introduction

First, LET me give you a brief introduction to the Volcano Engine RTC team.

We’re not from Douyin, we’re from Volcano Engine, and Douyin is a customer of Volcano Engine. My team is volcano Engine’s RTC team, which has been working for Douyin for 4 years. In the past four years, Tiktok has grown to 600 million DAUs, and volcano Engine’s RTC team has grown by leaps and bounds.

Let’s take a look at the APPLICATION scenario of RTC on Tiktok.

The most classic is Lien Mai PK. The two Tiktok anchors connect the mic through RTC, transcode to generate two channels of audio and video streams, and push them to CDN to broadcast live to the audience in their respective studios. In the process, anchors PK to see who gets more gifts. Some PK scenes also have the interaction between the anchor and the audience: Here, the interaction between the anchor and the audience is also carried out through THE RTC.

There are also some interesting scenes on Douyin that you may not know as much about, such as watching together and chatting privately with friends.

Watching together means that we can connect several friends on Douyin and watch the same video together. One of them is the owner of the house. When the video of the owner is viewed, others will automatically follow it. In addition to voice chat using RTC, short video message synchronization is also done with RTC low latency messages.

To watch together is to connect several friends and watch the same video on Douyin at the same time. There’s a homeowner, and wherever the homeowner sees it, everyone else’s video automatically plays there. People communicate in real time by voice. In this scenario, RTC is used to implement voice chat and message synchronization of video playback progress.

There is also a scene where friends chat privately. Some heavy Users know that Tiktok now supports video and voice calls, too, and the experience is pretty good. I change Tiktok when I compare cards with my friends on other apps. If you change douyin, it will not be stuck. If you are interested, you can try it. Video calls on Tiktok also come with beauty features, so the percentage of video calls is higher than voice calls.

Call performance on Douyin is supported by indicators. After long-term cooperation, we polished out a set of indicators system. Some of the core metrics are excerpted in this figure. On the left are the technical metrics for RTC, including lag rate, end-to-end latency, first frame time, and sharpness. The right side shows tiktok service indicators related to RTC quality, including user feedback rate, user penetration rate, user duration and business revenue. The optimization of RTC is carried out under the guidance of data indicators. In the optimization process, we did a lot of AB experiments and attribution analysis to optimize the business indicators by optimizing the technical indicators. The criterion for full rollout is to be able to achieve optimal business metrics.

2. Challenges

So let’s talk about what our challenge is.

If the core indicators of RTC are summarized, they can be divided into three core requirements: clarity, process and real-time. But those of you familiar with RTC know that it is sometimes impossible to have both of these three core requirements. When the network is good, there is no problem. When the network is bad, one or two of the indicators will be sacrificed.

As a simple example, when the network is bad and the video freezes up, increasing the buffer delay is the easiest way to optimize. If the buffer delay is too high, two people may interrupt each other, which seriously affects the call experience. If you want both fluency and real-time, you can only reduce clarity. There are many hosts with good appearance on Douyin, so it is a high priority to make the hosts’ faces clearly visible.

Faced with such a choice, the business can usually accept a little compromise on metrics, but will always require continuous optimization. In other words, “I want it all.” The business needs are all reasonable. So let’s talk about how we respond to these challenges.

Best practices

3.1 fluency

First, a word about fluency. The corresponding indicator for fluency is the lag rate. In fact, it is the biggest impact on communication, even if only one word is missing, you will obviously feel the communication is not smooth.

Caton is because of weak nets. So what causes the weak net?

We set up the simplest RTC transmission model, from terminal A to terminal B, with RTC’s cloud transmission network in the middle. In fact, the transmission quality of cloud transmission network is now very good. We will monitor QoS indicators.

Monitoring results can be found: cloud packet loss basically does not exist. The domestic cloud transmission delay is within 50ms, and the global transmission delay is within 250ms.

The main weak network is in the access network, also known as the FirstMile and LastMile, where the user’s own client connects to the RTC network. After statistics, it is found that about 30% of users will encounter weak network, 26.8% of them are mild weak network, moderate and severe in about 4%.

The user level of weak network is divided into four grades, good, mild, moderate and severe, according to instantaneous network index.

Here we find an extreme online case to see the limits of RTC capabilities.

This is a diagram of the packet loss rate and the delay parameters, and as we can see, it’s stable at first; The weak network occurred suddenly and lasted for a period of time. The packet loss rate reached 49% at the highest. With the anti-packet loss policy, the delay increases from 88ms to 700ms. After optimization, the card length of anti-packet loss strategy is basically controlled within 1.2 seconds.

To adapt to the sudden extreme weak network, our algorithm will also adjust automatically in real time.

Different algorithms are used for different scenarios. For example, in 1V1 communication, you need to adjust the sending policy based on the upstream and downstream networks of the sender and pay attention to the downstream networks of the receiver. When the downlink network at the receiving end is poor, it is not profitable to send high-quality audio and video data. When communicating with multiple people, we use the Simulcast method. When applying large and small streams, people often focus on the receiver. But in fact, the sending end may also have pressure, uplink if there is weak network, but also to consider whether the size of the flow is appropriate.

All of these scenarios are considered by our algorithm. Based on our insight into user QoS data, we can automatically deliver corresponding policies for different scenarios. The training data of the algorithm comes from the massive online network environment of real users. The example above is anti-packet loss. The real weak network environment is very complex, and the pure packet loss scenario almost does not exist, which is bound to multiply network problems such as jitter and delay. We continuously add the real situation of online users to the training database, and constantly optimize the response of the algorithm.

In addition, the network should be sensitive to the process from good to bad, but the recovery process should have a certain relaxation. Sometimes network fluctuations occur and disappear very quickly. Wait 3 or 4 seconds longer to make sure the network is really smooth before the algorithm restores the user’s bit rate.

We’ve prepared a video to demonstrate weaknet confrontation. It is recorded by team members themselves and does not involve user privacy.

The demo simulates weak network conditions and limits the maximum available bandwidth to three weak network levels. The bit rate and frame rate decrease from good to weak network. When the network is moderately weak, the network packet loss is serious and the resolution decreases. On heavy weak networks, the bit rate is less than 500fps. In extreme conditions, a 1-second lag is introduced, and then the network recovers, and then all of a sudden it goes to extreme conditions, and finally recovers.

It can be seen that in the extreme case, although there is a 1-second lag, there is no word leakage. After adapting to the weak network, the missed audio will be used at a relatively small double speed to catch up with the progress, without affecting the content.

3.2 real-time

Real time has two indicators, end-to-end delay and first frame rendering speed. For call scenarios, the end-to-end delay control is within 400ms, and the user experience is fine. Of course, there are also scenarios with higher latency requirements, such as cloud games, which require a very high latency. From the time the user triggers the command to the time the first frame response is received, the round-trip time is less than 100ms. This sharing time is limited, I will not expand.

We mainly share the first frame render speed.

We can think about it a little bit. Why is the delay of CDN much larger than RTC, but the first frame response is faster and more stable? In fact, CDN will add the videos with high hit ratio to the cache at the edge node. Users can pull the stream directly from the edge node, which is faster. Because of business characteristics, it is not possible for the RTC to do such a caching strategy. But we’re going to borrow that idea. In a multi-party scenario, for example, two people talk at first, and then a third person comes in, and the previous conversation between the two people is already on the edge node. Volcano Engine RTC has a strategy to cache the latest GOP audio and video stream to the edge to speed up the opening of the first frame of a new AUDIO and video call participant.

GOP is the time interval between two video keyframes. If you’re familiar with video processing, you know the concept. There are both 1S and 2S GOP in the industry.

We cannot predict when a request for a code stream will come. If there is no cache, the requester must wait until the next I frame to pull to the first frame, as long as it does not fall at the beginning of the GOP. Obviously, this waiting time has an expected distribution based on the size of the GOP. With a caching strategy, the first frame can be retrieved immediately, no matter when the request arrives.

Here is a Demo, mainly looking at the loading speed of 3 streams per room. You can slide up and down the room like Tik Tok, and finally a mic speed, which requires a faster first frame. In this Demo, the opening time of the first frame is between 100ms and 200ms. , we have also detected the speed of the first frame online, which is basically within 700ms, and some good business forms will be controlled within 400ms. We call this instantaneous opening.

3.3 definition

The third optimization direction is clarity, which increases the upper limit of the user experience. The previous optimization, the user perception is very direct, and the perception of clarity is subtle. The video isn’t that sharp, and you don’t see it very clearly at first, but after you watch it for a long time, you may not want to watch it anymore. So this metric will eventually affect how long users use it.

There is no upper limit to clarity. RTC needs to solve the problem of how to make the real-time transmission of **** video quality with limited bandwidth.

  • BVC1 – Bytedance self-developed coding algorithm

This video shows the comparison of encoding efficiency between BVC1 encoder and h.264 and H.265. The RD-plot on the right shows that the BVC1 encoder can gain 0.6dB more than the mainstream H.265 encoder. We usually judge a Codec algorithm by how much bandwidth it saves. However, in the CASE of RTC, the user bandwidth is stable and the resolution is determined by the service. There is no need to use up the bandwidth to make the resolution higher. So volcano engine RTC chose to use coding efficiency to improve image quality with the same bandwidth and resolution.

You can see the pink bar in the background.

  • ROI (Area of Interest) coding

ROI (Area of Interest) coding is also used extensively, basically even in mic scenarios. In vernacular, it is aimed at the middle of the face of the code. At the same frame rate and bit rate, the ROI encoded effect is clearer in facial details. Just before, some students asked us how to evaluate the ROI effect. ISO provides a way to measure image quality by blind vote ratio mapping JND. We invited more than 100 students to compare and evaluate through the internal test, and got 2.3 points. This is a relatively high score.

  • Super Resolution

Finally, we also used the hyperfraction algorithm.

Look at the details of the hair. Super fraction improves resolution. We’re overloading the 360P video to 720P. The blind test scored a higher 2.55.

Optimization strategies in different scenarios

With the algorithmic power of the core. We also apply the most appropriate optimization strategy for each scenario.

In PK scenarios, for example, the optimal resolution strategy is used.

Let’s talk a little bit about this strategy. PK, RTC screen will take up a quarter of the screen (half length and half width). Now as users’ phones are getting better and better, some phones can support 1080P audio and video calls, some can only support 540P and so on. For example, if you, as an anchor, PK the mic with a 1080P collection and a 720P anchor, in fact, what you see is the video sent by the host on the other side is 540P. It doesn’t help that the host on the other side collects 720P video; The reverse is also true. The optimal resolution strategy means that RTC will automatically choose the most appropriate resolution based on the device resolution of the host, rather than using the highest resolution mindlessly.

We accept another RSVP CDN strategy applied in the PK mik scenario. Most RTCS transcode in the cloud and then push the CDN. This actually introduces one more codec and transmission. PK scene is two people, the opposite end of the video stream must be from the remote end, this can not be done. But in fact, when the anchor’s own picture is transferred to THE CDN, its picture quality will be somewhat damaged after the secondary codec of the server. Therefore, we will encounter many business parties who propose to use the client side to transfer the code.

However, there are other problems with client transcoding: although it reduces one codec transmission, it can lead to an increase in device performance consumption. We put forward a CDN scheme for end-to-end cloud integration. If the anchor’s equipment, its performance and network are enough to do transcoding on the client side, we will do it on the client side; If not, demote to the server. In this way, the device can enjoy better clarity at high performance; Equipment performance is low, we can ensure normal use. This strategy currently applies to more than 60% of online users.

This involves how to judge the performance of the user’s device.

Volcano Engine RTC maintains a large database of machines in the background, with a total of 2W + machines and growing. Here are some screenshots. We will ensure that there are polished and verified recommended parameters and recommended strategies for each model in the corresponding scenario.

3.4 Beauty effects

Finally, a separate mention of the combination of RTC and beauty effects. Beauty effects actually consume more CPU and memory. With such a large model running over there, it brings new challenges to the RTC adaptive algorithm.

In our testing process, we have encountered the effect of beauty effect on the efficiency of coding algorithm. So we thought, how can we avoid that as much as possible?

Let’s take a look at the current mainstream practice. RTC and CV are separated. Developers need to collect and send the video stream to the SDK of beauty effects for processing, then echo the video stream locally and send it to the RTC SDK for coding, and then transmit it. This logic is sound, but the drawback is that RTC coding takes into account the degradation of weak network and device performance. If weak network or device performance is insufficient, RTC coding will degrade. You want to code a 360P video, capture and beautify at 1080P doesn’t make sense, it’s not low-carbon. If the RTC degradation can affect acquisition and beautification, the overall performance cost will be better.

Volcano engine RTC unified the beauty effects SDK and RTC. Collect the ability of the RTC SDK, and then tune the CV related interface through the RTC SDK. In this way, the video resolution collected and submitted to the beauty SDK is the same resolution. No more collecting, beautify 1080P, transmit 360P.

conclusion

Although today’s presentation is about the optimization we did for Douyin, it is actually a set of methodology for optimization of scene characteristics, not limited to Douyin.

At present, in addition to tiktok, we also serve other customers inside and outside Byte. It now averages more than 15 billion minutes a month. The huge data brought by the huge base is also the focus of our optimization.

Our attitude is to pursue the best, our goal is to achieve partners. If you are interested, we can make further communication.

Thank you!