Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~

This article is published by Tencent Video cloud terminal team in cloud + community column

Reprint, the author of this article, RexChang (Evergreen), technical director of Tencent Video Cloud terminal, graduated and joined Tencent in 2008. He has been engaged in client r&d and has participated in PC QQ, mobile QQ, QQ Internet of Things and other product projects. Currently, he is in charge of the optimization and implementation of audio and video terminal solutions in Tencent Video Cloud team. Our product line includes: interactive live broadcast, on-demand, short video, real-time video call, image processing, AI and so on.

For ease of digestion, please refer to the mind map of this article

Audio and video applets were born on a C7172 train from Shenzhen to Guangzhou in April 2017…

A chance collaboration

2016 wechat began to start small program internal test before, Tencent internal teams have begun to receive information. Each of us has a hunch that applets will make a big difference in the mobile application landscape. But at that time, I had just joined the Tencent Video cloud team, so I paid more attention to such information without much careful thinking.

At the beginning of 2017, with a large number of customer inquiries, I and my Tencent Video cloud team began to realize that there was a particularly strong demand here. But due to the limited energy, is famous for its “small team scores” WeChat engineer team is very difficult to have energy to cover all scenarios, with audio and video, small program only provides some basic ability of acquisition and playback, such as the label you are most familiar with is to use the system player to achieve, so can only live support HLS high latency and video on demand function.

At this time, Tencent video cloud SDK products after more than a year of polishing and optimization, has been like the zero fighter aircraft at the beginning of the Second World War, ready to “chop melon and cut vegetables”. Although the opportunities for cooperation and cooperation were uncertain, our team still took the shuttle bus from shenzhen headquarters to GUANGZHOU T.I.T.

After a lot of talks and jianx’s efforts, the collaboration was accidental and fraught with uncertainty, but it was finally achieved.

Technical challenges

In the context of audio and video applications, it’s a good thing that two teams can work together. However, wechat’s market position also determines that this is a serious battlefield, so the challenges we face are extremely severe:

(1) The interface must be simple and easy to use, preferably with one or two tags

(2) Meet a variety of application scenarios, support both live broadcast and real-time video calls

(3) The function must be extensible, developers can build a variety of personalized application scenarios according to their own needs

(4) Good maintainability, developers can self-help troubleshooting some technical problems, without being an audio and video expert

(5) The volume increment of the installation package is small enough, otherwise the volume of the wechat installation package cannot be controlled

In addition to high standards, time is also a major liability. The whole project left us only two weeks to prove our ability, in which we need to land in a G2C project and successfully pass the product demonstration and scheme acceptance.

Change numerous for brief

In the face of these challenges, I think of the Soviet Kalashnikov’s famous gun, the AK-47.

Its success stems from a simple and practical design philosophy: rotary latches ensure safety and eliminate the possibility of random accidents; The structure is simple and easy to disassemble, so to produce it does not require special precision processing technology, nor does it need to invest in huge production equipment, or even an ordinary small workshop can start production.

Yes, simplifying complexity and pursuing simplicity and reliability is what we need to achieve.

Overcome technical difficulties

It was not easy, and our team overcame the technical difficulties step by step

Up and down

First of all, we need to disassemble and abstract the existing audio and video system of Tencent Video Cloud, that is, to break the whole system into blocks one by one, of which the most important two are: audio and video upward (push) and video downward (play).

Audio and Video Uplink (PUSH)

It is to upload the voice and picture from your mobile phone to the cloud in real time. We use the video cloud SDK to implement this capability and encapsulate it into a tag called.

The internal implementation mechanism of SDK is shown in the figure above: First, we need to capture the picture of the camera and collect the sound of the microphone. However, the original collection and capture of the picture and sound need to be preprocessed, the directly collected picture may have a lot of noise, so we need to carry out image noise reduction; For example, the skin in a portrait collected in its original form may not be as good as one would expect, so we need to polish and beautify it. The sound directly collected may also have a lot of ambient noise, so we need to separate the foreground and background sound and then perform noise suppression.

Compared with the original collection, the preprocessed images and sounds will be greatly improved, because all preprocessing is aimed at “pleasing” human audio-visual experience, so this seemingly insignificant part will attract many companies to do a lot of technical input. For example, taking LCD flat panel TV as an example, SONY’s LCD product line does not have its own LCD panel (mainly Taiwanese and Mainland LCD panels), but it always leads other companies in terms of overall effect. The secret is the constant accumulation and investment in image processing (super-resolution displays based on image databases) and backlighting (the eyes of all animals are the most sensitive to brightness).

After the picture and sound are “whitewashed”, they can be sent to the encoder for coding compression. The job of the encoder is to compress images and sounds into 0101001… The volume after compression is much smaller than before compression. The last thing to do is to send the encoded data through the network module. In the live broadcast scenario, TCP is generally used as the network protocol. In the real-time call scenario, UDP is used as the main network protocol.

Audio-video downlink (PLAY)

Also called playback, is from the cloud after the encoding of audio and video data real-time download and real-time playback, so that you can see the remote picture, hear the remote sound. Again, we implement this capability with the Video cloud SDK and encapsulate it into a tag called.

The internal implementation mechanism of SDK is as shown in the figure above: data from the cloud will be directly sent to the network module, but the network is not perfect, there will always be fast and slow fluctuations, and there may be blocking and flash interruption. If a piece of data comes from the server, the SDK will play a piece of data, and the screen and sound will appear sluggish at the slightest fluctuation of the network. We solve this problem by using VideoJitterBuffer technology, which is like a small reservoir for incoming data from the network. Audio and video data are stored here for a short time before being sent to play, so that there is a certain amount of “emergency” data that can be used when the network is unstable.

After the data has been buffered, it can be sent to the decoder for decoding, decoding is to restore the compressed audio and video data into images and sound, and then render and play. We use openGL to render the picture and use the system interface of iOS and Android to play the sound.

Signal amplifier

With these two simple tags, we can make a preliminary combination to build the first and simplest application scenario: live streaming.

Online live broadcasting is a very classic one-way audio and video scene. You only need to simply combine the two tags together to upload local pictures and sounds to Tencent Cloud in real time, and then pull audio and video streams from the cloud in real time.

If it is a simple way + all the way down, then we can just build a transfer server to solve the problem, but this can only achieve high-quality live broadcast services in a very small range. To truly achieve high concurrency and smooth without delay, we need a strong video cloud.

The role of video cloud here is just like a signal amplifier, which is responsible for amplifying the audio and video from all the way to spread to all parts of the country, so that everyone can pull real-time and smooth audio and video stream on the cloud server close to them. Due to the principle of simple, stable and reliable and support millions of concurrent online viewing, so from online education to sports events, from the game live to pepper yingker, are based on this technology.

However, the online live broadcast scheme can only be applied to solve the problem of one-way audio and video, because it has an obvious problem, that is, the delay is generally between 2 seconds and 5 seconds, which is the effect that can be achieved by using the label and Tencent Cloud video cloud. For labels, the delay is longer, more than 20 seconds, so it is not applicable in some scenarios where latency requirements are very demanding.

Reduce latency

Scene in security monitoring, home IP cameras are generally well with a rotating function, namely camera point will follow in remote remote control to turn, if the picture delay is large, then watch the control button to see the picture movement need to wait for the time will be longer, so that the user experience is particularly bad.

Another example is the online doll clip scene, which is very popular in 2017. If the delay of the video screen of the remote player is very high, then the remote control of the doll machine becomes impossible, and no one can actually catch the doll.

In order to achieve such low requirements, ordinary live streaming technology is no longer applicable, we need to introduce two new technologies: delay control and UDP acceleration.

Time delay control

Networks are not perfect. Networks are volatile. On a fluctuating network, audio and video data from the server does not flow steadily to your phone, but up and down. You might see a lag on slow times and a pile-up on fast times, and the result of pile-up is increased latency. Therefore, we need to use the delay control technology, its principle is very simple, when the network is slow to broadcast a little slower, when the network is fast to broadcast a little faster, so as to play a certain buffer role. Of course, the real implementation will find that the voice is a very disobedient “child”, to deal with the effect of sound is a very difficult technical work.

UDP acceleration

Since the Internet is not perfect, it is always fast and slow, can we improve it? In the classical one-way audio and video scheme, TCP protocol is generally used, because it is simple, reliable and highly compatible. However, TCP congestion control is very fair, which is sometimes fast and sometimes slow, so we need to replace it with UDP, which can be more stable and faster than TCP, which is designed for reliable transmission.

We added delay control and UDP acceleration technology into the label, which can control the end-to-end delay around 500ms. This can meet the requirements of the operation delay requirements of the scenario.

Unidirectional change bidirectional

With the one-way low-latency technology, two-way video calls are naturally relatively simple, and only A and B need to pull through A low-latency link.

For example, in the scene of car insurance loss assessment, the owner in distress calls the insurance company through a small program. At this time, the internal loss assessment customer service of the insurance company can see the situation of the car as long as it passes through a low-delay link. But this is not enough, video content, like pictures, are easy to be forged and faked. Therefore, the loss controller needs to have a video to reach the car owner, so that the two channels of audio and video are connected at the same time, which constitutes a typical video call scene. The risk of fraud is greatly reduced because the car owner and the assessor can communicate via video.

While that’s true, the implementation isn’t that simple. On the contrary, it was very difficult because we had to introduce a lot of additional technology:

Noise elimination

The purpose of noise suppression is to remove the background noise in the user’s environment. Good noise suppression is the premise of echo cancellation, otherwise the acoustic module cannot distinguish from the collected sound which is echo and which should be retained.

Echo suppression

In a two-way video call, the microphone of the user’s own phone records the sound of the speaker again. If it is not erased, the sound is sent back to the user at the other end, creating an echo.

Qos flow control

The Internet can’t be perfect all the time, especially in mainland China, where there are restrictions on uplink speeds. The function of Qos flow control is to predict the current upstream network speed of the user and estimate an appropriate value and feed back to the encoder. In this way, the audio and video data sent by the encoder will not exceed the transmission capacity of the current network, so as to reduce the occurrence of lag.

Packet loss recovery

No matter how good the network is, there will inevitably be packet loss, especially WiFi and 4G and other wireless networks. Because the transmission medium itself is not exclusive, a large number of packet loss will occur once it is interfered or high-speed motion. At this time, some packet loss recovery technology needs to be introduced to recover the lost data as far as possible.

We started Chatting with hashtags and gave them a new mode, RTC (Real Time Chatting acronym), which really started Chatting.

You see, it’s not easy to keep the features in place without breaking out of the easy-to-use design style of tabs. In fact, the four technology points here are so difficult that it takes years of technology accumulation and precipitation that we don’t do it on the spot. The so-called standing on the shoulders of giants to see further, here’s the technical ability by Tencent audio and video lab “teana” engine to achieve.

Two-way multiplayer

Now that two-person video calls are done, can multiple people just do the same? You see, we just need to change the URL transpose between A and B into the URL transpose between A, B, C or even more people, isn’t that ok?

The idea is still right, but to really make the functionality useful and mature, relying on a simple URL exchange is very crude, and we need to introduce two additional technology points:

Room management

For example, in the multi-person video scene between A, B and C shown in the figure above, it is very difficult to make everyone clearly know the status of others (such as playing URL, and whether there is current uplink, etc.), and it is easy to cause information misalignment. For more complex cases, such as when a fourth person D comes in, or a fifth person E comes in and out, this synchronization is almost a nightmare.

The best way to do this is to collect the status and information of the participants on the server side, creating the concept of a single room, which ensures that the participants all get the same information from the server side without having to maintain it individually.

Notification system

When a new participant enters the room, or someone leaves, you need to broadcast messages to the people in the room, which requires a good IM system to send and receive messages. For example, when D enters, it can broadcast the “I’m coming” event to other members of the room, so that A, B and C can display D’s video screen on their UI.

After adding the room management and notification system, we can combine the basic capabilities of websocket and wechat applet together to build various powerful and logically complex applet applications.

Along the way

Along the way, you can see our efforts in the technical system of small program audio and video can be outlined by the following technical map:

  • First of all, simplify all audio and video solutions into two basic behaviors: uplink and downlink, and realize the most basic online live broadcasting function through the simple combination of two tags and.
  • After that, the delay of audio and video can be reduced to less than 500ms by accelerating circuit and delay control.
  • After that, we introduced acoustic processing modules such as noise cancellation and echo cancellation, which made it possible to do two things in one direction, which made it the simplest video call capability.
  • Finally, we expanded the scope of the application by adding room service and status synchronization notifications, turning two-channel audio and video into multi-channel audio and video.

Question and answer

How do I deploy applets?

reading

Teach you to build your own “micro vision” in 1 day

Teach you from 0 to 1 build small program audio and video

Teach you how to quickly build a press conference live program

Cloud, college courses, recommend | zhihu KOL, choose to share with you in the machine learning how to do

This article has been published by Tencent Cloud + community authorized by the author. Please refer to the complete original articleClick on the

Search concern public number “cloud plus community”, the first time to obtain technical dry goods, after concern reply 1024 send you a technical course gift package!

Massive technical practice experience, all in the cloud plus community!