7 Niu Yun Huo Kai: Real-time audio and video SDK design practice

In this paper, according to Mr Wenkai (seven NiuYun Audio and video client architects) on June 26, 2021 “ECUG Meetup 1) | 2021 audio and video technology best practices, hangzhou station” on the share consolidation.

To get “Lecturers full PPT”, please add ECUG assistant wechat (wechat ID: ECUGCON) and note “ECUG PPT”. The rest of the lecturers will also be released in the future, please stay tuned.

The following is the sharing text:

Hello, everyone. Let me introduce myself first. My name is Kai Huo, is qiniuyun audio and video client architect, joined qiniuyun in 2017, led and participated in the development of short video, push stream, player and other audio and video related SDK, the current work is mainly focused on the design and development of real-time audio and video SDK, the main energy is still in the client.

Here’s what I want to talk about today:

1) Technologies and challenges of real-time audio and video SDK 2) Architecture of Qniu Real-time Audio and video SDK 3) Practical experience of real-time audio and video SDK

In short, it is to share the problems we often encounter in the process of developing the real-time audio and video SDK, and how to do optimization, can be said to be some of our practical experience.

I. Technologies and challenges of real-time AUDIO and video SDK

Today is a special session on audio and video. I believe all of you here must have some knowledge or experience in audio and video technology, but I don’t know how many of you have ever made or designed AN SDK. It actually has its own unique challenges.

1. Customers’ demands for real-time audio and video SDK

As a company providing technical services, Qiniuyun often hears a lot of customer voices. For example, customers often ask how much bit rate of video should be configured, why there is no effect after a certain interface is called, and how to access third-party beauty.

In addition, some customers may encounter some abnormal phenomena in the use process, such as flower screen, black screen, failure to join the room and so on.

In addition, customers often give us some functional requirements, such as some customers want to release video content on both the camera and the phone screen at the same time, or want to access the video or audio frame data of each channel, etc.

After encountering these problems, we solved them bit by bit according to the steps of finding, analyzing and solving problems. First of all, let’s take a moment to analyze what the appeal is behind these questions. In short, we have summed up three points:

1) The audio and video technology level of the users is uneven, but without exception, they all want to be able to access quickly.

We couldn’t mandate a certain level of audio and video technology to access the SDK, so we had to ask ourselves that the SDK be designed to be extremely easy to use so that users could access it as quickly as possible.

2) Users’ usage scenarios and environments are complex and changeable, but they have high requirements for audio and video experience.

For example, video conferencing scenarios have a high latency requirement, and more than 300 milliseconds can significantly detect a delay in the other person’s speech. However, in the case of live broadcast, users have high requirements for clarity. An anchor may face a large number of audiences, and the picture of the anchor is not clear, which has a great impact on the audience. For another example, in the same live broadcast scene, outdoor live broadcast and indoor live broadcast are two completely different environments, but we have to ensure that we can give users the best audio and video experience under the existing environment.

3) Access users need a “sense of security” and have requirements on service stability, monitoring means and barrier removal ability.

Users need to truly perceive the use of online users, including the real-time quality of audio and video, errors or exceptions. In addition, once there is a problem, whether it is due to the SDK or the user’s posture, they should be able to quickly eliminate obstacles and minimize the impact.

2. Core requirements of the REAL-TIME AUDIO and video SDK

According to the demands of customers, we can define the specifications of an excellent real-time AUDIO and video SDK:

1) Simple interface and clear boundary.

There must be no ambiguity when users call our API, the interface must be simple and clear, so that users can easily call.

2) Strong expansibility and good ecology.

On the basis of SDK, it can easily expand functions, such as advanced beauty, face audit, voice to text, etc. Our approach is to make different plug-ins for users according to different scenarios on the UPPER layer of SDK to maximize the convenience for users to expand.

3) Abstract and optimize according to the scene to expand the function.

As mentioned earlier, audio and video quality requirements are different for videoconferencing and live scenes, so we should optimize for each scene and provide core functionality that covers that scene as much as possible.

4) The first frame, lag, delay, echo and other audio and video experience is optimized to the extreme.

These are the core metrics that affect the audio and video experience, so we can never do too much in terms of QoS optimization.

5) Stable and reliable service, rich data burial points, visual data monitoring and analysis.

Reliability is the premise of a technical service, and data is the premise of ensuring reliability.

3. Technical difficulties of real-time AUDIO and video SDK

Having said all this, we can also perceive that it is very difficult to make a real-time audio and video SDK well. In more detail, what are its technical difficulties? I’ll give you a few of them:

Including audio and video collection, coding, transmission. There’s also video processing and audio processing, like beauty filters and sound mixing. And weak network optimization, data reporting, crash analysis, audio 3A algorithm and so on.

In addition, there are compatibility adaptation, performance optimization and other aspects. These are all technical difficulties that we need to face and solve. Later, I will briefly introduce some of our experience in the third part of practice.

Ii. Architecture of Qiniuyun real-time AUDIO and video SDK

Next, I would like to introduce the seven niuyun real-time audio and video SDK is how to design. First, let’s take a look at the iterative process of qiniuyun real-time audio and video SDK.

In 2018, we launched version 1.0, support core audio and video communication function, after we found more and more customers will be released in the room than the flow of all the way, such as screen content and the content of the camera to collect, so we went on version 2.0 supports multiple track, is also support users multiple flow solutions. Going back to 3.0, we wanted to make sure that the audio and video experience that everyone can experience under the current network environment is the best, so we did the big and small streaming strategy.

In addition, we not only did the normal iteration of SDK, but also released some supporting solutions, such as video conference solution, interactive live broadcast solution and online interview solution. In addition, it also provides some plug-ins, such as the beauty plug-in, which is convenient for customers to access the beauty SDK, and the whiteboard plug-in to be launched soon, which is convenient for users in educational scenarios to access.

1. Module division of Qiniuyun real-time audio and video SDK

This is the module division of Qiniuyun Real-time audio and video SDK:

At the bottom level, we rely on some external libraries, such as WebRTC for SDK communication, WebSocket for signaling transmission, and our own HappyDNS, QNBeautyLib, QNEncoderLib, QNAECLib and so on.

At the top layer, we have done some encapsulation according to business modules on the underlying basis, including:

CameraManager, responsible for camera collection, rotation, exposure value, focus and other functions.
MicrophoneManager is responsible for microphone collection.
RenderProcesser, responsible for video processing, such as watermarking, cropping, beauty, filter and other functions.
AudioProcesser, responsible for audio processing, such as audio resampling, mixing and other functions.
RoomManager, which is responsible for adding core functions such as rooms, publish, and subscribe.
CrashReporter, responsible for reporting the stack information in time when a crash occurs.

At the next level up, we provide the core API, plus advanced beauty, whiteboard and other plug-ins.

At the top is the user’s own business layer.

2. Qiniuyun real-time audio and video SDK data flow

Next for you to briefly introduce the seven cloud real-time audio and video SDK data flow is how.

1) Data collection: Collect video data from the camera or screen, and audio data from the microphone. The video data collected include YUV and texture data, and audio PCM data. 2) Data processing: After the collection, it will be sent to the audio processing and video processing module. The video can be processed with beauty, watermark and mirroring, and the audio can be processed with resampling and mixing. 3) Coding: After processing, the data is sent to the video encoder and audio encoder, you can choose soft coding or hard coding. After encoding, the output is H.264 and Opus packets. 4) Encapsulation: the coded data is sent to the audio and video packaging module for encapsulation. 5) Upload: the encapsulated data packet is transmitted to our streaming media server through RTP protocol. 6) Forwarding: The streaming media server forwards the data to subscribers in the room.

The subscriber and publisher processes just in reverse, through decapsulation, decoding, audio and video processing, and finally rendering!

3. Practical experience of real-time AUDIO and video SDK

Here are some of our hands-on experiences with the Live Audio and Video SDK.

1. Extensible beauty plugin

First of all, to introduce our extensibility beauty plug-in. Many users find it difficult to combine the SDK with audio and video when accessing the beauty SDK. Since it is difficult for users to preprocess video frames using OpenGL before using beauty SDK, we have made a plug-in between Beauty SDK and RTC SDK, as shown below:

First of all, we use RTC SDK to collect the camera, and then give the collected data to the layer of beauty plug-in first. What does the Beauty plugin do? This includes loading beauty effects resources, converting OES textures to 2D textures, or turning camera captured textures to a positive Angle. We input the texture data that is processed by the plug-in and in line with the specifications of the BEAUTY SDK into the texture data, which is processed by the beauty SDK, beauty makeup, filter, sticker, etc. After the processing, the texture is returned to THE RTC SDK, and finally previewed, coded and transmitted.

These are the details of our internal implementation. The internal processing process is relatively complex, but externally it is actually very simple. Externally, we only provide the simplest interfaces, such as setBeauty, setSticker, setFilter, etc., so as to reduce the access cost for users.

2. Traffic policy

The next step is to introduce the size flow strategy.

Why have a traffic policy? We want to make sure that the audio and video experience is as good as it can be in each user’s network environment. For example, user A posts A video, which is sent to the encoder and then sent to the streaming media server. There are three channels, namely high, medium and low resolution video.

The streaming media server forwards packets based on the network bandwidth of users B, C, and D. For example, user B has a better network environment, so he can directly subscribe to the highest resolution route; The network environment of D is not good, and in the weak network environment, subscribe to the lowest resolution. If the network environment of user B changes, user B subscribs to the high resolution at the beginning, and then the network environment deteriorates, user B automatically returns to the low resolution, so that the user does not lag and maintains the fluency.

3. The QoS optimization

The importance of QoS optimization in the entire audio and video experience cannot be overemphasized. On the basis of WebRTC, we mainly optimize from the following aspects.

1) Bandwidth estimation: We mainly used GCC, BBR and other algorithms in bandwidth estimation. 2) Anti-packet loss: In terms of anti-packet loss, we mainly optimize the intelligent combination of data redundancy and packet loss retransmission. 3) Smooth jitter: Smooth jitter includes the optimization of Neteq and Jitterbuffer. 4) Bandwidth allocation: Bandwidth allocation includes policies such as audio priority, video stratification, and upstream and downstream transmission, or the corresponding allocation policies can be formulated based on user scenarios and services

4. Echo cancellation optimization

I’m going to show you how we optimized the echo cancellation in two scenarios.

The above two spectrum graphs on the left and right respectively represent two scenarios, each of which has three rows of data:

The first line: sound signal image of the original sound; Second line: filtered image with WebRTC built-in echo cancellation; The third line: the image after filtering with the echo elimination developed by Qiniuyun.

The scene on the left is a person talking, and there’s a very loud music playing in the background, and we want to get rid of that music. And as you can see, in the middle line, again you can see that there’s a very clear continuous line down here, and this is the residual sound image of the music. On the third line on the left, it’s clear that these musical sounds have been eliminated without residue, quite completely.

The picture on the right is A voice call, mainly in double talk mode: User A is talking, and all of A sudden user B starts talking, and we don’t want user A’s voice to be erased. However, the echo cancellation algorithm of WebRTC will filter out some sounds of A, and from the comparison of the third line of the right picture, we can see that the echo cancellation algorithm developed by ourselves has A much better “word-eating” phenomenon.

5. Compatibility optimization

Compatibility is one of our biggest headaches. For example, on the first page, the user suggested that a screen or echo appear on a certain machine, can be attributed to compatibility issues. Generally speaking, we can divide the compatibility optimization into two strategies:

The first policy is the dynamic switching policy. During operation, the encoder can be switched dynamically.

For example, in the coding configuration we set, when the hard encoder is opened on a certain mobile phone, there will be an exception that fails to be opened. At this time, we should first catch the exception, and then switch to the soft encoder without the user’s awareness, so that this function can be used normally.

Another example, when users use soft encoder, a mobile phone on the coding efficiency is extremely low, FPS is far lower than our expectations, we will open the hard encoder to see which the efficiency of encoder, so when the user has no sense dynamic code switching is most suitable for encoder.

The second policy is the whitelist policy. When initializing the RTC SDK, ask the server to automatically send the configuration whitelist based on the detected devices.

For example, a certain Android phone, in the hardcoded scenario, if the resolution is not a multiple of 16 will appear splintered screen phenomenon. We put them on our whitelist of specs, and when we detected them, we changed the resolution to a multiple of 16, or switched the phone’s hard encoder to a soft encoder to solve the problem.

For another example, when the sampling rate is 48K, we found that the echo cancellation effect of some mobile phones is very bad. Therefore, after detecting this device, we can adjust the sampling rate of 48K to 16K or switch to the self-developed echo cancellation strategy to solve the compatibility problem of echo.

6. Data collection and reporting

For the data collection and reporting module, we put forward three requirements:

1) Real-time reporting. RTC SDK, can not be reported after the complete operation, but needs to report some data in real time.

2) Action reduction. To be able to collect SDK logs, to truly restore the user call SDK situation. For example, on the first page, the customer mentioned why the call to an interface had no effect or why the entry into the room failed. We often found the order of the call to the interface or the error of the parameters passed by the user through the reported log restoration, so as to quickly help the customer to remove obstacles.

3) Module isolation. The data collection and reporting module must not affect the normal audio and video communication module, that is, the main service module. Even if there are some exceptions, these exceptions should be captured. Users should not be unable to use this service because of problems in the collection rate reporting.

What do we collect? First of all, we will not collect users’ private data, which is absolutely not allowed. Broadly speaking, it consists of two parts: SDK basic information and audio and video quality information.

Basic SDK information: includes SDK call logs, error and crash information, set model information, and SDK internal status information.
Audio and video quality information: includes the first frame time, real-time frame rate, real-time bit rate, and real-time packet loss rate.

7. Data monitoring and analysis

Finally, we need to monitor and analyze the collected data visually.

This is an example of the action restore I mentioned earlier, and we can see clearly what interfaces the user is calling, in what order, and what the internal state is. From initialization, to joining the room, publishing, subscribing, and finally exiting the room, we can see it in the background. If the user is using the wrong posture, we can immediately sense it.

This graph shows the real-time bit rate of a stream.

Because it is reported in real time, we can see the current bit rate, frame rate, packet loss rate, RTT, a series of core indicators that affect the audio and video experience. When these curves change significantly, we can set some thresholds on some thresholds to trigger the alarm mechanism.

Q & A

Question: Today’s topic is about audio and video, but we may usually contact with the camera face recognition is more, can use SDK technology?

Huo Kai: We will then provide a set of plug-ins including face recognition, living detection, voice to text and other functions based on the real-time audio and video SDK. Behind this set of plug-ins is connected with our intelligent multimedia services, so as to facilitate users in the process of audio and video calls to achieve face recognition and other related functions.

Questions: different phone models of hard ability is different, the underlying technology of each manufacturer is different also, just now you mentioned some compatibility problems, but in actual, hard to depend on the manufacturer ability of encoder, so keep it at the time of coding bit rate, but hard is to do some of the encoder. So, 7 niuyun is push hard to make up first, still recommend soft to make up first?

Mr. Huo: We recommend hard coding as a priority because it is more efficient. But there is a premise, when the hard encoder problems, need to be immediately sensed, sensed, through the dynamic switching strategy, automatically switched to the soft encoder.

About Qiliuyun, ECUG and ECUG Meetup

Seven NiuYun: Seven NiuYun was founded in 2011, as a famous domestic cloud computing and data service provider, seven NiuYun continued in mass file storage, CDN live content distribution, video on demand, interactive and intelligent analysis and processing of large-scale heterogeneous data in the field of the core technology for deep into, is devoted to drive the digital technology in data in the future, Enabling all industries to fully enter the data age.

ECUG: Effective Cloud User Group. Founded in 2007, CN Erlounge II was initiated by Xu Shiwei. ECUG is an indispensable high-end cutting-edge Group in the technology field. As a window of technological progress in the industry, ECUG gathers many technical people, pays attention to the current hot technology and cutting-edge practice, and leads the technological change in the industry together.

ECUG Meetup: ECUG, seven NiuYun collaboration technology share a series of activities, aimed at developers and technology practitioners offline gathering, the target is for developers to build a high quality learning and social platform, look forward to every participants between knowledge to create, build, and influence each other, to generate new knowledge to promote cognitive development and technological progress, promote the common progress of the industry, and by communication For developers and technical practitioners to create a better communication platform and development space.