During a video or audio call, audio compression is required to reduce the transmission bit rate of original sound data, and audio processing is required to obtain higher sound quality. So, how to deal with these two aspects, to ensure the high authenticity of sound transmission? This article will combine the actual combat and experience of netease Yunxin in audio and video technology, and discuss the audio processing and compression technology.

Recommended reading

Video Private Cloud: Building on-demand Private Cloud Platform based on Docker

Behind high-definition Sound Quality: Technical Decryption of netease Yunxin Music Teaching Program


Audio processing methods mainly include audio noise reduction, automatic gain control, echo suppression, mute detection and comfort noise generation. The main application scenarios are video or audio call fields. Audio compression includes various audio coding standards, including the telecom audio compression standard (G.7XX series) developed by ITU and the Internet audio compression standard developed by Microsoft, Google, Apple, Dolby and other companies. (iLBC, SILK, OPUS, AAC, AC3, etc.)

Basic Concepts of Audio

Here’s what you need to know before going any further into audio processing and compression:

(1) Tone: generally refers to the frequency information of sound. The subjective perception of human ear is the deep (bass) or sharp (high pitch) of sound.

(2) Loudness: the strength of sound.

(3) Sampling rate: the accuracy of sound information in the process of conversion from analog signal to digital signal. The higher the sampling rate, the more retained sound information.

(4) Sampling accuracy: In the process of converting analog signals into digital signals, the number of bytes required by each sampling point is represented. Generally, 16 bits (double bytes) represent a sampling point.

(5) Channel number: the number of relevant channels of sound, such as mono, double channel, 5.1 channel.

(6) Audio frame length: a piece of audio information operated by audio processing or compression, commonly 10ms, 20ms, 30ms.

Fundamentals of Audio Processing

1. Noise Suppression

The original sound collected by mobile phones and other devices often contains background noise, which affects the subjective experience of listeners and reduces the efficiency of audio compression. Taking WebRTC, a well-known open source framework of Google, as an example, we conducted rigorous tests on its noise suppression algorithm and found that the algorithm can suppress white noise and colored noise well. Meet the requirements for video or voice calls.

Other common noise suppression algorithms, such as the noise suppression algorithm contained in the open source project Speex, also have good results. This algorithm is applicable to a wider range of noise suppression algorithms than WebRTC, and can be used at any sampling rate.

2. Acoustic EchoCanceller

During a video or audio call, after the local voice is transmitted to the peer end for playback, the voice is collected by the microphone of the peer end and transmitted to the local end for playback together with the voice of the peer end. In this way, the local voice that is played contains the original voice collected by the local end, causing the subjective feeling that the echo of the local end is heard.

The principle of echo generation is shown in the figure below:



Taking WebRTC as an example, the echo suppression module suggests that mobile devices adopt the AECM algorithm with less computation. The following figure shows the processing procedure of the ALGORITHM. Interested readers can refer to the AECM source code for research, not introduced here.



3. Automatic Gain Control

Sometimes the loudness of audio data collected by mobile phones and other devices is too high, sometimes it is too low, causing the sound to be either loud or small, affecting the subjective feelings of listeners. The automatic gain control algorithm adjusts the input sound forward/negative according to the pre-configured parameters to make the output sound suitable for the subjective experience of human ear.

Taking WebRTC as an example, the basic flow chart of its automatic gain control algorithm is shown below.



4. Voice ActivityDetection

The basic principle of mute detection is as follows: Calculate the power spectral density of an audio. If the power spectral density is lower than the threshold, the audio is mute; otherwise, the audio is sound. Mute detection is widely used in audio coding, AGC, AECM and so on.

5. Comfort noise generation (ComfortableNoiseGeneration)

The basic principle of comfortable noise generation: according to the power spectrum density of noise, artificial construction noise. Widely used in audio codecs. At the coding end, the white noise spectral density is calculated and the silent period and the power spectral density information are encoded. At the decoding end, random white noise is reconstructed according to time information and power spectral density information.

How it could be used: Add random white noise during the audio post-processing phase to create a comfortable call experience when completely silent.

Fundamentals of Audio Coding

Another area where audio is widely used: audio coding.

Take a look at some of the most widely used audio coding standards today, as shown in the figure below.



In the figure, the horizontal axis is the audio coding rate and the vertical axis is the audio frequency band information. From the picture, we can get the following information.

(1) For fixed bit rate coding standards, such as G.711 or G.722, single point representation is adopted in the figure, indicating that these two coding standards are fixed bit rate coding standards. For others, such as Opus and Speex, their curves are continuous, indicating that these coding standards are variable bit rate coding standards.

(2) In terms of frequency bands, g.711, G.722, AMR and iLBC are suitable for narrowband (8khz sampling rate) and Wideband (16khz sampling rate) for general voice call scenarios. AAC and MP3 are suitable for fullband (48khz sampling rate) range for special music scenes. Opus is suitable for the whole frequency band, can carry out the largest range of dynamic adjustment, the widest range of application.

(3) iLBC, Speex and Opus for Internet transmission are free and open source according to the standard charges; MP3 and AAC for music scenarios require licenses and are not open source.

With the continuous development of audio processing and compression technology, better performance, wider range of applications, higher performance algorithms and new technologies will continue to emerge, if you have a good technology or share, welcome to leave a message, we discuss together.

Want to get more product dry goods, technical dry goods, welcome to pay attention to netease yunxin blog.