An analysis of the audio technology behind low-delay, high-quality voice calls - Noise reduction and Echo cancellation

In the real-time audio interactive scene, in addition to the codec which affects the sound quality and experience as we mentioned in the last article, on the end, noise reduction, echo cancellation and automatic gain modules also play an important role. In this article, we will focus on the echo cancellation and noise reduction modules, and talk about the technical challenges in real-time interactive scenarios, as well as our solutions and practices.

Three major algorithm module optimization of echo cancellation

Echo Cancellation has always served as the core algorithm in voice communications. Generally speaking, the effect of echo cancellation is affected by many factors, including: acoustic environment, including reflection, reverberation, etc.;

The acoustic design of the communication equipment itself, including the design of the cavity and the nonlinear distortion of the device;
System performance, the computing power of the processor, and the ability of the operating system to schedule threads.

At the beginning of the design of the acoustic network echo cancellation algorithm, the performance, robustness and universality of the algorithm are the final optimization goals, which is very important for a good audio and video SDK.

First of all, how does an echo occur? Basically, your voice comes out of the speaker, it gets picked up by the microphone, it gets picked up by the microphone, it gets sent back to you, and you hear the echo. To eliminate the echo, we will design an algorithm to remove the sound signal from the microphone signal.

So how does Acoustic Echo Cancellation work? The specific steps are shown in the diagram below:

The first step is to find the delay between the reference/speaker signal (blue polyline) and the microphone signal (red polyline), which is delay=T in the figure.
The second step is to estimate the linear echo component in the microphone signal according to the reference signal and subtract it from the microphone signal to get the residual signal (black polyline).
In the third step, the residual echo in the residual signal is completely suppressed by nonlinear processing.

Corresponding to the above three steps, echo cancellation is also composed of three large algorithm modules:

Delay Estimation
Linear Adaptive Filter
Nonlinear Processing

“Delay estimation” determines the lower limit of AEC, “linear adaptive filter” determines the upper limit of AEC, and “nonlinear processing” determines the final call experience, especially the balance between echo suppression and double talk.

Note: Double talk refers to an interactive scene in which two or more parties speak at the same time, and one party’s voice will be suppressed, resulting in intermittent situations. This is because the echo cancellation algorithm “overcorrected” by eliminating parts of the audio signal that should not have been removed.

Next, we will first talk about the technical challenges and optimization ideas of these three algorithm modules.

One, delay estimation

Influenced by the specific system implementation, there is a time delay between the data buffer stored by the reference signal and the microphone signal when they are respectively sent to the AEC module for processing, i.e., “Delay =T” as seen in the figure above. If the device that is producing the echo is a mobile phone, then some of the sound coming from its speaker will be transmitted to the microphone through the inside of the device and possibly through the external environment. So the delay includes the length of the device’s playback buffer, the time it takes for the sound to travel through the air, and the time it takes for the playback thread to start working with the acquisition thread. Because there are many factors affecting the delay, the value of this delay varies from system to system, device to device, and SDK to underlying implementation. It may be fixed during the call, or it may change mid-call (so-called overrun and underrun). This is also why an AEC algorithm may work on device A, but may be less effective on another device. The accuracy of delay estimation is a prerequisite for the operation of AEC. Excessive estimation deviation will lead to sharp deterioration of AEC performance or even failure to work. The failure to quickly track delay changes is an important factor for the occurrence of occasional echoes.

Enhanced robustness of the delay estimation algorithm

Traditional algorithms usually calculate the correlation between the reference signal and the microphone signal to determine the delay. Correlation can be calculated in the frequency domain. A typical method is Binary Spectrum. By calculating whether the signal energy at a single frequency point exceeds a certain threshold, the reference signal and the microphone signal are mapped into a two-dimensional 0/1 array, and the delay can be found by moving the array offset. The latest WebRTC AEC3 algorithm uses multiple NLMS linear filters in parallel to find the delay. This method achieves good results in terms of detection speed and robustness, but the computation is very heavy. When calculating the cross-correlation between two signals in the time domain, an obvious problem is that the speech signal contains a large number of harmonic components and has time-varying characteristics, and its related signals often show the characteristics of multiple peak values, some peak values do not represent the real delay, and the algorithm is susceptible to noise interference.

The acoustic network delay estimation algorithm was able to effectively suppress the local Maxima values to greatly enhance the robustness of the algorithm by reducing the de-correlate between signals. As shown in the figure below, the left is the cross-correlation between the original signals, and the right is the cross-correlation after the sound network SDK processing. It can be seen that the pre-processing of signals greatly enhances the robustness of delay estimation:

The algorithm is adaptive to reduce the computation

Generally, in order to reduce the need for computation, the delay estimation algorithm preassumes that the echo signal appears in a lower frequency band, so that the signal can be sent to the delay estimation module after the downsampling, thus reducing the computational complexity of the algorithm. However, in the face of tens of thousands of devices and various routes on the market, the above assumption is often not valid. The following figure is the spectrum diagram of VivoX20’s microphone signal in headphone mode. It can be seen that the echoes are concentrated in the frequency band above 4kHz, and the traditional algorithm for these cases will lead to the failure of the echo cancellation module. The acoustic network delay estimation algorithm searches for the region where the echo appears in the whole frequency band and adaptively selects the region to calculate the delay, ensuring that the algorithm has accurate delay estimation output in any device and route.

Dynamically update the audio algorithm library to improve equipment coverage

To ensure continuous iterative improvement of the algorithm, the Sound network maintains a database of audio algorithms. We used a large number of different test equipment to collect various combinations of reference and microphone signals in different acoustic environments, and the delays between them were all calibrated by off-line processing. In addition to real data, the database also contains a large amount of simulated data, including different speakers, different reverberation intensity, different floor noise levels, and different types of nonlinear distortion. In order to measure the performance of the delay estimation algorithm, the delay between the reference signal and the microphone signal can be randomly changed to observe the response of the algorithm to the sudden delay change.

Therefore, to judge the pros and cons of a delay estimation algorithm, we need to examine:

Adapt to as much equipment and acoustic environment as possible, and in the shortest time according to the equipment, acoustic environment factors to match the appropriate algorithm;
The algorithm strategy can be adjusted dynamically in time after the sudden random delay change.

The following is a comparison of latency estimation performance between sonnet SDK and its friends. A total of 8640 sets of test data were used in the database. As can be seen from the data in the figure, the acoustic network SDK can find the initial delay of most test data in a much shorter time. In 96% of the tests, the Soundnet SDK found their correct latency within 1s, compared with 89% for FriendNet.

The second test is to test the random delay jitter during the call. The test delay estimation algorithm needs to find the accurate delay value in the shortest possible time. As shown in the figure, 71% of the test data of Sonnet SDK can find the changed accurate delay value within 3s, while the proportion of Friends business is 44%.

Second, linear adaptive filter

For linear filters, a lot of literature has introduced their principle and practice. When applied to echo cancellation scenarios, the convergence rate, steady-state misalignment, and tracking capability should be considered. There are often conflicts between these indicators, for example, a larger step size can improve the convergence rate, but will lead to a larger imbalance. This is the No Free Lunch Theorem of adaptive filters.

For the types of adaptive filters, in addition to the most commonly used NLMS filter (Model Independent), RLS filter (Least Squares Model) or Kalman filter (state-space Model) can also be used. In addition to various assumptions, approximations, and optimizations in their theoretical derivation, the performance of these filters ultimately comes down to how to calculate the best step factor (in Kalman filter step factor is incorporated into Kalman Gain calculation). In the case that the filter does not converge or the environmental transfer function changes suddenly, the step factor needs to be large enough to track the environmental changes. When the filter converges and the environmental transfer function changes slowly, the step factor should be reduced as far as possible to achieve the smallest steady-state misadjustment. As for the calculation of step factor, the energy ratio between residual echo and residual signal after adaptive filter should be considered, and the leakage coefficients of the system should be modeled. This variable is often equivalent to finding the difference between the filter coefficient and the true transfer function (called the state space state vector error in Kalman filters), which is the difficulty in the whole estimation algorithm. In addition, the filter divergence problem in the double lecture phase is also a point to consider. Generally speaking, this problem can be solved by adjusting the filter structure and using two Echo Path Models.

The acoustic network adaptive filter algorithm does not use a single filter type, but takes into account the advantages of different filters, at the same time with the adaptive algorithm to calculate the optimal step factor. In addition, the algorithm estimates the transfer function of the environment in real time through linear filter coefficients, and automatically corrects the filter length to cover high reverberation, strong echo scenes such as HDMI peripherals connected to communication devices. Here is an example. In a medium-sized conference room (about 20m2 in area, with glass walls on three sides) in the soundnet office, Macbook Pro is connected to Xiaomi TV through HDMI. In the picture, the trend of linear filter time domain signal is shown. The algorithm can automatically calculate and match the length of the transfer function of the actual environment (strong reverberation environment is detected automatically around frame 1400) to optimize the performance of the linear filter.

Similarly, we also used a large number of test data in the database to compare the performance between the acoustic network SDK and its friends, including steadystate misadjustment (the degree of echo suppression after filter convergence) and convergence speed (the time it takes for the filter to reach the convergence state). The first figure represents the steady-state misadjustment of the adaptive filter. The acoustic network SDK can achieve echo suppression of more than 20dB in 47% of the test data, compared with 39% of the friends.

The figure below shows the convergence rate of the adaptive filter. In 51% of the test samples, the acoustic network SDK converged to steady state within 3s before the call, compared with 13% for friend vendors.

Three, nonlinear processing

Nonlinear processing aimed at curbing the echo of the linear filter did not predict composition, usually by computing the reference signal, the microphone signal, linear echo and the correlation between the residual signal, or the correlation map directly to inhibit gain, or through correlation estimate the residual echo power spectrum, further by the traditional wiener filter algorithm to reduce the noise of suppress the residual echo.

As the last module of the echo cancellation algorithm, in addition to suppressing residual echo, the nonlinear processing unit is also responsible for monitoring whether the whole system works properly. For example, does the linear filter fail to work properly because of delay jitter? Are there residual echoes that the hardware echo cancellation did not process before the sound network SDK echo cancellation?

The following is a simple example. Internal parameters, such as echo energy, estimated by the adaptive filter, can find the phenomenon of delay change more quickly and prompt NLP to take corresponding actions:

With the sound network SDK covering more and more scenes, the transmission of music signals has become an important scene. The acoustic network SDK has done a lot of optimization for the echo cancellation experience of music signals. A typical scene is the improvement of the estimation algorithm of comfort noise. The traditional algorithm uses the principle of Minimum Statistics to estimate the floor noise in the signal. When this algorithm is applied to the music signal, because the music signal is more stable than the speech signal, the noise power will be overestimated. Reflected in the echo cancellation will lead to the floor noise (background noise) between the echo period and the non-echo period after processing is not stable, and the experience is very poor. By means of signal classification and module fusion, the acoustic network SDK completely solves the fluctuation of floor noise caused by CNG estimation.

In addition, the sound network SDK is heavily optimized for all possible extreme situations, including non-causal systems, device frequency offsets, acquisition signal overflow, sound card containing system signal processing, etc., to ensure that the algorithm works in all communication scenarios.

Sound quality first noise reduction strategy

The effect of noise reduction on signal sound quality is greater than that of echo cancellation module, which is derived from our prior assumption that all floor noises are stationary signals (at least short-term stationary) at the beginning of the design of noise reduction algorithm, and according to this assumption, the distinction between music and floor noise is significantly weaker than that between speech and floor noise.

The sound network SDK presets a signal classification module at the front end of the noise reduction module, which can accurately detect the type of the signal and adjust the type and parameters of the noise reduction algorithm according to the type of the signal. Common signal types include general speech, cappellation, music signal, etc. The following figure shows the signal fragments processed by the two noise reduction algorithms. The first one is the mixed signal of speech and music. The first 15 seconds is the noisy speech signal, the 40s is the music signal, and the 10s is the noisy speech. The results show that under the premise of similar noise reduction performance of speech segment signal, the music part in the processed signal of rival products is seriously damaged, while the processing of sound network SDK does not reduce the sound quality of music.

In the second example, the audio used is the singer’s cappella, in which the singer repeatedly says “ah”. In the graph below, from top to bottom are the original signal, the processing result of friend business and the processing result of SDK of sound network. The results show that the noise reduction of Yousheng seriously damaged the spectrum components of the original speech, while the acoustic network SDK completely retained the harmonic components of the original speech, ensuring the sound quality experience of singers when singing.

conclusion

Since 1967, M. M. Sondhi of Bell Laboratories proposed the method of adaptive filter to eliminate echoes, countless researches and practices have been devoted to this most basic problem of voice communication. To solve the echo problem perfectly, in addition to having a strong algorithm as the basis, also need to do a lot of optimization in the field of engineering optimization. Soundnet will continuously improve the echo cancellation experience in different application scenarios.

In the next installment of this series, we’ll follow the audio signal from the device side into the real world network environment, and talk about delay, jitter, and optimization strategies behind packet loss countermeasures in audio interactive scenarios while taking a real-world tour of Shanghai. (A picture to briefly reveal the plot, stay tuned)

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

An analysis of the audio technology behind low-delay, high-quality voice calls — Noise reduction and Echo cancellation