Good sound is like a filter, especially in the live broadcast, sound has been a key factor to improve the popularity of the live broadcast room, clear and pleasant sound can quickly attract the audience, draw a good impression.

In order to create a live broadcast room with “good voice”, in addition to improving the voice control by adjusting the tone, intonation and speed, the application itself needs to provide technical support for adding bonus points to the voice and improving users’ hearing experience in terms of underlying capabilities. With the increasing richness and complexity of live broadcast scenes, different live broadcast scenes also have corresponding technical requirements.

In the booming live broadcast business, the function of Lianmai in user retention, live broadcast activity, content quality improvement and other aspects has been strongly verified, and has become a necessary capability in various business scenarios.

Rongyun Live SDK, based on The powerful IM + RTC +X full communication capability of Rongyun, completely encapsulates the business scene and provides 7 confluence layout modes to cover the scene of live broadcasting and maiquan. Move to [Rongyun Global Internet Communication Cloud] free experience

An essential technical capability in live Communication is Acoustic Echo Cancellation (AEC).

This paper shares rongyun’s practice and effect in AEC technology from the aspects of basic concepts, classical algorithms, main challenges and artificial intelligence echo cancellation technology exploration.

Introduction to Basic Concepts

What is an echo?

In the scene of live broadcast, echo mainly refers to acoustic echo (including linear echo signal and nonlinear echo signal). That is, after the voice signal of the remote speaker (anchor A or listener A) is transmitted to the local device (listener B or anchor B), it is played by the speaker of the local device, picked up by the microphone of the local device through A series of acoustic reflections, and then transmitted to the remote device (anchor A or listener A).

This causes the remote speaker to hear his voice again within a short time. The acoustic echo generation process is shown in Figure 1.

(FIG. 1 Acoustic echo generation process)

How to avoid this phenomenon?

The answer is to use AEC (acoustic echo cancellation) technology to eliminate the echo contained in the signal collected by the microphone of the near end device → to ensure that there is no echo signal in the voice heard by the remote speaker → to improve the user’s live experience and the quality of the direct broadcast room.

After AEC technology is used, the sound transmission process at both ends is changed as shown in Figure 2, so as to ensure the sound cleanliness in the scenario of live broadcast with microphone from the bottom.

(Figure 2 APPLICATION scenario of AEC technology)

Commonly used classical algorithms

At present, the commonly used complete acoustic echo cancellation algorithm consists of the following three main modules: Time Delay Estimation (TDE), Linear Echo Cancellation (LEC), and Residual Echo Suppression (RES) modules Its principle block diagram is shown in Figure 3.

(FIG. 3 Common echo cancellation algorithms)

Delay estimation module

There is time delay between echo signal and reference signal. The main reasons for time delay include:

(1) The reference signal gets the time that the speaker plays out;

(2) the time when the speaker reaches the microphone after it is played;

③ Time elapsed after microphone picks up echo signal and sends it to AEC algorithm module.

Because the above time is not fixed, and will appear delay jitter. Excessive delay or jitter deteriorates AEC performance. Therefore, the delay estimation (TDE) module is needed to estimate and align the delay between the mic signal and the reference signal, so that the processing of linear echo cancellation (LEC) module and residual echo suppression (RES) module is effective.

The classical TDE algorithm is realized based on the cross-correlation principle. Here, the TDE algorithm in WebRTC is taken as an example to illustrate that it mainly transforms the reference signal and the microphone signal into the frequency domain respectively, and performs 1/0 processing to indicate whether there is a voice signal. By constantly moving the two signals relative to each other by frame, the time difference when the most relevant is found. To estimate the time delay.

Linear echo cancellation module

Classical linear echo cancellation (LEC) modules are mainly designed by Adaptive Filter (AF), such as LMS, NLMS, AP, RLS, Kalman, etc. AF design needs to consider the following indicators:

**① Convergence rate: ** the faster the better, that is, the speed of AF from non-convergence state (such as the initial state or the non-convergence state caused by the echo path change) to convergence state;

**② Stability: ** Mainly because AF can work stably and effectively after convergence so that the output of residual echo is stable and small;

**③ Algorithm complexity: ** The computational complexity should be as low as possible under the condition of good filtering effect.

Double Talk Detection (DTD) module is usually introduced to design linear echo cancellation (LEC) module. “Double Talk” means that the near end and the far end speak at the same time, and “single Talk” means that only the far end speaks.

** Dual-speak detection (DTD) ** Makes the adaptive filter (AF) not updated during “dual-speak”, keeping the adaptive filter (AF) stable and non-diverging; It updates during talk alone to track echo path changes.

Residual echo suppression module

Adaptive filter (AF) usually takes echo link as a linear system to fit, but the actual system is not a strict linear system. And the length of adaptive filter (AF) is limited, it is difficult to accurately fit the strong reverberation in the environment.

Therefore, residual echo still exists after linear echo cancellation (LEC) processing, and it is necessary to introduce residual echo suppression (RES) module to further suppress residual echo.

Residual echo suppression (RES) module usually uses the correlation between residual signal and mic signal, reference signal and linear echo estimation value to estimate residual echo, and then estimates the posterior/prior signal echo ratio, and then estimates the final gain through Wiener filtering, so as to obtain the output.

It should be noted that the residual echo suppression (RES) module needs to balance the residual echo suppression and the near-end speech distortion, as well as the algorithm effect and computational complexity.

The main challenges of classical algorithms

Although classical algorithms are widely used in practical scenarios, they always face some difficult problems, which are as follows:

① In a scene with strong nonlinear echo, it is difficult to achieve good echo suppression effect without proximal speech damage or with acceptable damage.

② It is difficult to obtain accurate detection results by traditional dual-talk detection algorithm in strong nonlinear echo or unsteady noise scenarios.

③ In strong reverberation scenarios, it is difficult to achieve good echo suppression effect due to the length limitation of adaptive filter (AF).

In the scenario of lianmai live broadcast, the actual application scenario of AEC is more complex and faces more challenges:

(1) There are many types of user terminal devices with great differences, and the nonlinear situations of echo generation are quite different, which poses great challenges to AEC algorithm.

(2) The use environment of the equipment is complex and diverse, including quiet environment, indoor noisy environment, outdoor noisy environment, indoor strong reverberation environment, etc., which also tests the implementation effect of AEC.

(3) The popularity of live broadcasting platforms may bring about the influx of a large number of users. In the case of a large number of users, the probability of the occurrence of “double-talk” increases sharply, which also increases the difficulty of AEC.

Therefore, the traditional AEC algorithm is confronted with multiple challenges in the multiple lian-mai layout live scenes supported by Rongyun live SDK.

Artificial intelligence echo cancellation technology exploration

In recent years, deep learning has been applied more and more in the field of speech signal processing, and its combination with AEC algorithm has also made some progress.

The essence of deep learning is to build a deep model to fit the mapping relationship between input and output, and make the error between model output and target smaller and smaller through continuous self-adjustment of the built model until convergence is stable. For AEC algorithm, the input of deep network includes reference and mic signals, and the output is one.

Deep learning combined with AEC

Current research ideas on the combination of deep learning and AEC mainly include the following two types:

① Traditional LEC + deep learning RES

The depth model is used to fit the nonlinear residual echo, and the classical linear echo cancellation (LEC) algorithm is retained.

② Deep learning AEC

With the enhancement of deep network expression capability, there are more and more methods to directly use deep network model to fit all echoes (linear echoes + nonlinear echoes).

In general, more and more AEC algorithms based on deep learning have been studied and gradually applied to practical systems.

Exploration and application of Rongyun

Link is one of the important functions of live broadcast business, and AEC is one of the necessary algorithms of link live. AEC performance directly affects the experience of link live. Therefore, Rongyun has been continuously exploring AEC, including exploring the fusion of Transformer, the most popular in NLP field in recent years, and AEC algorithm.

Transformer, proposed by Google team in June 2017, abandons traditional CNN and RNN and consists of Attention mechanism.

The major breakthrough it brings to the industry lies in solving the problem that RNN relies on historical results, which leads to the limitation of model parallelism and the loss of sequential computing information.

Transformer has become the dominant model for natural language processing and has branched out into other areas, including image synthesis, multi-target tracking, music generation, timing prediction, visual-linguistic modeling, and more.

Voice signals are timing signals, and this is where Transformer will excel.

Explore the fusion of Transformer and AEC algorithm, and rongyun builds a set of AEC algorithm framework based on Dual-path Transformer. The schematic diagram is shown in Figure 4. Intra-transformer and Inter-transformer are used to model local information and global information respectively.

(FIG. 4 Exploration of Artificial intelligence echo cancellation technology of Rongyun)

Under the research trend of deep learning AEC, Rongyun has carried out practical verification on some existing deep learning AEC algorithms, such as DTLN-AEC algorithm based on LSTM deep learning model.

The result is shown in the figure below. In the figure, the first part is “single-speaker” and the second part is “double-speaker”.

It can be seen that the results of DTLN-AEC deep learning method based on LSTM are superior to traditional methods. By combining the powerful fitting ability of the depth model with the data set related to the live mic scene, it improves the performance of various scenes such as nonlinear echo, reverberation, noise and “double talk”, and greatly alleviates the challenges of the live MIC scene.