Brief introduction:In recent years, with the development of real-time communication technology, online meeting gradually become people indispensable important office tools in the work, according to incomplete statistics, about 75% for pure online meeting audio conference, which don’t need to open the camera and screen sharing function, voice quality and clarity of this meeting is important experience for the online meeting.

The author is BBB 0 Qiqi

Verify | Ty 1


In real life, the meeting environment is very diverse, including the open noisy environment, instantaneous non-stationary keyboard sound, etc., which poses a great challenge to the traditional speech front-end enhancement algorithm based on signal processing. At the same time, with the rapid development of data-driven algorithms, deep learning intelligent speech enhancement algorithms have gradually emerged in academia [1] and industry [2,3,4], and achieved good results. Aliclouddenoise algorithm emerged in such a context. With the excellent nonlinear fitting ability of neural network, Combined with traditional speech enhancement algorithm, in constant iterative optimization, the noise control effect for real-time scenarios meeting, performance, cost, etc, a series of optimization and improvement, in the end can be fully guaranteed noise reduction ability while preserving the high fidelity of voice, for ali cloud cloud real-time video conference system provides excellent voice conference experience.

The development status of speech enhancement algorithms

Speech enhancement refers to the technology that when clean speech is disturbed by various noises in real life scenes, the noise needs to be filtered out in a certain way to improve the quality and intelligability of the speech segment. In the past few decades, traditional single-channel speech enhancement algorithms have been developed rapidly, which are mainly divided into time domain and frequency domain methods. The time-domain method can be roughly divided into parametric filtering method [5,6] and signal subspace method [7], while the frequency-domain method includes spectral subtraction method, Wiener filtering method and speech amplitude spectrum estimation method based on the minimum mean square error [8,9], etc.

Traditional single channel speech enhancement method with a small amount of calculation, can the advantages of real-time online speech enhancement, but for non-stationary sudden noise suppression ability is bad, suddenly appeared on the road car horns and so on, for example, traditional algorithm is enhanced at the same time there will be a lot of residual noise and the noise will cause subjective listening is poor, and even affect the intelligibility of speech message. From the perspective of mathematical theoretical derivation of the algorithm, the traditional algorithm also has the problem of too many assumptions in the process of solving the analytical solution, which makes the effect of the algorithm have a clear upper limit, and it is difficult to adapt to the complex and changeable actual scenes. Since 2016, deep learning methods have significantly improved the performance of many supervised learning tasks, such as image classification [10], handwriting recognition [11], automatic speech recognition [12], language modeling [13] and machine translation [14]. In speech enhancement tasks, many deep learning methods have also appeared.

Fig. 1 Flow chart of classical algorithms for traditional single-channel speech enhancement systems

Deep learning-based speech enhancement algorithms can be roughly divided into the following four categories according to different training objectives: In this kind of algorithm, one or more submodules of the traditional speech enhancement algorithm based on signal processing are replaced by a neural network. Generally, the overall processing flow of the algorithm will not be changed, and the typical representative is RNNoise [15].

• Mask\ _Based Method This kind of algorithm trains the neural network to predict the time-frequency Mask and applies the predicted time-frequency Mask to the spectrum of the input noise to reconstruct the pure speech signal. Common time-frequency masks include IRM[16], PSM[17], and CIRM [18], etc. The error function in the training process is shown as follows:

This kind of algorithm realizes direct Mapping of features through training neural network. Common features include amplitude spectrum, logarithmic power spectrum and complex spectrum, etc. The error function in the training process is shown as follows:

• End-to-end speech enhancement algorithms such as End-to-End Method exert the data-driven idea to the extreme. Under the premise of reasonable data set distribution, it is one of the hot research directions that has been widely active in the academia in the past two years to directly map end-to-end numerical data from time-domain speech signals without frequency domain transformation.

AlicloudDenoise speech enhancement algorithm

After comprehensively considering the service usage scenarios and weighing the noise reduction effect, performance overhead, real-time performance and many other factors, AlicloudDenoise speech enhancement algorithm adopts Hybrid method, which takes the ratio of noise energy in noisy speech and target voice energy as the fitting target. Then, the gain estimator in traditional signal processing, such as MMSE-STSA estimator, is used to obtain the denoising gain in frequency domain, and finally the enhanced speech signal in time domain is obtained by inverse transformation. In the selection of network structure, both real-time performance and power consumption are taken into account, and TCN network is chosen instead of RNN structure. The basic network structure is shown in the figure below:

II. Algorithm optimization under real-time meeting scenario

1. What should I do when there are so many people nearby and it is noisy?

** Question Background **

In the scene of real-time meeting, Babble Noise is a kind of common background Noise, that is, the background Noise composed of the conversation sounds of multiple speakers. This kind of Noise is not only non-stationary, but also similar to the target speech components of the speech enhancement algorithm, which leads to the difficulty of algorithm processing in the suppression process of such Noise. Here is a specific example:

** Problem Analysis and Improvement Plan **

After analyzing dozens of hours of office scene audio containing Babble Noise and combining with human speech sound mechanism, it is found that this kind of Noise has the characteristics of quasi-long-term stable existence. As is known to all, in the speech enhancement algorithm, Contextual information has a significant impact on the effectiveness of the algorithm, so for a more context-sensitive Noise type like Babble Noise, Through dilated convolutions, the Aliclouddenoise algorithm systematically aggregates the key stage features of the model, explicitly increases the receptive field, and additionally incorporates gating mechanisms. As a result, the improved model can significantly improve the processing effect of Babble Noise. The figure below shows a comparison of key parts of the model before improvement (TCN) and after improvement (GATCN).

The results of speech test set show that the speech quality PESQ[19] and speech intelligability STOI[20] of GATCN model are improved by 9.7% and 3.4% compared with TCN model under IRM target. Under the target of Mapping a Priori SNR[21], speech quality PESQ improved by 7.1% compared with TCN model, and speech intellegibility STOI improved by 2.0% compared with TCN model, and was superior to all baseline models. The indicators are detailed in Table 1 and Table 2.

Table 1 PESQ comparison details of objective indicators of speech quality

Table 2. Stoi comparison details of objective indicators of phonetic intelligibility

Improvement effect display:

2, the critical moment how to drop the word?

The problem background

In speech enhancement algorithms, the phenomenon of swallowing words or the disappearance of specific words, such as the disappearance of the end of a sentence, is an important factor affecting the subjective listening sensation of enhanced speech. In real-time meeting scenarios, this phenomenon is more common due to the variety of languages involved and the variety of speakers’ speaking contents. A specific example is listed below:

Problem analysis and improvement plan

Based on the 1W + voice test data set constructed by classification, the timing of swalloping and dropping of words after enhancement is counted and the corresponding frequency domain features are visualized. It is found that the phenomenon mainly occurs on several specific phonemes or words, such as voicable sounds, overlapping sounds and long sounds. At the same time, it is found in the classification statistics based on the dimension of signal-to-noise ratio that there is a significant increase in the phenomenon of word swallow and word drop under the condition of low signal-to-noise ratio. Accordingly, improvements are made in the following three aspects:

• Data level: firstly, the distribution statistics of certain phonemes in the training data set were conducted. After the conclusion that the proportion of phonemes was relatively small, the phonetic components in the training data set were enriched accordingly.

• Noise reduction strategy level: Reduce the low SNR circumstance, under certain conditions using a combination of noise reduction strategy, namely traditional noise reduction first, then AliCloudDenoise noise reduction, the disadvantage of this method is embodied in the following two aspects, the first combined noise reduction can increase the algorithm costs, followed by traditional noise inevitably appear spectrum level quality damage, reduce the overall quality of the sound quality. This method can improve the phenomenon of swallowing and dropping words, but it is not used on line because of its obvious shortcoming.

• Training strategy level: After the targeted enrichment of speech components in the training data set, the phenomenon of word swallowing and word dropping after enhancement is indeed improved, but this phenomenon still exists. After further analysis, it is found that its spectral characteristics are highly similar to those of some noises, resulting in difficulties in local convergence of network training. Based on this, Aliclouddenoise algorithm adopts the training strategy which is not adopted in the process of deduction, and the auxiliary output speech existence probability is adopted in the training. The calculation formula of SPP is as follows:

The results of speech test set show that the proposed dual output auxiliary training strategy improves speech quality by 3.1% and speech intelligibility STOI by 1.2% compared with the original model under IRM target. Under the target of Mapping a Priori SNR, speech quality PESQ is improved by 4.0% compared with the original model, and speech intelligability STOI is improved by 0.7% compared with the original model, which is better than all the baseline models. The indicators are detailed in Table 3 and Table 4.

Table 3. PESQ comparison of objective indicators of speech quality

Table 4. Stoi comparison details of objective indicators of phonetic intelligibility

Improvement effect display:

For the real-time meeting scenario, the operating environment of Aliclouddenoise algorithm generally includes PC terminal, mobile terminal and IoT devices, etc. Although different operating environments have different requirements on energy consumption, However, CPU occupation, memory capacity and bandwidth, and power consumption are all key performance indicators that we pay attention to. In order to enable AlicloudDenoise algorithm to provide services for various business parties extensively, we adopted a series of energy consumption optimization methods. It mainly includes structural model cutting, resource adaptive strategy, weight quantization and training quantization, etc., and finally obtains an intelligent speech enhancement model of about 500KB by some auxiliary convergence strategies under the condition that the accuracy is reduced by 0.1% magnitude, which greatly widens the application range of Aliclouddenoise algorithm.

Next, we first make a brief review of the model lightweight technology involved in the optimization process, then introduce the resource adaptive strategy and model quantization, and finally give the key energy consumption index of Aliclouddenoise algorithm.

1. Adopted model lightweight technology

Lightweight technology for deep learning models generally refers to a series of technical means to optimize the “operation cost” of the model, such as the number and size of parameters, computation amount, energy consumption, speed and so on. Its purpose is to facilitate the deployment of the model in various hardware devices. At the same time, lightweight technology also has a wide range of applications in computing intensive cloud services, which can help reduce the cost of services and improve the corresponding speed.

The main difficulty of lightweight technology is that the algorithm’s effect, generalization and stability should not be significantly affected while optimizing the operation cost. This is difficult in all aspects for the common “black box” neural network model. In addition, part of the difficulty of lightweight is also reflected in the difference of optimization objectives.

For example, the reduction of model size does not necessarily reduce the amount of computation; The reduction of model computation may not improve the running speed. Nor does the increase in speed necessarily reduce energy consumption. This difference makes it difficult for lightweight to solve all the performance problems in a “package”, and it requires cooperation from a variety of perspectives and technologies to achieve a comprehensive reduction of operating costs.

At present, the common lightweight technologies in academia and industry include: parameter/operation quantization, pruning, small module, structure superparameter optimization, distillation, low rank, sharing and so on. For example, parameter quantization can compress the storage space occupied by the model, but it still reverts to floating point number when computing. The global quantization of parameter + operation can reduce the parameter volume and chip computation at the same time, but only when the chip is supported by the corresponding processor can the speed increase be achieved. Knowledge distillation uses a small network of students to learn the high-level features of large models to obtain lightweight models with performance matching, but the optimization is somewhat difficult and mainly suitable for tasks with simplified expression (such as classification).

Unstructured fine tailoring can eliminate the most redundant parameters to achieve excellent simplification, but it requires dedicated hardware support to reduce the amount of computation; Weight sharing can significantly reduce the model size, but the disadvantage is that it is difficult to accelerate or save energy. AutoML structure superparameter search can automatically determine the optimal model stack structure for small test results, but its application scope is limited by the complexity of search space and the superiority of iteration estimation. The following figure shows the main lightweight technology used by AlicloudDenoise algorithm in the process of energy consumption optimization.

2. Resource adaptive strategy

Resources, the core of the adaptive strategy is model can be in the case of insufficient resources adaptive output meet the low precision of the qualification as a result, just do your best, when enough resource output optimal accuracy is enhanced as a result, this function is the most direct idea training model in the equipment of different size, according to the need to use, but will be an additional storage costs, Aliclouddenoise algorithm adopts a hierarchical training scheme, as shown in the figure below:

The results of the middle layer were also output, and unified constraint training was finally carried out after the joint loss. However, the actual verification found the following two problems:

• The features extracted from shallow networks are relatively basic, and the enhancement effect of shallow networks is poor.

• After adding the output structure of the middle layer network, the enhancement result of the last layer network will be affected, because in the process of joint training, it is expected that the shallow layer network can also output better enhancement results, which destroys the distribution layout of the extracted features of the original network structure.

To solve the above two problems, we adopted the optimization strategy of multi-scale Dense connection + offline hyperparticipation pre-pruning, which ensured that the model could dynamically output the speech enhancement results with a precision range of no more than 3.2% on demand.

3. Model quantification

In the optimization of the memory capacity and bandwidth required by the model, the MNN team’s weight quantization tool [22] and Python offline quantization tool [23] are mainly used to realize the conversion between FP32 and INT8. The schematic diagram is as follows:

4. Key energy consumption index of AlicloudDenoise algorithm

As shown in the figure above, the size of the algorithm library on the Mac platform is 14MB for rival products, while the main output algorithm of Aliclouddenoise algorithm is 524KB, 912KB and 2.6MB, which has a significant advantage. In terms of operation consumption, the test results of Mac platform show that the CPU occupation of rival products is 3.4%, the CPU occupation of Aliclouddenoise algorithm library is 524KB 1.1%, and the CPU occupation of 912KB is 1.3%. The 2.6MB CPU usage is 2.7%, and AlicloudDenoise algorithm has obvious advantages, especially under long running conditions.

Four, the effect of the algorithm technical index evaluation results

The evaluation of speech enhancement effect of AlicloudDenoise algorithm is mainly focused on two scenes, general scene and office meeting scene.

1. Evaluation results under common scenarios

In the test set of generic scenes, the speech data set consists of Chinese and English parts (a total of about 5000 pieces), while the noise data set contains four typical types of common noises. For Stationary noise, non-stationary noise, Babble noise and Outdoor noise, the ambient noise intensity is set between -5 and 15dB. The objective indexes are mainly measured by PESQ speech quality and STOI speech intelligability. The higher the value of the two indexes, the better the enhanced speech effect will be.

As shown in the table below, AlicloudDenoise 524KB algorithm library has a 39.4% (English speech) and 48.4% (Chinese speech) improvement on PESQ compared with the traditional algorithm, according to the evaluation results on the speech test set of generic scenarios. STOI improved 21.4% (English pronunciation) and 23.1% (Chinese pronunciation) respectively, which was basically the same as the rival algorithms. Aliclouddenoise 2.6MB algorithm has a 9.2% (English pronunciation) and 3.9% (Chinese pronunciation) improvement on PESQ and 0.4% (English pronunciation) and 1.6% (Chinese pronunciation) improvement on STOI, respectively, compared with the rival algorithm, showing a significant effect advantage.

2. Evaluation results in the office scenario

Combined with the business acoustics scene of the real-time meeting, we conducted a separate evaluation of the office scene. The noise was the noisy noise in the real office scene recorded in practice, and a total of 5.3h of evaluation noisy voice was constructed. The figure below shows the comparison results of Aliclouddenoise 2.6MB algorithm library and rival product 1, rival product 2, tradition 1 and tradition 2. These four algorithms are compared in SNR, P563, PESQ and STOI indicators. You can see that the AlicloudNoise 2.6MB algorithm library has significant advantages.


In the context of real-time communication, AI + Audio Processing still has many research directions to be explored and implemented. Through the integration of data-driven thinking and classical signal Processing algorithms, Can give the audio front-end algorithm (ANS, AEC, AGC), audio back-end algorithm (bandwidth expansion, real-time bel canto, voice change, sound effects), audio codec and weak network audio processing algorithm (PLC, NETEQ) to bring the effect of the upgrade, for Ali cloud video cloud users to provide the extreme audio experience.


[1] Wang D L, Chen J. Supervised speech separation based on deep learning: An overview[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(10): 1702-1726.




[5] Gannot S, Burshtein D, Weinstein E. Iterative and sequential Kalman filter-based speech enhancement algorithms[J]. IEEE Transactions on speech and audio processing, 1998, 6(4): 373-385.

[6] Kim J B, Lee K Y, Lee C W. On the applications of the interacting multiple model algorithm for enhancing noisy speech[J]. IEEE transactions on speech and audio processing, 2000, 8(3): 349-352.

[7] Ephraim Y, Van Trees H L. A signal subspace approach for speech enhancement[J]. IEEE Transactions on speech and audio processing, 1995, 3 (4) : 251-266.

[8] Ephraim Y, Malah D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator[J]. IEEE Transactions on acoustics, speech, and signal processing, 1984, 32(6): 1109-1121.

[9] Cohen I. Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging[J]. IEEE Transactions on speech and audio processing, 2003, 11(5): 466-475.

[10]Ciregan D, Meier U, Schmidhuber J. Multi-column deep neural networks for image classification[C]//2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012: 3642-3649.

[11]Graves A, Liwicki M, Fernández S, et al. A novel connectionist system for unconstrained handwriting recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2008, 31(5): 855-868.

[12]Senior A, Vanhoucke V, Nguyen P, et al. Deep neural networks for acoustic modeling in speech recognition[J]. IEEE Signal processing magazine, 2012.[13]Sundermeyer M, Ney H, Recurrent LSTM Neural Networks for Recurrent LSTM Neural Networks [J]. IEEE Transactions on Recurrent LSTM Neural Networks, 2006, 19 (3) : 193-196 Speech, and Language Processing, 2015, 23(3): 517-529.

[14]Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks[C]//Advances in neural information processing systems. 2014: 3104-3112.

[15] Valin J M. A hybrid DSP/deep learning approach to real-time full-band speech enhancement[C]//2018 IEEE 20th international workshop on multimedia signal processing (MMSP). IEEE, 2018: 1-5.

[16] Wang Y, Narayanan A, Wang D L. On training targets for supervised speech separation[J]. IEEE/ACM transactions on audio, speech, and language processing, 2014, 22(12): 1849-1858.

[17] Erdogan H, Hershey J R, Watanabe S, et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks[C]//2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015: 708-712.

[18] Williamson D S, Wang Y, Wang D L. Complex ratio masking for monaural speech separation[J]. IEEE/ACM transactions on audio, speech, and language processing, 2015, 24(3): 483-492.

[19] Recommendation I T U T. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs[J]. Rec. ITU-T P. 862, 2001.

[20] Taal C H, Hendriks R C, Heusdens R, et al. A short-time objective intelligibility measure for time-frequency weighted noisy speech[C]//2010 IEEE international conference on acoustics, speech and signal processing. IEEE, 2010: 4214-4217.

[21] Nicolson A, Paliwal K K. Deep learning for minimum mean-square error approaches to speech enhancement[J]. Speech Communication, 2019, 111:44-55.



“Video cloud technology” is your most noteworthy public account of audio and video technology. Every week, you will push practical technical articles from the front line of Ali Cloud, where you can exchange ideas with first-class engineers in the field of audio and video. Public number backstage reply [technology] can join Ali cloud video cloud technology exchange group, and the author together to discuss audio and video technology, access to more industry latest information.

Copyright Notice:The content of this article is contributed by Aliyun real-name registered users, and the copyright belongs to the original author. Aliyun developer community does not own the copyright and does not bear the corresponding legal liability. For specific rules, please refer to User Service Agreement of Alibaba Cloud Developer Community and Guidance on Intellectual Property Protection of Alibaba Cloud Developer Community. If you find any suspected plagiarism in the community, fill in the infringement complaint form to report, once verified, the community will immediately delete the suspected infringing content.