In recent years, with the development of real-time communication technology, online meeting gradually become people indispensable important office tools in the work, according to incomplete statistics, about 75% for pure online meeting audio conference, which don’t need to open the camera and screen sharing function, voice quality and clarity of this meeting is important experience for the online meeting.

Author | Qi Qi

Review | Tai one

preface

In real life, the conference environment is very diverse, including open noisy environment, instantaneous non-stationary keyboard sound, etc., which presents a great challenge to the traditional speech front-end enhancement algorithm based on signal processing. At the same time, with the rapid development of data-driven algorithms, intelligent speech enhancement algorithms based on deep learning have gradually emerged in academic [1] and industrial [2,3,4] and achieved good results. AliCloudDenoise algorithm emerges in this context. Combined with traditional speech enhancement algorithm, in constant iterative optimization, the noise control effect for real-time scenarios meeting, performance, cost, etc, a series of optimization and improvement, in the end can be fully guaranteed noise reduction ability while preserving the high fidelity of voice, for ali cloud cloud real-time video conference system provides excellent voice conference experience.

The development of speech enhancement algorithms

Speech enhancement refers to the technology that when clean speech is disturbed by various noises in real life scenes, the noise needs to be filtered out in certain ways to improve the quality and intelligibility of the speech. In the past decades, traditional single-channel speech enhancement algorithms have been developed rapidly, which are mainly divided into time-domain and frequency-domain methods. The time-domain methods can be roughly divided into parameter filtering method [5,6] and signal subspace method [7], while the frequency-domain methods include spectral subtraction method, wiener filtering method and speech amplitude spectrum estimation method based on minimum mean square error [8,9].

Traditional single channel speech enhancement method with a small amount of calculation, can the advantages of real-time online speech enhancement, but for non-stationary sudden noise suppression ability is bad, suddenly appeared on the road car horns and so on, for example, traditional algorithm is enhanced at the same time there will be a lot of residual noise and the noise will cause subjective listening is poor, and even affect the intelligibility of speech message. From the perspective of mathematical theoretical derivation of the algorithm, the traditional algorithm also has the problem of too many assumptions in the process of solving the analytic solution, which makes the algorithm has an obvious upper limit of effect and is difficult to adapt to the complex and changeable actual scenes. Since 2016, deep learning methods have significantly improved the performance of many supervised learning tasks, such as image classification [10], handwriting recognition [11], automatic speech recognition [12], language modeling [13] and machine translation [14], etc. Many deep learning methods have also emerged in speech enhancement tasks.

FIG. 1 Flow chart of classical algorithm of traditional single channel speech enhancement system

Speech enhancement algorithms based on deep learning can be roughly divided into the following four categories according to different training objectives:

• Hybrid method based on traditional signal processing Most of these algorithms replace one or more sub-modules of the traditional signal processing-based speech enhancement algorithm with neural network. Generally, the overall processing process of the algorithm will not be changed, and the typical representative is Rnnoise[15].

• Speech enhancement algorithm based on time-frequency mask Approximation (Mask_based Method) This kind of algorithm predicts time-frequency mask by training neural network, and applies the predicted time-frequency mask to the spectrum of input noise to reconstruct pure speech signal.

Commonly used time-frequency masks include IRM[16], PSM[17], cIRM[18], etc. Error functions in the training process are shown in the following formula:

• Feature map-based speech enhancement algorithm (Mapping_based Method) This kind of algorithm realizes the direct mapping of features by training neural network. The commonly used features include amplitude spectrum, logarithmic power spectrum and complex spectrum, etc. The error function in the training process is shown in the following formula:

• Based on the end-to-end speech enhancement algorithm (end-to-end method), this kind of algorithm gives play to the idea of data-driven to the extreme. On the premise of reasonable data set distribution, it directly carries out end-to-end numerical mapping from the time-domain speech signal without frequency domain transformation, which is one of the hot research directions widely active in the academic world in the past two years.

AliCloudDenoise Voice enhancement algorithm

First, algorithm principle

AliCloudDenoise speech enhancement algorithm adopts the Hybrid method after considering many factors such as noise reduction effect, performance cost and real-time performance, and takes the ratio of noise energy in noisy speech to the target human voice energy as the fitting target. Then, a traditional gain estimator such as the minimum mean square error short-time spectral amplitude estimator (MMSE-STSA) is used to obtain the denoising gain in the frequency domain. Finally, the enhanced speech signal in the time domain is obtained by inverse transformation. In the selection of network structure, considering both real-time and power consumption, THE RNN class structure is abandoned and TCN network is selected. The basic network structure is shown in the following figure:

2. Algorithm optimization in real-time conference scenarios

1. What should I do if there are too many people nearby and it is noisy?

The problem background

In the real-time conference scene, Babble Noise is a common background Noise, that is, the background Noise composed of the talking sounds of multiple speakers. This kind of Noise is not only non-stationary, but also similar to the target speech components of speech enhancement algorithm, which makes it more difficult for the algorithm to deal with this kind of Noise in the suppression process. Here is a specific example:

Problem analysis and improvement plan

After analyzing dozens of hours of office scene audio containing Babble Noise and combining with human speech mechanism, it is found that this kind of Noise has the characteristics of stable existence for a long time. As we all know, in the speech enhancement algorithm, Contextual information is very important to the effectiveness of algorithms, so for a more contextual Noise type like Babble Noise, The AliCloudDenoise algorithm systematically aggregates key stage features in the model through dilated convolutions, explicitly enlarging the receptive field, and additionally incorporating gating mechanisms. The improved model has improved the processing effect of Babble Noise obviously. The figure below shows a comparison of key parts of the model before improvement (TCN) and after improvement (GaTCN).

The results on the speech test set show that the speech quality PESQ[19] and speech intelligibility STOI[20] of GaTCN are improved by 9.7% and 3.4% respectively compared with TCN under THE IRM target. Under the target of Mapping a Priori SNR[21], speech quality PESQ is improved by 7.1% compared with TCN model, and speech intelligibility STOI is improved by 2.0% compared with TCN model, which is better than all baseline models. For details of indicators, see Table 1 and Table 2.

Table 1 Objective indicators of speech quality PESQ comparison details

Table 2. STOI comparison of objective indicators of speech intelligibility

Improved effect display:

2, how can the critical moment drop word?

The problem background

In speech enhancement algorithms, the phenomenon of swallowing or the disappearance of specific words, such as the disappearance of the end of sentences, is an important factor affecting the subjective hearing of enhanced speech. In real-time meeting scenarios, due to the diversity of languages involved and the diversity of speakers’ speech content, this phenomenon is more common. A specific example is listed below:

Problem analysis and improvement plan

On the 1W + speech test data set constructed by classification, the occurrence of word swallowing and word dropping after enhancement was statistically analyzed, and the corresponding frequency domain characteristics were visualized. It was found that the phenomenon mainly occurred in several specific phonemes or words, such as unspoken, repeated and long sounds. At the same time, it is found in the classification statistics based on the dimension of SNR that the phenomenon of swallowing and dropping characters increases significantly in the case of low SNR. Accordingly, the following three improvements are made:

• Data level: Firstly, the distribution statistics of specific phonemes in the training data set are carried out. After the conclusion is drawn that the proportion of phonemes is relatively small, the phonetic components in the training data set are enriched.

• Noise reduction strategy level: Reduce the low SNR circumstance, under certain conditions using a combination of noise reduction strategy, namely traditional noise reduction first, then AliCloudDenoise noise reduction, the disadvantage of this method is embodied in the following two aspects, the first combined noise reduction can increase the algorithm costs, followed by traditional noise inevitably appear spectrum level quality damage, reduce the overall quality of the sound quality. This method can improve the phenomenon of swallowing and dropping characters, but it is not used online because of its obvious disadvantages.

• Training strategy level: After the targeted enrichment of speech components in the training data set, the phenomenon of word swallowing and word dropping after enhancement will indeed be improved, but this phenomenon still exists. After further analysis, it is found that its spectral characteristics are highly similar to those of some noises, resulting in difficulties in local convergence of network training. Based on this, The AliCloudDenoise algorithm adopts the existence probability of auxiliary output speech in training, but does not adopt the training strategy in inference. The SPP formula is as follows:

The results on the speech test set show that the proposed dual-output auxiliary training strategy improves the speech quality PESQ by 3.1% and the speech intelligibility STOI by 1.2% under the IRM target compared with the original model. Under the target of Mapping a Priori SNR, speech quality PESQ is improved by 4.0% compared with the original model, and speech intelligibility STOI is improved by 0.7% compared with the original model, which is better than all baseline models. For details of indicators, see Table 3 and Table 4.

Table 3 Comparison of objective indicators of speech quality PESQ details

Table 4 Details of STOI comparison of objective indicators of speech intelligibility

Improved effect display:

Third, how to make the algorithm applicable to a wider range of equipment

For real-time conference scenarios, AliCloudDenoise is typically run on PCS, mobile devices, and Internet of Things (IOT) devices. However, CPU usage, memory capacity, bandwidth, and power consumption are all key performance indicators. In order to make AliCloudDenoise widely available to all business parties, we use a series of energy optimization methods, including model structural clipping, resource adaptive strategy, weight quantification and training quantification, etc. Finally, an intelligent speech enhancement model of about 500KB is obtained with the accuracy reduced by 0.1% by some auxiliary convergence strategies, which greatly expands the application range of AliCloudDenoise algorithm.

Next, we briefly review the model lightweight techniques involved in the optimization process, then introduce resource adaptive strategies and model quantization, and finally give the key energy consumption indicators of AliCloudDenoise algorithm.

1. The model lightweight technology adopted

Lightweight technologies for deep learning models generally refer to a series of technical means to optimize the “operating costs” of models, such as the number and size of parameters, computation, energy consumption and speed. The purpose is to facilitate the deployment of the model on various hardware devices. At the same time, lightweight technology has a wide range of applications in computation-intensive cloud services, which can help reduce service costs and increase speed.

The main difficulty of lightweight technology lies in that the effect, generalization and stability of the algorithm should not be significantly affected while the operation cost is optimized. This is difficult for the common “black box” neural network model in all aspects. In addition, part of the difficulty of lightweight is also reflected in the difference of optimization goals.

For example, the reduction of model size does not necessarily reduce the amount of computation; The decrease of model computation may not increase the running speed. Nor does the speed increase necessarily reduce energy consumption. Such differences make it difficult for lightweight to solve all performance problems in a package, and it is necessary to achieve a comprehensive reduction of operating costs by using multiple perspectives and technologies.

At present, common lightweight technologies in academia and industry include: parameter/operation quantization, pruning, small modules, structural hyperparameter optimization, distillation, low rank, sharing, etc. Among them, various technologies have different purposes and requirements. For example, parametric quantization can reduce the storage space occupied by the model, but it still restores to floating point number during operation. The global quantization of parameter + operation can simultaneously reduce the parameter volume and chip computation, but the chip needs the support of corresponding arithmetic unit to play the effect of speed increase. Knowledge distillation uses small student networks to learn high-level features of large models to obtain lightweight models with matching performance, but optimization is difficult and is mainly suitable for tasks that simplify expression (such as classification).

Unstructured fine tailoring can eliminate the most redundant parameters and achieve excellent simplification, but requires dedicated hardware support to reduce the amount of computation; Weight sharing can significantly reduce model size, but it is difficult to accelerate or save energy. AutoML structure hyperparameter search can automatically determine the optimal model stack structure for small test results, but its application is limited by the complexity of search space and the good degree of iterative estimation. The following figure shows the main lightweight technologies used in energy optimization by the AliCloudDenoise algorithm.

2. Resource adaptive strategy

Resources, the core of the adaptive strategy is model can be in the case of insufficient resources adaptive output meet the low precision of the qualification as a result, just do your best, when enough resource output optimal accuracy is enhanced as a result, this function is the most direct idea training model in the equipment of different size, according to the need to use, but will be an additional storage costs, The AliCloudDenoise algorithm uses a hierarchical training scheme, as shown in the following figure:

The results of the middle layer are also output, and unified constraint training is finally carried out by combining loss. However, the following two problems are found in actual verification:

• Features extracted from shallow networks are relatively basic, and the enhancement effect of shallow networks is poor.

• After the middle-layer network output structure is added, the enhancement result of the last-layer network will be affected, because in the joint training process, it is expected that the shallow layer network can also output relatively good enhancement result, which destroys the distribution layout of extraction features of the original network structure.

To solve the above two problems, we adopted the optimal strategy of multi-scale Dense connection and off-line hyper-participating pruning, which ensured that the model could dynamically output speech enhancement results with an accuracy range of less than 3.2% on demand.

3. Model quantification

In terms of the optimization of the memory capacity and bandwidth required by the model, MNN team’s weight quantization tool [22] and Python offline quantization tool [23] are mainly used to realize the conversion between FP32 and INT8. The schematic diagram of the scheme is as follows:

4. Key energy consumption indicators of AliCloudDenoise algorithm

As shown in the figure above, the rival algorithm is 14MB in size. AliCloudDenoise currently outputs 524KB, 912KB, and 2.6MB in size, which are significant advantages. In terms of running consumption, Mac platform tests showed 3.4% CPU usage for the rival, 1.1% for the 524KB AliCloudDenoise library, and 1.3% for the 912KB library. With a 2.6MB CPU footprint of 2.7%, the AliCloudDenoise algorithm has a significant advantage, especially in long running conditions.

Fourth, the effect of the algorithm technical index evaluation results

The evaluation of the voice enhancement effect of the AliCloudDenoise algorithm is mainly focused on two scenarios: the common scenario and the office meeting scenario.

1. Evaluation results in general scenarios

In the test set of the universal scenario, the voice data set consists of Chinese and English parts (about 5000 pieces in total), while the noise data set contains four kinds of typical noises. Stationary noise, non-stationary noise, office Babble noise and Outdoor noise. The ambient noise intensity should be set between -5 and 15dB. Objective indicators are mainly measured by PESQ speech quality and STOI speech intelligibility. The larger the value of both indicators is, the better the enhanced speech effect is.

As shown in the following table, the 524KB AliCloudDenoise 524KB library is 39.4% (English) and 48.4% (Chinese) better than the traditional algorithm in PESQ. STOI has improved 21.4% (English speech) and 23.1% (Chinese speech) respectively, which is basically the same as that of rival algorithms. The AliCloudDenoise 2.6MB library has a 9.2% (English) and 3.9% (Chinese) improvement over rival algorithms in PESQ, 0.4% (English) and 1.6% (Chinese) improvement over rival algorithms in STOI.

2. Evaluation results in the office scenario

Combined with the business acoustic scene of real-time meeting, we did a separate evaluation for the office scene. The noise was the noisy noise recorded in the real office scene, and a total of about 5.3h of evaluation voice with noise was constructed. The following figure shows the AliCloudDenoise 2.6MB library compared to rival 1, rival 2, Tradition 1, and Tradition 2 for SNR, P563, PESQ, and STOI. You can see the obvious advantage of AliCloudDenoise’s 2.6MB library.

future

In real-time communication scenarios, there are still many research directions for AI + Audio Processing to be explored and implemented. Through the integration of data-driven ideas and classical signal Processing algorithms, It can upgrade the effect of front-end algorithm of audio (ANS, AEC, AGC), back-end algorithm of audio (bandwidth expansion, real-time bel canto, sound change, sound effect), audio codec and audio processing algorithm of weak network (PLC, NetEQ), providing users of Aliyun video cloud with the ultimate audio experience.

reference

[1] Wang D L, Chen J. Supervised speech separation based on deep learning: An overview[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(10): The 1702-1726. [2] venturebeat.com/2020/04/09/… [3] venturebeat.com/2020/06/08/… [4] medialab.qq.com/#/projectTe… [5] Gannot S, Burshtein D, Weinstein E. Iterative and sequential Kalman filter-based speech enhancement algorithms[J]. IEEE Transactions on speech and audio processing, 1998, 6(4): 373-385. [6] Kim J B, Lee K Y, Lee C W. On the applications of the interacting multiple model algorithm for enhancing noisy speech[J]. IEEE transactions on speech and audio processing, 2000, 8(3): 349-352. [7] Ephraim Y, Van Trees H L. A signal subspace approach for speech enhancement[J]. IEEE Transactions on speech and audio processing, 1995, 3 (4) : 251-266. [8] Ephraim Y, Malah D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator[J]. IEEE Transactions on acoustics, speech, and signal processing, 1984, 32(6): 1109-1121. [9] Cohen I. Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging[J]. IEEE Transactions on speech and audio processing, 2003, 11(5): 466-475. [10]Ciregan D, Meier U, Schmidhuber J. Multi-column deep neural networks for image classification[C]//2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012: [11]Graves A, Liwicki M, Fernandez S, et al. A novel connectionist system for unconstrained handwriting recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2008, 31(5): 855-868. [12]Senior A, Vanhoucke V, Nguyen P, et al. Deep neural networks for acoustic modeling in speech recognition[J]. IEEE Signal processing magazine, 2012. [13]Sundermeyer M, Ney H, Schluter R. Neural Networks of “Recurrent LSTM” From FeedForward to Network for Language Modeling [J]. IEEE/ACM Transactions on Audio, 2015, 37 (2) : 207-2017. Speech, and Language Processing, 2015, 23(3): 517-529. [14]Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks[C]//Advances in neural information processing systems. 2014: 3104-3112. [15] Valin J M. A hybrid DSP/deep learning approach to real-time full-band speech enhancement[C]//2018 IEEE 20th international workshop on multimedia signal processing (MMSP). IEEE, 2018: 1-5. [16] Wang Y, Narayanan A, Wang D L. On training targets for supervised speech separation[J]. IEEE/ACM transactions on audio, speech, and language processing, 2014, 22(12): 1849-1858. [17] Erdogan H, Hershey J R, Watanabe S, et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks[C]//2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015: 708-712. [18] Williamson D S, Wang Y, Wang D L. Complex ratio masking for monaural speech separation[J]. IEEE/ACM transactions on audio, speech, and language processing, 2015, 24(3): 483-492. [19] Recommendation I T U T. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs[J]. Rec. ITU-T P. 862, 2001. [20] Taal C H, Hendriks R C, Heusdens R, et al. A short-time objective intelligibility measure for time-frequency weighted noisy speech[C]//2010 IEEE international conference on acoustics, speech and signal processing. IEEE, 2010: 4214-4217. [21] Nicolson A, Paliwal K K. Deep learning for minimum mean-square error approaches to speech enhancement[J]. Speech Communication, 2019, 111:44-55. [22] www.yuque.com/mnn/cn/mode… [23] github.com/alibaba/MNN…

“Video cloud technology” your most noteworthy audio and video technology public account, weekly push from Ali Cloud front-line practice technology articles, here with the audio and video field first-class engineers exchange exchange. You can join ali Cloud video cloud technology exchange group to discuss audio and video technology with the author and get more latest information in the industry.