Recently, The paper “Time-frequency Attention for Monaural Speech Enhancement Algorithm based on time-frequency Sensing Domain model” co-authored by Aliyun Video cloud audio technology team and Professor Haizhou Li from National University of Singapore was presented ICASSP 2022 accepted and was invited to present research presentations to academics and industry at the conference in May. ICASSP (International Conference on Acoustics, Speech and Signal Processing) is the world’s largest and most comprehensive Conference in the field of Speech integrating Signal Processing, statistical learning and wireless communication.

Qi Qi | Author

In this collaborative paper, we propose a t-F attention (TFA) module incorporating speech distribution features, which can significantly improve the objective indicators of speech enhancement with almost no additional number of parameters.

Arxiv Link: arxiv.org/abs/2111.07…

Review of previous research findings:

INTERSPEECH 2021: Temporal Convolutional Network with Frequency Dimension Adaptive Attention for Speech Enhancement

Links:

www.isca-speech.org/archive/pdf…

1. The background

Speech enhancement algorithm is designed to remove redundant signal components such as background noise in speech signal. It is the basic component of many speech processing applications, such as online video conference and call, intelligent short video clip, real-time video broadcast, social entertainment and online education, etc.

2. In this paper,

At present, most studies on supervised learning algorithms for speech enhancement do not explicitly consider the energy distribution of speech in the time-frequency domain (T-F) representation, which is crucial to accurately predict the mask or spectrum. In this paper, we propose a simple and efficient T-F attention (TFA) module, which enables explicit introduction of a priori consideration of speech distribution characteristics in the modeling process. In order to verify the effectiveness of our proposed TFA module, residual sequential convolutional Neural network (ResTCN) is used as the basic model. Two commonly used training targets in The field of speech enhancement, IRM [1] (The Ideal Ratio Mask) and PSM [2] (The Phase-sensitive mask), were used for exploration experiments. Our experimental results show that the proposed TFA module can significantly improve the five commonly used objective assessment indicators with almost no additional number of parameters, and ResTCN+TFA model always has a large advantage over other baseline models.

3. Method analysis

Figure 1 shows the network structure of the proposed TFA module, in which TA and FA modules are identified by black and blue dotted boxes respectively. AvgPool and Conv1D stand for Average pooling and 1-D convolution operation respectively. ⊗ and ⊙ represent matrix multiplication and element-level multiplication respectively.

Figure 1

TFA module is represented by the transformed time-frequencyFor input, two independent branches are used to carry out 1-D time-frame attention map respectivelyAnd 1-D Frequency dimension Attention MapAnd then fuse it into the final required 2-D T-F attention map, the final result can be rewritten as:  。

4. Experimental results

Training error curve

Figure 2-3 shows the training and validation set error curves generated during 150 EPOCH training for each model. It can be seen that compared with ResTCN, the training and validation set errors generated by ResTCN using the proposed TFA (ResTCN+TFA) are significantly reduced, which confirms the effectiveness of THE TFA module. Meanwhile, compared with ResTCN+SA and MHANet, ResTCN+TFA achieves the lowest training and validation set error and shows obvious advantages. Among the three baseline models, MHANet has the best performance, and ResTCN+SA is superior to ResTCN. In addition, the comparison between ResTCN, ResTCN+FA and ResTCN+TA proves the efficacy of TA and FA modules.

FIG. 2 Training error curve under IRM training target

FIG. 3 Training error curve of PSM training target

Evaluation of objective indicators of speech enhancement

We used five metrics to evaluate enhanced performance, Including Wideband perceptual evaluation of speech Quality (PESQ) [3], Extended short-time Objective Intelligibility (ESTOI) [4], and there are three comprehensive indicators [5]. mean opinion score (MOS) predictors of the signal distortion (CSIG), background-noise intrusiveness (CBAK), Overall Signal Quality (COVL).

Table 1 and Table 2 show the average PESQ and ESTOI scores for each SNR level (with four noise sources) respectively. The evaluation results show that the PROPOSED ResTCN+TFA consistently achieves significant improvement over ResTCN in TERMS of PESQ and ESTOI on IRM and PSM, and the parameter increment is negligible, which proves the effectiveness of THE TFA module. Specifically, under the condition of 5 dB, compared with the baseline ResTCN, the ResTCN+TFA under IRM training target increased by 0.18 in PESQ index and 4.94% in ESTOI index. Compared to MHANet and ResTCN+SA, ResTCN+TFA performed best in all cases and showed significant performance advantages. In the three baseline models, the overall effect ranking is MHANet > ResTCN+SA > ResTCN. Meanwhile, compared with ResTCN+FA and ResTCN+TA, ResTCN has considerable improvement, which further confirms the effectiveness of FA and TA modules.

Table 3 lists the average CSIG, CBAK, and COVL scores for all test conditions. Consistent with the trend observed in Table 1 and Table 2, the proposed ResTCN+TFA significantly outperforms ResTCN in all three indicators and performs best among all models. Specifically, compared with ResTCN, the CSIG, CBAK and COVL of ResTCN+TFA under PSM training target are increased by 0.21, 0.12 and 0.18 respectively.

About Ali Cloud Video Cloud audio technology team

Ali Cloud Video cloud audio technology team, focusing on collection and playback – analysis – processing – transmission and other comprehensive audio technology, serving real-time communication, live broadcast, on-demand, media production, media processing, short and long video and other businesses. Through the combination of neural network and traditional signal processing, the company continues to polish the industry-leading 3A technology, deeply cultivate equipment management and adaptation, and qos technology, and continuously improve the live and real-time audio communication experience in various scenarios.


reference

[1] Y. Wang, A. Narayanan, and D. Wang, “On training targets for supervised speech separation,” IEEE/ACM Trans. Audio, speech, Lang. Process., vol. 22, no. 12, pp. 1849–1858, 2014.

[2] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase-sensitive and recognisation-Speech Separation using deep recurrent neural networks,” In Proc. ICASSP, 2015, pp. 708 — 712.

[3] R. I.-T. P. ITU, “862.2: Wideband extension to recommendation P. 862 for the assessment of wideband telephone networks and speech codecs. ITU-Telecommunicatio.

[4] J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM Trans. Audio, Speech, Lang.Process., Vol. 24, No. 11, pp. 2009 — 2022, 2016.

[5] Y. Hu and P. C. Loizou, “Evaluation of Objective Quality Measures for Speech Enhancement,” IEEE Trans. Audio, Speech, Lang. Process., Vol. 16, No. 1, pp. 229 — 238, 2007.


“Video cloud technology” your most noteworthy audio and video technology public account, weekly push from Ali Cloud front-line practice technology articles, here with the audio and video field first-class engineers exchange exchange. You can join ali Cloud video cloud product technology exchange group, and the industry together to discuss audio and video technology, get more industry latest information.