This article brings you a summary of eight of InterSpeech2020’s 25 papers on speech emotion analysis.

This article is written by Tython in INTERSPEECH2020.

4. Learningutterance-level Representations with Label Smoothing for Speech EmotionRecognition (INTERSPEECH2020)

(1) Data processing: IEMOCAP four classification, leave-one-speaker-out, unweightedaccuracy. OpenSMILE extracted 147 dimensional LLDs features from short frames.

(2) Model method: LSTM was used to model the feature sequences of multiple segments in a sentence. The output feature sequences were clustering compressed by NetVLAD to reduce the original DIMENSION of N*D to K*D, and then softmax was used to classify the features after dimensionality reduction. When available, the author uses a labelsmoothing strategy, that is, adding mismatched (X, y) data pairs during training. Also called label-dropout (dropping the real labels and replace them with others) and assigning a label with low weight. In this way, the adaptability of the model is improved and overfitting is reduced.

(3) NetVLAD is derived from VLAD of image feature extraction method. By clustering the feature vector of image, the clustering center is obtained and residual is made to compress a number of local features into a global feature of a specific size. Specific zhuanlan.zhihu.com/p/96718053 for reference

(4) Experiment: NetVLAD can be regarded as a pooling method, and finally WA reaches 62.6%, which is 2.3 percentage points higher than weight-pooling. The efficiency of label smoothing is 59.6% and 62%, two percentage points different.

(5) Summary: The biggest contribution is to conduct NetVLAD similar pooling operation on the features of each frame to screen useful features; In addition, labelsmoothing is also introduced to improve the effectiveness of training methods.

RemovingBias with Residual Mixture of Multi-view Attention for Speech EmotionRecognition (INTERSPEECH2020)

(1) Data processing: IEMOCAP data classification, Session1-4 training, Session5 test. Feature extraction of 23 – dimensional log-melfilterbank.

(2) Model method: An Utterance was divided into N frames and input BLSTM (Hiddenlayer 512 Nodes) in turn to get the matrix of N*1024 size. Input the first Attentionlayer 1. The output of this layer is combined with the original matrix and input three Attention_i_Layer_2 respectively. These three attention layers are independent and controlled by the hyperparameter GAMA. The three outputs are then summed and a full connection layer (1024Nodes) is entered, and finally the SoftMax layer is classified.

(3) Experiment: WA and UA were used as evaluation indexes, but UA was incorrectly defined in the paper. UA was actually WA. The definition of WA is also questionable. The experimental effect UA is 80.5%, which is actually the segment-level Accuracy. There is no general sentence level Accuracy, which is also a trick of evaluation.

(4) Conclusion: The innovation of this paper mainly carries out multiple Attention operations on BLSTM features, which, as MOMA module, has achieved significant improvement in effect. However, this improvement is only reflected in the accuracy of segment-level, which is of little reference significance.

3. AdaptiveDomain-Aware Representation Learning for Speech Emotion Recognition

(1) Data processing: Leave-one-speaker-out for IEMOCAP data classification. STFT Hamming window was used to extract spectral features with window lengths of 20ms,40ms and 10ms respectively.

(2) Model method: Input the same spectrum map and divide it into two parts. One part is loaded into the domain-awareattention module (time pooling, channel poolingand fully connected layer respectively). In the other part, perform timepooling and channel-wise fully connected(all channels are fully connected) to the Emotion module. Then the Domain module outputs a vector and turns the vector into a diagonal matrix, which is multiplied by the output matrix of the Emotion module so that the Domain information is integrated into the emotionEmbedding. Finally, multi-task learning, Domain loss and Emotionloss are obtained respectively. Domain here does not refer to different domains of data, but to additional information such as gender and age.

(3) Experiment: WA reached 73.02%, UA reached 65.86%, mainly the classification of Happy mood is not accurate. Multitasking WA was 3% higher and WA was 9% higher than single-task Emotion.

(4) Conclusion: This paper is essentially multi-task learning, so as to improve the effect of emotion classification.

4. SpeechEmotion Recognition with Discriminative Feature Learning

(1) Data processing: IEMOCAP data classification, train:validate:test=0.55:0.25:0.2. All utterance was sliced or filled into 7.5s to extract LLDs feature log-Melfilterbank 40-dimensional feature with window length of 25ms and window shift of 10ms.

(2) Model method: input spectrogram, and convolved six CNNblocks from head to tail to extract features; After that, the LSTM sequence was modeled. The Attention module selected the weight of the LSTM input, and finally the whole connection layer was classified by Softmax.

(3) Experiment: UA reached 62.3%, lower than the baseline effect (67.4%), but the focus of this paper is that the model is light (the number of parameters is less than 360K) and the calculation is fast. Another verification is that the effect of Additive margin softmax loss, Focal loss and attention pooling is similar, which can reach about 66%.

(4) Conclusion: The innovation of this paper is not the network structure, but the effect of different Loss.

5. UsingSpeech Enhancement Preprocessing for Speech Emotion Recognitionin Realistic Noisy Conditions

(1) Data processing: IEMOCAP data is artificially added with noise, while CHEAVD data is originally in noise, so there is no need to add noise.

(2) Model method: This paper is a speech enhancement model. Input the noisy spectrum, and the goal is to generate the spectrum of pure speech and ideal Ratio Mask. There are three LSTM layers in the middle, and each layer will generate some spectrum features and corresponding masks. The last layer of output generates pure speech spectrum and IRM.

(3) Experiment: The former IEMOCAP data and WSJ0 data were used to train the speech enhancement model, and then the IEMOCAP test set (after adding noise) was used to predict emotion. The latter speech enhancement model was first trained on 1000 hours of corpus, then enhanced with CHEAVD data, and the enhanced speech was used for speech emotion recognition.

(4) Conclusion: After the speech enhancement model is trained on the data containing speech emotion, it has a significant effect on the speech emotion recognition task with noise; In some fragments with low SIGNal-to-noise ratio, low energy and laughter, speech enhancement tends to be distorted, and SER effect may decline.

6. Comparisonof glottal source parameter values in emotional vowels

(1) Data processing: In the voice data recorded by JAIST in Japan, four people (two men and two women) expressed four emotions (angry, happy, neutral and sad). Pronounced as the vowel a.

(2) Modeling method: The ARX-LF model has been widely used for representing Glottalsource waves and vocal tract filter.

(3) Experiment: Based on the glottalsource waveform analysis, sad vowels are more rounded and happy and angry are steeper. The parameters Tp, Te, Ta, Ee, F0(1/T0) show that the fundamental frequency F0 has significant difference for different emotions.

(4) Conclusion: It deviates from the direction of traditional language emotion research to study the expression of glottal sound to emotion, which is exploratory and commendable under the trend of comprehensive DL. DL modeling of this data can be done later, perhaps in one direction. However, the difficulty lies in the collection and labeling of glottal sounds. At present, experimental data is relatively rare and manual recording, which costs a lot and has a small amount of data.

7. Learningto Recognize Per-rater’s Emotion Perception Using Co-rater TrainingStrategy with Soft and Hard Labels

(1) Data processing: IEMOCAP data and NNIME data were divided into three discrete categories of valence and activation: Low, middle and High. Features are derived from openSMILE’s 45-dimensional features, including MFCC, F0 and loudness, etc.

(2) Model method: For each piece of audio, each person has a different emotional perception of it. Traditionally, voting mechanism is adopted and mode is selected as the only label. This paper uses different strategies to predict each person’s emotional label. The basic model is the BLSTM-DNN model, shown in section (a) below. The training data is labeled in three parts, one hard label (unique) for each person, and two soft and hard labels for everyone other than the target person. The three types of label data were separately trained by BLSTM-DNN model. Then, the BLSTM-DNN parameters were frozen, the output of each BLSTM-DNN denselayer layer was spliced, and then three Dense layers were superimposed. Finally, softmax was applied to the hard label of individuals. So in the prediction stage, each person has a corresponding emotion perception, and when there are N people, there will be N models.

(3) Hard label and soft label: For an audio clip, if the annotation result of three taggers is [L,L, M], the hard label is L, that is, [1, 0, 0]; The soft label is [0.67,0.33, 0], that is, the proportion of the three categories.

(4) Experiment: It is 1-4 percentage points higher than individual label modeling. The design of soft and hard labels helps to improve SER effect. Only 50% of the target’s data needs to be labeled to achieve 100% of the labeling effect. This means that the new user only needs to annotate 50% of the iemocap data, and the model will achieve 100% of the annotation effect.

(5) Conclusion: In principle, it is true that crowdsourced labeling is conducive to predicting individual labeling, but there is no comparison with other models, which is not the focus of this paper.

8. EmpiricalInterpretation of Speech Emotion Perception with Attention Based Model forSpeech Emotion Recognition

(1) Data processing: IEMOCAP data classification, Session1-4 training, Session5 test. Feature extraction of 23 – dimensional log-melfilterbank.

(2) Model method: One Utterance was divided into multiple frames, one input was BLSTM+Attention model, the other input was CNN+Attention model. The results of the two models are then fused.

(3) Experiment: WA and UA were used as evaluation indexes, but UA was incorrectly defined in the paper. UA was actually WA. The definition of WA is also questionable. The UA of the experimental effect is 80.1%, which is actually the segment-level Accuracy. There is no general sentence level Accuracy, which is also a trick of evaluation.

(4) Conclusion: The paper is the result-level integration of the two mainstream models, which is not highly innovative. The improvement is only reflected in the accuracy of segment-level, which is of little reference significance.

Click to follow, the first time to learn about Huawei cloud fresh technology ~