This is my 40th day of participating in the First Challenge 2022

This paper is a Workshop paper for ACM MM 2021. The author is from Renmin University of China. Multimodal Fusion Strategies for Physiological- Emotion Analysis

Motivation

Physiological emotion is the real emotional state of human beings, which will not be changed because people consciously cover up their emotions. This paper is oriented to the 2021 Muse-Physio Challenge, a multimodal emotion analysis task designed to predict physiological emotions using a combination of audiovisual signals and subjects’ skin electrical responses in a highly stressful free-speech setting. In the past, multimodal sentiment analysis has used sound, text and visual information, but this information can vary widely from person to person and can be easily masked. Physiological signals collected from the sensors can reveal a person’s true emotional state, such as the electrical conductivity of the skin (EDA), which increases when the skin sweates. Therefore, the author hopes to design a multi-modal fusion strategy to comprehensively utilize these information for sentiment analysis.

Method

Based on the four modes of speech, vision, text and physiology, the author firstly extracted various features from these modes by different methods, and then proposed two multi-mode fusion strategies: feature-level fusion and pred-level fusion. In the feature-level fusion strategy, all types of multimodal features are connected, and LSTM is used to capture long-term time information. In the pre-level fusion strategy, the author proposes a two-stage training strategy.

Model

The overall structure of the model is as follows: Xj is the j segment of the video, Y is the emotional label, AND A, V, L and P are the four modes of audio, visual, language and physiological signals respectively.

Multi-modal Features

Pronunciation, intonation and tone in speech, facial expressions, body movements in vision, and text content in speech can all express the inner feelings of the speaker to a certain extent. Both low-level and high-level characteristics are important. The authors extracted four features as inputs, as follows:

  • Text Features: The author uses pre-trained language model to extract Features from the Text, and then embed and average these words in the video as segmental Features;
  • Acoustic Features: The author uses several different pre-training models, such as DeepSpectrum and Wav2Vec, to extract low-level emotional and Acoustic Features of audio, and then carries out down-sampling of these Features to obtain segment-level Features.
  • Visual Features: DenseFace and VGGFace were used to capture speakers’ facial expressions as advanced features, and FAU, gaze and head posture were extracted as lower facial expressions using OpenFace, GazePattern and OpenPose, respectively. For OpenPose, the average coordinate value of each frame within 500ms is used as the segment-level feature.
  • Physiological features, including heart rate (BPM), respiration (RESP) and electrocardiogram (ECG), were normalized by Z-Score normalization.

Feature-level Fusion

For input features of different modes, the author first spliced them together, and then projected them into an embedded space. Then LSTM was used to extract the current context of emotional information, and MSE was used as Loss for training.

Pred-level Fusion

In the first stage, the model is trained independently on each modal feature set. In the second stage, the authors concatenate the predicted values for each mode and send them to an independent LSTM to capture the information between the modes. Note that the two fusions were trained separately.


So what’s the result? Next issue: ‘◡’●)~

Multimodal Affective Computing: How AI Analyzes Your Physical Emotions (Part 2)