This article mainly refers to the paper “Deep Facial Expression Recognition:A Survey”.

Links to papers: arxiv.org/abs/1804.08…

This article [1] is a review article on Deep Facial Expression Recognition (DFER) (emotion Recognition) by Professor Deng Weihong of BUPT, which was included by CVPR. This article is perfect for someone like me who is interested in emotion recognition but has never used it.


introduce

Facial expression can be said to be an esperanto language, regardless of national boundaries, races and genders, it can be said that everyone has a common facial expression. FEP has been widely used in robotics, medical treatment, driver fatigue detection and human-computer interaction systems. Ekman and Friesen defined six basic expressions: anger, fear, disgust, happiness, sadness and surprise through cross-cultural research in the 20th century, and then added the expression of “contempt”. Groundbreaking work and intuitive definitions have kept the model popular in automatic Facial expression analysis (AFEA).

According to feature representation, FER system can be divided into picture FER and video FER. Image FER only extracts the features of the current image, while video needs to consider the relationship between adjacent frames. In fact, all computer vision tasks can be divided into two categories: pictures and videos.

The disadvantages of the traditional method of FER, which uses hand-extracted features and shallow learning, are not much discussed. Thanks to the development of deep learning and the emergence of more challenging data set FER2013, more and more researchers are applying deep learning technology to FER.

Deep facial expression recognition

This section discusses the three steps of deep learning applied to facial expression recognition. The methods of preprocessing, feature extraction and feature classification are described briefly, and related papers are cited.

pretreatment

Face alignment

Given a data set, the first step is to remove the background and non-face areas that are not related to the face. The ViolaJones(V&J) face detector [2] (implemented in OpenCV and Matlab) can crop the original image to obtain the face region. The second step is face alignment, which is very important because it can reduce the effect of face scale change and rotation. The most commonly used implementation of facial alignment is IntraFace[3], which uses SDM algorithm to locate 49 facial feature points (eyes, two eyebrows, nose and mouth).

Data to enhance

Data enhancement can be done online or offline:

  • Offline: random perturbation, image transformation (rotation, comment, flip, zoom and align), adding noise (salt and pepper noise and speckle noise), adjusting brightness and saturation, and adding 2-d Gaussian noise between the eyes. In addition, the adversarial neural network GAN[4] was used to generate faces, and 3DCNN assisted AUs to generate expressions. Whether the use of GAn-generated faces improves network performance has not been verified.
  • Online mode: included during training, crop the picture, and flip it horizontally. The model is trained mainly by random perturbation.

Face normalization

Face illumination and head posture changes can weaken the performance of training models. There are two kinds of face normalization strategies to weaken the effect, namely brightness normalization and posture normalization.

  • Brightness normalization :Inface toolbox [5] is the most commonly used illumination invariant face detection box. In addition to intuitive brightness adjustments, there are contrast adjustments. Common contrast adjustment methods include histogram normalization, DCT normalization and Dog normalization.
  • Posture normalization: this is a thorny problem, and current methods are not ideal. There are 2D Landmark alignment,3Dlandmark alignment, estimation by image and camera parameters, as well as depth sensor measurement and calculation. Newer models are based on GAN, including FF-gan, TP-GAN and DR-gan.

Deep feature learning

This part mainly focuses on feature extraction using Deep learning models, including Convolutional neural Network (CNN), Deep belief network, DBN), Deep autoencoder (DAN) and Recurrent neural network (RNN). The process of deep face expression recognition is as follows. As can be seen from the figure below, there are four commonly used models in the deep network model. The author just briefly introduces several network models, here I also do not want to repeat more. The CNN model is described in detail in my previous posts on the structure and related algorithms of convolutional neural networks and the interpretation of convolutional neural network models — LeNet5, AlexNet, ZFNet, VGG16, GoogLeNet and ResNet. The rest of the network model will be sorted out one by one later.

Facial expression classification

After feature extraction is completed, the last step is to classify them. In traditional FER system, feature extraction and feature classification are independent. As FER of deep learning is an end-to-end model, loss layer can be added at the end of the network to adjust the error of back propagation, and the prediction probability can be directly output by the network. The two can also be combined, which is to extract features with deep learning and then classify them with SVM and other classifiers.

Facial expression database

This section summarizes the public data sets available for FER.

  • CK+: including 123 subjects and 593 image sequence. The database was recorded by 118 subjects. Among the 593 image sequences, 327 of them had the emotion label. It contains seven expressions in addition to neutral ones: anger, contempt, disgust, fear, happiness, sadness and surprise.
  • MMI: including 32 subjects and 326 image sequence. 213 Sequence labels have emotion. Containing 6 emoticons (no contempt compared to CK+), MMI is more challenging because many people wear accessories.
  • JAFFE: Contains 213 images (each image has a resolution of 256*256) of Japanese women’s faces, containing 7 expressions. The databases are all frontal face, and the original images are adjusted and pruned. The illumination is all frontal light source, but the illumination intensity is different.
  • TFD: The database is a collection of several facial expression data sets. TFD contains 112,234 images (each image is adjusted to a size of 48*48), and the eyes of all subjects are at the same distance. Of these, 4,189 were tagged and contained seven emojis.
  • FER2013: The modified database is automatically collected through the Google Images API, and all images in the database are labeled to 48*48. There were 28,709 training images, 3589 test images, and 7 expressions.
  • AFEW: The AFEW dataset is The dataset used by Emotion Recognition In The Wild Challenge (EmotiW) series, which has been held annually since 2013. The content of this dataset is video clips containing emoticons that are edited from movies and contain seven categories of emoticons. The training set, validation set and test set contain 773, 383 and 653samples, respectively.
  • SFEW: This dataset is static frames with expressions extracted from THE AFEW dataset, including 7 types of expressions. The training set, validation set and test set contain 958, 436 and 372samples, respectively.
  • Multi-pie: consists of 4 scenes, 9 lighting conditions and 337 subjects from 15 perspectives, with a total of 755370 images. Contains 6 expressions (no contempt)
  • Bu-3dfe: 606 facial expression sequence obtained from 100 people, including 6 expressions (without contempt), mostly used for THREE-DIMENSIONAL facial expression analysis.
  • Oulu-casia: 2880 image sequences were collected from 80 unmarked subjects. Contains 6 expressions (no contempt). There are two cameras, infrared (NIR) and visible (VIS), which shoot in three different lighting conditions.
  • RaFD: 1608 images with 67 subjects, eyes with three different gaze orientations, including forward, left and right. Contains seven emojis.
  • KDEF: Originally used in medical and psychological research. The data set came from 70 actors with six expressions from five angles.
  • EmotioNet: Contains nearly a million images of facial expressions collected from the Web.
  • Raf-db: Contains 29,672 facial images collected from the Internet, including 7 basic expressions and 11 compound expressions.
  • AffectNet: Contains more than 1 million facial images collected from the Web, 450,000 of which are hand-labeled with seven expressions.

FER current development level

The progress of FER based on still image and dynamic image sequence (video) is summarized.

Static image FER progress

For each data set, the table below shows the results of the current best practices on that data set.

Pre-training and fine-tuning

Direct training of deep networks on relatively small data sets can easily lead to overfitting. To mitigate this problem, many studies pre-train networks on large data sets, or fine-tune already trained networks.

As shown in the image above, training on ImageNet data sets is followed by fine-tuning on specific facial expression data sets. Fine-tuning has a good effect, face expression recognition has a variety of fine-tuning methods, such as grading, fixed some once, different network layer with different data sets fine-tuning, specific can see the paper cited in the original text. In addition, literature [6] points out that there are huge differences between FR and FER data sets, and it seems that the face differentiation model weakens the difference of facial expression, and the FaceNet2ExpNet network is proposed to eliminate this effect. The model is divided into two stages: first, features are extracted from human face recognition model, and then expression recognition network is used to eliminate the weakening of emotional differences caused by face recognition model. As shown in the figure below.

Diversified network input

The traditional approach is to use the original RGB image as the input of the network, but the original data lacks important information, such as texture information, as well as the invariance of image scaling, rotation, occlusion and illumination. So you can take advantage of some hand-designed features. Such as SIFT, LBP, MBP, AGEhe NCDV and so on. PCA can be used to cut out five features for feature learning instead of the whole face.

Auxiliary block and layer improvements

Based on the classic CNN architecture, some studies have designed good auxiliary modules or improved the network layer. Several examples are listed in this part. If you are interested, you can find relevant papers and refer to them. It is worth noting that Softmax’s performance in the field of facial expression recognition is not so ideal. This is due to the low level of interclass differentiation of expressions. The author sorted out several improvements to the expression classification layer.

  • Inspired by Center Loss, a penalty term is added to the distance between features and corresponding classes, which can be divided into two types
    • One is island Loss [7], which increases the distance between classes, as shown in the figure below
    • The other is LP[8] Loss, which reduces the distance within the class, so that locally adjacent features of the same class are combined.
  • Based on triplet- Loss, the idea of triplet-loss can refer to the original text andThis blog.
    • Exponential triplet-based Loss (Increase the weight of difficulty sample)
    • (N+M) -Tupes Cluster Loss (reducing the selection difficulty of Anchor and threshold-triplet inequality), as shown in the figure below.

Network integration

Previous studies have shown that a collection of multiple networks can perform better than a single network. There are two things to consider when integrating networks:

  • Network models should be sufficiently diverse to ensure complementarity between networks
  • Have a reliable integration algorithm

On the first point, there are many ways to generate the diversity of network. Different training data, different preprocessing methods, different network models and different parameters can generate different networks.

On the second point integration algorithm. There are two main points, one is feature integration, the other is output decision integration. The most common approach of feature integration is to link the features of different network models directly, as shown in the figure below

The decision integration adopts voting mechanism, and different networks have different weights. Several strategies for decision integration are shown in the table below.

Multitask network

At present, many networks are the output of a single task, but in reality, many other factors often need to be considered. The multi-task model can learn additional information from other tasks and improve the generalization ability of the network. Check out this blog post on the benefits of the multitasking model. As shown below, the two tasks of face verification and expression recognition are integrated in a network in MSCNN[9] model.

Network cascade

In a cascading network, different modules handling different tasks are combined to design a deeper network in which the output of one module is used by the next. As shown in the figure below, AUDN network consists of three parts.

Dynamic image sequence FER progression

Dynamic expression recognition can be more comprehensive than static images, by moving image sequences, i.e. in video.

The frame aggregation

Considering that expressions have different changes at different times, but it is impossible to count the results of each frame individually as output, it is necessary to give an identification result for a frame sequence, which requires frame aggregation. This time series is represented by an eigenvector. Similar to ensemble algorithm, frame aggregation has two types, namely decision level frame aggregation and feature level frame aggregation. Those interested in these two parts can refer to the paper.

Intensity expression network

There are subtle changes in the expression in the video, and intensity refers to the degree to which all the frames in the video represent an expression. Generally, a certain expression can be expressed in the middle position, which is the peak intensity. Most methods focus on the peaks and ignore the low frames at the beginning and end. This part mainly introduces several deep networks. The input is the sample sequence with certain intensity information, and the output is the correlation results between frames of different intensity in a certain class of expressions. For example, PPDN (peak-Piloted) is used to identify the correlation between frames in the inner expression sequence, and DCPN, a cascading PPDN network based on PPDN, has deeper and stronger identification ability. Although these networks all consider the expression transformation in a sequence, and even design different loss functions to calculate the change trend of expression, I sincerely think that such cost is actually meaningless for engineering. Interested, you can look at the corresponding method in the paper, here is no longer repeated.

Deep spatiotemporal FER network

The frame aggregation and intensity expression network introduced above both belong to traditional structured processes. In the video, some frames are input as separate image sequences to output the classification results of a certain type of expression. RNN network can make use of “sequence information”, so video FER model uses RNN network, C#D:

  • RNN: Theoretically, it can make use of any long sequence of information. RNN can model changes in time series.
  • C3D: 3D space-time convolution is formed by adding a time dimension along the time axis on the basis of 2D space convolution on the usual image. For example 3dCNN-DAP [10], the network model is shown in the figure below.

There is also a “violent” approach, which does not consider the time dimension, splicing frame sequences into large vectors, and then conducting CNN classification, such as DTAN[11].

  • Facial landmark movement trajectory: Changes in facial expressions can be analyzed by studying the change trajectory of facial features, such as deep temporal Geometry Network (DTGN). The method combines the X and Y coordinate values of landmark in each frame. After normalization, landmark is taken as a motion track dimension, or paired L2 distance features of landmark feature points are calculated, and PHRNN is used to obtain spatial change information within the frame. In addition, landmark points are divided into four blocks according to the five senses and input into BRNNs to locate local features, as shown in the figure below:
  • Cascade network: the idea of cascade network is the same as that of static images before, which is mainly to extract features by CNN and cascade RNN for sequence feature classification. For example, LRCN, cascade CNN and LSTM, similarly, cascade DAE as feature extraction, LSTM for classification, and ResNET-LSTM, that is, LSTM is directly used to connect low-level CNN features between sequences at the low-level CNN layer. 3DIR constructs a 3D Inception-ResNet feature layer using LSTM as a unit, and there are many similar cascading networks, including CRFs instead of LSTM, etc.
  • Network integration: for example, two-channel CNN network model is used for behavior recognition, one method is used for obtaining time information by dense optical flow training of multi-frame data, one method is used for feature learning of single frame image, and finally the output of two channels CNN is fused. In addition, multi-channel training was performed. For example, channel was used for optical flow information training between natural face and facial expression face, and one way was used for facial expression feature training. Then, three fusion strategies were used: average fusion, SVM fusion and DNN fusion. In addition, PHRNN time network and MSCNN space network are combined to extract local global relations, geometric changes and static and dynamic information. In addition to fusion, there are also joint training, such as DTAN and DTGN combined fineturn training.

At present, the best effect of dynamic sequence expression recognition on various data sets is shown in the following table:

The final arrangement is not easy, point a wave of attention, or go to my personal Blog to hit weak’s Blog.

reference

[1]: Li S, Deng W. Deep Facial Expression Recognition: A Survey[J]. 2018.

[2]: Viola P, Jones M. Rapid object detection using a boosted cascade of simple features[J]. Proc Cvpr, 2001, 1:511.

[3] : Torre F D L, Chu W S, Xiong X, et al. IntraFace[C]// IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. IEEE, 2015:1-8.

[4]: Goodfellow I J, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets[C]// International Conference on Neural Information Processing Systems. MIT Press, 2014:2672-2680.

[5]: [http://luks.fe.uni-lj.si/sl/osebje/vitomir/face tools/INFace/](http://luks.fe.uni-lj.si/sl/osebje/vitomir/face tools/INFace/)

[6]: Ding H, Zhou S K, Chellappa R. FaceNet2ExpNet: Regularizing a Deep Face Recognition Net for Expression Recognition[J]. 2016:118-126.

[7]: Cai J, Meng Z, Khan A S, et al. Island Loss for Learning Discriminative Features in Facial Expression Recognition[J]. 2017.

[8]: Li S, Deng W, Du J P. Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild[C]// IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2017:2584-2593.

[9] : Zhang K, Huang Y, Du Y, et al. Facial Expression Recognition Based on Deep Evolutional Spatial-Temporal Networks[J]. IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society, 2017, PP(99):1-1.

[10]. Liu M, Li S, Shan S, et al. Deeply Learning Deformable Facial Action Parts Model for Dynamic Expression Analysis[M]// Computer Vision — ACCV 2014. Springer International Publishing, 2014:143-157.

[11] : Jung H, Lee S, Yim J, et al. Joint Fine-Tuning in Deep Neural Networks for Facial Expression Recognition[C]// IEEE International Conference on Computer Vision. IEEE, 2016:2983-2991.