The original reference: zhuanlan.zhihu.com/FaceRec

directory

This series of columns will parse key papers on deep learning-based face recognition, from DeepFace’s start in 2014 to the latest algorithms. Loss functions for face recognition are also introduced: Contrastive Loss, Triplet Loss, Center Loss based on Euclidean distance; Angular Margin based Loss functions (L-Softmax Loss, A-Softmax Loss, COCO Loss, CosFace Loss, ArcFace Loss). Development history of face recognition based on deep learning

  • In 2014, DeepFace and DeepID series mainly trained Softmax multi-classifier face recognition framework first. Then, feature layer is extracted and used to train another neural network, twin network or combined Bayesian face verification framework. To have both face verification and face recognition systems, two neural networks need to be trained separately. But the size of the linear transformation matrix W increases linearly as the number of identities n increases.
  • DeepFace and DeepID framework is CNN+Softmax. The network forms face features with strong discrimination in the first FC layer for face recognition.
  • DeepID2, DeepID2+ and DeepID3 all adopt CNN+Softmax+Contrastive Loss to make the L2 distance of similar features as small as possible, and the L2 distance of different features is greater than a certain interval.
  • In 2015, FaceNet proposed a unified framework to solve most of the face problems, directly learning embedded features, and then face recognition, face verification and face clustering are based on this feature. On the basis of DeepID2, FaceNet abandoned the classification layer, and then improved Contrastive Loss into Triplet Loss to obtain the compactness within classes and differences between classes. However, the number of face triples has exploded, especially for large data sets, leading to a significant increase in the number of iterations. The sample mining strategy makes it difficult to train the model effectively.
  • In Defense of the Triplet Loss for Person Re-Identification In 2017, soft-margin Loss formula was proposed to replace the original Triplet Loss expression. Batch Hard Sampling is also introduced.
  • In 2017, Wu, C et al. explained why non-squared Distance is better than squared Distance from the perspective of derivative function, and proposed Margin Based Loss Based on this insight. In addition, they also proposed Distance Weighted Sampling. The paper argues that semi-hard Sampling in FaceNet, Random hard and Batch hard in Deep Face Recognition cannot be easily extracted to produce large gradient (large loss, Triplets for model training).
  • In 2015, VGGNet first used the traditional Softmax to train the face recognition model in order to speed up the training of Triplet Loss. Because of the strong supervision characteristic of Classficiation signal, the model will fit quickly. After that, the Classificiation Layer is removed and the Triplet Loss is used to finetune the feature Layer of the model.
  • In 2016, Center Loss learns a Center for each category and pulls all feature vectors of each category to the corresponding category Center. According to the Euclidean distance between each feature vector and its category Center, the intra-class compactness can be obtained. Interclass dispersion is guaranteed by the combined penalty of Softmax Loss. However, updating the actual category center during training is difficult, as the number of face categories available for training has recently increased dramatically.
  • In 2017, COCO Loss was normalized with weight C and feature F, and multiplied by scale factor, reaching 99.86% on LFW.
  • In 2017, SphereFace proposed A-Softmax, which is an improvement of L-SoftMax. It proposed Angle interval penalty and normalized weight W, so that training was more focused on optimizing depth feature mapping and feature vector Angle and reducing the imbalance of sample number.
  • Learning Towards Minimum HyperSpherical Energy goes on to say that its loss function requires a series of approximations to be computed, resulting in unstable network training. In order to stabilize the training, they proposed the mixed Softmax Loss function. Empirically, Softmax Loss dominates the training process, because the integer multiplicative Angle interval makes the target logic curve very steep, thus hindering convergence.
  • In 2018, CosFace directly added cosine interval punishment to target logistic regression, using additive cosine interval, cos(θ)-m, normalized feature vectors and weights. Compared to hereFace, it achieves better performance, is easier to implement, and reduces the need for joint supervision by Softmax Loss.
  • In 2018, ArcFace proposed additive Angle interval loss, θ+m, normalized eigenvectors and weights, and geometrically had a constant linear Angle Margen. Directly optimize radians. In order to stabilize performance, ArcFace does not need to realize joint supervision with other Loss functions.
  • 2018 MobileFaceNets, MobileNetV2+ArcFace Loss, lightweight model.

Face recognition loss function (top) based on Euclidean distance face recognition loss function (bottom) based on Angular Margin 1 DeepFace 2 DeepID3 DeepID2 4 DeepID2+ 5 DeepID3 6 FaceNet 7 VGGFace 8 SphereFace 9 CosFace 10 ArcFace

 

An overview of the

FR task, system structure, characteristic development, loss function development, backbone development, data set development and training test data set examples, twin network, face recognition loss function 1. FR task, structure, characteristic development, loss function development, backbone development face recognition task classification (FR, Face Recognition (1) 1:1 (Face verification)

  • 1:1, commonly called the face verification task, is binary (e.g., comparing one face to another)
  • Face Verification (Face Verification; Face recognition; Face check) = verify that you are you? (1:1 matching)
  • Buy air ticket, ticket on the Internet, hospital registration, government benefit people project, and all sorts of securities open an account, telecom open an account, Internet finance open an account can be used

(2) 1:N (face recognition)

  • Searching for a face in a database of many faces
  • Face Identification = Find out who you are? (1:n matching)
  • The biggest difference is A/B A/C A/D…… Multiple 1:1 calculation, the biggest problem is that once the BCD sum is larger, the calculation speed will be slower, and the sum is more than 200,000, there will be multiple similar results (200,000 people will lead to a lot of people with similar appearance), requiring manual assisted positioning.
  • It is mainly used for face retrieval, investigation of criminal suspects, full database search of missing persons, and repeated investigation of one person with multiple certificates. Listing corresponding results based on similarity can greatly improve the efficiency of investigation.
  • The actual usage scenario of 1:1 is more limited and single, while the actual usage scenario of 1: N is more wild and uncontrollable. So 1:1 is easier to achieve high availability, and 1:n is harder to achieve.

(3) N: N

  • N:N This algorithm is actually based on the 1:N algorithm, input multiple solution results. For example, the frame processing of video stream requires strict computing environment of the server

(4) Face clustering

  • Looking for someone similar?

The challenge of face recognition

  • Intra-personal variation
  • Inter-personal variation

Deep face recognition system

  • First, a face detector is used to locate faces. The face is then aligned with standardized canonical coordinates. Finally, the FR module performs face recognition.
  • The anti-spoofing of THE FR module can identify whether the face is valid or deceptive. Face processing deals with recognition difficulties before training and testing;
  • During training, the discriminant deep features are extracted with different structures and loss functions. When the deep features of the test data are extracted, face matching method is used for feature classification.
  • The figure below lists some important methods of data processing, structural design, loss function and face matching
  • Deep Face Recognition: A Survey. Mei Wang,Weihong Deng. 201804

Development of FR characteristic representation

  • In the 1990s and early 2000s, a holistic approach dominated FR
  • In the early 2000s and 2010s, local feature-based FR and learning-based local descriptors were developed
  • In 2014, DeepFace and DeepID achieved state-of-the-art accuracy, and the research focus has shifted to deep-learn-based approaches.

Development of FR loss function

  • Deepface and DeepID in 2014 marked the birth of deep learn-based FR with Softmax Loss
  • After 2015, loss based on Euclidean distance has been playing an important role in loss function, such as Contrastive Loss, Triplet loss, Center Loss.
  • In 2017, Feature and Weight normalization has also begun to show excellent performance, which has led to studies of Softmax variants such as L2 Softmax
  • In 2016 and 2017, Large margin loss further promoted the development of large-interval feature learning, such as L-Softmax, A-Softmax, Cosface and Arcface

Red, green, blue and yellow represent the SoftMax based depth method, the Euclidean distance based loss method, the SoftMax variant based method, and the Angle/cosine based interval loss method respectively.

FR backbone network development

  • The architecture of deep FR always follows the network structure of deep object classification and evolves from AlexNet to SENet

2. The evolution of THE FR data set

  • Prior to 2007, FR’s early work focused on constrained and small-scale data sets.
  • The introduction of LFW datasets in 2007 marked the beginning of FR under unconstrained conditions. Since then, more test databases with different tasks and scenarios have been designed.
  • In 2014, CasIA-Webface provided the first widely available large-scale public training data set.
  • Red rectangles represent training datasets and other colored rectangles represent test datasets with different tasks and scenarios

Used to train common FR data sets

A common FR data set for testing

Data sets List the LFW face recognition data sets

  • Unconstrained Natural Scene Face recognition dataset, which consists of 13,323 images of faces of Internet celebrities in natural scenes (different orientations, expressions and lighting environments). Of the 5,749 celebrities, 1,680 had two or more faces, only 85 had more than 15 and 4,069 had only one. Each face image is distinguished by a unique name, ID and serial number.
  • LFW data set mainly tests the accuracy of face recognition. The database randomly selected 6,000 pairs of faces, 3,000 of which had two photos of the same person and 3,000 of which had one photo each of different people.
  • vis-www.cs.umass.edu/lfw/

FDDB face detection dataset

  • The unconstrained natural scene face detection dataset contains 5171 human faces from 2845 images taken from various natural scenes. Each face has its specified coordinate position.
  • FDDB data set mainly tests the accuracy of face detection. The face recognition algorithm needs to detect the face in each image of the dataset and calibrate the position of the detected face. Then according to the correct answer given by the data set itself, calculate the number of correctly detected faces and the number of wrongly detected faces to judge the quality of the face detection algorithm.
  • vis-www.cs.umass.edu/fddb/

CelebA (CelebFaces, CelebFaces+) Face attribute recognition data set

  • Large-scale Celeb Faces Attributes (CelebA) Dataset is a Large face recognition Dataset published by the laboratory of Professor Tang Xiaoou of the Chinese University of Hong Kong, which is mainly used for face attribute recognition. 202,599 face images, 10,177 identities, and 5 landmark locations, with 40 binary attribute annotations per image.

YouTube Faces (YTF)

  • YouTube Video Faces is for verifying human Faces. In this data set, the algorithm needs to determine whether the two videos are the same person. There are many methods that work on photos that may not be effective/efficient on video, and the video image quality is poor. The dataset contains 1,595 people in 3,425 videos. The shortest duration was 48 frames, the longest 6,070 frames, and the average length of video clips was 181.3 frames.

CASIA-WebFace

  • Chinese Academy of Sciences (CASIA), the first widely used large-scale public training data set for FR. 490,000 images, 10,000 people.

IJB – A data set

  • The IJB-A (IARPA Janus Benchmark A) dataset includes not only A still image of the subject but also A video clip of the subject. Because of this feature, the concept of template is introduced: it refers to a collection of all interesting facial media collected under unconstrained conditions. The media collection includes not only still images of the person being photographed but also video clips.
  • All media in the dataset are collected in a completely unconstrained environment. Many of the subjects were photographed with wildly varying facial postures, wildly varying lighting and different image resolutions.
  • The disadvantage is the small size of the dataset, iJB-A data contains only 5,396 still images and 20,412 frames of video data from 500 objects.

MegaFace data set

  • The MegaFace dataset, published by the University of Washington, includes 690,572 objects and about 4.7 million images, pushing the scale of face data to a new level.
  • The data set is set differently, with dozens of images of Internet celebrities plus 1m images of ordinary people as interference data. Compared with face recognition, it is more inclined to face verification under large noise, and the distribution of data is also unbalanced, with only 7 images per object on average, and the change of face data within the same object is small.

MS – Celeb – 1 m data set

  • The dataset contains 100, 000 objects and about 10 million images, according to Microsoft Research Asia. This is the largest face recognition data set so far. Although the scale is large, the data distribution is unbalanced, and the face data with large posture accounts for a small proportion and there is a lot of noise data.
  • From 1M celebrities, choose 100K according to their popularity. Then, using a search engine, I searched 100K people for about 100 images each. A total of 100K╳100=10M pictures. The test set consisted of 1,000 celebrities randomly selected from 1M celebrities. And it’s labeled by Microsoft. Each celebrity has about 20 pictures, none of which can be found online.
  • MSR IRC is one of the largest and highest level image recognition competitions in the world. It is initiated by Zhang Lei, leader of image analysis and Big Data Mining research group of MSRA (Microsoft Research Asia), and is held regularly every year.

3. Twin network architecture

  • Siamese network is a conjoined neural network, and Siamese Architecture is a framework. The “conjoined” of neural networks is realized by sharing weights. Siamese refers to two Siamese Siamese twins. Twin neural networks are used to deal with cases where two inputs are “relatively similar”, such as calculating the semantic similarity of two sentences or words.
  • The weights of the two neural networks on the left and right are exactly the same, even the code implementation can be the same network, there is no need to implement the other one, because the weights are the same, both sides can be LSTM or CNN.
  • Pseudo-siamese network: If the right and left sides do not share weights, but two different neural networks, it is. Its two sides can be different neural networks (for example, one is LSTM, the other is CNN), or the same type of neural network. Pseudo-twin neural networks are useful for handling situations where two inputs are “somewhat different”, such as verifying whether the title and the description of the text are consistent (the title and the length of the text differ greatly), or whether the text describes a picture (one is a picture, the other is a text).
  • The purpose of twin neural networks is to measure the similarity of two inputs. Put the two input feeds into the left and right neural networks. These two neural networks map the input to vectors in the new space respectively, judge Cosine distance, EXP function, Euclidean distance, etc. in the new space to get the similarity of the two input. Through the training of Loss, the similarity D of similar images decreases and the similarity D of dissimilar images increases.

  • The traditional Siamese network uses Contrastive Loss. Of course there are other options for loss functions, and Softmax is certainly a good choice, but not necessarily the optimal one, even in classification problems. The figure below uses the comparison loss:
  • To keep neighbor outputs far apart, pay squared
  • To make the non-neighbor output less than marginal m, pay squared. 0 if D is greater than m, the dissimilar D is greater.

  • There are many applications in both NLP and CV fields:
  • In 1993, Yann LeCun used Siamese neural network in signature verification, that is, to verify whether the signature on the check is consistent with the reserved signature of the bank. NIPS 1993 Signature Verification using a ‘Siamese’ Time Delay Neural Network
  • In 2010, Hinton was used for face verification, and the effect was very good. Two faces were fed into the convolutional neural network and the output was same or different, which belongs to dichotomy. 1. Rectified Linear Units Improve Restricted Boltzmann Machines
  • Handwriting recognition
  • The visual tracking algorithm based on Siamese network has also become a hot topic. Fully- Convolutional Siamese Networks for Object Tracking
  • Semantic similarity analysis of words and matching of question and answer in QA
  • In a question pair contest on Kaggle, which determines whether two questions are the same question, the winning team uses n features +Siamese network reference: ref 1

4. Loss function of FR

  • Label prediction (the last fully connected layer) is like a linear classifier. The features of deep learning need to be easily separable (S). At this point, Softmax loss can directly solve the classification problem.
  • However, for face recognition tasks, features of deep learning will not only need to be self-marked but also discriminative. Can be generalized to identify unseen categories without label prediction.

Loss function Contrastive Loss, Triplet Loss, and Center Loss based on Angular Margin Loss function L-softmax For details about Loss, A-Softmax Loss, COCO Loss, CosFace Loss, and ArcFace Loss, see the following chapters.