"Face recognition series tutorial" 1 catalogue and overview

The original reference: zhuanlan.zhihu.com/FaceRec

An overview of the

FR task, system structure, characteristic development, loss function development, backbone development, data set development and training test data set examples, twin network, face recognition loss function 1. FR task, structure, characteristic development, loss function development, backbone development face recognition task classification (FR, Face Recognition (1) 1:1 (Face verification)

1:1, commonly called the face verification task, is binary (e.g., comparing one face to another)
Face Verification (Face Verification; Face recognition; Face check) = verify that you are you? (1:1 matching)
Buy air ticket, ticket on the Internet, hospital registration, government benefit people project, and all sorts of securities open an account, telecom open an account, Internet finance open an account can be used

(2) 1:N (face recognition)

Searching for a face in a database of many faces
Face Identification = Find out who you are? (1:n matching)
The biggest difference is A/B A/C A/D…… Multiple 1:1 calculation, the biggest problem is that once the BCD sum is larger, the calculation speed will be slower, and the sum is more than 200,000, there will be multiple similar results (200,000 people will lead to a lot of people with similar appearance), requiring manual assisted positioning.
It is mainly used for face retrieval, investigation of criminal suspects, full database search of missing persons, and repeated investigation of one person with multiple certificates. Listing corresponding results based on similarity can greatly improve the efficiency of investigation.
The actual usage scenario of 1:1 is more limited and single, while the actual usage scenario of 1: N is more wild and uncontrollable. So 1:1 is easier to achieve high availability, and 1:n is harder to achieve.

(3) N: N

N:N This algorithm is actually based on the 1:N algorithm, input multiple solution results. For example, the frame processing of video stream requires strict computing environment of the server

(4) Face clustering

Looking for someone similar?

The challenge of face recognition

Intra-personal variation
Inter-personal variation

Deep face recognition system

First, a face detector is used to locate faces. The face is then aligned with standardized canonical coordinates. Finally, the FR module performs face recognition.
The anti-spoofing of THE FR module can identify whether the face is valid or deceptive. Face processing deals with recognition difficulties before training and testing;
During training, the discriminant deep features are extracted with different structures and loss functions. When the deep features of the test data are extracted, face matching method is used for feature classification.
The figure below lists some important methods of data processing, structural design, loss function and face matching
Deep Face Recognition: A Survey. Mei Wang,Weihong Deng. 201804

Development of FR characteristic representation

In the 1990s and early 2000s, a holistic approach dominated FR
In the early 2000s and 2010s, local feature-based FR and learning-based local descriptors were developed
In 2014, DeepFace and DeepID achieved state-of-the-art accuracy, and the research focus has shifted to deep-learn-based approaches.

Development of FR loss function

Deepface and DeepID in 2014 marked the birth of deep learn-based FR with Softmax Loss
After 2015, loss based on Euclidean distance has been playing an important role in loss function, such as Contrastive Loss, Triplet loss, Center Loss.
In 2017, Feature and Weight normalization has also begun to show excellent performance, which has led to studies of Softmax variants such as L2 Softmax
In 2016 and 2017, Large margin loss further promoted the development of large-interval feature learning, such as L-Softmax, A-Softmax, Cosface and Arcface

Red, green, blue and yellow represent the SoftMax based depth method, the Euclidean distance based loss method, the SoftMax variant based method, and the Angle/cosine based interval loss method respectively.

FR backbone network development

The architecture of deep FR always follows the network structure of deep object classification and evolves from AlexNet to SENet

2. The evolution of THE FR data set

Prior to 2007, FR’s early work focused on constrained and small-scale data sets.
The introduction of LFW datasets in 2007 marked the beginning of FR under unconstrained conditions. Since then, more test databases with different tasks and scenarios have been designed.
In 2014, CasIA-Webface provided the first widely available large-scale public training data set.
Red rectangles represent training datasets and other colored rectangles represent test datasets with different tasks and scenarios

Used to train common FR data sets

A common FR data set for testing

Data sets List the LFW face recognition data sets

Unconstrained Natural Scene Face recognition dataset, which consists of 13,323 images of faces of Internet celebrities in natural scenes (different orientations, expressions and lighting environments). Of the 5,749 celebrities, 1,680 had two or more faces, only 85 had more than 15 and 4,069 had only one. Each face image is distinguished by a unique name, ID and serial number.
LFW data set mainly tests the accuracy of face recognition. The database randomly selected 6,000 pairs of faces, 3,000 of which had two photos of the same person and 3,000 of which had one photo each of different people.
vis-www.cs.umass.edu/lfw/

FDDB face detection dataset

The unconstrained natural scene face detection dataset contains 5171 human faces from 2845 images taken from various natural scenes. Each face has its specified coordinate position.
FDDB data set mainly tests the accuracy of face detection. The face recognition algorithm needs to detect the face in each image of the dataset and calibrate the position of the detected face. Then according to the correct answer given by the data set itself, calculate the number of correctly detected faces and the number of wrongly detected faces to judge the quality of the face detection algorithm.
vis-www.cs.umass.edu/fddb/

CelebA (CelebFaces, CelebFaces+) Face attribute recognition data set

Large-scale Celeb Faces Attributes (CelebA) Dataset is a Large face recognition Dataset published by the laboratory of Professor Tang Xiaoou of the Chinese University of Hong Kong, which is mainly used for face attribute recognition. 202,599 face images, 10,177 identities, and 5 landmark locations, with 40 binary attribute annotations per image.

YouTube Faces (YTF)

YouTube Video Faces is for verifying human Faces. In this data set, the algorithm needs to determine whether the two videos are the same person. There are many methods that work on photos that may not be effective/efficient on video, and the video image quality is poor. The dataset contains 1,595 people in 3,425 videos. The shortest duration was 48 frames, the longest 6,070 frames, and the average length of video clips was 181.3 frames.

CASIA-WebFace

Chinese Academy of Sciences (CASIA), the first widely used large-scale public training data set for FR. 490,000 images, 10,000 people.

IJB – A data set

The IJB-A (IARPA Janus Benchmark A) dataset includes not only A still image of the subject but also A video clip of the subject. Because of this feature, the concept of template is introduced: it refers to a collection of all interesting facial media collected under unconstrained conditions. The media collection includes not only still images of the person being photographed but also video clips.
All media in the dataset are collected in a completely unconstrained environment. Many of the subjects were photographed with wildly varying facial postures, wildly varying lighting and different image resolutions.
The disadvantage is the small size of the dataset, iJB-A data contains only 5,396 still images and 20,412 frames of video data from 500 objects.

MegaFace data set

The MegaFace dataset, published by the University of Washington, includes 690,572 objects and about 4.7 million images, pushing the scale of face data to a new level.
The data set is set differently, with dozens of images of Internet celebrities plus 1m images of ordinary people as interference data. Compared with face recognition, it is more inclined to face verification under large noise, and the distribution of data is also unbalanced, with only 7 images per object on average, and the change of face data within the same object is small.

MS – Celeb – 1 m data set

The dataset contains 100, 000 objects and about 10 million images, according to Microsoft Research Asia. This is the largest face recognition data set so far. Although the scale is large, the data distribution is unbalanced, and the face data with large posture accounts for a small proportion and there is a lot of noise data.
From 1M celebrities, choose 100K according to their popularity. Then, using a search engine, I searched 100K people for about 100 images each. A total of 100K╳100=10M pictures. The test set consisted of 1,000 celebrities randomly selected from 1M celebrities. And it’s labeled by Microsoft. Each celebrity has about 20 pictures, none of which can be found online.
MSR IRC is one of the largest and highest level image recognition competitions in the world. It is initiated by Zhang Lei, leader of image analysis and Big Data Mining research group of MSRA (Microsoft Research Asia), and is held regularly every year.

3. Twin network architecture

Siamese network is a conjoined neural network, and Siamese Architecture is a framework. The “conjoined” of neural networks is realized by sharing weights. Siamese refers to two Siamese Siamese twins. Twin neural networks are used to deal with cases where two inputs are “relatively similar”, such as calculating the semantic similarity of two sentences or words.
The weights of the two neural networks on the left and right are exactly the same, even the code implementation can be the same network, there is no need to implement the other one, because the weights are the same, both sides can be LSTM or CNN.
Pseudo-siamese network: If the right and left sides do not share weights, but two different neural networks, it is. Its two sides can be different neural networks (for example, one is LSTM, the other is CNN), or the same type of neural network. Pseudo-twin neural networks are useful for handling situations where two inputs are “somewhat different”, such as verifying whether the title and the description of the text are consistent (the title and the length of the text differ greatly), or whether the text describes a picture (one is a picture, the other is a text).
The purpose of twin neural networks is to measure the similarity of two inputs. Put the two input feeds into the left and right neural networks. These two neural networks map the input to vectors in the new space respectively, judge Cosine distance, EXP function, Euclidean distance, etc. in the new space to get the similarity of the two input. Through the training of Loss, the similarity D of similar images decreases and the similarity D of dissimilar images increases.

The traditional Siamese network uses Contrastive Loss. Of course there are other options for loss functions, and Softmax is certainly a good choice, but not necessarily the optimal one, even in classification problems. The figure below uses the comparison loss:
To keep neighbor outputs far apart, pay squared
To make the non-neighbor output less than marginal m, pay squared. 0 if D is greater than m, the dissimilar D is greater.

There are many applications in both NLP and CV fields:
In 1993, Yann LeCun used Siamese neural network in signature verification, that is, to verify whether the signature on the check is consistent with the reserved signature of the bank. NIPS 1993 Signature Verification using a ‘Siamese’ Time Delay Neural Network
In 2010, Hinton was used for face verification, and the effect was very good. Two faces were fed into the convolutional neural network and the output was same or different, which belongs to dichotomy. 1. Rectified Linear Units Improve Restricted Boltzmann Machines
Handwriting recognition
The visual tracking algorithm based on Siamese network has also become a hot topic. Fully- Convolutional Siamese Networks for Object Tracking
Semantic similarity analysis of words and matching of question and answer in QA
In a question pair contest on Kaggle, which determines whether two questions are the same question, the winning team uses n features +Siamese network reference: ref 1

4. Loss function of FR

Label prediction (the last fully connected layer) is like a linear classifier. The features of deep learning need to be easily separable (S). At this point, Softmax loss can directly solve the classification problem.
However, for face recognition tasks, features of deep learning will not only need to be self-marked but also discriminative. Can be generalized to identify unseen categories without label prediction.

Loss function Contrastive Loss, Triplet Loss, and Center Loss based on Angular Margin Loss function L-softmax For details about Loss, A-Softmax Loss, COCO Loss, CosFace Loss, and ArcFace Loss, see the following chapters.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

“Face recognition series tutorial” 1 catalogue and overview

directory

An overview of the

“Face recognition series tutorial” 1 catalogue and overview

directory

An overview of the

Related Posts

Data analysis – Summary of common data indicators

Based on CNN: classic NAS algorithm, Genetic algorithm (ga) standards apply | ICCV 2017

Use Yolov5 for tensorRT model deployment