Face detection and face recognition

Face detection is the first step of face recognition and processing, mainly used to detect and locate the face in the picture, return high-precision face frame coordinates and face feature point coordinates. Face recognition will further extract the identity features contained in each face, and compare it with the known face, so as to identify the identity of each face. At present, the application scenarios of face detection/recognition gradually evolve from indoor to outdoor, from a single limited scene to square, station, subway entrance and other scenes, face detection/recognition is facing higher and higher requirements, such as: Face size is variable, the number of redundant, various postures, including aerial shot of the face, wearing hats and masks, exaggerated expressions, makeup and camouflage, poor lighting conditions, low resolution and even the naked eye is difficult to distinguish. With the development of deep learning, face detection/recognition methods based on deep learning technology have achieved great success. This paper mainly introduces the deep learning model MTCNN for face detection and the deep learning model FaceNet for face recognition.

Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li and Yu Qiao proposed the MTCNN (Multi-task Cascaded Convolutional Networks) model of face detection in 2016. This mode is a multi-task face detection framework, which uses three CNN cascade algorithm structures to carry out face detection and face feature point detection at the same time. The detection effect is shown in the figure below:

Google engineers Florian Schroff, Dmitry Kalenichenko and James Philbin proposed the FaceNet model for face recognition, which did not use the traditional Softmax method for classification learning, but extracted a certain layer as a feature. Learn a coding method from image to Euclidean space, and then do face recognition, face verification and face clustering based on this coding. The face recognition effect is shown in the figure below. The number represented on the horizontal line is the distance between faces. When the distance between faces is less than 1.06, it can be regarded as the same person.

MTCNN model

MTCNN is a deep learning model for face detection based on multi-task cascaded CNN, in which face border regression and face key point detection are considered comprehensively. The overall network architecture of MTCNN is shown in the figure below:

First of all, the photos will be scaled to different sizes according to different scaling ratios, forming the feature pyramid of the picture. PNet mainly obtains the regression vector of candidate window and boundary box of face region. Then, the candidate window is calibrated and the highly overlapped candidate boxes are merged by non-maximum suppression (NMS). RNet will be trained in THE RNet network through the CANDIDATE box of PNet, and then use the regression value of the boundary box to fine-tune the candidate form, and use NMS to remove the overlapping form. The function of ONet is similar to that of RNet, except that the overlap candidate window is removed and the key location of five faces is displayed at the same time.

Training data can be from mmlab.ie.cuhk.edu.hk/projects/WI MTCNN face detection… A total of 93,703 faces were tagged, as shown below:

The format of the tag file is as follows:

__Wed May 30 2018 15:03:55 GMT+0800 (CST)____Wed May 30 2018 15:03:55 GMT+0800 (CST)__# File name File name # bounding box __Wed May 30 2018 15:03:55 GMT+0800 (CST)____Wed May 30 2018 15:03:55 GMT+0800 (CST)__Copy the code

Where x1,y1 are the coordinates of the upper left corner of the frame, w, H are the width of the frame, blur, expression, illumination, invalid, Occlusion, and pose are the attributes of the frame, such as whether it is blurred, illumination, occlusion, effective, posture, etc.

x1, y1, w, h, blur, expression, illumination, invalid, occlusion, pose

Face point detection of training data can be downloaded from http://mmlab.ie.cuhk.edu.hk/archive/CNN_FacePoint.htm address. The dataset contains 5,590 images from the LFW dataset and 7,876 images downloaded from the website. As follows:

The format of the tag file is:

# the first data for the file name, the second and the third to the data frame on the upper left corner coordinates, the fourth and the fifth to the data frame width, the sixth and seventh tag data for left eye points, eighth and ninth data to mark in the right eye points, the tenth and the eleventh data mark left his mouth, the last two coordinates of the right mouth marked points.

Lfw_5590 \ abbas_Kiarostami_0001.jpg 75 165 87 177 106.750000 108.250000 143.750000 108.750000 131.250000 127.250000 106.250000 155.250000 142.750000 155.250000

The network structure of PNet is a fully convolutional neural network structure, as shown in the following figure:

The input of the training network is a 12*12 picture, so the training data of PNet network need to be generated before training. The training data can generate a series of bounding boxes by calculating IOU of Guarantee True Box. Training data can be obtained through sliding Windows or random sampling, and the training data can be divided into three kinds of positive samples, negative samples and intermediate samples. The IOU of the generated sliding window and Guarantee True Box is greater than 0.65, the IOU of the negative sample is less than 0.3, and the IOU of the intermediate sample is greater than 0.4 and less than 0.65.

Next, the bounding box resize is converted into 12*12 pictures and 12*12*3 structure to generate training data of PNet network. 10 5*5 feature maps are generated by 2*2 Max Pooling (stride=2) operation of 10 3*3*3 convolution kernels for training data. Then 16 feature graphs of 3*3 are generated through 16 convolution kernels of 3*3*10. Then 32 1*1 feature graphs are generated through 32 3*3*16 convolution kernels. Finally, for 32 1*1 feature images, two 1*1 feature images can be generated through two 1*1*32 convolution kernels for classification. Four 1*1*32 convolution kernels generate four 1*1 feature graphs for regression box judgment; Ten 1*1*32 convolution kernels generate ten 1*1 feature graphs for face contour point judgment.

The model structure of RNet is as follows:

Model input is 24*24 images, and 28 11*11 feature maps are generated after 28 3*3*3 convolution kernels and 3*3 (stride=2) Max pooling. 48 feature maps of 4*4 are generated by pooling 48 3*3*28 convolution kernels and Max pooling 3*3 (stride=2); After passing 64 2*2*48 convolution kernels, 64 3*3 feature graphs are generated. Transform the feature map of 3*3*64 into a full connection layer of 128 size; The regression box classification problem is converted to a full connection layer of size 2. For the position regression problem of bounding box, the bounding box is converted into fully connected layers of size 4. The key points of face contour are converted to the full connection layer of size 10.

ONet is the last network in MTCNN and is used as the last output of the network. The training data generation of ONet is similar to that of RNet. The detection data are the bounding boxes detected by PNet and RNet networks, including positive samples, negative samples and intermediate samples. The model structure of ONet is as follows:

The model input is a 48*48*3 image, which is converted into 32 23*23 feature maps after 32 3*3*3 convolution kernels and 3*3 (stride=2) Max pooling. 64 feature maps of 10*10 are converted by 64 3*3*32 convolution kernels and 3*3 (stride=2) Max pooling; 64 feature maps of 4*4 are converted by 64 3*3*64 convolution kernels and 3*3 (stride=2) Max pooling; 128 2*2*64 convolution kernels are converted into 128 3*3 feature maps; Convert to 256 size full link layer by full link operation; It is better to generate regression box classification features of size 2. Regression characteristics of the position of regression box with size 4; Regression feature of face contour position with size 10.

MTCNN model reasoning

The Inference process of MTCNN is shown in the figure below:

Generated predicted bounding boxes from the original image and PNet. Bounding Box is generated by input original pictures and PNet, and corrected bounding box is generated by RNet. Input element pictures and bounding box generated by RNet, and generate corrected bounding box and face contour key points through ONet. The execution process is as follows:

  1. Image = cv2.imread(imagepath)
  2. Load the trained model parameters and build the detection object: Detector = MtcnnDetector
  3. Perform inference operation: all_boxes,landmarks = detect.detect_face (image)
  4. Rectangle (image, box,(0,0,255))

FaceNet model

FaceNet is primarily used to verify that a face is the same person and recognize who that person is by looking at the face. The main idea of FaceNet is to map the face image to a multi-dimensional space, and express the similarity of face through spatial distance. The spatial distance of different face images is larger than that of individual face images. In this way, face recognition can be realized through the spatial mapping of face images. In FaceNet, image mapping method based on deep neural network and Loss function based on Triplets are used to train the neural network, and the network directly outputs the vector space of 128 dimensions.

FaceNet training data can be downloaded from http://www.cbsr.ia.ac.cn/english/CASIA-WebFace-Database.html, the training data included 10575 individuals, a total of 453453 images. The validation dataset can be downloaded from http://vis-www.cs.umass.edu/lfw/ and contains 13,000 images. The organization structure of the training data is shown as follows, where the directory name is the person’s name and the file under the directory is the photo of the corresponding person.

__Wed May 30 2018 15:03:55 GMT+0800 (CST)____Wed May 30 2018 15:03:55 GMT+0800 (CST)__Aaron_Eckhart
    Aaron_Eckhart_0001.jpg

Aaron_Guiel
    Aaron_Guiel_0001.jpg

Aaron_Patterson
    Aaron_Patterson_0001.jpg

Aaron_Peirsol
    Aaron_Peirsol_0001.jpg
    Aaron_Peirsol_0002.jpg
    Aaron_Peirsol_0003.jpg
    Aaron_Peirsol_0004.jpg
    ...
__Wed May 30 2018 15:03:55 GMT+0800 (CST)____Wed May 30 2018 15:03:55 GMT+0800 (CST)__Copy the code

Then each image in the training data is preprocessed, and faces are detected through MTCNN model to generate training data of FaceNet, as shown in the figure below:

The corresponding data structure is as follows:

__Wed May 30 2018 15:03:55 GMT+0800 (CST)____Wed May 30 2018 15:03:55 GMT+0800 (CST)__Aaron_Eckhart Aaron_Eckhart_0001_face. JPG Aaron_Guiel Aaron_Guiel_0001_face. JPG... __Wed May 30 2018 15:03:55 GMT+0800 (CST)____Wed May 30 2018 15:03:55 GMT+0800 (CST)__Copy the code

The network structure of FaceNet is shown in the figure below:

Batch represents the training data of face, followed by deep convolutional neural network. Then, L2 normalization operation is adopted to obtain the feature representation of face image, and finally, the Loss function of Triplet Loss.

The following figure shows the Inception architecture of the deep convolutional neural network used in FaceNet:

Triplet Loss is used at the end of the model structure for direct classification. The inspiration of Triplet Loss is that traditional Loss functions tend to map face images with a kind of features to the same space. Triplet Loss attempts to separate images of an individual’s face from those of others. Triplet is actually three samples, such as (Anchor, pos, neg), judged by distance relation. That is, in as many triples as possible, the distance between anchor and pos positive example should be smaller than that between Anchor and NEG negative example, as shown in the figure below:

It can be expressed as:

During the training of each Mini Batch, a reasonable triplet of triplet should be selected to calculate the triplet Loss value. If the method of violence is used to find out the nearest counter example and the furthest positive example from all the samples, and then optimize, the search time will be too long, and the training convergence will be difficult due to the wrong label image. The triplet can be generated online. In each mini-batch, when the triplet is generated, all the Anchor-pos pairs are found out, and then the hard NEg samples of each anchor-pos pair are found out. The main process is as follows:

  1. At the beginning of mini-Batch, face photos were sampled from the training dataset. For example, the number of people sampled in each batch and the number of pictures sampled by each person will be used to obtain facial photos to be sampled.
  2. Calculate the embedding of these sampled images in the network model, so that the triplet can be obtained by calculating the euclidian distance between the embedding images.
  3. According to the triplet obtained, triplet loss is calculated, model optimization is carried out, and embedding is updated.

FaceNet model reasoning

The reasoning process of FaceNet model is as follows:

  1. Face images are extracted from photos by MTCNN face detection model.
  2. The face image is input to FaceNet and the feature vector of Embedding is calculated.
  3. Compare the Euclidean distance between feature vectors to judge whether they are the same person. For example, when the feature distance is less than 1, they are considered the same person, and when the feature distance is greater than 1, they are considered different people.

conclusion

This paper first introduces face detection and face recognition, face detection is used to locate the face in the picture, face recognition is used to identify the identity of the face. Then the main idea of MTCNN model is explained, and the key technologies of MTCNN are analyzed, including training data, network architecture, PNet, RNet, ONet and model inference. Then the main ideas and key techniques of FaceNet model are explained, including training data, network structure, loss equation and selection of Triplet. Users can apply MTCNN and FaceNet model architecture to relevant face detection and recognition scenes in the industrial field.

reference

[1] MTCNN: a Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks.

[2] https://github.com/AITTSMD/MTCNN-Tensorflow

[3] FaceNet: A Unified Embedding for Face Recognition and Clustering

[4] https://github.com/davidsandberg/facenet

Author’s brief introduction

Wu Wei (wechat: Allawnweiwu) : PhD, currently an architect at IBM. Mainly engaged in deep learning platform and application research, research and development in the field of big data.