Face recognition MTCNN parsing

The source code, the effect is quite good (only test code) : kpzhang93. Making. IO/MTCNN_face_…

A student implemented MTCNN training code based on MXNET, the work is relatively complete, the reference value is relatively large: github.com/Seanlinx/mt…

— — — — — — Pipeline — — — — — —

Above is the flow chart of this method, which is also triadic, similar to my previous post CascadeCNN.

Stage1: On the basis of constructing the image pyramid, fully convolutional network is used for detection, and Boundingbox regression and NMS are used for correction. (Note: The full convolutional network here is different from the network with deconvolution in R-CNN. It only means that only the convolutional layer can accept input of any size, and the slide window can be completed automatically by the network stride.)

Stage2: All window inputs from STAGe1 will be further evaluated, and boundingbox regression and NMS will also be done.

Stage3: Similar to StagE2, but with a stronger constraint: 5 human face key points.

— — — — — — the Network — — — — — —

Stage1: Proposal Net

Stage2: Refine Net

Stage3: Output Net

As can be seen from the above, its network structure is slightly deeper than CascadeCNN, but each layer has fewer parameters, so the performance of this method is good and the speed is almost the same as CascadeCNN.

Supplement:

(1) The Online Hard sample mining strategy was used in the training in this paper, that is, only the top 70% of loss samples were selected in a batch for BP.

(2) At different stages, the weights of classifier, Boundingbox regression and Landmarks detection in calculating Loss are different;

(3) There were 4 types of training data, with a ratio of 3:1:1:2 as negative and IOU<0.3; , IOU > 0.65; Part number, 0.4

— — — — — — the Result — — — — — —

FDDB performance:

Speed performance, about 15FPS CPU

 

The MTCNN algorithm from Shenzhen Institute of Advanced Technology, Qiao Yu teacher group, is this year’s ECCV 2016. (At least I know of one CVPR and one ECCV this year).

 

 

Get into the business

Theoretical Basis:

 

As shown in the figure above, the MTCNN consists of three network structures (P-NET, R-NET and O-NET).

Proposal Network (P-NET) : This Network structure mainly obtains the regression vector of candidate window and boundary box of face region. Then, the candidate window is calibrated and the highly overlapped candidate boxes are merged by non-maximum suppression (NMS).

Refine Network (R-net) : This Network structure again removes the false-positive areas through bounding box regression and NMSS.

However, due to the difference between this network structure and p-NET network structure, a full connection layer is added, so it can achieve better suppression of false-positive effect.

Output Network (O-NET) : This layer has one more volume base layer than r-NET layer, so the processing results will be more delicate. The same effect as the R-NET layer. However, this layer provides more supervision of the face area and outputs five landmarks.

 

The detailed network structure is shown in the figure below:

The more detailed network structure of prototxt is as follows: det1,det2, and det3 respectively.

Det1. Prototxt structure:

Det2. Prototxt structure:

Det3. Prototxt structure:

 

Training:

MTCNN feature descriptor mainly consists of three parts, face/non-face classifier, boundary box regression and landmark location.

Face classification:

 

The above formula is the cross entropy loss function of face classification, where PI is the probability of face and Yidet is the real label in the background.

Boundary box regression:

 

The formula above is the regression loss calculated by Euclidean distance. Among them, y with tip is obtained through network prediction, and Y without tip is the actual real background coordinate. Where y is a quad consisting of (top left x, top left y, length, width).

Landmark positioning:

 

Like the boundary regression, the Euclidean distance between the landmark location predicted by the network and the actual real landmark is calculated, and the distance is minimized. Where,, y with tip is obtained through network prediction, and Y without tip is the actual real landmark coordinates. Since there are five points, and each point has two coordinates, y is a dectuple.

Training for multiple input sources:

 

The whole training learning process is to minimize the above function, where N is the number of training samples, AJ is the importance of the task, BJ is the sample label, and Lj is the above loss function.

In the training process, in order to achieve better results, the author only propagated the gradient of the first 70% of samples backward each time, so as to ensure that the transmitted numbers were valid. A bit similar to latent SVM, but the author more in the implementation of deep learning end-to-end.

In the training process, the IoU (Intersection- over-union) ratio of y point and Y is as follows:

0-0.3: Non-face

0.65-1.00: the human face

0.4-0.65: Part face

0.3-0.4: landmark

Proportion of training samples, negative sample: positive sample: Part sample: landmark = 3:1:2:2

Installation steps:

Caffe – Windows Installation:

Blog.csdn.net/qq_14845119…

Installation of Pdollartoolbox:

 

Pdollartoolbox was written by Piotr Dollar of UCSD, focusing on feature extraction and classification algorithms related to Object Recognition detection. This toolbox belongs to the specialized and refined type, mainly related algorithms of Dollar’s several object detection papers, which should be very useful for object recognition related research. At the same time its image operation or matrix operation function can also be used as a supplement to Matlab image processing toolbox, functions mainly include several modules: * Channels module, image Feature extraction, including HOG, etc. Dollar’s research work proposed a Channel Feature Feature [2], so channels mainly includes some basic algorithms needed for Feature extraction, such as gradient, volume and so on

* Classify module, some fast correlation algorithms, including random Ferns, RBF Functions, PCA, etc. * Detector module, detection algorithm corresponding to Channel Feature Some conventional image filter * images module, some conventional image, video operation, some very practical functions * MATLAB module, some conventional MATLAB functions, including matrix calculation, display, variable operation, etc., very practical * Videos module, some conventional video operation functions, etc

 

Download link: github.com/pdollar/too…

After downloading to Toolbox, unpack it into any directory, such as E:\MATLAB\MATLAB Production Server\ Toolbox

Enter in the Matlab command line

 

Addpath (genpath (” toolbox – masterROOT “)); savepath;

 

Add the decompression directory to the Matlab path. Toolbox -masterROOT is the decompression directory path, such as E:\ MATLAB\MATLAB Production Server\ Toolbox, Addpath (genPath (‘ E:\ MATLAB\MATLAB ProductionServer\ Toolbox ‘)); savepath;

 

So Piotr’s Image & Video omatLab Toolbox is installed.

 

Add caffe’s library directory to path. For example, add your own path to the following path

 

Open demo.m and modify caffe_path, pdollar_toolbox_path, caffe_model_path

. At the same time, since my computer does not have a GPU, it is modified as follows.

 

Experimental results:

The running time was 1.2s, calculated as 18 faces were detected, with an average of 66MS. The running version was release version. From the experimental results, both detection and alignment are unprecedented. In my experience, face++ alignment is the best, MTCNN is the best in the rest of open source, then SDM.

 

From the correctness verified by the following authors on FDDB+WIDERFACE+AFLW, the accuracy is basically 95%. It can be seen that the performance and efficiency of MTCNN are very powerful.

 

From the experimental results, it can be seen that there is something wrong with the second alignment of the second line in the figure above. Therefore, I have made minor changes to the program. The actual operation effect is shown in the figure below, both time and effect have been improved.

Download link: download.csdn.net/detail/qq_1…

After the group thanks to the company’s help, the C language version was finally changed. Honestly, it wasn’t easy. A lot of walking. Post an effect picture to commemorate those bitter times with a smile.

References:

[1] kpzhang93. Making. IO/MTCNN_face_…

[2] github.com/kpzhang93/M…

[3] ZhangK, Zhang Z, Li Z, et al. Joint Face Detection and Alignment using Multi-taskCascaded Convolutional Networks[J]. arXiv preprint ArXiv: 1604.02878, 2016.

 

The experiment

 

This article mainly uses three data sets for training: FDDB, Wider Face and AFLW.

A. Training data

This paper divides the data into four types:

Negative faces, Positive Part faces, Landmark face

They are used to train for three different tasks. Negative and Positive are used for face classification, Positive and part faces are used for bounding box regression, and Landmark face is used for feature point positioning.

B, effects,

The results of face detection and face feature location are very good. The key is that this algorithm is fast, up to 16fps on 2.6GHZ cpus and up to 99fps on the Nvidia Titan.

conclusion

This paper uses a cascade structure for face detection and feature point detection. The method has fast speed and good effect, and can be considered for use on mobile devices. This is also a coarse-to-thin approach, similar to viola-Jones’ cascading AdaBoost.

Similar to Viola-Jones: 1. How to select the area to be detected: image pyramid + P-net; 2. How to extract target features: CNN; 3, how to judge whether to specify the target: cascade judgment.

The appendix

In object detection papers, some common methods are Bounding box regression, IoU and NMS. Here is a detailed introduction.

Bounding box regression

Here’s a good answer:

www.caffecn.cn/?/question/…

Simply speaking, it is to move the predicted box to the actual box. The input feature is the feature extracted from the candidate area, and the target is the change value of the two boxes.

IoU

Degree of overlap (IOU) :

Object detection needs to locate the bounding box of objects, just as in the following picture, we not only need to locate the bounding box of vehicles, but also identify the objects inside the bounding box as vehicles.

There is a very important concept for the positioning accuracy of bounding box: because our algorithm cannot completely match the manually marked data 100%, there is an evaluation formula for the positioning accuracy: IOU. It defines the degree of overlap of two bounding boxes, as shown in the figure below

 

It is the proportion of the overlapping area of rectangular boxes A and B to the area of the union of A and B.

 

Non-maximum suppression (NMS) :

RCNN will find out N possible rectangular frames of objects from a picture, and then make classification probability for each rectangular frame:

Just like in the picture above, locating a vehicle, the algorithm eventually finds a bunch of boxes, and we need to determine which ones are useless. The method of non-maximum suppression is as follows: first assume that there are six rectangular boxes, sort them according to the classification probability of classifier, and assume that the probability of belonging to vehicles from small to large is A, B, C, D, E and F respectively. (1) Starting from the maximum probability rectangular box F, judge whether the overlap degree IOU of A~E and F is greater than A set threshold; (2) Assume that the overlap degree of B, D and F exceeds the threshold value, then throw away B and D; And mark the first rectangle, F, which we kept. (3) From the remaining rectangular boxes A, C and E, select E with the highest probability, and judge the degree of overlap between E and A and C. If the degree of overlap is greater than A certain threshold, throw it away; And mark E as the second rectangle that we keep. Keep repeating, finding all the remaining rectangles. Non-maxima suppression (NMS), as its name implies, inhibits elements that are not maxima and searches for local maxima. This local represents a neighborhood with two variables: dimension and size of the neighborhood. The general NMS algorithm is not discussed here, but is used to extract the window with the highest score in target detection. For example, in pedestrian detection, after features are extracted from sliding Windows and classified and identified by classifiers, each window will get a score. However, sliding Windows can cause many Windows to contain or mostly cross with other Windows. In this case, NMS is needed to select the neighborhood with the highest score (the highest probability of being a pedestrian) and suppress the Windows with the lowest score.