Face recognition development analysis in deep learning

 

directory

Introduction to Face Recognition

Face recognition algorithm

Actual combat parsing

reference


Introduction to Face Recognition

 

What is face recognition

 

Face recognition problems are divided into two categories: 1. Face verification (also known as face comparison) 2. Face recognition.

 

Face verification does a one-to-one comparison to determine whether the person in two images is the same person. The most common application scenario is face unlocking. Terminal devices (such as mobile phones) only need to compare the user’s pre-registered photo with the photo collected on the spot to determine whether the user is the same person to complete identity authentication.

 

What face recognition does is a ratio of 1 to N, that is, to judge the person the system is currently seeing, which one of the many people it has seen in advance. Suspect tracking, community access, venue check-in, and customer identification in the new retail concept.

 

The common feature of these application scenarios is that the face recognition system has stored a large number of different faces and identity information in advance. When the system runs, it is necessary to compare the faces to be seen with a large number of faces stored before to find the matching face.

 

In the early stage (2012 ~2015), the two algorithms were implemented through different algorithm frameworks. In order to have face verification and face recognition system at the same time, two neural networks need to be trained separately. However, the publication of Google’s FaceNet [1] paper in 2015 changed this situation and unified the two into one framework.

 

Face recognition, how to recognize

 

This section just illustrates a core idea: different faces are made up of different features.

 

To understand this idea, the first thing we need to introduce is the concept of features. Take a look at this example:

 

 

Assuming that these five features are sufficient to describe a human face, each face can be represented as a combination of these five features:

 

(Feature 1, Feature 2, Feature 3, Feature 4, Feature 5)

 

A European and American boy with double eyelids, a straight nose, blue eyes, white skin and brown hair can be characterized as (see the bold items in the table) :

 

,1,0,1,0 (1)

 

So iterating through the feature table above can represent the totalA different face. Thirty-two faces isn’t nearly enough to cover a population of more than seven billion. In order to cover enough faces with different features, we need to expand the feature list above. The extended feature table can be expanded from two perspectives: row and column.

 

The column Angle is simple, just increase the number of features :(feature 6. Face shape, feature 7. 8. Thickness of lips…) In practice, 128,256,512 or 1024 different features are usually used. Where do these features come from? Do you design them one by one? This question will be answered later.

 

For example, “feature 3”, in addition to the value 0 for blue and 1 for gray, could we add a value 2 for black and 3 for no hair? Furthermore, in addition to these discrete integers, we can also take consecutive decimals, such as the value of feature 3, 0.1, which represents “blue with a slight tinge of black”, and 0.9, which represents “grey with a tinge of blue”…

 

After this extension, the feature space becomes infinite. A face in the expanded feature space may be expressed as:

 

(0, 1, 0.3, 0.5, 0.1, 2, 2.3, 1.75…)

 

The question posed earlier: Where do a lot of the features used to represent faces come from? This is where deep learning (deep neural networks) comes in. It will automatically summarize the most suitable for computer to understand and distinguish the face features after learning and training on the face database of ten million or even one billion levels.

 

Algorithm engineers often need some kind of visualization to know which features the machine has learned to distinguish between different people, but that’s not the focus of this section.

 

After illustrating that different faces are composed of different features, we have enough knowledge to analyze face recognition and how to recognize it.

 

Now consider the simplest, most ideal case, where there are only two characteristics that distinguish one person from another: trait 1 and trait 2. Then each face can be represented as a coordinate (feature 1, feature 2), a point in the feature space (in this case, two-dimensional space).

 

Face recognition is based on an assumption that works by default: the faces of the same person in different photos are very close together in the feature space.

 

Why is this assumption true by default? Imagine a person with brown hair, whose hair color looks slightly different under different lighting, shading, and angles, but is still very close to the real color. This is reflected in the eigenvalue of hair color, which may vary from 0 to 0.1.

 

Another task and challenge of deep learning is to accurately identify features in a variety of extremely complex environmental conditions.

 

 

The above picture is a PPT used in kumamoto’s speech on the denoising of large-scale face data set. Three photos of Tomohisa Yamashita are transformed into three points (red) in the 128-dimensional space after neural network extraction, while Rimi Ishihara’s feature points are green.

 

This PPT wants to express the same idea: features extracted from different photos of the same person are very close to each other in feature space, and faces of different people are far away from each other in feature space.

 

Consider two problems in the field of face recognition: face verification and face recognition.

 

Face verification

 

FaceID face unlock, for example, the iPhone prior saved a photo of the user (users) and the picture turned out to be transformed into a series of characteristic value (that is, a point in the feature space), a user to unlock, mobile phone just need to compare the current collection to the face and to register face geometric distance in feature space, and if you are close enough, The user is the same. If the distance is not close enough, the unlock fails. The distance threshold is set by algorithm engineers through a lot of experiments.

 

Face recognition

 

Also consider a scenario, face attendance. Company X has employees A,B and C. The company requires each of the three employees to provide A personal photo for registration in the company system when they enter the company and lie quietly in the feature space.

 

The next morning when employee A goes to work, he points his face at the attendance machine, and the system puts the current employee A’s face into the feature space and compares it with the faces registered in the previous feature space. It is found that employee A is the closest feature face among the registered faces to the current collected face.

 

Once you know the basics of face recognition, you can see its technical limitations. Here are some examples of easy to identify failures:

 

 

Under many conditions, such as poor illumination, occlusion, deformation (laughter), and side face, it is difficult for neural network to extract features similar to “standard face”, and abnormal face falls into the wrong position in feature space, resulting in failure of recognition and verification. This is a limitation of modern face recognition systems and, to some extent, of deep learning (deep neural networks).

 

In the face of this limitation, three countermeasures are usually taken to make the face recognition system work normally:

 

1. Engineering perspective: research and development quality model to evaluate the quality of the detected face, poor quality will not be recognized/checked.

 

2. Application Angle: Impose scene restrictions, such as face unlocking, face gate, and venue check-in, requiring users to face the camera under good lighting conditions to avoid collecting poor quality pictures.

 

3. Algorithm perspective: improve the performance of face recognition model, add more complex scenes and quality photos to the training data, so as to enhance the anti-interference ability of the model.

 

All in all, face recognition/deep learning is not nearly as smart as people think. I hope that after reading the first section, readers will be able to distinguish the truth and falsehood of information on social networks and “we media”, treat ARTIFICIAL intelligence more rationally, give it time and tolerance, and grow slowly.

 

Face recognition algorithm

 

This section will follow up modern face recognition algorithms from two ideas:

 

Idea 1: Metric Learning: Contrastive Loss, Triplet Loss and related sampling method.

 

2. Margin Based Classification: includes Softmax with Center Loss, Sphereface, NormFace, AM-Softmax (CosFace) and ArcFace.

 

Key words: DeepID2, Facenet, Center loss, Triplet loss, Contrastive Loss, Sampling method, Sphereface, Additive Margin Softmax (CosFace), ArcFace.

 

Idea 1: Metric Learning

 

Contrastive Loss 

 

DeepID2 [2] was one of the first to apply Metric Learning ideas in the field of face recognition based on deep Learning. The “features” of the same Chapter 1 are called “DeepID Vector” in this paper.

 

DeepID2 trains both Verification and Classification (that is, two monitoring signals) on the same network. Verification Loss introduces Contrastive Loss in the feature layer.

 

Contrastive Loss essentially makes photos of the same person close enough together in the feature space that different people are far enough apart in the feature space until a certain threshold m is exceeded (sounds a lot like Triplet Loss).

 

Based on this insight, DeepID2 is trained not on a single Image, but on an Image Pair, entering two images at a time, and verifying Label 1 for the same person and -1 for non-same person. The idea of parameter updating is shown in the following formula (excerpt from DeepID2 paper) :

 

 

DeepID2 has been a very influential work in the face field for 14 years and has led the way in the introduction of Metric Learning in the face field.

 

Triplet Loss from FaceNet

 

This 15-year-old piece from Google’s FaceNet was also a watershed in the field of facial recognition. Not only did they successfully apply Triplet Loss to obtain state-of-art results on benchmark, but also they proposed a unified framework for solving most face problems, namely: Identification, verification, search and other problems can be put into the feature space to do, need to focus on solving is just how to better map the face to the feature space.

 

Therefore, On the basis of DeepID2, Google abandoned the Classification layer, namely Classification Loss, and improved Contrastive Loss into Triplet Loss, with only one purpose: to learn a better feature.

 

The idea of Triplet Loss is also very simple. The input is no longer Image Pair, but three Triplet images, namely Anchor Face, Negative Face and Positive Face. Anchor and Positive Face are the same person, and Negative Face are different people. Then Triplet Loss can be expressed as:

 

 

The intuitive explanation is: in the feature space, the distance between Anchor and Positive is smaller than that between Anchor and Negative, which is more than one Margin Alpha.

 

With a good face feature space, the face problem is transformed into a simple and intuitive description at the end of Chapter 1. Attached is a PPT of Contrastive Loss and Triplet Loss that I made:

 

 

Metric Learning problems

 

Metric Learning based on Contrastive Loss and Triplet Loss conforms to people’s cognitive rules and has achieved good results in practical application. However, it has two very fatal problems, which make it like pain in the ass when applying them.

 

1. It takes months mentioned in FaceNet paper a very long time to fit the model. The training samples of Contrastive Loss and Triplet Loss are both based on pair or Triplet, and the possible sample numbers are O (N2) or O (N3).

 

When the training set is large, it is almost impossible to traverse all possible samples (or samples that provide a sufficient gradient), so it generally takes a long time to fit. It took me nearly a month to fit on an Asian dataset of 10,000 people and 500,000 or so sheets.

 

2. The quality of the model depends very much on the Sample method of training data. The ideal Sample method can not only improve the final performance of the algorithm, but also slightly accelerate the training speed.

 

Many scholars have conducted follow-up studies on these two issues. The following content is an extension of Metric Learning and will not be detailed.

 

Metric Learning read more at ****

 

1. Deep Face Recognition [3]

 

In order to speed up the training of Triplet Loss, this paper first uses the traditional Softmax to train the face recognition model. Because of the strong supervision characteristic of Classafficiation signal, the model will fit quickly (usually less than 2 days, even a few hours).

 

After that, the Classiation Layer on the top was removed and Triplet Loss was used to finetune the feature Layer of the model, which achieved good results. In addition, this paper also published the Face data set VGG-Face.

 

2. In Defense of the Triplet Loss for Person Re-Identification [4]

 

The article makes three very interesting points:

 

  • The author said that in the experiment, the Squared Euclidean Distance performed worse than the real non-squared Euclidean Distance. Frankly speaking, the square in the formula below was removed.

     

  • The soft-margin Loss formula is proposed to replace the original Triplet Loss expression.

     

  • Batch Hard Sampling was introduced.

 

 

3. Sampling Matters in Deep Embedding Learning [5]

 

The article makes two valuable points:

 

  • From the perspective of derivative function, it explains why non-squared Distance mentioned in point 2 is better than squared Distance. Based on this insight, Margin Based Loss (essentially a Variant of Triplet Loss, as shown in the figure below) is proposed.

     

  • Distance Weighted Sampling is proposed. According to the paper, semi-hard Sampling in FaceNet, Random hard in Deep Face Recognition [3] and Batch hard mentioned in [4] cannot be easily obtained to produce large gradient (big loss, That is, triplets that are helpful for model training), and then Distance Weighted Sampling Method is used from the perspective of statistics.

 

 

4. My impression of the experiment

 

  • The methods mentioned in point 2 and 3 have been applied in the experiment. It is intuitively felt that soft-margin and Margin Based Loss are easier to use than the original Triplet Loss, but they are superior in the experiment of Margin Based Loss.

     

  • Distance Weighted Sampling Method does not improve significantly.

 

You may refer to the article you are interested in. Finally, it is worth noting that Triplet Loss has also achieved good results in the field of pedestrian re-recognition, although it is likely to be defeated by Margin Based Classfication in the future.

 

2. Margin Based Classification

 

As the name implies, Margin Based Classficiation does not directly calculate the loss Metric Learning at the feature layer, which adds intuitive strong restriction to the feature. Instead, face recognition is still trained as classification task. Through the modification of Softmax formula, margin limit is imposed on the feature layer indirectly to make the final feature obtained by the network more discriminative.

 

This part starts with Sphereface [6].

 

Sphereface

 

Follow the author’s insight first.

 

 

Figure (a) is the feature trained with the original Softmax loss function, and Figure (b) is the normalized feature. It is not difficult to find that the characteristics of Softmax from the perspective of latent distribution.

 

So why not just optimize the Angle? If the weight of classification layer is normalized and bias is not considered, the improved loss function can be obtained:

 

 

It is not difficult to see that for feature X_i, the optimization direction of the loss function is to make it close to the y_I center of the category and away from other category centers. This goal is consistent with the goal of face recognition, minimizing the intra-class distance and maximizing the inter-class distance.

 

However, in order to ensure the correctness of face comparison, it is necessary to ensure that the maximum distance within the class is less than the minimum distance between classes. The loss function above does not guarantee this. Therefore, the author introduces the idea of margin, which is consistent with the idea of introducing margin Alpha in Triples Loss.

 

So how does the author further improve the above formula and introduce margin?

 

In the red box above is the cosine value of the sample feature and the class center. Our goal is to reduce the Angle between the sample feature and the class center, that is, to increase this value. In other words, if this value is smaller, the loss function value is larger, that is, the greater the penalty for deviating from the optimization goal.

 

In other words, this can further reduce the distance within the class and increase the distance between classes, to achieve our goal. Based on such ideas, the final loss function is as follows:

 

 

The original cosine (θ) is replaced by phi(θ), the simplest form of phi(θ) is actually cosine (mθ), which is complicated in the text only to extend the domain to [0,2π] and to ensure that it decreases monotonically within the domain.

 

And this m is the increased margin factor. When m=1, phi(θ) is equal to cos(θ). When m>1, phi becomes smaller and the loss becomes larger. The hyperparameter M controls the severity of punishment, and the larger m, the greater the severity of punishment.

 

To facilitate calculation, m is generally set as an integer. The author proves mathematically that m>=3 guarantees that the maximum in-class distance is less than the minimum interclass distance. And when you do that, you use the double Angle formula.

 

In addition, Sphereface’s training was tricky, and this article didn’t mention the details of its training but referred to the author’s previous article [10]. You can also find training details on Github, which has plenty of discussion on issues.

 

Normface

 

Sphereface works well, but it’s not pretty. In the testing phase, hereFace measures similarity by cosine between features, that is, by Angle.

 

But in the training phase, if you haven’t noticed, Sphereface’s loss function isn’t directly optimizing the Angle between the feature and the class center. Rather, the Angle between the feature and the class center is multiplied by the length of a feature.

 

That is to say, my above statement about the optimization direction of Sphereface loss function is not precise. In fact, part of the optimization direction is to increase the length of features.

 

I have done experiments on MNIST data sets, and the following pictures are the feature visualization when M =1 and m=4 respectively. The above views can be verified by paying attention to the scale of coordinates.

 

 

However, the length of the feature is not helpful when we use the model. This results in the inconsistency between training and test. According to the original words of Normface, there is a gap.

 

This is where the core idea of Normface comes in: why not normalize features during training? The corresponding loss function is as follows:

 

 

Where W is the normalized weight, f_I is the normalized feature, and the dot product is the Angle cosine. The parameter S is introduced because of its mathematical properties, which ensures the rationality of the gradient size. There is a more intuitive explanation in the original text, which is not the focus here.

 

There is no convergence without s training. For the setting of s, you can set it as a learnable parameter. However, the author recommends that it be regarded as a hyperparameter, and its value has corresponding recommended value according to the number of classification categories. For this part, there is a formula in appendix.

 

It is also pointed out that the normalized Euclidean distance and cosine distance in FaceNet are actually the same. There are also many interesting discussions about the normalization of weights and features. Interested readers are advised to read the original text.

 

AM-softmax [11] / CosFace [12]

 

These two articles are the same thing. Normface solves the problem of inconsistency between Sphereface training and testing with feature normalization. But it has no sense of margin. Am-softmax can be said to have introduced margin on the basis of Normface. Direct up loss function:

 

 

Where the weights and features are normalized.

 

Intuitively, cos(θ)-m is smaller than cos(θ), so the loss function is larger than Normface, hence the sense of margin.

 

M is a hyperparameter that controls the severity of the punishment. The larger m is, the stronger the punishment is. The authors recommend m=0.35. The way margin is introduced here is much softer than the ‘gentle’ in Sphereface, which is not only easy to reproduce, but also works well without many tricks tweaks.

 

ArcFace [13]

 

Compared with AM-Softmax, the difference lies in the different way Arcface introduces margin, loss function:

 

 

Does it look like AM-Softmax at first glance? Notice that m is inside the cosine. The paper points out that the boundary between features obtained based on the above equation is superior and has stronger geometric interpretation.

 

However, is it a problem to introduce margin in this way? So let’s think about is cosine theta plus m definitely less than cosine theta?

 

Finally, we use the figure in the article to explain this problem and make a summary of margin-based Classification in this chapter.

 

summary

 

 

This figure is from Arcface. The abscis θ is the Angle between the feature and the center of the class, and the ordinate is the value of the molecular exponential part of the loss function (not considering S). The smaller the value is, the greater the loss function will be.

 

Reading so many classification-based face recognition papers, you might get the feeling that everyone is working on the loss function, or more specifically, designing the Target logit-theta curve shown above.

 

This curve means how you’re going to optimize the off-target sample, or how much punishment you’re going to give depending on how off-target you are. Two points:

 

1. Strong constraints are not easy to generalize. For example, Sphereface’s loss function satisfies the requirement that the maximum distance within a class is less than the minimum distance between classes at m=3 or 4. The loss function is large, that is, the target logits is small. But it does not mean that it can be generalized to samples outside the training set. Imposing too strong constraints will reduce model performance and training is not easy to converge.

 

2. It is important to choose what kind of sample to optimize. Arcface points out that excessive punishment of θ∈[60°, 90°] may lead to training non-convergence. Optimization of θ∈[30°, 60°] samples may improve the accuracy of the model, while excessive optimization of θ∈[0°, 30°] samples will not improve significantly. As for the larger Angle of the sample, deviating too far from the target, forced optimization is likely to reduce the model performance.

 

And that answers the question left in the last video, which is that Arcface is going up, and that’s not important and it’s even good. Because it may not be beneficial to optimize the hard sample with a large Angle. This is similar to FaceNet’s semi-hard strategy for sample selection.

 

Margin based classification

 

1. A discriminative feature learning approach for deep face recognition [14]

 

Center Loss is proposed and weighted into the original Softmax Loss. Maintaining a Euclidean space class center reduces the distance within a class and enhances the discriminative power of a feature.

 

2. Large-margin softmax loss for convolutional neural networks [10]

 

A previous article by hereFace authors, without normalized weights, introduced margins in Softmax Loss. It also covers the training details of Sphereface.

 

Note: Idea 2 was written by Chen Chao

 

Actual combat parsing

 

Based on the knowledge of the first two chapters, I achieved 99.47% of the results on LFW. This result was trained on Vggface2 without any de-weighting with LFW, and did not go through the painful adjustment process, which could be regarded as the direct benefits brought by the am-Softmax loss function.

 

In the process of stepping on a lot of pits, this chapter will be the previous period of experimental results and experience to do an arrangement, in addition to answer most of the engineers in face recognition when some of the most concerned questions. やり Imperial Japan UN!

 

Project Address:

 

Github.com/Joker316701…

 

Include code to reproduce all experimental results

 

A standard face recognition system includes these links: face detection and feature point detection -> face alignment -> face recognition.

 

Face Detection & Landmark Detection

 

At present, the most popular face and Landmark detection is MTCNN [7], but MTCNN occasionally fails to detect Face, and Landmark detection is not accurate enough. Both of these can adversely affect subsequent alignment and identification.

 

In addition, as mentioned in COCO Loss [8] paper, good detection and alignment methods can reach 99.75% only with SoftMax, which beats the results of most of the latest papers at present. More details are mentioned in Github Issue [16] of COCO Loss.

 

In addition, due to the difference in the performance of alignment algorithms, papers in 2017 and later pay more attention to the comparison of relative experimental results, so as to eliminate the advantages and disadvantages introduced by alignment algorithms and facilitate a more intuitive comparison of various face recognition algorithms. The ease of reaching 99% or more on LFW is also the reason for the current preference for relative results.

 

Face alignment

 

What face alignment does is to transform the detected face and Landmark to a relatively fixed position in the picture through geometric transformation to provide a strong prior.

 

The widely used alignment method is Similarity Transformation. For more transformation methods and experiments, please refer to this Zhihu article [17].

 

Author code implementation:

 

Github.com/Joker316701…

 

One question worth asking is: Is face detection and alignment really necessary? In practical applications, face landmarks often cannot be detected, and Similarity Transoformation cannot be used without landmarks.

 

There are also relevant studies aiming at this problem. By using Spatial Transform Network [9] to “let the Network learn alignment by itself”, End-to-End Spatial Transform Face Detection and Recognition. The research progress in this aspect is not sufficient, so the process of detection->alignment is still used in most practical systems.

 

Face recognition

 

It can be said that most of the problems in face recognition projects are face detection and alignment problems. The gap between recognition models is less obvious. However, the training of AM-Softmax still encountered some noteworthy problems.

 

Resface20, introduced in Spheraface and also used in am-softmax, achieved only 94% on LFW when replicated exactly the same way.

 

TensorFlow can be fitted as follows:

 

Adam, no weight decay, use batch normalization.

 

Corresponding to the original configuration:

 

Momentum, weight decay, no batch normalization.

 

And what they found in their experiments: Any optimizer other than Adam does not achieve the desired results, which may be the reason for the differences between the underlying implementations of different frameworks. Sphereface, am-Softmax are all based on Caffe, and all experiments in this paper use TensorFlow, so it is normal to have different conclusions.

 

On the other hand, Sandberg FaceNet’s Resnet-Inception -v1 was transferred to apply am-Softmax with less than 97% results on LFW, which was a bit of a problem in the process.

 

From other papers, if loss is selected correctly, resnet-inception, resnet of different depths, and even mobile-net, Structures such as Squeezenet should also not perform significantly differently (at least 99% in the case of am-Softmax).

 

In addition, direct application of Arcface cannot be fitted, requiring further experiments.

 

Finally, one interesting thing about Sandberg’s code is that he defines train_op in the facenet.train() function. A close reading of the function reveals that Sandberg’s code does not use values for all network parameters after each gradient update. Instead, the moving average is used as the actual parameter value of the network.

 

That’s why Sandberg didn’t even give the value of “is_training” to the placeholder in the batch_norm parameter Configuration, By default, train and test use the local statistics mode.

 

This use of batCH_norm would be wrong if it were not for the moving average used for all parameters. How well Sandberg does this will be judged by the results of the experiment.

 

If you want to use network parameters and batch norms normally, instead of using moving averages and leaving “is_training” on all the way, just replace the facenet.train() function with plain Optimizer, I then put the batch_norm’s “is_training” into the placeholder process (see my am-Softmax implementation for details).

 

 

Thank you for reading to the end, and end with TensorBoard’s plot!

 

 

reference

 

[1] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proc. CVPR, 2015. 

[2] Y. Sun, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. CoRR, Abs / 1406.4773, 2014.

[3] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In BMVC, 2015 

[4] A. Hermans, L. Beyer, And B. Leibe. Defense of the Triplet Loss for Person Re-identification. ArXiv Preprint arXiv:1703.07737, 2017

[5] Wu, C. Manmatha, R. Smola, A. J. and Krahenb uhl, P. 2017. Sampling matters in deep embedding learning. arXiv preprint arXiv:1706.07567 

[6] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017 

[7] Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multi-task cascaded convolutional networks. arXiv preprint, 2016 

[8] Yu Liu, Hongyang Li, and Xiaogang Wang. 2017. Learning Deep Features via Congenerous Cosine Loss for Person Recognition. arXiv preprint ArXiv: 1702.06890, 2017

[9] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In NIPS, 2015. 

[10] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-margin softmax loss for convolutional neural networks. In ICML, 2016. 

[11] F. Wang, W. Liu, H. Liu, and J. Cheng. Additive margin softmax for face verification. In arXiv:1801.05599, 2018. 

[12] CosFace: Large Margin Cosine Loss for Deep Face Recognition 

[13] Deng, J., Guo, J., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: Arxiv preprint. 2018 

[14] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In ECCV, 2016. 

[15] Y. Liu, H. Li, X. Wang. Rethinking feature discrimination and Polymerization for large-scale Recognition. ArXiv :1710.00870, 2017.

[16] github.com/sciencefans…

[17] zhuanlan.zhihu.com/p/29515986