[I am Jarvis] : Talk about the deep learning vision algorithm behind FaceID

What attracted me most about the iPhoneX, which was unveiled last week, was not the silly bunny ears, but apple’s FaceID. Behind apple’s move to replace TouchID with FaceID is strong support for visual algorithms that give the iPhoneX the ability to detect all manner of spoofing and masquerading, daring to use FaceID as the most important security check.

As we all know, deep learning algorithms have been advancing rapidly lately, driving the growth of artificial intelligence. The success of AlexNet in ImageNet data set in 2012 made the long-dormant Concolutional Neural Network (CNN) get people’s attention again, VGG, A series of models such as ResNet have been put forward to give computers the visual ability of approaching human beings. It has a Recurrent Neural Network (RNN) application in natural language processing and speech recognition, which has broken through the bottleneck of traditional models that are difficult to grasp temporal data. As a basic ability, deep learning promotes the development of a series of related disciplines such as transfer learning and reinforcement learning. Of course, it also includes the transformation of vision.

To this end, we have integrated a number of classical deep learning algorithms and models into Jarvis algorithm library and provided them to everyone through the Tesla platform. At present, Jarvis has integrated 9 different algorithms in deep learning category, which are distributed in the following 3 categories:

This paper will focus on introducing several classical algorithms in the field of computer vision, including their principles, models and applicable scenarios. There will be articles in other fields, please keep an eye on them

Introduction to Computer Vision

Computer vision is the science of making machines “see” the world. Computer vision combines image processing, pattern recognition and artificial intelligence technology, focusing on the analysis of one or more images, so as to obtain the required information. So computer vision can also be seen as the science of making artificial systems “feel” from images or multidimensional data. Specifically, computer vision includes object detection and recognition, object tracking, image restoration (removing noise, etc.) and scene reconstruction, etc.

Since Alexnet won the title of ILSVRC in the visual field in 2012, Deep Learning has been out of control, and ILSVRC has been ranked by Deep Learning every year. The models shown below represent milestones in the field of depth vision. As the model becomes deeper and deeper, the error rate of top-5 is getting lower and lower. The results in Resnet have reached around 3.5%, while in the same ImageNet data set, the error rate of human eyes is about 5.1%, which means that the recognition ability of deep learning model has surpassed that of human.

(ai.51cto.com/art/201704/…).

Classical algorithms and models

Alexnet, VGG and other networks not only achieved good results in ImageNet, but also can be used in other scenarios. In order for users to be able to train these models flexibly and quickly, we integrated these algorithms in Jarvis. Users can directly drag out corresponding algorithm nodes on Tesla, and train corresponding pictures without writing complex network definition and model training codes, so as to obtain a basic usable model. Of course, if you want a good effect, or their patience slowly debugging ha ~

1. AlexNet

The network structure of AlexNet is shown in the figure below. It can be seen that AlexNex still follows the principle of traditional CNN in general structure, which is composed of convolution, downsampling and full connection layer one by one. The weird “layering up and down” is actually GPU parallelism.

(cca shut. Nips. Cc/paper / 4824 -…

AlexNet can achieve good results on ImageNet. In addition to using deeper and wider network structure to enhance learning ability, there are the following points worth learning in data processing and training skills:

Data Augmentation

Data Augmentation is a series of operations on the underlying Data to produce additional Data, which enhances the diversity of the Data set and can reduce some degree of over-fitting. AlexNet uses the following data enhancement methods in ImageNet training:

* Horizontal flip * random clipping * color/lighting changes, etcCopy the code

Dropout

Dropout is when a neural network unit is temporarily dropped from the network at a certain probability during the training of a deep learning network. For stochastic gradient descent, each mini-batch can be regarded as training different networks due to random discarding. It is an effective method to prevent network overfitting and enhance model generalization.

Each dropout is equivalent to finding a more “thin” structure from the original network, as shown in the image on the right. During the training phase, we set a dropout factor P, with a range of 0-1, to represent the proportion of connections that need to be randomly disconnected during the forward calculation phase, and update only the undisconnected weight values during backpropagation. In the test phase, all connections need to be used, but these weights need to be multiplied by 1-p.

It should be noted that each break and update is implemented randomly with a probability of P, so the break is different for each iteration. For a neural network with n nodes, assuming the dropout factor P =0.5, if the number of forward turns out to be sufficient, 2^ N connection combinations will be obtained during the entire training process, which is equivalent to training 2^n models, resulting in a combination of 2^n models.

(blog.csdn.net/stdcoutzyx/…).

ReLU activation function

Compared with traditional Tanh or Logistic function, ReLu activation function has the following advantages:

* The process of forward calculation and reverse partial derivative is very simple, without complex exponential or division operations * There is no "flat spot" like TANh and Logistic, so it is not prone to the problem of gradient dispersion * With the left side closed, many hidden layer output can be 0, so the network becomes sparse. It has certain regularization effect and can alleviate the overfittingCopy the code

The disadvantage of Relu is that the left side of the function is turned off. If the output of some nodes falls on the left side of the function, it will “never turn over”. In order to solve this problem, some improved activation functions such as pRelu appeared later, that is, the left side of the function was also given a certain gradient, so as not to make some nodes “die” while ensuring the nonlinearity.

(gforge. Se / 2015/06 / Ben…).

Local Response Normalization (LRN)

There’s a neurobiological concept called lateral inhibitio, in which activated neurons inhibit their neighbors in some way.

LRN refers to the idea of lateral inhibition to create a competitive mechanism for the activities of local neurons, so that the values with larger responses are relatively larger, so as to improve the generalization ability of the model. LRN has two normalized modes: in-channel and inter-channel, as described in this article. In the ImageNet experiment of this paper, LRN can reduce the top-5 error rate by 1.2%. However, according to our experiments on other data sets, the effect of LRN lifting training is not so significant. Therefore, it can be seen that THE operation of LRN is not applicable to all scenes, and results can only be obtained through more experiments.

Overlapping pooling

As the name implies, there will be overlap in pooling. Generally speaking, pooling is the process of sub-sampling by extracting the maximum pooling or average pooling of the input results, where blocks do not overlap each other. In AlexNet, there will be a part overlap. As with LRN, this trick may not work in all scenarios.

On the whole, AlexNet is still a very classic algorithm. So we integrated AlexNet into the visual class of Jarvis’ deep learning algorithm. In terms of implementation, we do the following things according to the original construction and training method of AlexNet and the article “One Weird Trick for Parallelizing convolutional Neural Networksc” published by Alex:

Remove the LRN layer and change the initialization of the parameters to Xavier mode.
The activation function uses the Relu function
Regularization selects L2 regularization
Use dropout operation with a coefficient of 0.5 for both fc6 and FC7 fully connected layers
Automatic Resize: In data input, the given input image is 224x224x3. If the image read is larger than this size, it will be randomly cropped to 224x224x3. If the image read is smaller than this size, it will be resized to 224x224x3.

In addition, we provide some data augmentation operations for the input images, including horizontal flip, color, and light variations, which are described later in the article.

2. VGG16

VGG16 inherits the framework of AlexNet, but it achieves a higher image recognition rate than AlexNet by deepening the number of convolutional layers. It is a deeper network structure than AlexNet, as shown in the figure below. For more detailed network structure parameters, please refer to this paper.

(www.cs.toronto.edu/~frossard/p)…

The obvious differences between VGG16 and AlexNet are:

Continuous convolution blocks and smaller filter sizes

It can be seen from the network structure diagram that VGG16 contains multiple continuous convolution operations (the part in the black box in the figure), and the size of the convolution kernel of these convolution layers is 3×3, much smaller than AlexNet’s 7×7. It can be understood that VGG achieves the same effect by reducing the size of filter and increasing the number of convolutional layers.

This can be seen as an imposing a regularisation of the 7 × 7 conv. filters, Forcing them to have a decomposition through the 3 × 3 filters

Small convolution kernel reduces the number of weights that need to be trained to some extent. Assuming that the number of channels of both input and output is C, the weight number of this layer is 7x7xCxC=49xCxC under the size of 7×7 convolution kernel, while the weight number of the three successive convolutional layers in VGG is 3x3x3xCxC=27xCxC. The reduction of the number of weights is beneficial to the training and generalization of the network.

The number of channels increased

The comparison between AlexNet and VGG shows that the number of channels of AlexNet is significantly smaller than that of VGG. In this way, the network can extract richer features from the input. The reason why VGG can achieve higher accuracy rate is largely related to the increase of channel number.

VGG16 is also integrated under the vision module of Jarvis’ deep learning algorithm. Except for the different definitions of the network structure, everything is similar to Alexnet, including the dropout Settings for the full connection layer, Relu activation functions, and L2 regularization.

3. VGG19

VGG19 and VGG16, described in the previous section, are actually from the same paper and are two different configurations of the same approach. The network structure of VGG19 is very similar to that of VGG16. The difference is that VGG19 adds one more convolution layer in the third, fourth and fifth “convolution blocks”, so it has three more weight layers than VGG16. I will not repeat it here.

In addition to AlexNet and VGG series network structure mentioned above, Jarvis will gradually integrate Googlenet, Resnet, etc., welcome your continuous attention.

Custom network

The classical algorithm is good, but it always has some limitations. In order to provide better flexibility, Jarvis also integrates classification and regression algorithms based on convolutional neural network. Its biggest advantage is that the network structure can be flexibly customized to meet the needs of different scenarios.

Defining a convolutional + activation layer in TensorFlow typically requires the following lines of code:

with tf.name_scope('conv2') as scope: kernel = tf.Variable(tf.truncated_normal([5, 5, 96, 256], dtype=tf.float32,stddev=1e-1), Name ='weights') biases = tf.variable (tf.constant(0.0, shape=[256], dtype=tf.float32),trainable=True, name='biases') conv2 = tf.nn.conv2d(pool1, kernel, [1, 1, 1, 1], padding='SAME') conv2 = tf.nn.relu(tf.nn.bias_add(conv2, biases))Copy the code

This is just a convolution layer construction, if we were to build a network consisting of a dozen or dozens of layers, we would probably have to type hundreds of lines of code, and the point is that you can copy and paste layers of the same nature and not find the right parameters.

Fortunately, TensorFlow also supports some high-level abstractions of interfaces such as Slim, Keras and TensorLayer, so that a convolution layer can be written in one sentence. For example, Slim can define a convolution layer operation like this:

net = layers.conv2d(inputs, 64, [11, 11], 4, padding='VALID', scope='conv1')
Copy the code

Although these high-level interfaces make the definition of the network much easier, the hardcode form is still inconvenient if the network structure has to be adjusted several times during training.

In view of the above, when we implemented CNN on Tesla, we separately extracted the network structure and passed it into the algorithm as a modifiable parameter. The network parameters are actually a JSON file in which each line represents a layer and the last few lines represent some information about the data input.

The network structure of a CifarNet is shown below:

(inclass.kaggle.com/c/computer-…).

An example of converting the CifarNet into our customized JSON file is as follows:

{ "layer1" : { "operation": "conv", "maps": 64, "kernel_height": 5, "kernel_width": 5, "stride_height": 1,"stride_width": 1, "padding": "SAME", "activation_func": "relu"}, "layer2" : { "operation": "max_pool", "kernel_height": 3,"kernel_width":3,"stride_height": 2,"stride_width":2, "padding": "SAME"}, "layer3" : { "operation": "conv", "maps": 64, "kernel_height": 5,"kernel_width":5, "stride_height": 1,"stride_width":1, "padding": "SAME", "activation_func": "relu"}, "layer4" : { "operation": "max_pool", "kernel_height": 3,"kernel_width":3, "stride_height": 2,"stride_width":2, "padding": "SAME"}, "layer5" : { "operation": "fc", "maps": 384, "dropout_rate": 1.0, "activation_func": "relu"}, "layer6" : {"operation": "fc", "maps": 192, "dropout_rate": 1.0, "activation_func": "relu"}, "Layer7 ": {"operation":" FC ", "maps": 100}, "initial_image_height": 32, "initial_image_width": 32, "input_image_height": 32, "input_image_width": 32, "normalize": 1,"crop": 1,"whitening": 1,"contrast": 1,"flip_left_right": 1,"brightness": 1 }Copy the code

Lazy folks can even copy the template and change the parameters according to their own network definition. When running in Tesla interface, if you need to modify the network structure, you can directly edit the JSON file in the parameter configuration on the right side of the interface without modifying and uploading the code

Based on the characteristics of the above user-defined network structure, we configure two algorithms of CNN classification and regression.

1. CNN Classification

Classification algorithm based on convolutional neural network CNN. Support the above network structure customization, can adapt to different scenes of image classification tasks.

2. CNN Regression

The regression algorithm based on CNN of convolutional neural network is similar to CNN Classification, but the difference is that euclidian distance loss function is used in training, which can accept float tags. Note that when configuring the network structure, the number of feature maps at the last layer is 1, instead of the number of categories in the classification.

Use of Jarvis vision algorithm

At present, Jarvis is mainly revealed through the Tesla platform, and its training and use are very simple. Generally speaking, Jarvis can be divided into three steps: data preparation, model training and model use

1. Data preparation

Before starting the training, we need to prepare the training set and test set data. For the current algorithm in the computer vision directory, the input is image data, and all need to be converted into tfRecord format.

Open the working interface of Tesla, drag out the data set node under Input -> Data source, click the node, complete the parameter configuration options on the right of the interface, and complete the configuration of a data node. Transformation nodes can be appended to data nodes. For the convenience of users, we will provide an image-> TFRecord conversion tool under the input -> Data conversion directory.

2. Model training

Drag out the algorithm node you want to train from the deep learning algorithm on the left in the Tesla workflow interface. Click this node, and parameter configuration options are displayed on the right, including algorithm IO parameters, algorithm parameters and resource configuration parameters.

Resource parameters
- Specifies the GPU and CPU resources required for model training.

Algorithm IO parameters
- Ceph path to specify the data set, model store, and Tensorboard store. Here we only need to specify the data set path, while model store and Tensorboard path are specified by Tesla by default.
- If there is a corresponding dataset node on the algorithm node, the data path of the dataset node will automatically match the data input path of the algorithm. Users who do not want to drag the dataset node can also manually fill in the data input path of the algorithm.

The parameter name	Parameters that	The default value
Continuous training model	Specify the path of the folder where the model is located (including the checkpoint file), and the training will start from the model. Fill in 0 if training from scratch	0
Data input	Training set data must be in TFRecord format	There is no
The test data	Test set data must be in TFRecord format	There is no

Algorithm parameters
- Used to specify parameters required during training. Some parameters of each deep learning algorithm are common, such as batch processing size, learning rate and iteration times. But different algorithms may also have their own special parameters. In the following introduction, we will find the corresponding parameter explanation in the detailed introduction link of each algorithm.

The parameter name	Parameters that	The default value
Batch size	Batch size, also known as batch size	128
The number of iterations	The total number of iterations during training	100000
The test interval	Test every n iterations	100
Initial learning rate	Initial learning rate	0.01
Learning rate decay number of steps	The learning rate is attenuated every n rounds of iteration	1000
Learning rate decay factor	Learning rate decay factor	0.1
Regularization factor	Regularization factor, using L2 regularization	0.0005
Model save interval	Save the model every n iterations	1000
Number of categories	The number of categories in a categorizing task	1000
Original image height	Refers to the height of the image read from tfRecord, which may be different from the size of the image input into the network (crop, resize, etc.)	256
Original image width	Refers to the width of the image read from the TFRecord, which may be different from the size of the image input to the network (crop, resize, etc.)	256
Number of original image channels	Refers to the number of channels of images read from TFRecord, 3 for color images, 1 for black and white images	3
Flip horizontal	Whether the data is randomly flipped horizontally	no
Brightness change	Whether to randomly change brightness of data	no
Contrast variation	Whether to randomly change the contrast of the data	no
standardized	Whether the data is standardized	no

For CNN Classification/Regression algorithm, due to the custom support network structure, so the parameters and the table will be slightly different. The parameter configuration item on the right of the workflow prevails.

After all parameters are configured, right-click the algorithm node and select the starting point to start the training model. During the training, log information and Tensorboard visualization can be viewed through the Tensorflow console, facilitating the understanding of the training direction.

3. Model collection and use

Model use refers to the use of trained models to make predictions. In addition to predicting the model right after the training is completed, Tesla also helpfully provides the model collection function, which can be saved for future prediction.

Model collection: Select the model under model operation to save the model to the personal model directory on the left of tesla interface. Next time you need to use the model node, you can directly drag it out, fill in the corresponding configuration parameters, right-click the starting point and run it.

conclusion

This paper introduces several classical deep learning algorithms in the field of computer vision, and shows how to train and use deep learning models quickly and flexibly on the Tesla platform. Under the algorithm Demo, there are examples of each algorithm for your reference.

Finally, one more mention of Apple’s FaceID. The successful implementation of FaceID is undoubtedly a great progress in the application of computer vision in human life. Its lower crack rate than fingerprint unlock is inseparable from the machine learning, especially the deep learning algorithm behind it. Of course, the face recognition technology behind FaceID is much more complicated than the algorithms outlined in this article. We won’t go into it here, because I don’t know if you ask me. Apple published SimGAN’s article “Learning from Simulated and Unsupervised Images through Adversarial Training” on CVPR last year. In addition, Apple announced at last year’s NIPS that they would begin to disclose their research results, which will bring a lot of surprises to academia and industry. You can expect this decision.

Finally +1, this article was directed by andymhuang and Roy li, thanks here. If you encounter related problems in the use of algorithms, welcome to consult me (Joyjxu) or Royalli, also welcome all kinds of criticism of the wall crack comments!

reference

[1] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[C]//Advances in neural information processing systems. 2012: 1097-1105.

[2] Simonyan K, Convolutional networks for Large-scale Image recognition[J]. ArXiv Preprint arXiv:1409.1556, 2014.

[3] Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: A simple way to prevent neural networks from overfitting[J]. The Journal of Machine Learning Research, 2014, 15(1): 1929-1958.

[4] Krizhevsky A. One Weird Trick for Parallelizing convolutional Neural Networks [J]. ArXiv Preprint arXiv:1404.5997, 2014.

[5] Shrivastava A, Pfister T, Tuzel O, Learning from simulated and unsupervised images through adversarial training[J]. ArXiv preprint arXiv:1612.07828, 2016.