I hope we can understand some programming ideas and models of AI and help sort out the path of self-growth.


The author/share | Li Jiaxuan


Author of “TensorFlow Technology Analysis and Practice”, lecturer of InfoQ, 51CTO, Oreilly Strata and other conferences, active in major domestic technical communities, zhihu programming question answering. He is good at studying the structure of deep learning framework, source code analysis and application in different fields. I have deep learning practical experience in image processing, social text data emotion analysis, data mining and other fields. I participated in Hackathon competition for 2d perception system of autonomous driving based on deep learning and once worked as a R&D engineer of Baidu. Now I am studying the performance optimization and FPGA compilation of NLP, ChatBot, and TensorFlow.


This article is published by Gitchat, AI Technology Base

This Chat mainly includes:

  1. Combing the overall knowledge system of artificial intelligence.

  2. What are the career prospects for ai/deep learning engineers?

  3. Is it possible for non-related majors to learn by themselves? How to construct the method and advanced system of self-study?

  4. Having worked in Java/ client/front-end /iOS for 8 years, what are the barriers for novice deep learning engineers to enter this new field and how to overcome them?

I hope we can understand some programming ideas and models of AI and help sort out the path of self-growth.

I. The overall knowledge system of artificial intelligence

What is the current body of knowledge in the FIELD of AI? In other words, to truly become a deep learning engineer, what aspects should we make preparation on the basis of existing engineers?

At present, the research fields of deep learning mainly include the following three groups.

  • Scholars. I mainly do theoretical research on deep learning, studying how to design a “network model”, how to modify parameters and why such modification will have a good effect. My daily work mainly focuses on the frontier of scientific research and theoretical research, model experiment, etc. I am sensitive to new technologies and theories.

  • Algorithm improver. In order to adapt the existing network model to their own applications and achieve better results, these people will make some improvements to the model and apply some new algorithm improvements to the existing model. This kind of person is mainly to do some basic application services, such as basic speech recognition services, basic face recognition services, to provide excellent models for other upper applications.

  • Industrial researcher. This kind of crowd will not involve too deep algorithm, mainly master the network structure of various models and some algorithm implementation. They are more likely to read good papers, reproduce them and apply them to their industry. People at this level are also the mainstream of deep learning research.

From the point of view of everyone’s transformation needs at present, the group that most fits and matches with everyone is the second and third group, and the third group is in the majority. That’s what deep learning engineers are doing.

From this point of view, we comb from the following aspects.

The framework

At present, there are many DL frameworks on the market, such as TensorFlow, Caffe, Pytorch, etc. There are many people comparing the performance of the frameworks. We recommend TensorFlow as the first choice based on its popularity and ease of use. Here’s the framework’s popularity trend as of March:

So regardless of the specific framework, what should a deep learning/machine learning framework do functionally?

  • The Tensor library is CPU/GPU transparent and does a lot of things (slicing, array, matrix manipulation, etc.). By being transparent, I mean that the framework does what the user does on different devices, and the user only needs to specify which device to perform which operation on.

  • There is a completely separate code base that operates on Tensors in a scripting language (ideally Python) and implements all deep learning, including forward/back propagation, graphic computation, etc.

  • Pretraining models (such as Caffe’s model and the Slim module in TensorFlow) can be easily shared.

  • There is no compilation process. As deep learning moves toward larger, more complex networks, the time spent on complex graph algorithms multiplies. Furthermore, you lose interpretability and the ability to log effectively by compiling. Read more about The Gray-Effectiveness of Recurrent Neural Networks.

TensorFlow provides Python, C++, and Java interfaces to build user programs, and the core is implemented in C++.

The following figure shows the system architecture of TensorFlow, which is divided from bottom to top into device layer and network layer, data operation layer, graph computing layer, API layer and application layer, among which device layer and network layer.

Data manipulation layer and graph calculation layer are the core layers of TensorFlow.

Here is a detailed overview of TensorFlow’s architecture from the bottom up. The lowest layer is the network communication layer and the device management layer. The network communication layer includes Google Remote Procedure Call Protocol (gRPC) and Remote Direct Memory Access (RDMA), which are needed in distributed computing. The device management layer includes the realization of TensorFlow on CPU, GPU, FPGA and other devices, which provides a unified interface for the upper layer, so that the upper layer only needs to deal with the logic of convolution, but does not need to care about the realization process of convolution on hardware.

Above it is the data manipulation layer, which mainly includes convolution function, activation function and other operations. Above that is the graph computing layer, which is the core of what we want to know, and contains the implementation of both local and distributed computing graphs. Then the API layer and application layer.

There are many companies using TensorFlow. In addition to Google using TensorFlow in its product line, companies like JD.com, Xiaomi, Sina and ZTE in China, as well as foreign companies like Uber, eBay, Dropbox and Airbnb, are all trying to use TensorFlow.

The paper

Reading a paper a week, implementing or reading an open source implementation of a paper a month is a reasonable pace of study.

Those who are transforming from engineering, who lack the habit of reading papers before, may struggle to read for a while, plus the English language barrier, will linger outside for a long time unable to get started.

A good tip here is:

First read the Chinese review related to the main ideas of this paper, the Chinese doctoral thesis, and then the English review.


Through the Chinese review, we can first understand the basic terms of this field and the common methods of experiment. Otherwise, if you start directly from the paper, the height of the author is not consistent with our level, and it is easy to take it for granted or not look at it at all. Therefore, before reading this article, I have a thorough understanding of the basic knowledge involved in this article.

So what papers to read in the transition period to master the essence as soon as possible? Let’s take the development of CNN as an example.

The development process of convolutional neural network is shown in the figure.

The starting point of the development of convolutional neural networks is the neural cognitive machine (Neocognitron) model, at that time, convolution structures have appeared. The first convolutional neural network model was born in 1989, and its inventor was LeCun. The reading material for learning convolutional neural network is Lecun’s paper, which explains in detail what convolutional neural network is, why convolution is required, why downsampling is required, how to use radial basis function (RBF), and so on.

LeCun proposed LeNet in 1998, but then the edge of convolutional neural network was gradually overshadowed by SVM and other hand-designed feature classifiers. With the introduction of ReLU and Dropout, and the historical opportunities brought by GPU and big data, convolutional neural networks saw a historic breakthrough — AlexNet in 2012.

As shown in the figure, the evolution process of convolutional neural network after AlexNet mainly includes four directions:

  • One is network deepening;

  • Second, enhance the function of convolution layer;

  • The third is from classification task to detection task;

  • Fourth, add new functional modules.

As shown in the figure above, papers on several networks at each stage are found and their structures and characteristics are understood. These networks are implemented under TensorFlow Models.

Understand the code and run it yourself. Then do finetune on your own data set, and you will have an intuitive understanding of the development process of deep learning network in the industry in the future.

The following is a brief description of the structure and characteristics of several networks at each stage.

Network to deepen

LeNet

LeNet’s paper can be found at:

http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf

LeNet contains the following components.

  • Input layer: 32 x 32

  • Convolution layer: 3

  • Downsampling layer: 2

  • Full connection layer: 1

  • Output layer (Gaussian connection) : 10 categories (probability of numbers 0 ~ 9)

The network structure of LeNet is shown in the figure



The purpose and significance of each layer are described below.

  • The input layer. The input image size is 32×32. This is bigger than the MNIST data set (28×28) and is 0 preprocessing the image The purpose of this is to hope that potentially obvious features, such as broken strokes and corners, can appear in the center of the top-level feature monitoring convolution kernel.

  • Convolution layer (C1, C3, C5). The main purpose of convolution operation is to enhance the original signal characteristics and reduce noise. In a visual online demo example, we can see the difference in the output feature mapping of different convolution kernels, as shown in the figure.

  • Lower sampling layer (S2, S4). The lower sampling layer mainly aims to reduce the over-fitting degree of network training parameters and models. There are usually two ways.

    • Max pooling: Find the maximum value in a selected area as the sampled value.

    • Mean pooling: Use the mean values of selected areas as the sampled values.

  • Full connection layer (F6). F6 is the full connection layer, computing the dot product of the input vector and the weight vector, plus a bias. This is then passed to the sigmoID function, producing a state of cell I.

  • Output layer. The output layer consists of Euclidean radial basis function units, with one radial basis function unit for each category (digit 0 ~ 9) and 84 inputs for each unit. That is, each output RBF unit computes the Euclitic distance between the input vector and the class marker vector. The farther the distance, the greater the RBF output.

After testing, the error rate can be reduced to 0.95% by using LeNet data set of 60,000 original images. The error rate was reduced to 0.8% for 540,000 artificially distorted data sets and 60,000 original images.

Then, the historical turning point happened in 2012, Geoffrey Hinton and his student Alex Krizhevsky won the ImageNet contest, broke the record for image classification, and answered the question of convolution method through the contest. The network they used in the competition was called AlexNet.

AlexNet

AlexNet in 2012 ImageNet image classification competition, top-5 error rate is 15.3%; The 2011 winner was based on the traditional shallow model method with a top-5 error rate of 25.8%. AlexNet was also well ahead of the runner-up in the 2012 competition, with an error rate of 26.2 percent. AlexNet’s thesis is detailed in ImageNet Classification with Deep Convolutional reading by Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. NORTON Neural Networks.

The structure of AlexNet is shown below. The diagram clearly shows the division of responsibilities between two Gpus: the top part of the GPU operation diagram, and the bottom part of the GPU operation diagram. Gpus communicate with each other only at certain layers.

AlexNet consists of 5 convolution layers, 5 pooling layers, 3 full connection layers and about 50 million adjustable parameters. The output of the last fully connected layer is sent to a 1000 dimensional SoftMax layer, producing a distribution covering 1000 class tags.

AlexNet succeeded in bringing the method of deep learning convolution back to people’s vision, because the following methods were used.

  • Prevent overfitting: Dropout, data augmentation.

  • Nonlinear activation function: ReLU.

  • Big data training: 1.2 million ImageNet image data.

  • GPU implementation, use of the Normalization layer of LRN(Local Responce Normalization).

To learn so many parameters and prevent overfitting, two methods can be used: data enhancement and Dropout.

  • Data enhancement: Increasing training data is a good method to avoid overfitting and improve the accuracy of the algorithm. When the training data is limited, some transformations can be used to generate some new data from the existing training data set to expand the amount of training data. The following deformation methods are usually used, and the specific effects are shown in the figure.

    • Flip an image horizontally (also known as a reflection change, flip).

    • Crop some images (e.g. 224×224) are randomly translated from the original image (size 256×256).

    • Add some random lighting to the image (also known as lighting, color transformation, color shake).

  • Dropout. What AlexNet does is set the output of each hidden layer neuron to 0 with a probability of 0.5. Neurons inhibited in this way participate in neither forward nor back propagation. Thus, each time a sample is entered, the neural network tries a new structure, but all of these structures share weights. Because neurons cannot exist in isolation from other neurons, this technique reduces the complex interadaptation of neurons. As a result, the network needs to be forced to learn more robust features that are useful for combining some different random subset of other neurons. Without Dropout, our network would exhibit a great deal of overfitting. Dropout roughly doubles the number of iterations required for convergence.

Alex replaced SigmoID with the nonlinear activation function RELu and found that the convergence rate of SGD obtained was much faster than sigmoID/TANH. A single GTX 580 GPU has only 3 GB of memory, so the amount of data trained on it is limited. As can be seen from the structure diagram of AlexNet, it distributes the network on two Gpus, and can read and write directly from the memory of the other GPU, without passing through the host memory, greatly increasing the scale of training.

Enhance the function of the convolution layer

VGGNet

VGGNet can be viewed as a deeper version of AlexNet, See Karen Simonyan and Andrew Zisserman’s paper, Very Deep Convolutional Networks for Large-scale Visual Recognition.

VGGNet and GoogLeNet, mentioned below, were the second and first winners in the 2014 ImageNet contest, with top-5 error rates of 7.32% and 6.66%, respectively. VGGNet also has 5 convolution groups, 2 layer fully connected image features and 1 layer fully connected classification features, which can be regarded as the same as AlexNet with a total of 8 parts. According to the first five convolution groups, five configurations A ~ E are given in VGGNet paper, as shown in the figure. The number of convolutional layers increases from 8 (A configuration) to 16 (E configuration). VGGNet differs from AlexNet in that it uses more layers, usually 16 to 19 layers, while AlexNet has only 8 layers.

GoogLeNet

When it comes to GoogleNet, we first talk about the idea of NIN(Network in Network) (see Min Lin, Qiang Chen and Shuicheng Yan’s paper Network in Network). It makes two improvements to the traditional convolution method: it changes the original Linear convolution layer into multilayer perceptron layer; Improve the full connection layer to global average pooling. This makes convolutional neural networks evolve towards another branch of evolution — enhancing the function of convolutional modules, and GoogLeNet(Inception V1) was born in 2014. Google’s GoogLeNet, the winner of the 2014 ILSVRC Challenge, reduced the top-5 error rate to 6.67 percent. More about GoogLeNet can be found in the paper Going Deeper with Convolutions by Christian Szegedy and Wei Liu et al.

The main idea of GoogLeNet is around “depth” and “width”.

  • Depth. The number of layers is deeper, and 22 layers are used in this paper. In order to avoid the gradient disappearing problem, GoogLeNet cleverly added two loss functions at different depths to avoid the gradient disappearing phenomenon during back propagation.

  • Width. Convolution kernels of various sizes are added, such as 1×1, 3×3 and 5×5, but all of these are not used in feature mapping, and the feature mapping thickness combined will be very large. However, the Inception model for dimensionality reduction as shown on the right of Figure 6-11 is adopted. A 1×1 convolution kernel is added before 3×3 and 5×5 convolution, and after maximum pooling, respectively, to reduce the thickness of feature mapping.

To be continued…