· Glossary of Deep Learning (1)

Deep learning terminology can be very difficult for newcomers to learn. This deep learning vocabulary contains some common terms used in deep learning to help readers gain insight into specific topics.

The line between deep learning and “general” machine learning terminology is blurry. I try to keep the vocabulary around deep learning, but there may be a little overlap. For example, I’m not including “cross-validation” here, because it’s a generic technique that uses all cross-machine learning. However, I decided to include terms like SoftMax or Word2Vec because they are often associated with deep learning, even if they are not deep learning technologies.

Activation Function

In order to allow neural networks to process complex human brain neurons from the perspective of information processing, we apply nonlinear activation functions to this level. Signals enter one neuron, pass through a nonlinear activation function, pass to the next neuron, and repeat until they reach the output layer. Common functions include Sigmoid functions, TANH functions, RELU functions, and variations of these functions.

AdaDelta algorithm

AdaDelta algorithm is mainly to solve the defects in AdaGrad algorithm. Before introducing AdaDelta, it is necessary to understand AdaGrad first. The characteristics of AdaDelta are that in the early stage of descent, the gradient is relatively small, then the learning rate will be relatively large, while in the middle and late stage, near the lowest point, the gradient is large. At this time, the learning rate will be relatively reduced, and the speed will be slowed down so that the iteration can reach the lowest point. Adadelta is an extension of Adagrad. The initial scheme is still adaptive constraint on learning rate, but it is simplified in calculation. Adagrad adds up all the previous gradients squared, whereas Adadelta only adds up fixed size terms and doesn’t store them directly, just approximates the corresponding average.

Adagrad algorithm

Adagrad is an adaptive learning rate algorithm that tracks the square gradient over time and automatically adjusts the learning rate for each parameter. It can replace plain SGD and is particularly helpful for sparse data, where it assigns a higher learning rate to parameters that are not frequently updated.

Adam

Adam is an adaptive learning rate algorithm similar to RMSprop, but its function not only uses the running average estimation of the first and second moments of the gradient, but also includes the deviation correction term.

RMSprop, Adadelta, and Adam have similar effects in many cases. Adam adds bias-correction and momentum on the basis of RMSprop. As the gradient becomes sparse, Adam has better effect than RMSprop.

Overall, Adam is the best choice.

Affine Layer

Affine Layer is a fully connected Layer in a neural network. Affine means that every neuron in the previous layer is connected to every neuron in the current layer. In many ways, this is the “standard” layer of neural networks. Before making the final prediction, affine layer is usually added to the output of convolutional neural network or recursive neural network. The affine layer is usually of the form y= F (W x+b), where X is the layer input, w is the parameter, B is the bias vector, and F is the nonlinear activation function.

Attention Mechanism

The attention mechanism is inspired by human visual attention, which focuses on the function of processing specific parts of an image. The Mechanism used to improve the effect of RNN (LSTM or GRU) Encoder + Decoder model, commonly referred to as the Attention Mechanism. Attention mechanisms can be incorporated into both language processing and image recognition architectures to give information the ability to discriminate and help understand what to “focus” on when making predictions.

Alexnet

AlexNet is the 2012 ImageNet competition winner designed by Hinton and his student Alex Krizhevsky, and is named after the first author, Alex Krizhevsky. Alexnet is the name of convolutional neural network architecture. Its appearance rekindled people’s interest in CNN image recognition. AlexNet is an 8-layer deep network, including 5 convolutional layers and 3 full-connection layers, excluding LRN layer and pooling layer.

For Alexnet and convolutional neural network model classification, please refer to www.atyun.com/37216.html.

Autoencoder

AutoEncoder is another important content of deep learning. Feature data is compressed and then decompressed. Neural network carries out end-to-end training through a large number of data sets to continuously improve its accuracy, while AutoEncoder designs encode and decode processes to make the input and output closer and closer, which is an unsupervised learning process.

Average-Pooling

Pooling operation is a basic operation often used in convolutional neural network, which is generally followed by a pooling operation after the convolution layer. However, in recent years, the mainstream classification algorithm model on ImageNet is max-pooling, and average-pooling is rarely used. Generally speaking, The effect of max-pooling is better. Although the data are down-sampled by max-pooling and average pooling, the effect of max-pooling is more like feature selection, and features with better classification and identification are selected, providing nonlinear. According to relevant theories, The error of feature extraction mainly comes from two aspects :(1) the variance of estimated value increases due to the limitation of neighborhood size; (2) Parameter error of convolution layer causes deviation of the estimated mean value. Generally speaking, average pooling can reduce the first error and retain more background information of the image, while max-pooling can reduce the second error and retain more texture information. Average-pooling emphasizes one-layer down-sampling of the overall feature information, and contributes more to reducing parameter dimensions.

Backpropagation

Backpropagation is an effective algorithm for computing gradients in neural networks. It is popularly known as a feedforward computing graph, which is the link rule of composite functions. It boils down to applying differential chain rules starting from the network output and then propagating gradients back. The first use of backpropagation can be traced back to Wapnik in the 1960s, but backpropagation misrepresentations of learning are often cited as sources.

Backpropagation Through Time (BPTT)

Back propagation algorithms were first proposed in the 1970s, but by 1986, By David Rumelhart, Geoffrey Hinton, “Learning Representations by Back-propagating Errors”, a famous paper published by Ronald Williams, only fully realised how important this algorithm is. Time-based back propagation (PAPER) is a back propagation algorithm applied to recursive neural network (RNN). BPTT can be regarded as a standard back-propagation algorithm applied to RNN, where each time step represents a layer and parameters are shared across layers. Since the RNN shares the same parameters in all time steps, errors in one time step must be propagated “through time” back to all previous time steps, hence the name. When dealing with long sequences (hundreds of inputs), truncated versions of BPTT are often used to reduce computational costs. The truncated BPTT stops backpropagation of the error after a fixed number of steps.

Today, I will update to this, and there will be more words in the follow-up, I will slowly explain for like-minded partners.