An introduction to 1.

Visual stratification of edge detection: the first layer is the problem of edge structure recognition: image classification (old feature -> SVM) As determined by the verification set, if the training set is small, cross verification (for example, folding 10, dividing the training set into 10 parts, taking 9 training parts under a certain super parameter X for 10 times, calculating the average value of the remaining verification set, as the effect of the super parameter, and then comparing, but consuming resources)

2. The classifier

2.1 The nearest neighbor classifier KNN

For the pictures in the test set, find the most similar pictures in the training set, and then judge the label

2.2 Linear classifier

  1. I’m going to show you a linear mapping
  2. Multiclass SVM hinge loss function
  3. Regularization punishment

L2 normal form is often used for some preference of increasing weight W, because in order to fit some outliers, a larger W will produce a larger derivative, and when W<1, the penalty term becomes smaller and smaller. 4. Softmax classifier

Neural networks

Understand network characteristics:

3.1 Loss function

  1. Why cross entropy as a loss function

Information amount (self-information) : Firstly, find a function that satisfies: (1) is a monotone function of P (x) (2) two unrelated events x and Y, observe the amount of information obtained by them at the same time == the sum of separate observations: I(x,y)=I(x)+I(y), and P (x,y)= P (x)p(y), so the logarithmic function I(x)=-log(p(x)), and the negative sign guarantees that the amount of information brought by the occurrence of a positive number and an event is opposite to the probability information entropy: The more certain the information is, the smaller the information entropy is; otherwise, the more information is needed to determine the expected relative entropy (KL divergence) of information => : For the same random variable XX, there are two separate probability distributions P (x) and P (y), and relative entropy can be used to measure the difference between the two distributions. Cross entropy: Since relative entropy = cross entropy – information entropy, and since the information entropy is fixed, the optimization of cross entropy can be done

3.2 Activation function

  1. Why activate functions: Complex nonlinear mappings between inputs and outputs, approximating complex function mappings (associative curve diagrams)

Tanh: Gradient disappearance, output centered on 0 ReLU: piecewise linear, not linear overall, can fit any, advantages: (1) The first two consume computational resources, and this one only uses threshold or matrix or if else. (2) Convergence accelerates fast. (3) Gradient disappearance problem. Disadvantages: (1) Not 0 mean (2) Data will not be compressed, data will expand (3) Neuron necrosis phenomenon: parameters will never be updated: two reasons: Parameter initialization/learning rate is too high and the parameter updates during training are too large Leaky Relu: Solve the death problem.

3.3 Data Preprocessing


1. Zero centralized;

2.Minus the mean over normalization: Then divide by the standard deviation to modulate the numerical range (to obtain a normal distribution with mean of 0 and variance of 1, so that different dimensions have the same range), because each sample consists of multiple features, but the dimensions and magnitude of features are different, so they have the same scale

3.PCA and bleaching: For example, from ND to the N100, leaving the 100 dimensions of maximum variance (but not used in convolutional networks), and then executing the method. After execution, the thread releases the object. During execution, other threads cannot obtain the Monitor object.

3.4 Weight initialization

BN layer can reduce the dependence of weight initialization 1. Cannot all zero equal values, because the output value is the same, the same in back propagation, and thus the same parameters update, no longer symmetry 2. Random initialization of decimals, W=0.01* Np.random. Randn (D,H), but small values are not necessarily good, the gradient is too small 3 in reverse. The variance was calibrated using 1/ SQRT (n)

//** reference: **www.cnblogs.com/shine-lee/p…


**E(w)=0; Need to control var(w); The variance of input of different activation layers is the same. Both processes should be considered Expectation and variance: The calculation process

Since the input and output of A and Z are expected to be equally distributed, the variance of W is I/FAN_in and NP.random.randn (n)/ SQRT (n). As the data amount increases, the variance of data distribution X also increases, and the range can be adjusted by dividing by the square root of the data amount, so as to ensure that the neuron has approximately the same output distribution at the beginning. It’s normalized.

Understanding the difficulty of training deep feedforward neural networks has similar analysis. In this paper, the author concludes that the form of initialization is suggested as Var(W)=2(nIN + Nout), where NIN and Nout are the number of neurons at the upper and lower layers respectively. Delving Deep into Rectifiers: Surpassing human-level Performance on ImageNet Classification, weight initialization of ReLU neurons is deduced, and the neuron variance needs to be 2.0/ N, That is w= NP.random.randn (n)/ SQRT (2.0/n), which is the current recommendation to use related neural networks in neural networks.

4. Sparse initialization 5. Bias initialization

3.5 regularization


1.**L1\L2 regularized, combined **

2.Maximum normal form constraint, to the weight of the upper limit, so that it does not explode

3.6 Updating Parameters

1. First-order method: stochastic gradient descent, momentum 2. Learning rate annealing: learning rate decay: decay with the number of steps, exponential decay, 1/ T decay 3. 4. Each parameter adaptive learning rate method: Adagrad/RMSprop/Adam complement: parameter calculation: convolution and number of width and height * thickness

3.7 Batch normalization

The reason:

1. At the input layer, part of the sample is significantly different from the other part, [0,1] and [10,100]. Therefore, for the shallow model, it is inefficient to describe model a and model b for a while, overturning one’s own labor. Similarly, in the remaining layers, the output of the previous layer will do the same, so we need mean =0 and variance =1; It can also be useful for gradient extinction.

2. The fourth step is understood as: for sigmoID, the nonlinear expression ability of the region limited to normal distribution will be reduced, and each layer will learn by itself whether translation and scaling are needed to fit the distribution of original data

The first 3 steps are the normalization process, and the normal distribution is obtained, but the sigmoID gradient between [-1,1] does not change much, the nonlinear is weak, and the data itself is asymmetric, and the normal distribution is not the most ideal, so the fourth step is to avoid the problem of network expression ability reduction caused by the limitation of normal distribution: Scale transformation and migration and scaling of the variance and changed the average, made to test the new suit real distribution = > bn comes from training preserved entire means and variances of samples, as the test sample of mean and variance In addition, the effect of BN is control input (input) the next layer on each floor of the modulus value, solve a little gradient explosion/disappear. Why does the gradient disappear or explode? For example, the value of the backpropagation result not only depends on the derivation formula, but also depends to a large extent on the input modulus. When the input modulus of the calculation graph is greater than 1 each time, the gradient will inevitably show a geometric multiple growth after many layers of return. BN’s function is essentially to control the moduli of each layer’s input, so the gradient explosion/disappearance phenomenon should have been solved (at least for the most part) long ago.

3.8 Prevent overfitting

1. Data enhancement: Image transformation and horizontal flipping, changing RGB channel intensity 2. Dropout 3. regularization