Linear models are the most basic and important tools in the field of machine learning. Taking logistic regression and linear regression as examples, they can efficiently and reliably fit data through either closed form or convex optimization. However, in the real environment, most of us will encounter the problem of linear inseparability, which requires nonlinear transformation to remap the distribution of data. In the deep neural network, we superimpose a nonlinear function on each layer of linear transformation to avoid the multi-layer network equivalent to a single layer of linear function, so as to achieve the fitting analysis of nonlinear problems.

1. Activate the function

1.sigmoid

The Sigmoid function is characterized by limiting the output between 0 and 1. If it is a very large negative number, the output is 0, and if it is a very large positive number, the output is 1, so that the data is not easy to diverge in the process of transmission.

2.tanh

Tanh is the transformation of Sigmoid function. The mean value of tanh is 0, which has a better effect than Sigmoid in practical application.

3.RELU

ReLU is a popular activation function recently. When the input signal is less than 0, the output is 0. When the input signal is greater than 0, the output is equal to the input.

2. Sigmoid and TANH cause the gradient to disappear

Sigmoid function According to the above formula and interval distribution, when the variable is very large, the result tends to 1. When the variable is very small, the result tends to 0; Its derivative tends to 0 when the variable is very large or very small, resulting in the disappearance of the gradient.

According to the above formula and interval distribution, the Tanh activation function tends to 1 when the variable is very large. When the variable is very small, the result tends to be -1; Its derivative tends to zero at very large and very small variables, and the gradient disappears as well.

3. Advantages and changes of Relu

Advantages:

(1) From the perspective of calculation, both sigmoID and TANH activation function need to calculate the exponential, which is of high complexity, while RELU only needs a threshold value to obtain the activation function

(2) The unsaturated property of RELU can effectively solve the problem of gradient disappearance and provide a relatively wide activation boundary.

(3) Unilateral inhibition of RELU provides sparse expression capability of the network.

Disadvantages:

Training can cause problems with neuron death. According to the formula and interval distribution, the negative gradient is set to 0 when the RELu unit is modified, and will not be activated by any data afterwards, that is, the gradient flowing through the neuron is always 0 and does not respond to any data.

Lrelu, Prelu, RRelu and other solutions.

4. Set parameters during neural network training

Considering the whole connection, the depth of the neural network in any of the same layer neurons are homogeneous, they have the same input and output, if then all parameters are initialized to the value of the same, so no matter spread forward or reverse the spread of the values are the same, the learning process will not be able to break the feature, eventually the same layer of the various parameters of the network layer is still the same.

We need to initialize the parameter values of the neural network randomly to break the parameter symmetry problem.

5. Model optimization of neural network

Underfitting and overfitting processes, in addition to some machine learning methods, also include dropout, which I won’t go into here.