Refer to the article: captainbed. VIP / 1-3-4 /, zhuanlan.zhihu.com/p/73214810, blog.csdn.net/tyhj_sf/art…

Summary of activation function/excitation function ⭐️ Why is the activation function required? Common activation function Sigmoid function formula and image advantages disadvantages Tanh function formula and image advantages disadvantages ReLu function formula and image advantages disadvantages Leaky ReLu function formula and image disadvantages How to choose the activation function?

Summary of activation function/excitation function

We briefly mentioned the activation function in the three basic problems of neural networks and gave the example of sigmoid. At that time, we proposed only one use for the activation function: to map the results between 0 and 1 so that we could predict the probability. This is actually not the primary use of activation functions, and there are far more activation functions than sigmoid.

The activation function, as its name suggests, activates our neural network!

⭐️ why do I need to activate functions

Activation function is the source of nonlinearity in neural network. If the activation function is removed, only linear operation is left in the whole network. No matter how many layers the neural network has, the combination of linear operation or linear operation, the final effect is only equivalent to a single-layer linear model. 💡 High school mathematics knowledge, low order set of high order function can set out high order, and only high order can fit complex curve.

The essence of neural network is to get a function to fit the function graph through training, and the more layers of neural network, the more complex the curve can be expressed. Theoretically, as long as there are enough layers, any complex curve graph can be fitted.

Common activation functions

The activation function is also called the excitation function

The Sigmoid function

The sigmoID function is also called the Logistic function, because the sigmoID function can be deduced from the Logistic regression (LR) and is also the activation function specified by the Logistic regression model. Sigmoid for binary classification problems, the characteristics of its range (0,1) can just be used to predict the probability, so we can use sigmoid in the output layer of binary problems, and other activation functions in other layers.

The value range of sigmod function is between (0, 1). The output of the network can be mapped to this range for convenient analysis.

The Sigmoid function used to be used a lot, but in recent years it has been used less and less. Mainly because of its inherent disadvantages.

Formula and image

advantages

  • Smooth and easy to differentiate (as can be seen from the image)
  • Well suited to binary classification problems.

disadvantages

  • Activation function requires a lot of calculation and takes a lot of time for large-scale deep networks (including power operation and division in both forward and back propagation).
  • ⭐️Sigmoid’s output is notZero-centered(mean 0.5),As a result, the neurons in the later layer will get the non-zero mean output signals of the previous layer as input, which will change the original distribution of data as the network deepens.
  • The value range of Sigmoid derivative of gradient disappearance problem in deep network is [0, 0.25]. Due to the “chain reaction” of neural network back propagation, it is easy to appearGradient disappearedIn the case. For example, for a layer 10 network, according to, the error of the 10th layer is relative to the parameters of the convolution of the first layerThe gradient of will be a very small value, which is called “gradient disappearance”. ⚠ ️The speed of learning is related to partial derivatives: To put it simply, gradient descent is reduced by subtracting the derivative. If the derivative is too small, the effect will not work when there are too many layers apart.Mathematical analysis

Tanh function

The tanh function is kind of an upgrade of the SigmoID function, which is a little bit better than sigmoID in every way. The average sigmoID is 0.5, and the average tanh is 0. If you pass those outputs close to zero to the next layer of neurons, the next layer of neurons will work more efficiently.

Formula and image

advantages

  • Compared with the Sigmoid function, the range of tanh is (-1,1), which solves the problem of the mean of 0

disadvantages

  • The serious problem of exponentiation time remains
  • The derivative range of tanh is between (0,1), although larger than sigmoid’s (0,0.25), but still very small, and the gradient disappearance is relatively relieved, but not solved.

ReLu function

1. Relu(Rectified Linear Unit) : provides a Linear Unit function. The function is simple in form, but it is one of the most widely used activation functions.

The derivative range of Sigmoid and TANh leads to the problem of gradient disappearance, while the derivative of RELu is always 1, which effectively solves the problem of gradient disappearance in the chain reaction, and the back propagation can be carried out normally.

Another very good feature of RELU is that its negative output is 0 (setting it to 0 means shielding this feature), which enables only part of neurons to be activated at the same time, making the network sparse and greatly improving computing efficiency. The following paragraph I think is very good:

Before describing this feature, it is necessary to clarify the objectives of deep learning: ** Deep learning is to find key information (key features) from the intricate data relations based on a large number of sample data. In other words, the dense matrix is transformed into a sparse matrix, the key information of the data is retained, and the noise is removed. In this way, the model is Robust. ReLU sets the output x<0 to 0, which is a process of noise removal and sparse matrix. Moreover, this sparsity is dynamically adjusted in the training process, and the network will automatically adjust the sparsity proportion to ensure that the matrix has the optimal effective characteristics.

However, ReLU forcibly sets the output of the part x<0 to 0 (setting it to 0 means shielding this feature), which may lead to the failure of the model to learn effective features. Therefore, if the learning rate is set too high, most neurons in the network may be in dead state. Therefore, the learning rate cannot be set too high in ReLU’s network.

Formula and image

relu=max(0, x)

advantages

  • It can be seen that the derivative of ReLu is constant 1, which can effectively solve the problem of gradient disappearance in the deep network.
  • Compared to Sigmoid and TANH, ReLU dispenses with complex calculations and speeds them up.
  • Negative output is 0,Not all neurons are activated at the same time, so that the network remains sparse and the computational efficiency is improved.

disadvantages

  • Although the problem of gradient descent is solved, the problem of gradient explosion may occur. The solution is to further control value and make full power value within the range of (0, 1).
  • If the learning rate is too high, most neurons may be in a dead state.

Leaky relu function

The relu function can result in a dead state. We can consider replacing the 0 with a non-zero but very small number (such as 0.01), which turns the 0 gradient into a very small one, and this is the Leaky RELu function.

Formula and image

\begin{cases}x&x>0\ \lambda x&x\leqslant 0\end{cases}

How to choose the activation function?

For different application scenarios and different training data, the suitable activation function is different. To find the suitable activation function, we usually try it bit by bit with a small amount of training data, and try the data one by one. There is no definitive method to this problem, and most of it is rule of thumb.

  • There are some accepted selection rules. Relu is usually used the most, butRelu is only used in hidden layers. Based on experience, we can usually start with relu activation function, if RELu does not solve the problem well, then try another activation function.
  • If you are using ReLU, be careful to set the learning rate and avoid having too many “dead” neurons on the network. If this is a problem, try Leaky ReLU, PReLU or Maxout.
  • Tanh is superior to SigmoID in every respect (except for the output layer in binary classification applications).
  • In addition, deep learning often needs a lot of time to process a large amount of data, so the convergence rate of the model is particularly important. Therefore, in general, training deep learning networks should try to use zero-centered data (which can be realized through data preprocessing) and zero-centered output. Therefore, activation functions with zero-centered characteristics should be selected as far as possible to accelerate the convergence of the model.
  • Theoretically, each layer could use a different activation function, but this is generally not done.