preface
The tutorial on computer vision is over for now, and I expect to update my study notes and tutorials on CV project development after midAugust. Now the number of fans has been over 100, thanks to my family and friends for their support, as well as machine learning enthusiasts for my affirmation, I will continue to adhere to my learning route, to share with you. In the next section, a friend from yesterday’s learning group asked a question about activation functions. I have collected some related content, and here I collate the collection to compare the shortcomings and shortcomings of activation functions.
What is the activation function
This article mainly from the activation function concept, mathematical form analysis, Python code form display, advantages and disadvantages comparison and other aspects of learning. The activation function of a node defines the output of a node given an input or set of inputs. A standard integrated circuit can be considered a digital network of activation functions, which can be “ON” (1) or “OFF” (0) depending ON the input. This definition is similar to that of logistic regression. In other words, an activation function is a function that is added to a neural network to help it learn complex patterns in the data. Similar to neuronbased models in the human brain, the activation function ultimately determines what is transmitted to the next neuron. In artificial neural networks, the activation function of a node defines the output of that node under a given input or set of inputs. The activation function is the mathematical equation that determines the output of the neural network.
When neurons receive other neurons or digital signal from the outside world, neurons by weights and bias to linear transformation of input information, due to the linear equation is simple, and limited ability to solve complex problems, therefore to join the activation function to nonlinear input transformation to learn and perform more complex tasks. The activation function is very important, and the proper activation function is also very important.
Activation function type
Common activation functions can be divided into three categories: ridge functions, radial functions, and folding activation functions applied in convolutional neural networks.

Ridge function: function of several variables acting on a linear combination of input variables

Linear function:
$\phi (\mathbf {v} )=a+\mathbf {v} ‘\mathbf {b}$ 
ReLU function:
$\phi (\mathbf {v} )=\max(0,a+\mathbf {v} ‘\mathbf {b} )} \phi (\mathbf {v} )=\max(0,a+\mathbf {v} ‘\mathbf {b} )$ 
The Heaviside function:
$\phi (\mathbf {v} )=1_{a+\mathbf {v} ‘\mathbf {b} >0}} \phi (\mathbf {v} )=1_{a+\mathbf {v} ‘\mathbf {b} >0}$ 
The Logistic function:


Radial activation function: in Euclidean space to obtain the distance between points, as a general function approximation apparatus has a good effect.

Gaussian function:
$f(x)=a\cdot \exp \left({\frac {(vc)^{2}}{2c^{2}}}\right)$ 
Polynomial function:
$\,\phi (\mathbf {v} )={\sqrt {\\mathbf {v} \mathbf {c} \^{2}+a^{2}}}$

Note: c is the sum of vectors at the center of the function, and a is the parameter that affects the radius propagation
 Folding activation function: Pool layer is widely used in the output layer of convolutional neural networks and multiclassification networks. The activation function uses the mean, minimum, or maximum. Software activation is often used in multiple categories
Activation function mathematical properties
Each activation function has its own characteristics, and depending on the characteristics, it may be suitable for a particular model to show better results. In addition to the mathematical structure, the activation function has different mathematical properties:
 Nonlinear: When the activation function is nonlinear, then it can be proved that the twolayer neural network is a general function approximation. In other words, neural networks are composed of many layers of neurons, and the whole network can be viewed as a single layer model using nonlinear activation functions. So that the neural network can arbitrarily approximate any nonlinear function, this property enables the neural network to be applied to many nonlinear models.
 Range: The range of output values of the activation function can be finite or infinite. When the output value is finite, the gradientbased optimization method is more stable, because the representation of features is more significantly affected by the finite weight. When the output value is infinite, the training of the model is more efficient. Note that in this case, a smaller learning rate is generally required.
 Continuous differentiability: This property is not required (the nondifferentiability of ReLu partial points has little effect). This property guarantees computability of gradients in optimization.
 Unsaturated: Saturation is when the gradient approaches 0 (the gradient disappears) in some interval, making it impossible for the parameter to continue updating. Sigmoid its derivative approaches 0 as it approaches plus infinity and minus infinity. Step function almost all position gradient is 0, can not be used as the activation function. Therefore, some scholars have proposed some improved activation functions to solve this problem
 Monotonicity: the derivative sign does not change. Most activation functions have this characteristic. In other words, monotonicity makes the orientation of the gradient of the activation function less likely to change, making the training more convergent and more effective.
 Approximate identity transformation: This property makes the network more stable and exists in a small number of activation functions where the derivative is 1 only near the origin in Tanh and ReLu is linear only at x>0. This structure is designed with ReNet in CNN and LSTM in RNN
 Fewer parameters: Fewer parameters can reduce the network size
 Normalization: The main idea is to automatically normalize the sample distribution to zero mean, unit variance distribution, so as to stabilize the training.
These mathematical properties do not decisively affect the effect of the model, and there is no only useful attribute, which can facilitate us to select the appropriate activation function when building the model.
Activation function comparison
Next comes the most critical part, the comparison of activation functions, we have mastered the types and characteristics of activation functions, so what are the commonly used activation functions, and what are their characteristics. The following is a collection of common activation functions, including function curves, mathematical structures, ranges, differentiable intervals, and continuities.
Common activation functions full collection!
And the fold function:
How to select the appropriate activation function
Copy the code
By understanding these functions and analyzing their properties, we can summarize how to choose the right activation function; Depending on the nature of the problem, we can make better choices for building the model. Based on the experience provided by some articles, the selection rules are as follows (for reference only)
 First consider the common activation functions: Sigmoid, TanH, ReLU, Leaky ReLU, ELU, SoftPlus, Binary Step, Maxout, and Mish
 When used as a classifier, the Sigmoid function and its combination usually work better
 So to avoid the gradient disappearing problem, you need to avoid Sigmoid TanH
 First consider ReLU, the fastest, observe the performance of the model, if the effect is not good, try Leaky ReLU, Maxout
 ReLU can only be used in hidden layers
 In CNN with few layers, the activation function has little influence.
Code implementation
After completing the theoretical basis, the next step is to prepare to build the wheel link in actual combat. It is suggested that these code parts be collected for a rainy day.
Sigmoid code implementation: suitable for binary classification, multi classification, the effect is general, pay attention to the gradient disappear problem
import numpy as np
def sigmoid(x) :
s = 1 / (1 + np.exp(x))
return s
# Tensorflow2.0 version
sigmoid_fc = tf.keras.activations.sigmoid(x)
# pytorch version
sigmoid_fc = torch.nn.Sigmoid()
output = sigmoid_fc(x)
Copy the code
TanH code implementation: Note the gradient disappearance problem
import numpy as np
def tanh(x) :
s1 = np.exp(x)  np.exp(x)
s2 = np.exp(x) + np.exp(x)
s = s1 / s2
return s
# Tensorflow2.0 version
tanh_fc = tf.keras.activations.tanh(x)
# pytorch version
tanh_fc = torch.nn.Tanh()
output = tanh_fc(x)
Copy the code
ReLU code implementation: most commonly used, only used to hide layers
import numpy as np
def relu(x) :
s = np.where(x < 0.0, x)
return s
# Tensorflow2.0 version
relu_fc = tf.keras.activations.relu(x)
# pytorch version
relu_fc = torch.nn.Relu()
output = relu_fc(x)
Copy the code
Leaky ReLU Code implementation: Applied when a large number of inactive neurons are present in the build network
import numpy as np
def lrelu(x) :
s = np.where(x >= 0, x, alpha x)return s
# Tensorflow2.0 version
lrelu_fc = tf.keras.activations.relu(x,alpha=0.01) # Need to specify the size of alpha
# pytorch version
lrelu_fc = torch.nn.LeakyReLU(0.01)
output = lrelu_fc(x)
Copy the code
ELU code implementation
import numpy as np
def elu(x) :
s = np.where(x >= 0, x, alpha (np. J exp (x) 1)
return s
# Tensorflow2.0 version
elu_fc = tf.keras.activations.elu(x,alpha=0.1) # Need to specify the size of alpha
# pytorch version
elu_fc = torch.nn.ELU(0.1)
output = elu_fc(x)
Copy the code
Softmax code implementation
def softmax(x) :
x_exp = np.exp(x)
x_sum = np.sum(x_exp, axis=1, keepdims=True)
s = x_exp / x_sum
return s
# Tensorflow2.0 version
softmax_fc = tf.keras.activations.softmax(x)
# pytorch version
softmax_fc = torch.nn.Softmax()
output = softmax_fc(x)
Copy the code
Binary step code
def binaryStep(x) :
''' It returns '0' is the input is less then zero otherwise it returns one '''
return np.heaviside(x,1)
x = np.linspace(10.10)
plt.plot(x, binaryStep(x))
plt.axis('tight')
plt.title('Activation Function :binaryStep')
plt.show()
Copy the code
Maxout code implementation: used in competitions
import tensorflow as tf
x = tf.random_normal([5.3])
m = 4
k = 3
d = 3
W = tf.Variable(tf.random_normal(shape=[d, m, k])) # 3 * 3 * 4
b = tf.Variable(tf.random_normal(shape = [m, k])) # 4 * 3
dot_z = tf.tensordot(x, W, axes=1) + b # 5 * 4 * 3
print(dot_z)
z = tf.reduce_max(dot_z, axis=2) # 5 * 4
print(z)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print(sess.run([x,dot_z,z]))
Copy the code
Mish code implementation: newer activation functions that outperform ReLU and Swish, TanH and Softplus combined
Here’s how it works: Take a look if you’re interested
import matplotlib.pyplot as plt
%matplotlib inline
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from keras.engine.base_layer import Layer
from keras.layers import Activation, Dense
from keras import backend as K
from sklearn.model_selection import train_test_split
from keras.datasets import mnist
from keras.optimizers import SGD
from keras.utils import np_utils
from __future__ import print_function
import keras
from keras.models import Sequential
from keras.layers.core import Flatten
from keras.layers import Dropout
from keras.layers import Conv2D, MaxPooling2D
from keras.layers.normalization import BatchNormalization
import numpy as np
class Mish(Layer) :
''' Mish Activation Function. .. math:: mish(x) = x * tanh(softplus(x)) = x * tanh(ln(1 + e^{x})) Shape:  Input: Arbitrary. Use the keyword argument `input_shape` (tuple of integers, does not include the samples axis) when using this layer as the first layer in a model.  Output: Same shape as the input. Examples: >>> X_input = Input(input_shape) >>> X = Mish()(X_input) '''
def __init__(self, **kwargs) :
super(Mish, self).__init__(**kwargs)
self.supports_masking = True
def call(self, inputs) :
return inputs * K.tanh(K.softplus(inputs))
def get_config(self) :
base_config = super(Mish, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
def compute_output_shape(self, input_shape) :
return input_shape
def mish(x) :
return keras.layers.Lambda(lambda x: x*K.tanh(K.softplus(x)))(x)
###### Use in your model ##########
model.add(Dense(128,activation= mish))
Copy the code
Cheat code: Small sample
def tanh(x) :
return (np.exp(x)  np.exp(x))/(np.exp(x) + np.exp(x))
def softplus(x) :
return np.log(1 + np.exp(x))
def misc(x) :
return x * tanh(softplus(x))
Copy the code
conclusion
By reviewing and summarizing the significance, types, mathematical characteristics, and scope of use of the activation function, we can have a good understanding of the activation function and how to choose to use it in the construction of the model. I have seen this question asked in the learning exchange group. I have read many articles on the Internet, and I think it is a question worth summarizing and learning. And by going deeper into the principles, understand why it’s important to choose. Not only do code porter, should have their own thinking, the pursuit of high level, our vision is high, in order to have more space for development. However, the description of this article uses a lot of formulas to show, which greatly improves my Latex writing ability. I find that only using them is valuable and can I grow faster…
Welcome to exchange and study!
References:
1. The Activation function en.wikipedia.org/wiki/Activa…
2. Zhuanlan.zhihu.com/p/260970955 nonlinear activation function
3. The common popular cloud.tencent.com/developer/a activation function…
4. Activation of Mish Neural Networks sefiks.com/2019/10/28/…
5. Understand Maxout Activation Function in Deep Learning – Deep Learning Tutorial www.tutorialexample.com/understand…
Recommended reading
 Differential operator method
 Using PyTorch to construct neural network model for handwriting recognition
 Neural network model and back propagation calculation were constructed using PyTorch
 How to optimize model parameters and integrate models
 TORCHVISION Target detection finetuning tutorial