### preface

The tutorial on computer vision is over for now, and I expect to update my study notes and tutorials on CV project development after mid-August. Now the number of fans has been over 100, thanks to my family and friends for their support, as well as machine learning enthusiasts for my affirmation, I will continue to adhere to my learning route, to share with you. In the next section, a friend from yesterday’s learning group asked a question about activation functions. I have collected some related content, and here I collate the collection to compare the shortcomings and shortcomings of activation functions.

### What is the activation function

This article mainly from the activation function concept, mathematical form analysis, Python code form display, advantages and disadvantages comparison and other aspects of learning. The activation function of a node defines the output of a node given an input or set of inputs. A standard integrated circuit can be considered a digital network of activation functions, which can be “ON” (1) or “OFF” (0) depending ON the input. This definition is similar to that of logistic regression. In other words, an activation function is a function that is added to a neural network to help it learn complex patterns in the data. Similar to neuron-based models in the human brain, the activation function ultimately determines what is transmitted to the next neuron. In artificial neural networks, the activation function of a node defines the output of that node under a given input or set of inputs. The activation function is the mathematical equation that determines the output of the neural network.

When neurons receive other neurons or digital signal from the outside world, neurons by weights and bias to linear transformation of input information, due to the linear equation is simple, and limited ability to solve complex problems, therefore to join the activation function to nonlinear input transformation to learn and perform more complex tasks. The activation function is very important, and the proper activation function is also very important.

### Activation function type

Common activation functions can be divided into three categories: ridge functions, radial functions, and folding activation functions applied in convolutional neural networks.

• Ridge function: function of several variables acting on a linear combination of input variables

• Linear function:

${\displaystyle \phi (\mathbf {v} )=a+\mathbf {v} ‘\mathbf {b} }$
• ReLU function:

${\displaystyle \phi (\mathbf {v} )=\max(0,a+\mathbf {v} ‘\mathbf {b} )}{\displaystyle \phi (\mathbf {v} )=\max(0,a+\mathbf {v} ‘\mathbf {b} )}$
• The Heaviside function:

${\displaystyle \phi (\mathbf {v} )=1_{a+\mathbf {v} ‘\mathbf {b} >0}}{\displaystyle \phi (\mathbf {v} )=1_{a+\mathbf {v} ‘\mathbf {b} >0}}$
• The Logistic function:

${\displaystyle \phi (\mathbf {v} )=(1+\exp(-a-\mathbf {v} ‘\mathbf {b} ))^{-1}}{\displaystyle \phi (\mathbf {v} )=(1+\exp(-a-\mathbf {v} ‘\mathbf {b} ))^{-1}}$
• Radial activation function: in Euclidean space to obtain the distance between points, as a general function approximation apparatus has a good effect.

• Gaussian function:

${\displaystyle f(x)=a\cdot \exp \left(-{\frac {(v-c)^{2}}{2c^{2}}}\right)}$
• Polynomial function:

${\displaystyle \,\phi (\mathbf {v} )={\sqrt {\|\mathbf {v} -\mathbf {c} \|^{2}+a^{2}}}}$

Note: c is the sum of vectors at the center of the function, and a is the parameter that affects the radius propagation

• Folding activation function: Pool layer is widely used in the output layer of convolutional neural networks and multi-classification networks. The activation function uses the mean, minimum, or maximum. Software activation is often used in multiple categories

### Activation function mathematical properties

Each activation function has its own characteristics, and depending on the characteristics, it may be suitable for a particular model to show better results. In addition to the mathematical structure, the activation function has different mathematical properties:

• Nonlinear: When the activation function is nonlinear, then it can be proved that the two-layer neural network is a general function approximation. In other words, neural networks are composed of many layers of neurons, and the whole network can be viewed as a single layer model using nonlinear activation functions. So that the neural network can arbitrarily approximate any nonlinear function, this property enables the neural network to be applied to many nonlinear models.
• Range: The range of output values of the activation function can be finite or infinite. When the output value is finite, the gradient-based optimization method is more stable, because the representation of features is more significantly affected by the finite weight. When the output value is infinite, the training of the model is more efficient. Note that in this case, a smaller learning rate is generally required.
• Continuous differentiability: This property is not required (the non-differentiability of ReLu partial points has little effect). This property guarantees computability of gradients in optimization.
• Unsaturated: Saturation is when the gradient approaches 0 (the gradient disappears) in some interval, making it impossible for the parameter to continue updating. Sigmoid its derivative approaches 0 as it approaches plus infinity and minus infinity. Step function almost all position gradient is 0, can not be used as the activation function. Therefore, some scholars have proposed some improved activation functions to solve this problem
• Monotonicity: the derivative sign does not change. Most activation functions have this characteristic. In other words, monotonicity makes the orientation of the gradient of the activation function less likely to change, making the training more convergent and more effective.
• Approximate identity transformation: This property makes the network more stable and exists in a small number of activation functions where the derivative is 1 only near the origin in Tanh and ReLu is linear only at x>0. This structure is designed with ReNet in CNN and LSTM in RNN
• Fewer parameters: Fewer parameters can reduce the network size
• Normalization: The main idea is to automatically normalize the sample distribution to zero mean, unit variance distribution, so as to stabilize the training.

These mathematical properties do not decisively affect the effect of the model, and there is no only useful attribute, which can facilitate us to select the appropriate activation function when building the model.

### Activation function comparison

Next comes the most critical part, the comparison of activation functions, we have mastered the types and characteristics of activation functions, so what are the commonly used activation functions, and what are their characteristics. The following is a collection of common activation functions, including function curves, mathematical structures, ranges, differentiable intervals, and continuities.

Common activation functions full collection!

And the fold function:

### How to select the appropriate activation function

Copy the code

By understanding these functions and analyzing their properties, we can summarize how to choose the right activation function; Depending on the nature of the problem, we can make better choices for building the model. Based on the experience provided by some articles, the selection rules are as follows (for reference only)

• First consider the common activation functions: Sigmoid, TanH, ReLU, Leaky ReLU, ELU, SoftPlus, Binary Step, Maxout, and Mish
• When used as a classifier, the Sigmoid function and its combination usually work better
• So to avoid the gradient disappearing problem, you need to avoid Sigmoid TanH
• First consider ReLU, the fastest, observe the performance of the model, if the effect is not good, try Leaky ReLU, Maxout
• ReLU can only be used in hidden layers
• In CNN with few layers, the activation function has little influence.

### Code implementation

After completing the theoretical basis, the next step is to prepare to build the wheel link in actual combat. It is suggested that these code parts be collected for a rainy day.

Sigmoid code implementation: suitable for binary classification, multi classification, the effect is general, pay attention to the gradient disappear problem

import numpy as np

def sigmoid(x) :
s = 1 / (1 + np.exp(-x))
return s

# Tensorflow2.0 version
sigmoid_fc = tf.keras.activations.sigmoid(x)
# pytorch version
sigmoid_fc = torch.nn.Sigmoid()
output = sigmoid_fc(x)
Copy the code

TanH code implementation: Note the gradient disappearance problem

import numpy as np

def tanh(x) :
s1 = np.exp(x) - np.exp(-x)
s2 = np.exp(x) + np.exp(-x)
s = s1 / s2
return s

# Tensorflow2.0 version
tanh_fc = tf.keras.activations.tanh(x)
# pytorch version
tanh_fc = torch.nn.Tanh()
output = tanh_fc(x)
Copy the code

ReLU code implementation: most commonly used, only used to hide layers

import numpy as np

def relu(x) :
s = np.where(x < 0.0, x)
return s

# Tensorflow2.0 version
relu_fc = tf.keras.activations.relu(x)
# pytorch version
relu_fc = torch.nn.Relu()
output = relu_fc(x)
Copy the code

Leaky ReLU Code implementation: Applied when a large number of inactive neurons are present in the build network

import numpy as np

def lrelu(x) :
s = np.where(x >= 0, x, alpha x)return s

# Tensorflow2.0 version
lrelu_fc = tf.keras.activations.relu(x,alpha=0.01) # Need to specify the size of alpha
# pytorch version
lrelu_fc = torch.nn.LeakyReLU(0.01)
output = lrelu_fc(x)
Copy the code

ELU code implementation

import numpy as np

def elu(x) :
s = np.where(x >= 0, x, alpha (np. J exp (x) -1)
return s

# Tensorflow2.0 version
elu_fc = tf.keras.activations.elu(x,alpha=0.1) # Need to specify the size of alpha
# pytorch version
elu_fc = torch.nn.ELU(0.1)
output = elu_fc(x)
Copy the code

Softmax code implementation

def softmax(x) :
x_exp = np.exp(x)
x_sum = np.sum(x_exp, axis=1, keepdims=True)
s = x_exp / x_sum
return s

# Tensorflow2.0 version
softmax_fc = tf.keras.activations.softmax(x)
# pytorch version
softmax_fc = torch.nn.Softmax()
output = softmax_fc(x)
Copy the code

Binary step code

def binaryStep(x) :
''' It returns '0' is the input is less then zero otherwise it returns one '''
return np.heaviside(x,1)
x = np.linspace(-10.10)
plt.plot(x, binaryStep(x))
plt.axis('tight')
plt.title('Activation Function :binaryStep')
plt.show()
Copy the code

Maxout code implementation: used in competitions

import tensorflow as tf

x = tf.random_normal([5.3])
m = 4
k = 3
d = 3

W = tf.Variable(tf.random_normal(shape=[d, m, k])) # 3 * 3 * 4
b = tf.Variable(tf.random_normal(shape = [m, k])) # 4 * 3
dot_z = tf.tensordot(x, W, axes=1) + b # 5 * 4 * 3
print(dot_z)
z = tf.reduce_max(dot_z, axis=2) # 5 * 4
print(z)

with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print(sess.run([x,dot_z,z]))
Copy the code

Mish code implementation: newer activation functions that outperform ReLU and Swish, TanH and Softplus combined

Here’s how it works: Take a look if you’re interested

import matplotlib.pyplot as plt
%matplotlib inline

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from keras.engine.base_layer import Layer
from keras.layers import Activation, Dense
from keras import backend as K
from sklearn.model_selection import train_test_split
from keras.datasets import mnist
from keras.optimizers import SGD
from keras.utils import np_utils
from __future__ import print_function
import keras
from keras.models import Sequential
from keras.layers.core import Flatten
from keras.layers import Dropout
from keras.layers import Conv2D, MaxPooling2D
from keras.layers.normalization import BatchNormalization
import numpy as np

class Mish(Layer) :
''' Mish Activation Function. .. math:: mish(x) = x * tanh(softplus(x)) = x * tanh(ln(1 + e^{x})) Shape: - Input: Arbitrary. Use the keyword argument input_shape (tuple of integers, does not include the samples axis) when using this layer as the first layer in a model. - Output: Same shape as the input. Examples: >>> X_input = Input(input_shape) >>> X = Mish()(X_input) '''

def __init__(self, **kwargs) :
super(Mish, self).__init__(**kwargs)

def call(self, inputs) :
return inputs * K.tanh(K.softplus(inputs))

def get_config(self) :
base_config = super(Mish, self).get_config()
return dict(list(base_config.items()) + list(config.items()))

def compute_output_shape(self, input_shape) :
return input_shape

def mish(x) :
return keras.layers.Lambda(lambda x: x*K.tanh(K.softplus(x)))(x)

###### Use in your model ##########

Copy the code

Cheat code: Small sample

def tanh(x) :
return (np.exp(x) - np.exp(-x))/(np.exp(x) + np.exp(-x))

def softplus(x) :
return np.log(1 + np.exp(x))

def misc(x) :
return x * tanh(softplus(x))
Copy the code

### conclusion

By reviewing and summarizing the significance, types, mathematical characteristics, and scope of use of the activation function, we can have a good understanding of the activation function and how to choose to use it in the construction of the model. I have seen this question asked in the learning exchange group. I have read many articles on the Internet, and I think it is a question worth summarizing and learning. And by going deeper into the principles, understand why it’s important to choose. Not only do code porter, should have their own thinking, the pursuit of high level, our vision is high, in order to have more space for development. However, the description of this article uses a lot of formulas to show, which greatly improves my Latex writing ability. I find that only using them is valuable and can I grow faster…

Welcome to exchange and study!

### References:

1. The Activation function en.wikipedia.org/wiki/Activa…

2. Zhuanlan.zhihu.com/p/260970955 nonlinear activation function

3. The common popular cloud.tencent.com/developer/a activation function…

4. Activation of Mish Neural Networks sefiks.com/2019/10/28/…

5. Understand Maxout Activation Function in Deep Learning – Deep Learning Tutorial www.tutorialexample.com/understand-…