The original reference: zhuanlan.zhihu.com/FaceRec

0-1 Loss (binary classification), cross entropy Loss (classification), Softmax Loss (multi-classification), folding Loss (SVM), mean square error (linear regression), Modified Huber Loss (classification), index Loss (Adaboost)

1 0-1 Loss

  • Generally speaking, binary machine learning model consists of two parts: linear output is generally S = wx; Non-linear output such as the Sigmoid function

After Sigmoid function, g(s) value is limited between [0,1]. If S ≥ 0 and G (s) ≥ 0.5, the prediction is positive. If S < 0 and G (s) < 0.5, the prediction is negative.

  • There are two ways to represent positive and negative classes: {+1, -1} or {1, 0}. When y={+1, -1} is used, if ys ≥ 0, the prediction is correct, and if ys < 0, the prediction is wrong. The symbol of ys reflects the accuracy and confidence of the prediction. There are four cases of predicted versus true categories:
  • S ≥ 0, y = +1: prediction is correct
  • S ≥ 0, y = -1: prediction error
  • S < 0, y = +1: prediction error
  • S < 0, y = -1: prediction is correct
  • 0-1 Loss is the simplest Loss function. For dichotomous problems, if the predicted category y^ is different from the true category Y, then L=1; If y_^= y, L=0.
  • L represents the loss function, and the abscissa is ys. The following figure shows the loss curve:

  • Pros: Very intuitive and easy to understand
  • Disadvantages: Raw Loss is rarely used in practical applications.
  • Loss imposes the same punishment on each misclassification point (the Loss is 1), so it is not reasonable to impose a larger punishment on the point where the mistake is larger (ys is far less than 0).
  • It is difficult to use gradient optimization algorithm because it is discontinuous, non-convex and non-derivable.

2 Cross Entropy Loss

  • For Logistic regression and Softmax classification, in combination with Sigmoid (binary classification) or Softmax (multiple classification). Logistic regression is a classification problem
  • From the perspective of Shannon information theory, cross entropy is deduced:
  • Amount of information: I(x)=−log(p(x)). The greater the probability of an event, the smaller the amount of information.
  • Entropy: For a random variable X, the expectation E[I(X)] of all the possible values of information is called entropy.

  • Expectation: The sum of the product of all possible values of the discrete random variable xi and the corresponding probability P (xi) is called expectation E(x). It is a generalization of the simple arithmetic average, similar to the weighted average.

  • Relative entropy, relative entropy) : also known as the KL divergence, KL distance, is a measure of distance between two random distribution, D (p | | q). It measures the inefficiency of the hypothetical distribution Q when the real distribution is P.
  • The purpose of machine learning is to hope that the calculated probability distribution Q is as close to the real probability distribution P as possible, so that the relative entropy is close to the minimum value 0.

  • Cross entropy: The latter part of the relative entropy formula is cross entropy, which also reflects the similarity of distribution P q. The smaller the value, the more similar it is.
  • Since the real probability distribution is fixed, the first half of the relative entropy formula becomes a constant, so when the relative entropy reaches the minimum, it also means that the cross entropy reaches the minimum. The optimization of q is equivalent to finding the minimum of cross entropy.

  • Logistic regression cross entropy loss function of dichotomies: when y^ uses Sigmoid for dichotomies, the output label is {0,1} as follows: when y=1 but y^ is not 1, the error increases with the decrease of y^ :
  • For multiple categories, y^ uses Softmax, L=-yi*log(yi^), and I is the real category

  • From the perspective of maximum likelihood, it is deduced that minimization of cross entropy is equivalent to maximum likelihood estimation. Output label for {0, 1} predicted class probability P can be written as, when introduced into the log that the bigger the probability P (y | x) – the smaller the log P (y | x) :
  • The second term is ignored when the real sample label y = 1. Ignore the first term when y equals 0.

  • Loss curve, when the output label y is represented by {1,0} :
  • When y=1, L=-log(y^), the sigmoid y^=1/(1+e^-s) is substituted to obtain the Loss curve. The linear output s is greater than zero, L is smaller. On the left figure, the x-coordinate is S
  • When y=0, L=-log(1-y^), the less s is, the less L is, on the right

  • When the output label y is represented by {-1,+1} : Cross entropy loss is the same, but is represented differently. Replace s with ys. The integration of ys as abscissa is easy to draw and has practical physical significance.

  • Advantages: it is one of the most widely used loss functions
  • In the whole real number field, Loss changes approximately linearly, especially when ys << 0. The greater the mistake, the greater the punishment
  • Cross entropy Loss is continuously differentiable for easy derivative calculation
  • The model is less disturbed by outliers

Reference: 1. Derivation of cross entropy loss function formula

3 Softmax Loss

  • It is mainly used for neural network multi-classification problem

Concept to distinguish

  • Softmax regression: the general form of logistic regression, which extends the Logistic activation function to C categories. Softmax regression with C=2 is logistic regression.
  • Softmax classifier: Is a multi-category classifier. Neural networks without hidden layers The decision boundary between any two classes is linear, but can be divided into multiple classes (neural units) with multiple different linear functions. The deeper the network, the more complex nonlinear decision boundaries can be learned.
  • Softmax layer: in the output layer of the last layer of the neural network, several units are set for several categories. After calculating the linear predictive value Z of each unit, Softmax activation function is used to calculate each probability.
  • Image classification of backbone network AlexNet/VGG/ResNet/MobileNet prediction model of the last layer is “Softmax”, model training and validation of the last layer is the “Accuracy” + “SoftmaxWithLoss”. Because image classification only needs to separate the depth features.
  • Softmax function (activation function) : The formula for mapping Z to probability is as follows, it also needs a vector Z_i, because it combines all the outputs to normalize to probability, and the outputs are also vectors. It is usually used on the output layer.
  • Both the Sigmoid and ReLu activation functions deal only with a real number z, and the output is also a real number. Sigmoid function is used for output layer and ReLu function is used for hidden layer.
  • Softmax-loss (Softmax Loss function) : The Loss function of the neural network with the Softmax output layer, the Loss function of the Softmax classifier, and the calculated results of the Softmax activation function as the input of the Softmax classifier Loss function.
  • Hardmax: Set the largest element in Z to 1 and all other positions to 0. Softmax’s mapping from Z to probability is more gentle and is a specific probability value.

Softmax activation function, multi-category probability

  • In logistic regression, the function is to convert linear predicted value into category probability. Softmax function is defined as follows:
  • The top layer of classical deep neural network is actually a logistic regression classifier

  • Let zi=wx+b be the linear prediction result of the ith category, and bring into Softmax is actually taking the index of each ZI to become non-negative (to avoid the positive and negative balance when adding), and then dividing by the sum of all terms to normalize into the probability of each category between 0 and 1. Now each oi=σ I (z) can be interpreted as the probability that data x belongs to multiple categories I (m categories, one data x has many features x0,x1,x2…). , or Likelihood.
  • Note: In logistic regression, ZI is converted into the probability value between 0 and 1 by sigmoid activation function. Calculate the probability that data X belongs to the correct class y=1.

Multinomial Logistic Loss is the Loss function (cross entropy Loss) used by the Softmax classifier

  • The objective function (cost function, loss function) of logistic regression is established according to the maximum likelihood principle, assuming that the corresponding category of data X is Y, o_y is the probability of x belonging to the correct category Y calculated by Softmax function. Maximum likelihood is to maximize O_y, usually negative log-likelihood is used instead of likelihood, that is to minimize the value of -log(o_y), and these two results are mathematically equivalent. So minimizing the cost function is minimizing:

  • Loss functions commonly used in Softmax classification are :(explained by ng)
  • Correct sample y2=1, and all other samples are 0. (The formula is the cross entropy loss function)
  • Minimizing the loss function, which is the maximum likelihood, is maximizing the minus log of o_y, which is maximizing o_y. O_y is y2^.

  • The loss of the entire training set, J, is the average of the sum of all the losses, and the preceding is the loss of the individual training sample. The loss function expansion of this is the algebraic expression of the parameters w b of each unit of the network layer, and the gradient is also calculated for them when calculating the gradient. Gradient descent requires the gradient of the loss function to all layers and all elements of the weight, while updating all the weight.

Softmax-Loss

  • Softmax-loss is to substitute the Softmax activation function into the above Loss function L, and participate in the Softmax activation value s_yi of the actual category and the sum s_j of the Softmax activation value of all categories. Here is the representation in CS231n (R is the regular term and penalizes W to reduce complexity) :

  • Softmax-loss is a combination of the Softmax function and the Cross Entropy Loss function.
  • Multinomial Logistic Loss and Softmax activation function can be combined into one Softmax-Loss layer or separated. This is more flexible, but suffers from numerical stability and requires more computation.
  • Loss curve: When S << 0, Softmax is approximately linear; When S >>0, Softmax tends to zero.

  • Advantages: Softmax is also less disturbed by outliers

Weighted Softmax Loss

  • Let’s say we have a classification problem, and there are only two categories, but the sample size difference between the two categories is very large. For example, for edge detection, the importance of edge pixels is greater than that of non-edge pixels, so samples can be weighted accordingly.

  • Wc is the weight, c=0 represents edge pixel, c=1 represents non-edge pixel, then we can set w0=1, w1=0.001, that is, increase the weight of edge pixel. Of course, you can also adapt the weights.

Reference: Softmax and SoftMax-Loss AI path

4 Hinge Loss

  • Hinge Loss is generally used in SUPPORT vector machine SVM (large spacing classifier), which reflects the idea of maximizing the distance of SVM.
  • Advantages: When Loss>0, it is a linear function, facilitating the derivation of gradient descent algorithm.

Hinge Loss is used for dichotomies

  • Taking the linear output model of SVM as an example, y^= S =wx, L is the following formula, and ys=y*y^ is the horizontal axis of loss curve
  • If y*y^<1, then the loss is 1-y*y^ greater than zero, classification error
  • When the output label y=-1, -1
  • If y*y^>=1, then the loss is 0, the classification is correct
  • When the output label y=-1, y^<=-1, y^>=1, y=1, loss is equal to 0

Hinge Loss is used for maximum spacing

  • Represents the spacing between positive and negative sample scores:
  • When y’>y+m, loss=0, that is, the predicted value of positive samples is greater than the predicted value plus spacing of negative samples
  • Y is the score of the positive sample, y prime is the score of the negative sample, and m is the distance between the two.

  • It is hoped that the higher the positive sample score is, the better the negative sample score is, the lower the negative sample score is, but the difference between the two scores is at most M is enough, and there will be no reward if the difference increases.
  • Representation in CS231N:

5 mean squared error (MSE, Mean Squared Error)

  • For linear regression problems, belonging to least squares (OLS)
  • The basic principle of least squares is that the optimal fitting line should be the line that minimizes the sum of the distances from each point to the regression line, that is, the sum of squares.
  • Using mean square deviation (MSE) as a measure:

  • Why can’t we use cross entropy for linear regression? Cross entropy formula For linear regression problems, arbitrary values such as -1.5, can not calculate log(-1.5), so generally do not use cross entropy to optimize regression problems.
  • Why not use MSE for classification problems? The classification problem needs to calculate the probability of each label in the form of one HOT, and then use Argmax to determine the classification. Softmax is usually used to calculate the probability. The problem of calculating loss with MSE lies in that the curve output by Softmax fluctuates and has many local extreme points, which is a non-convex optimization problem. The cross entropy calculation of loss is still a convex optimization problem, which can be solved by gradient descent.

Reference: variance, MSE, etc. Why do we use cross entropy for classification problems and MSE for regression problems

6 Modified Huber Loss

  • It is used by the SGDClassifier in sciKit-learn.
  • Huber Loss can also be applied to classification problems, called Modified Huber Loss:

  • Loss curve: Modified Huber Loss combines the advantages of MSE, Hinge Loss and cross entropy Loss.

  • Advantages: On the one hand, sparse solutions can be generated when YS > 1 to improve training efficiency; On the other hand, the penalty for ys < −1 samples increases linearly, which means less interference from outliers.

7 Exponential Loss

  • Used mostly in AdaBoost
  • Index loss:

  • Exponential Loss is similar to cross entropy Loss, but it declines exponentially with a greater gradient.
  • Disadvantages: by the interference of anomalous point is larger

The tolerance of each loss to abnormal points was compared

  • Draw 5 kinds of Loss. The value range of ys in the left figure is [-2,+2], and the value range of YS in the right figure is [-5,+5] :

  • Exponential Loss is much larger than other Loss. From the perspective of training, if there are outliers in the sample, Exponential Loss will assign higher penalty weight to outliers, but it may reduce the overall performance of the model at the expense of the prediction effect of other normal data points
  • Compared with Exponential Loss, the other four losses, including Softmax Loss, have better “tolerance” for outliers, are less disturbed by outliers, and the model is more robust.