The problem background

Suppose you have the need to determine whether a flower is an iris. We know that flowers of different species grow differently, so we can describe a flower by a number of outward features (calyx length, calyx width, petal length, petal width, etc.). Based on this idea, we collected N flowers and labeled them to obtain the following data set.

Consider the simplest case, Y(whether it is iris or not), which is linearly correlated with feature X, W is defined as the correlation coefficient, that is, model F can be expressed by the following formula:

Reduced to vectorized form:, which is linear regression,

Now the question comes, yes and no are two states, in computer science we often use the 1/0 switch to express, but from the range of the expression of the mathematical formula:You can take any value, but you can’t directly express this as a 0/1 switch. So how do you solve this problem? The linear regression is converted into logistic regression by a transformation function (also called activation function).

Modeling ideas

Not all functions can be used as activation functions, activation functions have the following good properties:

Nonlinear, the linear function of the linear function is still linear function, so the linear activation function can not bring nonlinear transformation, using such activation function can not enhance the expression of the model, so there is no way to fit complex implementation problems, so the activation function must be nonlinear.

Continuously differentiable, if the function is not differentiable, there is no way to iterate through gradient descent to get an approximate optimal solution. If the activation function is not differentiable, other complicated mathematical tools may be needed to solve it. On the one hand, there may not be a solution, and on the other hand, the calculation cost is too high to realize and implement.

Monotone, linear function itself is monotone, this itself is a certain physical significance, so after the activation function transformation also maintain this property, can not change its monotone. meet

Direct conversion

Through a piecewise function, f(x) is directly mapped to 0 or 1, as shown in the formula:



However, this piecewise function is not continuous, non-differentiable, non-monotonic, and takes an additional parameter k, so this piecewise function is not suitable for use with activation functions.

Indirect mapping

Instead of mapping directly to 0 or 1, the range of f(x) is compressed between (0,1), as shown in the formula:



This is the sigmoid function. The picture below shows the sigmoid function.



It is obvious that this function has all three of the nice properties of the activation function mentioned above. At the same time the output compressed to (0, 1] interval, there is a very intuitive sense is that we can reduce the output value as a kind of probability, refers to the probability of irises on this issue, when the probability value is greater than 0.5, shows irises probability which is 1, the opposite is not irises is 0, it can achieve the discriminant classification.

Implementation logic

So now that we have the SigmoID activation function, how do we use it to train the model? The training of the model depends on two magic tools: loss function and gradient descent. The former can quantify the error between our model prediction and real results and determine the optimized objective function, while the latter can know how to reduce the error and specifically optimize the objective function.

Loss function

The output of the Sigmoid activation function can be viewed as the probability, and specifically we can view this probability as the probability that the predicted result will be yes.



We need the predicted classification result to be either yes or no, there are only two cases, obviously sample X obeys Bernoulli (0-1) distribution. Assuming sample X, when the truth value of the classification label y is 1, we see y_pred as the output value of sigmoid (the probability that the model predicts yes), 0 as a mutually exclusive event of 1, and when the truth value of the classification label is 0, we see 1-y_pred (the probability that the model predicts no), So the conditional probability P (Y | X) to quantify the model prediction accuracy.



Merge, simplify, combine into a unified form



P (Y | X) is the model prediction results, obviously P (Y | X) value is close to 1, to illustrate the model prediction results. A data set has N samples, each of which is independent from each other, so the quality of the model in the whole data can be defined as follows:



Obviously, to make the model work best, you need to find an optimal parametermakeSo if we get to the maximum, this is the MLE in the optimization method, and we found the loss function.



Next, we need to see how to transform the loss function by adding a negative sign (maximum problem to minimum problem, gradient descent can find the minimum value) and taking a logarithm (not changing the monotonicity, turning complex multiplication into simple addition).

Gradient descent

After the objective function is determined, the parameter W can be updated iteratively by using gradient descent to make it approach the extremum point of the objective function.



Gradient derivation:



The link is visible ②③



It can be seen that the prediction error of the model at t+1 is always smaller than that at t. Through such iterations, the model can continue to learn and adjust until it reaches the (local) optimal extreme point where the partial derivative is 0. At this point, parameters can no longer be adjusted and the model stops retraining.

conclusion

Logistic Regression is the simplest and most basic model framework and basic paradigm of machine learning. It is no exaggeration to say that Logistic Regression is one of the foundation stones of machine learning. Many subsequent machine learning models are based on this basic model framework and put forward various forms of expansion and improvement. A deep understanding of the logistic regression model, and sorting out the modeling ideas, cause and effect, and implementation logic behind the logistic regression model can help us have a more comprehensive and clear cognition of the methodology of machine learning.