Some basic concepts

LogisticRegression. Although its name contains regression, it is a nonlinear classification model. Logistic regression model introduced sigmoID function mapping, is a nonlinear model, but its essence is a linear regression model. It only covers a layer of SigmoID mapping outside the linear classification, so in the sample space constrained by SigmoID, the segmented hyperplane of LR is linear. So LR is essentially a linear classification model.

Here we need to note that many bloggers write logistic regression as a linear classification model only, and prove the linear relationship by giving examples that the partition hyperplane of logistic regression is linear. I don’t think it’s appropriate. For example, logistic regression segmentation hyperplane is linear, and its sample is not the original sample, but the sample after sigmoID function mapping. The partition hyperplane of logistic regression is nonlinear in the original sample space. According to the netizens, then the SVM model with kernel function is also a linear model, because in high dimensional space we SVM segmentation hyperplane is always linear hyperplane. Here, I think I can only express that logistic regression and SVM are linear models in nature, but cannot say that they are linear models. (Welcome to refute)

Why does LR use Sigmod?

The bottom line is the property of sigmoID’s maximum entropy. Entropy can be used to represent the uncertainty contained in the probability distribution, and the greater the entropy, the greater the uncertainty. So, uniform distribution has maximum entropy, because the fundamental new data is equally likely to be any value. And what we care about now is, given some assumptions, the distribution of maximum entropy. So this distribution should be as uniform as possible as long as it satisfies my hypothesis. For example, the well-known normal distribution assumes the distribution with the maximum entropy given mean and variance. So logistic regression, what is assumed here? First of all, we in modeling to predict Y | X, and Y | X to Bernoulli distribution, so we just need to know the P (Y | X); Secondly we need a linear model, so P (Y | X) = f (wx). And then we just need to know what f is. And this f that we can derive from the maximum entropy rule is sigmoid. We know that the Bernoulli distribution, the exponential family of functions, is 1 over (1 + e^-z).

advantages:

Linear regression requires variables to follow normal distribution, while logistic regression does not require variable distribution.

  1. It is suitable for the scene that needs to get a classification probability. Its output can not only be used for classification, but also represent the probability that a sample belongs to a certain category
  2. Computation cost is not high, easy to understand implementation. LR is quite efficient in terms of time and memory requirements. It can be applied to distributed data, and there are online algorithms implemented to process large data with fewer resources.
  3. LR is robust to small noise in data and is not particularly affected by slight multicollinearity. (Severe multicollinearity can be solved using logistic regression combined with L2 regularization, but L2 regularization is not the best choice for a reduced model because it covers all features.

disadvantages:

  1. Easy to lack of fitting, classification accuracy is not high.
  2. When data features are missing or feature space is large, the performance is not good.

Push formula

  1. Determine the classification decision function.

Linear binary classification model:


The logistic regression decision function is to nest this linear dichotomy with a sigmoid function:


  1. The Loss function is derived:

We use likelihood function as loss for model update, but here we are maximizing likelihood function, so strictly speaking, the final optimization is “maximizing” loss function:



This loss function is hard to differentiate, so we take the log of it and make it a logarithmic likelihood function:


  1. Gradient descent (ascent) optimization problem

Special properties of sigmoid function:


A gradient:





Watch out for a few minutes.We won’t have it in a minute, but we prefer not to, but it’s a little bit cleaner to just multiply matrices. It’s just expressing the likelihood function, usingMore intuitive