3.1 Logistic regression

3.1.1. Introduction to logistic regression and conditional probability

As mentioned above, the perceptron algorithm may fail to converge when the nonlinear is separable, so logistic regression can be used to improve the classification efficiency. Note that although it is called logistic regression, it is actually a classification model, not a regression model.

To introduce logistic regression, we need a concept: the odd ratio, mathematically defined as p1−p\frac {p}{1-p}1−pp, the ratio of the probability that an event will happen to the probability that it will not. And then we define a function logit, which is the logarithm of the probability ratio


l o g i t ( p ) = l o g p ( 1 p ) logit(p) = log \frac {p}{(1-p)}

The input p of logit function is the number in the interval [0,1], and the output range is the set of real numbers. In this way, we can establish the relationship between the probability that the sample belongs to a certain class and the eigenvalue of the sample


l o g i t ( p ( y = 1 x ) ) = i = 0 n w m x m = w T x logit(p(y=1|x)) = \sum_{i=0}^nw_mx_m=w^Tx

P (y | x) = 1 means under the condition of a given feature x the probability of the sample belong to category 1. Just like before, if we set z=wTxz =w ^Txz=wTx, we can solve for the probability p.


p = ϕ ( z ) = 1 1 + e z p = \phi(z) = \frac {1}{1+e^{-z}}

ϕ(z)\phi(z)ϕ(z) is then used to represent the probability that the sample is in a category. The graph of the function is as follows

The sigmoid function takes a real value as its input and maps it to the interval [0,1], with an inflection point at ϕ(z)\phi (z)ϕ(z) = 0.5.

In this way, given the characteristics of a sample, we can get the probability that the sample belongs to a certain category, and then convert it into binary output.


y ^ = { 1 . ϕ ( z ) p 0.5 ( or z p 0 ) 0 . other \hat y = \begin{cases} 1, &\ phi(z) \ge 0.5(or z \ge 0)\\ 0, &other \end{cases}

3.1.2 Loss function of logistic regression

We used the sum of error squares as the loss function.


J ( w ) = 1 2 ( Σ ( y ( i ) ϕ ( z ) ( i ) ) ) 2 J(w) = \frac 12( \Sigma (y^{(i)} – \phi (z)^{(i)}))^2

However, in logistic regression, the sum of error squares is no longer applicable because ϕ(z)\phi (z)ϕ(z) in logistic regression is nonsubconvex in ϕ(w)J(w)J(w) J(w) is not an optimal solution, so we need a new loss function.

We can use the maximum likelihood function:


L ( w ) = P ( y x . w ) = i = 1 n P ( y ( i ) x ( i ) : w ) L(w) = P(y|x,w) = \prod_{i=1}^{n}P(y^{(i)}|x^{(i)}:w)

Because y(I)y^{(I)}y(I) has only two values, 0 and 1, the above equation can be simplified as


L ( w ) = P ( y x . w ) = i = 1 n ϕ ( z ) y ( i ) ( 1 ϕ ( z ) ) 1 y ( i ) L(w) = P(y|x,w) = \prod_{i=1}^{n}\phi(z)^ {y^{(i)}}(1-\phi(z))^{1-y^{(i)}}

Then take logarithm


l ( w ) = l o g ( L ( w ) ) = i = 1 n y ( i ) l o g ( ϕ ( z ( i ) ) ) + ( 1 y ( i ) ) l o g ( 1 ϕ ( z ( i ) ) ) l(w)=log(L(w)) = \sum_{i=1}^n y^{(i)}log(\phi(z^{(i)})) + (1-y^{(i)})log(1-\phi(z^{(i)}))

Our goal is to find the maximum value of this function, and the loss function is a minimum value, so the loss function is the loss function we need to take the negative number.


J ( w ) = i = 1 n y ( i ) l o g ( ϕ ( z ( i ) ) ) ( 1 y ( i ) ) l o g ( 1 ϕ ( z ( i ) ) ) J(w)=\sum_{i=1}^n -y^{(i)}log(\phi(z^{(i)})) – (1-y^{(i)})log(1-\phi(z^{(i)}))

To better understand this function, let’s look at the loss function for a single sample:


J ( w ) = { l o g ( ϕ ( z ) ) . y = 1 l o g ( 1 ϕ ( z ) ) . y = 0 J(w)=\begin{cases} -log(\phi(z)), & \text{y} = 1 \\ -log(1-\phi(z)), & \text{y} = 0 \\ \end{cases}

The image is as follows:

It can be seen that when the predicted value is 1 and the actual value is 1, the value of the loss function tends to 0 (solid line part), but when the predicted value is 0 and the loss Hanshu value is 1, the cost function tends to infinity.