【 Break the casserole to ask to the end 】 the logistic regression you said, does it have logic

Looking for looking for work is only learned a fur algorithm in the usual, the interviewer asks the darker is what all not, so here to read all studied carefully polished again, asking why, because there are always some things is no detailed to write a book or blog has brought, the dear interviewers like to ask these one has brought, You have to build a rocket from start to finish before you can come in and turn the screws.

This is basically a summary of myself, because I have a hard time remembering things, so I’m not going to come up with a formula that I can’t read. But it’s not without formulas, because that’s what the interviewer wants you to do. So I’m going to try to make the algorithm clear and the formula clear in a fusion way.

* the introduction

Suppose we got a big offer, and then to see real wage situation we find magical conch access to the company a lot of staff wages, but see also don’t see any light, when we hear about salary and length of service and performance, HR said just magical conch gave us these employee’s length of service and performance. This we can be happy, we vaguely feel as if we can use a scientific method to calculate our future salary, and then the promotion salary queue appeared in front of us…… Wait, that’s the salary. What’s the promotion? The magical conch sensed our query and gave us promotions. Well, with the weight of data in our hands and the sight of the conch’s retreating back, we know the secret to promotion and pay rise is in our hands.

* Linear regression

Let’s start with linear regression, because logistic regression is a linear model, and you can’t circumvent linear regression anyway. The data we have at our disposal clearly should be

wage	Length of service	The performance of
5k	2	3.5
10k	5	3.5
15k	6	3.75
…	…	…

We treat salary as YYy, length of service and performance as x1{x}_{1} X1 and x2x_2X2, if expressed by linear equation

$y=f(x)=\omega_1x_1+\omega_2x_2+b$

Let me write it a little bit more generally

$y=f(x)=\omega^Tx_i+b$

Here omega \omega omega and XXX are both vectors. So we have an equation (which is really a model), and what we need to know is how to find the parameters ω\omegaω and BBB. Wait, we’re already vectors, so why don’t we write BBB in the argument ω\omegaω? We artificially add a column to the data, i.e

wage	Length of service	The performance of	No solid righteousness
5k	2	3.5	1
10k	5	3.5	1
15k	6	3.75	1
…	…	…	1
Then put the $b$ income $\omega$ , there is

Omega = (omega 1, 2, omega 3, omega… T \ omega, omega n) = (\ omega_1, \ omega_2 \ omega_3,… Omega, \ omega_n) ^ T = (omega 1, 2, omega 3, omega… Omega n omega T here omega n=b omega n=b omega n=b omega n=b omega n=b

And then for XXX

$X = (X_1, X_2, X_3,… ,X_n)$

Then the model can be expressed as

$\hat{y}=X\cdot\omega$

Now how do we solve for omega \omega omega? Let’s suppose we have a parameter vector omega \omega omega, then we at least know how to evaluate it ———— if it works. We use our real salary yyy to compare with the output of the model y^\hat{y}y^, if the difference is not much, it means he is good, and if the difference is much, it means he is not good. So we choose the root mean square error as the evaluation function (that is, the loss function)

$L=\sum_{i=1}^{m} (y_i-\hat{y})^ 2$

It is found that if MSE is selected as the loss function, the difference between the predicted value and the true value can be converted to [0,+∞)[0,+\infty)[0,+∞). Obviously, the closer the result is to 0, the better

MAE is not widely conductable in MSE. L1 and L2 normals are sparse, and MAE is not widely conductable. Can pull out too much, mark next explanation)Copy the code

$L(y,X)=(y-X\omega)^T(y-X\omega)$

Unfold the have

$L(y,X)=\omega^TX^TX\omega-2(X\omega)^Ty+y^Ty$

And a derivative

$\frac{\partial L}{\partial \omega}=2X^TX\omega-2X^Ty$

Let’s set them equal to zero

$\omega^*=(X^TX)^{-1}X^Ty$

So we know how to calculate the parameters omega \omega omega, and with the parameters, we have the model, and here we have our linear model. So let’s summarize a little bit about the process of linear regression. (1) we first decide the style of the model (linear), (2) and then to know what is we need to request the parameters (omega \ omega omega), (3) next, we found a evaluation function is used to evaluate whether our model is good (MSE), (4) bring our model into MSE, and worked out our parameters by means of mathematics, And that completes our model.

Isn't it a bit of a relief that we've figured out a model so easily, that we know what it's like for those of us who specialize in math. You don't have to do any experiments, you just have to prove itCopy the code

* Logistic regression

It’s time for the hard dish. Now that we know our salary, when will we be promoted? So let’s say that the data we have is something like this

Have you been promoted within a year	Length of service	The performance of
is	2	3.5
no	5	3.5
is	6	3.75
…	…	…

So how do we figure out our chances of getting promoted within a year? Let’s start with a linear regression, so let’s start with a linear equation. Wait, not quite. If we were to use a linear equation to fit the probability of a promotion within a year, then the first column of our data should be the probability of a promotion within a year, not a definite result. So what do we do? And our model is going to output probabilities, and we’re going to end up with outcomes, so how do we evaluate our model by combining these two? We’re going to use a sigmoid function here to help us.

$Sigmoid=\theta(z)=\frac{1}{1+e^{-z}}$

I’m sure you’re all familiar with this function, and I won’t show you the picture, but it has two properties that are important and will be used here

$\theta(-z)=1-\theta(z)$
$\theta'(z)=\theta(z)(1-\theta(z))$

The nice thing about this function is that you can map the values of the real number field to (0,1)(0,1)(0,1), and then the output can be viewed as a probability, and the input, what is it, is the output of our original linear model. It might be hard to understand, but let’s do the whole formula, let’s say we have a linear model

$z=f(x)=\omega_1x_1+\omega_2x_2+b$

Write it in vector form

$\hat{z}=X\cdot\omega$

It's a little familiar. It's just linear regression. All right, so I'm going to do this one, and then I'm going to find the loss function, MSE, and I'm going to plug it in and solve for itCopy the code

But that’s not the case. Let’s say we have one-dimensional data, let’s say we don’t know how good each person is, we just know how long they’ve worked, and then we plot whether they’ve been promoted or notHow can such data be elegantly fitted with a straight line? It’s obviously not appropriate

This is also a point that I used to wonder why sigmoID is necessary, why direct linear regression can not be done, in fact, sometimes do it yourself, the answer is right in front of me. And I'm going to draw this picture in more detail later, so I'll do thatCopy the code

So let’s go to Sigmoid. We’ll integrate the output ZZZ into sigmoID

$h(X)=\theta(X\omega)= \frac{1}{1+e^{-X\omega}}$

Merc ωz=X\cdot\omegaz=X \ \omegaz=X \ \ sigma is the linear part of logistic regression. By passing this linear part through a sigmoid function, we obtain the final probability, representing the probability of predicting class 1. So the next step is to find out if we want the model parameter ———— or omega \omega omega. So step three, we need to evaluate our model, we need to find a loss function, and what does that function use? With our old friend MSE? So let’s go down in order here, and then we’ll go back and see if MSE is appropriate here

** Derivation of probability distribution

And since we’re predicting categories, like category 1 and category 0, we can define the probability that a sample is in category 1 or category 0, and that’s an assumption that we can make in classification problems, and in regression problems, there’s no probability that the regression is going to be a certain value, which I think you can see here. We assume that the probability of category 1 is p^\hat pp^, then

$P(y|X)=\hat p$

$P(y=0|X)=1-\hat p$

This equation represents that when we obtain a data feature vector XXX, the probability that it belongs to positive class (1) is P ^\hat pp^, and the probability that it belongs to negative class (0) is 1− P ^1-\hat P1 − P ^. And by multiplying, we can integrate this probability into an equation, namely

$P(y|X)=\hat p^y \ast (1-\hat p)^{(1-y)}$

It can be found that when YYy takes 1 and 0, it is exactly two separate equations. This time, in other Blog will say need to make the P (y ∣ X) P (y | X) P (y ∣ X) take the maximum, so why?

Imagine that at present we have a data feature vector XXX, we have a peek at the answer at this time and find that XXX is actually a positive sample, that is, the label is 1. Then should we pray that our model predicts that the probability of 1 is closer to 1, the better if it is equal to 1? And then the probability is p^ hat pp^, which is p^ hat pp^ the bigger the better. OK, so conversely, suppose we peek at the answer and find that XXX is actually a negative sample with a label of 0, then we need our model to predict that the higher the probability of it being 0, the better, the probability is (1− P ^)(1-\hat P)(1− P ^), so the smaller p^\hat pp^, the better.

P ^ hat PP ^ as big as possible, and as small as possible. How do we measure it? That’s actually the advantage of combining the two. When our sample label to 1, our P (y ∣ X) P (y | X) P (y ∣ X) will become the P (y = 1 ∣ X) P (y | X) = 1 P (y = 1 ∣ X), at this time we want P (y = 1 ∣ X) P (y | X) = 1 P (y = 1 ∣ X) the bigger the better, And P (y ∣ X) P (y | X) P (y ∣ X) the bigger the better. When our sample label is 0, our P (y ∣ X) P (y | X) P (y ∣ X) will become the P (y = 0 ∣ X) P (y = 0 | X) P (y = 0 X ∣), at this time we want P (y = 0 ∣ X) P (y = 0 | X) P (y = 0 ∣ X) the bigger the better, Same P (y ∣ X) P (y | X) P (y ∣ X) the bigger the better, we found no matter what circumstances, P (y ∣ X) P (y | X) P (y ∣ X) the bigger the better.

The point of all this nonsense here is that the greater the probability that each sample belongs to its true label, the betterCopy the code

So we have a goal, then please can you write the loss function, the probability P (y ∣ X) P (y | X) P (y) ∣ X the exponential and invert, there is

$L(y,X)=-[y\ln(\hat p)+(1-y)\ln(1-\hat p)]$

Here, we conducted the exponential operation, so don’t change the function of the monotonicity, in addition to invert function (plus negative) operation, so our goal from the function becomes the biggest new function minimum, after all, the “minimum” conforms to the “loss”, so our loss function is shown in the above type, consists of two parts of logarithm, This is also called the cross entropy function. Another point to note is that the loss function is written without an explicit parameter ω\omegaω, which is actually because ω\omegaω exists in the probability PPP, where the probability PPP is actually the predicted probability output of the Sigmoid function.

The loss function here is for a single sample, so you just add the summation sign to the whole sampleCopy the code

Now that we have the loss function, we can find the parameter ω\omegaω that minimizes the loss function. So let’s pause here for a second, but there’s another way to derive the loss function.

Just now we are from the point of view of probability, from the probability distribution of binary problems (Bernoulli distribution in fact), and then found the optimization goal, also defined the loss function. Now let’s derive the loss function from another Angle, the maximum likelihood Angle.

The result here may not be in the form of a common logistic regression loss function, but I think it helps to understand, in case the interviewer asksCopy the code

** Maximum likelihood derivation

This time, we analyze from the perspective of data, assuming that the existing data are:

$D = (x_1, 1), (x_2, 1), (x_3, 1),… ,(x_n,-1)$

The first term of each bracket here xix_ixi represents the characteristic data vector, the second term represents the corresponding label, and notice that the positive sample here is still 1, but the negative sample is defined here as -1

It's important that the negative sample is no longer labeled 0Copy the code

The generation of these data is independent of each other, so the probability of obtaining DDD data is

$P (x_1, 1), ast P (x_2, 1), ast (x_31, 1) P \ ast… \ast P(x_n,-1)$

Rewrite it in terms of conditional probability

$P (x_1) P (1 | x_1) \ ast P (x_2) P (1 | x_2) \ ast P (x_3) P (1 | x_3) \ ast… \ast P(x_n)P(-1|x_n)$

Here P (yi ∣ xi) P (y_i | x_i) P (yi ∣ xi) can be understood as a model in the input feature data xix_ixi after its belong to yiy_iyi probability, and P (xi) P (x_i) P (xi) may not be able to find an actual definition of physical meaning, Because we’re doing this mathematically, but if we keep going we can see that this P of xi, P of x_i, P of xi doesn’t really matter. We’re going to assume that the probability distribution for each piece of data is zero

$P(y_i|x_i)= \begin{cases} f(x_i)& {y=+1}\\ 1-f(x_i)& {y=-1} \end{cases}$

Rewrite our conditional probability up here

$P (x_1) f (x_1) \ ast P (x_2) f (x_2) \ ast P (x_3) f (x_3) \ ast… \ast P(x_n)(1-f(x_n))$

Now, we can understand that the probability of the data that we get is this expression above, so using the idea of maximum likelihood, since I get this data, it implies to some extent that this data might be like this; This is the relationship between real “characteristic data” and its corresponding “label”; In real data, I have the highest probability of getting a DDD. So what we’re going to do is we’re going to take f(x), f(x), f(x), to maximize this probability. In this probability, P(xi)P(x_i)P(xi) is label independent, which means that what we want to maximize the result of this probability is still f(x)f(x)f(x). Due to the relationship of multiple multiplication, we only need to maximize the product of f(x)f(x)f(x), which is to achieve the maximum of the whole, that is to say, to make

$F (x_1), ast f (x_2), ast (x_3) f \ ast… \ast (1-f(x_n))$

Is the largest. And then we can plug in our model, which is theta

$h(x)=\frac{1}{1+e^{-z}}=\frac{1}{1+e^{-X\omega}}$

get

$H (x_1), ast h (x_2), ast (x_3) h \ ast… \ast (1-h(x_n))$

Combined with the properties of sigmoID function, there are

$H (x_1), ast h (x_2), ast (x_3) h \ ast… \ast h(-x_n)$

Then, the optimization objective is

$\min_{\omega} -\prod_{i=1}^n h(y_ix_i)$

Here yiy_iyi is put into the sigmoid function to express all the cases. Note that yyY is marked with +1+1+1 and −1-1−1. If I take the log of this, I get

$\min_{ \omega} – \sum_{i=1}^{n} \ln h(y_ix_i)$

In this way, we transform the serial multiplication into addition, and we can obtain our loss function by expanding the expression of h(x)h(x)h(x), i.e

$L(y,\boldsymbol{X}) = \sum_{i=1}^{n} \ln (1+e^{-y_i\boldsymbol{X} \omega})$

Thus we derive another expression for the loss function of logistic regression. We can see that the two expressions are not the same. Are they both loss functions of logistic regression? Are they really an equation or are they really two loss functions?

Here in fact, I do not know, feel the original paper should be described, but have not come to see the paper, first in accordance with their derivation to guess it, if there is a god to see know the relationship between them might as well give me also point...Copy the code

If the loss functions of the two forms are listed here, and one sample is taken respectively, then

L1 (y, X) = – [y1 ln (p ^) + (1 – y1) ln ⁡ (1 – p ^)] L_1 (y, X) = – [y_1 \ ln (\ hat p) + (1 – y_1) and ln (1 – \ hat p)] L1 (y, X) = – [y1 Ln (p^)+(1−y1)ln(1− P ^)] — Formula 1

L2 (y, X) = ln ⁡ (1 + e – y2X omega) = – ln ⁡ (theta (y2X omega)) L_2 (y, X) = \ ln (1 + e ^ {- y_2X \ omega}) = – \ ln (\ theta (y_2X \ omega)) L2(y,X)=ln(1+e−y2Xω)=−ln(θ(y2Xω)) — equation 2

P ^ \ hat pp ^, have p ^ = 11 + e – omega = theta (omega) X X \ hat p = \ frac {1} {1 + e ^ {- X \ omega}} = \ theta (X \ omega) p ^ = 1 + e – X 1 = theta (X) omega, omega for omega, omega derivation, omega

$\frac{\partial L_1}{\partial \omega}=(\hat p-y_1)X =(\theta(X\omega)-y_1)X$

$\frac{\partial L_2}{\partial \omega}=(\theta(y_2Xw)-1)y_2X$

The value of the first label is {1,0}\{1,0\}{1,0} {1,0}, while the value of the second label is {+1,−1}\{+1,-1\}{+1,−1}. It can be found that when the label is positive, So in both cases y=1y=1y=1, the derivative is the same, so let’s focus on the negative label, where y1=0y_1=0y1=0, and the derivative of l1 with respect to L1L_1L1 is 0

$\frac{\partial L_1}{\partial \omega}=(\theta(X\omega)-y_1)X=\theta(X\omega)X$

For L2L_2L2, the label is y2=−1y_2=-1y2=−1

$\frac{\partial L_2}{\partial \omega}=-(\theta(-Xw)-1)X=(1-\theta(-Xw))X= \theta(X\omega)X$

The derivation of the above design formula will often make use of two properties of the Sigmoid function, which have been mentioned aboveCopy the code

At this point we found two forms of loss function derivative (gradient) is consistent, so I think these two forms of loss function are two forms of the same substance, this conclusion is not rigorous, because the same derivative does not represent a function the same, but I didn’t see my paper, so I guess that, in addition, The usual gradient descent method is to calculate the gradient, and their gradient should be one thing, XD

*** Small summary of the loss function

Before continuing, let’s briefly summarize the loss function of logistic regression, using a sample as an example, which we obtained in turn

Since both forms of the loss function are the same, let's use the first one as an exampleCopy the code

⋅ω z^=X \hat{z}=X\cdot\omega z^=X \ ⋅ω

P ^ = h (X) = theta (X) omega \ hat = h (X) = p \ theta \ \ omega (X) p ^ = h (X) = theta (X) omega – output probability by sigmoid function

P (y ∣ X) = P ^ y ∗ (1 – P ^) 1 – (y), P (y | X) = \ hat P ^ y \ ast (1 – \ hat P) ^ {} (1 – y) P (y ∣ X) = P ^ y ∗ (1 – P ^) 1 – (y) – according to different category prediction probability sample

L (y, X) = – [yln ⁡ (p ^) + (1 -) y ln ⁡ (1 – p ^)] L (y, X) = – [y \ ln (\ hat p) + (1 – y) and ln (1 – \ hat p)] L (y, X) = – [yln (p ^) + (1 -) y ln (1 – p ^)] – loss function

So now we’ve done logistic regression from beginning to end, oh no, we’re not done yet, we’ve only done logistic regression here by deriveing the loss function, and knowing the loss function we need to solve it, find the value that minimizes the loss function, and then we’re done logistic regression.

** Optimization solution

So if we want to minimize, the default is to take the first derivative and set it to zero to solve, does that work here? We find that the parameter ω\omegaω is the exponential term of eEE contained in the sigmoid function, which leads to complex matrix operations involved in the case of multidimensional features, which are obviously not easy to solve. Therefore, we choose another idea, by randomizing the initial point on a function, and then iterating to make the result of the loss function as small as possible in each iteration, so that after many iterations, we can get the set of parameters ω\omegaω that minimizes the loss function.

The description of optimization here is not rigorous, just to express the meaning of an optimization iterationCopy the code

There are many common optimization algorithms, gradient descent method, Newton method, and so on, and in recent years, there are many optimization improvements on these traditional optimization algorithms, here does not introduce the optimization algorithm in detail (because the engineering is too large), just briefly mention the basic idea of gradient descent method.

The gradient descent method starts with a random initial point for the parameter ω\omegaω, calculates the gradient (that is, the derivative) of the loss function at that point, and then proceeds an increment along the gradient to get a new ω\omegaω, repeating the process until the gradient is zero (reaching an extreme value) or the gradient reaches a threshold. The formula for that is

$\omega=\omega – \alpha \frac{\partial{L(y,X)}}{\partial{\omega}}$

The alpha \alpha alpha here represents the learning rate, which is a parameter of how far we think we are going in the direction of the gradient once we have figured out the direction. This parameter is not part of the model itself and needs to be set manually. Through this way of thinking, we can finally obtain the optimal solution of the parameter ω\omegaω in logistic regression, thus completing the whole model. Again, let’s briefly summarize the logistic regression process. ① We first decide the model style (linear kernel with a sigmoID shell), ② then we know what parameter we need to solve (ω\omegaω), ③ then we find the loss function (cross entropy) which can evaluate the model by using two ideas respectively, ④ combined with the relevant optimization algorithm, The optimal parameter ω\omegaω is found and the whole model is completed.

After we finished the model, we remembered that there was another MSE we had forgotten, so why didn’t we just choose the MSE in the loss function phase? So let’s start with the end, if we choose MSE as the loss function, then for a single sample we have

$L=\frac{1}{2}(y-\hat p)^2$

The gradient of

$\frac{\partial{L}}{\partial{\omega}}=- |(y-\hat p)| h'(x)X$

And the gradient of cross entropy is zero

$\frac{\partial L}{\partial \omega}=(\hat p-y)X$

By contrast, it is found that the gradient of MSE is multiplied by one coefficient relative to the gradient of cross entropy $h'(x)$ And this coefficient is the derivative of the sigmoid function, so let’s draw the derivative of the sigmoid function

It can be seen that the derivative of sigmoid has a maximum value of 0.25 and tends to 0 for larger or smaller input values. However, our optimization goal is that the output approaches +1 when the sample is positive, so the input must be large enough. Therefore, if MSE is used as the loss function, the gradient is likely to disappear when it approaches the optimization goal, and it cannot continue to approach the minimum value of the loss function.

In addition, the use of MSE as the loss function will lead to the introduction of model Sigmoid non-convex, which will have many local optimizations, affecting the optimization algorithm to find the global optimal.

* summary

So far is the logistic regression from beginning to end over again, with the key points briefly summarized it.

Linear regression

Linear regression aims to fit data points with linear models and can be used in regression problems
Use linear equations as models
Use MSE as the loss function
Put the model into the loss function to obtain the parameter expression

Logistic regression

Logistic regression aims to train a classification model for classification problems
Use sigmoID as the outer layer. The linear model serves as the inner model
Cross entropy is used as the loss function
Put the model into the loss function and use the optimization algorithm to get the parameter expression
There are two expressions of cross entropy, but they are unified in fact, and the difference lies in the label setting value
The loss function does not use MSE for two reasons: ① gradient disappearance, ② local optimum

Well, it turns out that this logistic regression is actually quite a logical XD

If you have any questions or find where I write wrong, you are welcome to discuss and point out ah, after all, I am also a vegetable chicken, it is inevitable that there are mistakes, you can discuss more to deepen understanding

Refer to the blog

www.jianshu.com/p/b07f4cd32… Blog.csdn.net/weixin_4153… www.cnblogs.com/shayue/p/10… The write really good www.cnblogs.com/shayue/p/10… Blog.csdn.net/weiweixiao3… www.cnblogs.com/maybe2030/p… Blog.csdn.net/mzpmzk/arti… Blog.csdn.net/dpengwang/a… Blog.csdn.net/huwenxing08… Yq.aliyun.com/articles/66… Blog.csdn.net/alw_123/art…

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

【 Break the casserole to ask to the end 】 the logistic regression you said, does it have logic

* the introduction

* Linear regression

* Logistic regression

** Derivation of probability distribution

** Maximum likelihood derivation

*** Small summary of the loss function

** Optimization solution

* summary

*END

【 Break the casserole to ask to the end 】 the logistic regression you said, does it have logic

* the introduction

* Linear regression

* Logistic regression

** Derivation of probability distribution

** Maximum likelihood derivation

*** Small summary of the loss function

** Optimization solution

* summary

*END

Related Posts

Animated data visualization with Tableau

Remember the HQL optimization process

Deep source constant ji: worried about personal identity being used fraudulently? Use face recognition technology to do so!