This is the 20th day of my participation in the August More Text Challenge

Logistic Regression

Unlike linear regression, the y predicted here is discrete. The Logistic Regression algorithm is one of the most popular and widely used learning algorithms today.

Classification Emai:Spam/ Not Spam? Online Transactions:Fraudulent (Yes/No)? Tumor:Malignant/ Benign

Mail: Spam or not Online transactions: Fraud or not Tumor classification: Conscience

If you remember, all of this is discrete supervised learning. What the above three questions have in common:

Y ∈ {0, 1}

0: Negative Class”

1: “Positive Class”

Of course, not all discrete problems are black and white with only two results, but y∈{0,1,3… N} countably finitely many.

For example

Start with a simple two-type discrete:

To predict whether a tumor is benign or malignant from its size:

If we still use the method of linear regression fitting will not be effective. Like this:

Now we assume that h(x)>0.5h(x)>0.5h(x)>0.5 is malignant for this fitting line, and vice versa is benign. You might say, well, that’s a good fit.

But what if:

There are a lot of disgusting tumors that are judged to be benign.

So linear regression is not suitable for discrete problems.

Logistic regression

What about discrete regression. First we need to make sure that:

$0 \leq h_{\theta}(x) \leq 1$

So you don’t have linear regression problems. It doesn’t matter how you fit it in linear regression. As long as it goes beyond a certain range, h(x)> 1H (x)> 1H (x)>1 will always occur.

So how do you fix that?

This makes use of Sigmoid function, also called Logistic Function:

$g(z) = \frac{1}{1+e^{-z}}$

After applying the formula of linear regression to the Logistic function, we can get:

h_{\theta}(x) =g(\theta^Tx)= \frac{1}{1+e^{-\theta^Tx}}

Now that we know how to fix hθ(x)h_θ(x)hθ(x) between 0 and 1, what about the output?

Estimated probability that y = 1 on input X.

Describe the likelihood of a conclusion given the input x for y = 1.

For example, y = 1 represents malignant tumor, and user X input the result 0.7.

You can’t say, “Congratulations, you have a malignant tumor.” You can say, “There’s a 70 percent chance you have a malignant tumor.”

The normal result, expressed in probabilistic terms, is:

H_ theta (x) = P (y | x = 1; Theta)

We can also derive hθ of x is equal to P of y is equal to 1∣x; Theta. Theta) + h (x) = P (y = 0 ∣ x; Theta) = 1 h_ theta (x) = P (y | x = 1; Theta) + h_ theta (x) = P (y = 0 | x; Theta. Theta) = 1 h (x) = P (y = 1 ∣ x; Theta. Theta) + h (x) = P (y = 0 ∣ x; Theta) = 1

The decision boundary

So when I was describing the result, I was talking about the probability of the input of x for y equals 1. But strictly speaking not all of these are possibilities for y equals 1.

Suppose:

Theta predict y = 1 if h (x) 0.5 or higher h_ {\ theta} (x) 0.5 h \ geq theta (x) 0.5 or higher
Theta predict y = 0 if h (x) < 0.5 h_ {\ theta} (x) < 0.5 h theta (x) < 0.5

In general, hθ(x)h_{\theta}(x)hθ(x) is greater than or equal to 0.5 with respect to y=1, and less than 0.5 with respect to 0.

Note the distinction between logic, for example, if you put in x and y=0.2, you can’t say you have a 20% chance of having a malignant tumor, you can say you have an 80% chance of having a benign tumor.

The above statement can also be equivalent to

Suppose:

predict y=1 if $\theta^Tx \geq 0$
predict y=0 if $\theta^Tx < 0$

Because the above g (z) = 11 + e – mid-december (z) = \ frac {1} {1 + e ^ {z}} g (z) = 1 + e – z1 image, positive z > 0, z < 0 is negative. Theta and h (x) = g (theta Tx) h_ {\ theta} (x) = g (\ theta ^ Tx) h theta (x) = g (theta Tx), so it can be to convert the above.

And the decision bound is theta Tx=0 theta Tx=0.

Let me draw a picture to make it more intuitive:

Above assumes that we have found prediction function h theta (x) = theta. Theta. Theta 0 + 1 x1 + 2 x2h_ {\ theta} (x) = \ theta_0 + + \ \ theta_1x_1 theta_2x_2h theta (x) = theta. Theta. Theta 0 + 1 x1 + 2 x2, Including theta = [- 311] theta = \ begin {bmatrix} – 3 \ \ 1 \ \ 1 \ end theta = {bmatrix} ⎣ ⎢ ⎡ – 311 ⎦ ⎥ ⎤, Into theta is h (x) = – 3 + x1 + x2h_ {\ theta} (x) = 3 + x_1 + x_2h theta (x) = – 3 + x1 + x2

Now you don’t have to worry about how the prediction function comes out, you’ll see later in the article, right

Including theta Tx = – 3 + x1 + x2 = 0 \ theta ^ Tx = 3 + x_1 + x_2 = 0 theta Tx = – 3 + x1 + x2 = 0 x1 + x2 = 3 x_1 + x_2 = 3 x1 + x2 = 3 this line. This line is the decision boundary. The Red Cross on this line is the region where x1+x2 is greater than 3, x_1 + x_2 is greater than 3, x1+x2 is greater than 3, and it’s called the region y=1. And the bottom blue circle is the y equals 0 region.

The decision boundary is a property of the hypothesis.

The determinate boundary is a property of the hypothesis function, independent of the data. In other words, we need the data set to determine θ, but once we do, our decision boundaries are set, and the rest is independent of the data set, and we don’t have to visualize the data set on the image.

Now for a more complicated example:

Theta for this image is our forecast function h (x) = g (theta. Theta. Theta 0 + 1 x1 + 2 x2 + + theta. Theta 3 x12 4 x22) h_ {\ theta} (x) = g (\ theta_ {0} + \ theta_ {1} x_ + \ theta_ {1} {2} X_ {2} \ left + \ theta_ {3} x_ {1} ^ + \ theta_ {2} {4} x_ {2} ^ {2} \ right) h theta (x) = g (theta. Theta. Theta 0 + 1 x1 + 2 x2 + + theta. Theta 3 x12 4 x22), Including theta = [- 10011] theta = \ begin \ \ {bmatrix} – 1 0 \ \ \ \ \ \ 1 \ end theta = {bmatrix} ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ – 10011 ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤

Y = 1 is x12 + 1 x_ x22 acuity {1} ^ + x_ {2} {2} ^ {2} \ geq 1 x12 + x22 1 or more

Y = 0 is x12 + x22 < 1 x_ {1} ^ {2} + x_ {2} ^ {2} < 1 x12 + x22 < 1

This case decision boundary is x12 x22 = 1 x_ + {1} ^ + x_ {2} {2} ^ {2} = 1 x12 + x22 = 1

How to fit Logistic Regression

Training set:

$\left\{\left(x^{(1)}, y^{(1)}\right),\left(x^{(2)}, y^{(2)}\right), \cdots,\left(x^{(m)}, y^{(m)}\right)\right\}$

M examples x ∈ x0x1… xn x0 = 1, y ∈ {0, 1} \ quad x \ \ in left [\ begin {array} {c} x_ {0} \ \ x_ {1} \ \ \ ldots \ \ x_ {n} {array} \ \ end right] \ quad x_ {0} = 1, y \ \ {0, 1 \} in x ∈ ⎣ ⎢ ⎢ ⎢ ⎡ x0x1… Xn ⎦ ⎥ ⎥ ⎥ ⎤ x0 = 1, y ∈ {0, 1}

$h_{\theta}(x)=\frac{1}{1+e^{-\theta^{T} x}}$

How to choose parameters θ\thetaθ?

Let’s start with our training set, which is m points. Column x into a matrix as before, and add x0=1x_0 = 1×0=1, θTx=θ0X0+θ1×1+… + theta NXN \ theta ^ Tx = \ \ theta_1x_1 theta_0X_0 + +… + \ theta_nx_n theta Tx = theta. Theta 0 x0 + 1 x1 +… + theta NXN

So how do you choose theta?

To compute theta, you have to find the cost function.

Cost function

Remember the cost function of linear regression?

$J\left(\theta\right)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}$

Theta in a writing, use Cost (h (x), y) = 12 theta (h (x) – y) 2 Cost (h_ {\ theta} (x), y) = \ frac {1} {2} (h_ {\ theta} (x) – y) ^ {2} Cost (theta h (x), y) = 21 (h theta (x) – y) 2

Theta Cost (h (x), y) Cost (h_ {\ theta} (x), y) Cost (theta h (x), y) represents the prediction function and the actual value of difference, the Cost function J said to all the Cost of the training sample and forecast function sum after averaged.

So the linear retrospective cost function can also be written as:

$J\left(\theta\right)=\frac{1}{m} \sum_{i=1}^{m}Cost(h_{\theta}(x), y)$

In the discrete case if you continue to use the cost function, then when you draw the graph you end up with a non-convex function. The long bottom looks like this, which means you can’t find the optimal solution smoothly.

Therefore, we need to find a convex function as the cost function of logistic regression:

Logistic regression cost function

\operatorname{Cost}\left(h_{\theta}(x), y\right)=\left\{\begin{aligned} -\log \left(h_{\theta}(x)\right) & \text { if } y=1 \\ -\log \left(1-h_{\theta}(x)\right) & \text { if } y=0 \end{aligned}\right.

If y=1 the graph looks like this:

We know that h of x is in the range of 0 to 1, and with the characteristics of the log image, we can understand how the above image appears.

This function has some interesting nice properties:

$Cost = 0:\quad if \quad y=1,h_θ(x)=1$

When the cost function is equal to 0, that is, our hypothesis function H θ(x)= 1H_ θ(x)=1hθ(x)=1, that is, we predict a malignant tumor, and the actual data y=1y=1y=1, that is, the patient is indeed a malignant tumor. So the cost function, 0, we predicted correctly.

$But \quad as \quad h_θ(x)→0,Cost→∞$

But if our hypothesis goes to 0, the cost goes to infinity.

Captures intuition that if H θ(x)=0h_θ(x)=0hθ(x)=0 (predict P(y=1∣x; Theta) = 0, p (y | x = 1; Theta) = 0, p (y = 1 ∣ x; θ)=0), but y=1, we’ll penalize learning algorithm by a very large cost.

If we assume that the function is equal to 0, that’s the same thing as saying that for y=1, the patient’s malignancy, we’re predicting a probability of 0.

In real life, we’re saying to a patient: You can’t possibly have a malignant tumor! In reality, if the tumor is indeed malignant, the doctor’s words are a medical malpractice. Doctors pay a high price. But in this function it only goes to 0, it doesn’t go to 0.

Let’s look at y=0:

$Cost = 0:\quad if \quad y=0,h_θ(x)=0$

The cost function 0, y=0, indicates that the patient has a benign tumor, and our prediction function H θ(x)= 0H_ θ(x)=0hθ(x)=0 indicates that the tumor we predicted is benign, and the prediction is completely correct, so the cost function 0.

$But \quad as \quad h_θ(x)→0,Cost→∞$

Captures intuition that if H θ(x)=1h_θ(x)=1hθ(x)=1 (predict P(y=0∣x; Theta) = 1 p (y = 0 | x; Theta) = 1 p (y = 0 ∣ x; Theta) = 1)

Y =0 means the patient has a benign tumor, but if our prediction function is equal to 1, then we’re predicting a malignant tumor, which tells the patient: you can’t have a benign tumor. In case someone has a benign tumor in their life, the doctor’s words will cause unnecessary panic…

So what’s interesting and nice about this function is that you can’t say too much.

We already talked about the cost function:

Logistic regression cost function

$J(\theta)=\frac{1}{m} \sum_{i=1}^{m} \operatorname{Cost}\left(h_{\theta}\left(x^{(i)}\right), y^{(i)}\right)$

$\operatorname{Cost}\left(h_{\theta}(x), y\right)=\left\{\begin{aligned}-\log \left(h_{\theta}(x)\right) & \text { if } y=1 \\-\log \left(1-h_{\theta}(x)\right) & \text { if } y=0 \end{aligned}\right.$

Note: $y=0$ or 1 always

Now let’s simplify it:

Cost (h_ {theta} (x), y) = – y \ log (h_ {theta} (x)) – (1 -) y \ log (1 – h_ {\ theta} (x))

Why does this simplify to this?

I can just plug it in.

y = 1: Theta cost (h (x), 1) = 1 – log ⁡ theta (x) (h) – (1-1) log ⁡ theta (1 – h (x)) = – log ⁡ theta (x) (h) cost (h_ {theta} (x), 1) = 1 \ log (h_ {theta} (x)) – (1-1)/log (1 – h_ {\ theta} (x) = – \ log (h_ {theta} (x)) cost (theta h (x), 1) = 1 – log theta (x) (h) – (1-1) log (1 – h theta (x)) = – log theta (x) (h)
y = 0: Theta cost (h (x), 0) = 0 log ⁡ theta (x) (h) – (1-0) log ⁡ theta (1 – h (x) = – log ⁡ theta (1 – h (x)) cost (h_ {theta} (x), 0) = 0 \ log (h_ {theta} (x)) – (1-0) \ log (1 – h_ {\ theta} (x) = – \ log (1 – h_ {\ theta} (x)) cost (theta h (x), 0) = 0 log theta (x) (h) – (1-0) log (1 – h theta (x) = – log (1 – h theta (x))

So I’m just going to merge these two things into one, and I’m not going to have to figure out what y is.

Now we can write the cost function of logistic regression:

Logistic regression cost function

\begin{aligned} J(\theta) &=\frac{1}{m} \sum_{i=1}^{m} \operatorname{Cost}\left(h_{\theta}\left(x^{(i)}\right), y^{(i)}\right) \\ &=-\frac{1}{m}\left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right] \end{aligned}

According to this cost function, we find min⁡θJ(θ)\min_{\theta} J(\theta)minθJ(θ), that is, the parameter θ that minimizes the cost function.

So we’re going to do gradient descent again.

Gradient descent

Gradient Descent

J(\theta)=-\frac{1}{m}\left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right]

Want min ⁡ theta J (theta) : \ min _ {\ theta} J (\ theta) : min theta J (theta) : Repeat {

\theta_{j}:=\theta_{j}-\alpha \frac{\partial}{\partial \theta_{j}} J(\theta)

} (simultaneously update all θj\theta_{j}θj)

After taking the partial derivative of the above J(θ)J(\theta)J(θ) and substituting the gradient descent formula, the following form can be obtained:

Gradient Descent

J(\theta)=-\frac{1}{m}\left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right]

Want min ⁡ theta J (theta) : \ min _ {\ theta} J (\ theta) : min theta J (theta) : Repeat {

\theta_{j}:=\theta_{j}-\alpha \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)}

The \}

(Simultaneously update all θj\theta_{j}θj)

Now, have you noticed that the gradient descent formula for logistic regression looks exactly the same as the linear regression gradient descent formula?

Why does it look bold? Because after all, their prediction function is different.

$\begin{aligned} &h_{\theta}(x)=\theta^{T} x \\ &h_{\theta}(x)=\frac{1}{1+e^{-\theta^{T} x}} \end{aligned}$

So it looks the same, but it’s very different.

Advanced optimization

When we talked about linear regression, we said that in addition to gradient descent, you can use normal equations to find the best θ. The same is true for logistic regression, where we can use algorithms other than gradient descent. (The following three types require cost functions and partial derivatives of cost functions. The difference is that it is not the same as the gradient descent iteration.)

Conjugate gradient
BFGS
L-BFGS

Advantages:

No need to manually pick α
Often faster than gradient descent
More complex

These three algorithms do not need to choose the learning rate alpha relative to gradient descent and are faster than gradient descent. The disadvantage is that the algorithm is more complex.

Of course it’s more complicated and that’s not a disadvantage at all, because you don’t need to know how it works, you can just use what someone else has already written. “I have been using them for more than 10 years, but I only found out some details about them a few years ago,” ng said.

Think of a funny meme: expensive things have only one fault that is expensive, but this is not the fault of things, is my fault.

Octave and MATLAB have libraries that you can use directly. If you use C, C++, Python, etc., you may have to try several libraries to find a better implementation.

How does it apply to logistic regression?

Theta = 1 ⋮ theta n] [theta. Theta 0 = \ left [\ begin {array} {c} \ theta_ {0} \ \ \ theta_ {1} \ \ \ vdots \ \ \ theta_ {n} {array} \ \ end right] = ⎣ ⎢ ⎢ ⎢ ⎢ ⎡ theta 0 1 ⋮ theta. Theta n ⎦ ⎥ ⎥ ⎥ ⎥ ⎤

function [jVal, gradient] $=$ costFunction (theta)

= = jVal [\ mathrm {jVal} [jVal = [code to compute J (theta)]; J (\ theta)]; J (theta)];

Gradient (1) = [(1) = \ left/right. (1) = [code to compute partial partial theta 0 j (theta)] \ left \ frac {\ partial} {\ partial \ theta_ {0}} J (\ theta) \ right] partial theta 0 partial J (theta)]

Gradient (2) = [(2) = \ left [\ right. (2) = [code to compute partial partial theta 1 j (theta)] \ left \ frac {\ partial} {\ partial \ theta_ {1}} J (\ theta) \ right] partial theta 1 partial J (theta)]

Gradient (n + 1) = [(n + 1) = \ left [\ right (n + 1) = [code to compute partial partial theta nJ (theta)] \ left \ frac {\ partial} {\ partial \ theta_ {n}} J (\ theta) Partial theta, quad, right] n partial J (theta)]

Here’s an example:

\begin{aligned} &\text { Example: }\\ &\theta=\left[\begin{array}{l} \theta_{1} \\ \theta_{2} \end{array}\right]\\ &J(\theta)=\left(\theta_{1}-5\right)^{2}+\left(\theta_{2}-5\right)^{2}\\ &\frac{\partial}{\partial \theta_{1}} J(\theta)=2\left(\theta_{1}-5\right)\\ &\frac{\partial}{\partial \theta_{2}} J(\theta)=2\left(\theta_{2}-5\right) \end{aligned}

Now we have an example with two parameters, and you can see that both θ are equal to 5 when the cost function is minimal (equal to 0). Okay, now we’re going to learn the algorithm, and we’re going to pretend we don’t know the result.

Function [j,gradient] = costFunction(theta) % j j = (theta(1)- 5) ^2 + (theta(2)- 5) ^2;
  
  gradient = zeros(2.1); % partial derivatives gradient (1) = 2*(theta(1)- 5);
  gradient(2) = 2*(theta(2)- 5);
  
 endfunction
Copy the code

The costFunction function returns two values. One is the cost function J, and one is the partial derivative with respect to J, the vector that stores the result.

%octave: options = optimset ('GradObj'.'on'.'MaxIter'.'100');
initheta = zeros(2.1);
[Theta,J,Flag] = fminunc (@costFunction,initheta,options)
Copy the code

Optimset: sets four parameters:
- GradObj: Sets gradient target parameters
- Confirm that the previous setting is enabled
- Maximum iteration
- Set the maximum number of iterations
Fminunc: An unconstrained minimization function of Octave that takes three arguments
- If you write your own function, always put at sign in front of it
- Your preset theta must be a two-dimensional vector or higher, and if it’s a real number, the function will fail.
- Settings for this function

The final run result looks something like this, where Theta stores the value of Theta when the final cost function is minimized. J represents the optimal solution of the cost function, and Flag = true indicates convergence.

Multiclass classification

What is multicategory classification?

Let’s say you have another email that you automatically categorize into: Work, Friends, family, other.

What do we do with this classification?

Using an idea called one-versus-all classification, we can then take this and make it work for muti-class classification, as well.

Using the idea of one-to-many classification, we can also apply the idea of binary classification to multi-category classification.

Here’s how one-versus-alll classiffication works. And, this is also sometimes called one-versus-rest.

Now let’s introduce the adversarial multiple classification method (pair surplus) :

Let’s say, we have a training set

Triangle for 1, square for 2, cross for 3

Now change this to three separate binary categories:

$h_{\theta}^{(i)}(x)=P(y=i \mid x ; \ \ theta) quad (I = 1, 2, 3)$

So in this case when I is equal to 1 that’s when the triangle is a positive class. The I classifiers above have trained for each of these situations.

In the one-to-many category

One-vs-all Train a Logistic classifier H θ(I)(x)h_{\theta}^{(I)}(x) H θ(I)(x) for each class III to predict the probability that y=iy=iy=i

On a new input $x$ , to make a prediction, pick the class $i$ that maximizes

\max _{i} h_{\theta}^{(i)}(x)

We get a logistic regression classifier, h theta (I) (x) h_ ^ {\ theta} {(I)} (x) h theta (I) (x) to predict the I type in the probability of y = I time. Finally, we make the prediction, we give a new input value x, to get the prediction result, all we have to do is run the input x in each classifier, and finally choose the category with the largest prediction function, which is the result y that we want to predict.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Logistic Regression | Logistic Regression

Logistic Regression

For example

Logistic regression

The decision boundary

How to fit Logistic Regression

Cost function

Gradient descent

Advanced optimization

Multiclass classification

Logistic Regression | Logistic Regression

Logistic Regression

For example

Logistic regression

The decision boundary

How to fit Logistic Regression

Cost function

Gradient descent

Advanced optimization

Multiclass classification

Related Posts

How to run finrL-Libray stock trading strategy framework on Moment Pool Cloud

Introduction to Flink (I) – Introduction to Apache Flink

Machine Learning