Logistic regression of a series of basic algorithms for machine learning

Write up front: I will describe the algorithms commonly used in machine learning from a beginner’s perspective. Their level is really limited, if there is any mistake, I hope you point out, to avoid misleading you. Then this is the second article of this series, for starters, if you didn’t read the first article, recommend to see basic algorithm of linear regression, and machine learning which were mentioned a lot of basic knowledge of mathematics and some machine learning thought, is very helpful for understanding this article, a lot of things there would not be in machine learning routing introduces why to do so, Rather, it directly explains why logistic regression needs to be implemented in this way, focusing more on logistic regression implementation itself rather than machine learning process understanding.

Experimental code – Reference books -[Reference blog see the end of]

First, the effect image ^_^

0 to prepare

0.1 environment

Python: 3.5

TensorFlow: Anaconda3 envs

IDE: PyCharm 2017.3.3

0.2 Basic Understanding

If you saw the introduction to linear regression, linear regression is about fitting a class of linearly distributed data and then training a linear model to make predictions about that data. So in logistic regression, it can be temporarily understood as dealing with classification problems. For example, given the longitude and latitude of a person, it is difficult to construct a good linear model to predict the region based on the longitude and latitude if we only use the thinking of linear regression. What we need to know is the national demarcation line, and these demarcation lines are obtained through logistic regression.

1 Binary Classification

1.1 Propose the fitting function`Sigmoid`

If you don’t quite understand the problems of logistic regression processing and the uniqueness of linear regression ratios, let’s take a quick example (note: the numbers are special values for simplicity) to introduce today’s hero:

With this picture, it’s obvious that you can’t draw a straight line well, you’re always going to make a big error. And if there is a function that says:

Then it perfectly conforms to the current data distribution, and our intuition tells us that there is a high probability that it can predict the data well. The next step is to find oneForm function as the fitting function. It’s obvious here that our function is not linear. I’m going to pull out a guy calledFunction:

As you can see from the graph, this function is perfect as we expected.

# coding: utf-8

import numpy as np
import matplotlib.pyplot as plt

x = [1.2.3.4.6.7.8.9.10]
y = [0.0.0.0.1.1.1.1.1]

train_X = np.asarray(x)
train_Y = np.asarray(y)

fig = plt.figure()
plt.xlim(- 1.12)
plt.ylim(0.5.1.5)
plt.scatter(train_X, train_Y)

s_X = np.linspace(2 -.12.100)
s_Y = 1/ (1 + np.power(np.e, - 6*(s_X - 5)))
plt.plot(s_X, s_Y)
plt.savefig("linear2LogisticreGressionGraphAddSigmoid.png")
# plt.savefig("linear2LogisticreGressionGraph.png")
plt.show()
Copy the code

1.2 Expanded Classification

As mentioned above, in order to better understand logistic regression, so the data is a little special and single, let’s look at another set of data:

Scatterplot with obvious gathering effect can be regarded as a classification problem, so how can we use the logistic regression mentioned above to deal with such classification problems?

So this is a little bit of a shift in thinkingThe value of theta as the output, here it’s not, here the result that needs to be predicted is givenIs the point a red circle or a blue triangle? So look for someone who looks likeFunction of delta, and the previous oneThe delta function will help us categorize, assuming 和 Each represents one case, so we getCan be inferredIs the situation now a red circle or a blue triangle, what is the situation nowThe results of pretreatment andStudent: Phi. As you probably already know, willThe range of omega is omegaThe domain of. So given aWe can get an interval betweenWithin the value. And now, the function will do what we want.

Mimicking the previous machine learning thinking, we use a unified input matrix for a given inputsaidAnd then add weights, the function is obtainedWhat we’re going to do is define a model to distinguish between these distributions, right? If you look at the code, the default when you’re doing data is to exploitSo it’s natural to choose the linear model as the first choice. As for how to choose the model, I think the simplest representation is often the best. Let’s take a look at The example of Mr. Ng to experience the process.

Well, since it’s linear, I can just get thisHere’s a reminder to addBecause our default vectors are all column vectors. Well, if you understand this, then it’s easyThe fitting function can be obtained:

1.3 Derive loss function from intuitive feeling

Now that we’ve processed the data, it doesn’t matter whether the input is binary or unary, just oneAnd nothing more. So let’s go back to the linear function, and find a function that describes the error. Look at the picture:

So let’s just analyze thetaHow does the point ofFind a function to express the loss. The first thing you can be sure of is being rightThe punishment is the most severe. Because we can see from the graph that it is the most outrageous prediction, so the loss function should be a decreasing function, and then continue to analyze whether the loss function is correctThe change of is uniform, that is, what is the decreasing range of this decreasing function? Obviously,The closer theIt should be something like exponential growth. In this case, the parameter fitting result of the convergence of the final loss function is in line with most of the data, and the extreme classification will not occur, because the penalty for extreme data is too great. Okay, so this function has three characteristics,diminishing,Absolute decline of derivative(decrease magnitude to be reduced) and domain inThe span inside has to be large, preferably from plus infinity to PIBecause if the function is zeroWe’re not going to penalize this point, so it has to beThe value of this function. And then you might think, well, the logarithm is just the case. So here are the results:

I’m not going to post a function diagram here, but let’s analyze it for ourselvesIn the case. I really can’t come up with the next reading bar, if you directly look at the lack of thinking process of their own understanding may be a little lack. Just like many books and blogs are directly given the formula, although it is difficult to understand, but most of the time it is their own memory, for which there is no experience and experience of their own thinking process. Here’s a picture for you to guess:

The final result is given directly below:

1.4 Analyze the loss function from the perspective of probability theory

Remember that logistic regression can be viewed as a classification problem? The prediction of classification problems can be calculated as probability. So the functionYou can view it as, atNext, the inputValues forThe probability. And all we have to do is figure out oneYes for a givenAs much as possibleComply with. That’s when you find the resultsThe Bernoulli distribution, the random variableCan only beIn theAs forThe probability of when this value is greater thanWe think it’s classified as.

All right, forget about that, and get into a probabilistic mode of thinking. So let’s say there’s an infinite number of points in this plane, and this is just a sample. But we can’t list all of them, so what we do is we estimate the overall distribution from the sample we give. In fact, this is also related to the need for machine learning to train a large number of samples. It is obvious that the larger the amount of data, the smaller the accidental error, and the closer the sample distribution is to the overall distribution. Going back to probability theory, I find it easy to write and get off track 😂😂😂. Now, what we have in our statistics is a Bernoulli distribution. And this way of knowing the distribution and finding the parameters that can match the distribution there is a parameter estimation method in probability theory called maximum likelihood estimation. If you do not understand the brief introduction:

Maximum likelihood estimation (MLE), commonly understood, is the use of known sample results information, the most likely (maximum probability) to lead to the occurrence of the sample results of the model parameter values! In other words, maximum likelihood estimation provides a way to evaluate model parameters given observed data, i.e., “model determined, parameters unknown”.

See from @ Yizheng: Understanding maximum likelihood estimation

According to the above meaning that is obvious, we areThen of course:

This is a singleThe probability. If you take the entire sampleWhat about the second independent replication? So the probability of that happening is going to be the product of this. Used hereThe vector represents the entire data. That is:

You can see that this is a story aboutFunction of phi, and now the metaphysical part is, we say, well, if this event happened, then it should have the highest probability of happening of all events, and if not why would it happen? Ha ha, so xuan xue. The maximum? Well, that’s easy to do, right? No, no, it’s too hard… The most common trick in high school: multiply and add, using logarithms. And then multiplied by theI’m going to minimize it.

1.5 Extreme value of gradient descent

The loss function has been given previously, and we use the gradient descent algorithm to update the weight again. If we write the code ourselves, we still require partial derivatives 😂😂😂

High number of things, here lazy, the word is a little difficult to play direct screenshots, after all, this is not what difficult:

So the update function is:

Notice here.It’s a column vector,On behalf of the firstThe column firstA number. In fact, if you think about it carefully, you can find that the newerIt’s just a number, and when you take the partial derivative of it, the remainder is what you’re multiplying with. Because I started with a single point, I just used subscripts to make it easier to understand, but you have to have a matrix in mind when you use linear boundary functions. I’m not going to go into details here, but I’m going to have to use vector form for multivariate.

1.6 Handwritten Code

Ok, so the basic logic is clear, but it’s similar to linear regression, except that we’re dealing with a linear dichotomy problem, where we’re processing the raw data into linear values and then mapping the linear values toOn the domain of the function, according toFunction characteristics we can conclude that the data classification is positiveAnd then evaluate it.

If you still don’t understand the formula, please refer to this article on machine learning — The Derivation of Logistic regression calculation process to explain how to vectorize. If you are not familiar with linear algebra, you can read this article on your own. I’m just going to go through the code and not repeat the principle.

In order to get better accuracy, it is best to use float64:

import matplotlib.pyplot as plt
import numpy as np
import matplotlib.animation as animation

x = [1.2.3.4.6.7.8.9.10]
y = [0.0.0.0.1.1.1.1.1]

train_X = np.asarray(np.row_stack((np.ones(shape=(1, len(x))), x)), dtype=np.float64)
train_Y = np.asarray(y, dtype=np.float64)
train_W = np.asarray([- 1.- 1], dtype=np.float64).reshape(1.2)
Copy the code

define 和 function

def sigmoid(X):
    return 1 / (1 + np.power(np.e, -(X)))


def lossfunc(X, Y, W):
    n = len(Y)
    return (- 1 / n) * np.sum(Y * np.log(sigmoid(np.matmul(W, X))) + (1 - Y) * np.log((1 - sigmoid(np.matmul(W, X)))))
Copy the code

Implement parameter update (gradient descent algorithm)

def gradientDescent(X, Y, W, learningrate=0.001, trainingtimes=500):
    n = len(Y)
    for i in range(trainingtimes):
        W = W - (learningrate / n) * np.sum((sigmoid(np.matmul(W, X)) - Y) * X, axis=1)
Copy the code

One important point here is to vectorize the previous analysis, and it’s easy to notice the dimensional change. Among themPhi is an interesting point, and it represents the sum of some dimension.

In fact, at this point the algorithm is almost complete. Let’s visualize it below. You can refer to my previous article Matplotlib to save giFs. The effect and source code are given below:

# coding: utf-8

import matplotlib.pyplot as plt
import numpy as np
import matplotlib.animation as animation

x = [1.2.3.4.6.7.8.9.10]
y = [0.0.0.0.1.1.1.1.1]

train_X = np.asarray(np.row_stack((np.ones(shape=(1, len(x))), x)), dtype=np.float64)
train_Y = np.asarray(y, dtype=np.float64)
train_W = np.asarray([- 1.- 1], dtype=np.float64).reshape(1.2)


def sigmoid(X):
    return 1 / (1 + np.power(np.e, -(X)))


def lossfunc(X, Y, W):
    n = len(Y)
    return (- 1 / n) * np.sum(Y * np.log(sigmoid(np.matmul(W, X))) + (1 - Y) * np.log((1 - sigmoid(np.matmul(W, X)))))


Training_Times = 100000
Learning_Rate = 0.3

loss_Trace = []
w_Trace = []
b_Trace = []


def gradientDescent(X, Y, W, learningrate=0.001, trainingtimes=500):
    n = len(Y)
    for i in range(trainingtimes):
        W = W - (learningrate / n) * np.sum((sigmoid(np.matmul(W, X)) - Y) * X, axis=1)
        # for GIF
        if 0 == i % 1000 or (100 > i and 0 == i % 2):
            b_Trace.append(W[0.0])
            w_Trace.append(W[0.1])
            loss_Trace.append(lossfunc(X, Y, W))
    return W


final_W = gradientDescent(train_X, train_Y, train_W, learningrate=Learning_Rate, trainingtimes=Training_Times)

print("Final Weight:", final_W)
print("Weight details trace: ", np.asarray([b_Trace, w_Trace]))
print("Loss details trace: ", loss_Trace)

fig, ax = plt.subplots()
ax.scatter(np.asarray(x), np.asarray(y))
ax.set_title(r'$Fitting\ line$')


def update(i):
    try:
        ax.lines.pop(0)
    except Exception:
        pass
    plot_X = np.linspace(- 1.12.100)
    W = np.asarray([b_Trace[i], w_Trace[i]]).reshape(1.2)
    X = np.row_stack((np.ones(shape=(1, len(plot_X))), plot_X))
    plot_Y = sigmoid(np.matmul(W, X))
    line = ax.plot(plot_X, plot_Y[0].'r-', lw=1)
    ax.set_xlabel(r"$Cost\ %.6s$" % loss_Trace[i])
    return line


ani = animation.FuncAnimation(fig, update, frames=len(w_Trace), interval=100)
ani.save('logisticregression.gif', writer='imagemagick')

plt.show()
Copy the code

2 Multivariate Classification

Start with the ternary classification and add the following distribution, asking you to find the boundary:

It’s a lot harder than the previous one. First of all, there’s a linear classification that obviously isn’t here, and second of all, there’s another variable here… Don’t worry, compare the three pictures and you’ll see.

So here we’ve solved the problem of having one more variable, by picking one pivot variable, the other values areAnd then you can divide. Another question is how do you pick parameters? Now that we can convert to a binary problem, for each binary we just pick.There are obviously two basic onesBut in order to get through and pick a more appropriate boundary, these two features are not enough, so the othersAnd so on, depending on what kind of boundary you want to synthesize, you can choose what kind of feature. Similarly, in three dimensions, you can choose a feature quantity that can express the zigzagging plane. Now I’m going to pick the weight, which is the coefficient in front of the feature, and if the feature is selected well, then I just need to add one on this basisIt’ll be ok.

Here is a small example of how to do this:

There’s obviously a tendency to encircle, so it makes sense to think of circles. In other words, selectedthenThe function can be defined as:

And the solution is similar to the one above, if you abstract all of this into vectors, then you have the same number of variables in terms of the dimensions. Remember, the number of runs is the number of categories, that is to say, treat each of them as the main character once to get the dividing line that separates the other elements, and the prediction is to run all the models, and the one with the highest probability is chosen as the result of the classification prediction.

I won’t go into the details of the application here, but there will be opportunities to supplement the code practices in the future. Here are a few reference blogs:

“Actual Machine Learning” Logistic Regression Algorithm (1)
Logistic regression gradient descent method
In-depth Machine learning series 3- Logistic regression

3. Model Optimization

3.1 Algorithm Direction

There are other ways to solve optimization problems that converge faster than just gradient descent. For example: conjugate gradient method BFGS (variable scale method) and L-BFGS (limited variable scale method), but I have not seen the concrete implementation of these algorithms, here is the FLAG. However, we usually use these algorithms to call on the existing library, after all, the algorithm written by ourselves is more or less flawed,

3.2 Direction of fit degree

In the previous example, you saw that it is important to choose the number of iterations once you have selected the feature quantity. Here’s a quick example: Let’s fit a scatter graph that fits a quadratic distribution, if you choose a modelWell, it’s not going to work out, you knowThis characteristic term plus, so it doesn’t underfit. Similarly, if you choose a quadratic model well, but the number of iterations is not enough, it will also lead to under-fitting. In fact, if the model prediction and distribution are consistent, the more iterations the better. But but, if you’re not sure, add another oneIf the training times are well controlled, a good model can also be trainedRights to reset to zero, if the number of training control is not good, so will cause less fitting (too little training times) or a fitting (amounts of training), here is not to say that loss function value, the smaller, less but said that the trend of fitting function in line with the overall trend data, from the perspective of theory of probability, is our extraction is one of the overall sample, Since it is not practical to obtain the whole data, only the sample can be used to estimate the whole, so the estimated here should be a trend.

So how do you avoid over-fitting or under-fitting? I think there’s a way to think about it:

Amount of data (more is better)
Characteristic quantity (least descriptive trend principle)
Number of sessions (the more you train in the direction of the trend, the better)

In short, the most important thing here is the accuracy of the model. If you choose the wrong model at the beginning, no iteration will produce good results.

The following is the courseware diagram of Mr. Enda Ng for you to experience:

As you can see, there is another way to handle this: regularization

I dare to explain it according to my own understanding, if you have your own understanding can not watch. Avoid misleading. Regularization generally means over-fitting, that is, we cannot determine the trend at all, so we add a lot of redundant feature quantities. Then what kind of results can we accept for this model? If those extra eigenvalues are close to each otherDoes that mean we can get a better fit, or even get the same result without these interference features? And this is what you do by penalizing the coefficients of those eigenquantities.

As can be seen from the above explanation, we do not need to pairI’m going to penalize, so I’m going to end up with the loss function. Of course, this is only one way, you can choose the way it works, the main purpose is to avoid too much weight, fitting curve too curved (because after taking the derivative, slope and coefficient are positive correlation). The philosophy here is that under the condition that the fitting results are acceptable, the smoother the curve is, the higher the fault tolerance is, and the more representative the sample is of the population.

Here’s an example of a screenshot:

Regularized linear regression
Regularized logistic regression

4 summarizes

After some detailed reading, the understanding of the whole process deepened a lot. Hands-on practice is the best way, before I thought I understood the formula on the line, and this time I wrote the code is to write a variable in parentheses, dead fit can not get the effect. Then, after a few hours, I decided to push the formula again and re-type the formula code (because printing the data was weird), and went through it again. So if you are just starting to learn, even according to the knock again, after also want to organize their ideas into text/code form, it is best to organize into a blog, convenient for their own review later.

Take the time to understand the details is to save time

Finally: Happy Year Of Dog ^_^!

5 Reference Materials

@Yizhen: Understanding maximum likelihood estimation
Machine learning — Derivation of Logistic regression calculation process
“Actual Machine Learning” Logistic Regression Algorithm (1)
Logistic regression gradient descent method
In-depth Machine learning series 3- Logistic regression
A Brief analysis of Logistic Regression