Abstract: This paper introduces multivariable linear regression and presents two common tips for gradient descent.

This article is shared from huawei cloud community “Multivariate Linear regression (I) with Machine Learning!”, originally written by Skytier.

1. Multidimensional features

Since it is multivariable linear regression, there must be multiple variables or multiple features. Take the linear regression studied before as an example, there is only a single characteristic variable, that is, x represents the house area, and we hope to use this characteristic quantity to predict Y, that is, the house price. Of course, the actual situation is certainly not so, everyone in the house is not also considering the number of houses, floors, the age of the house and so on all kinds of information ah, so we can useTo represent these four features, we can still use Y to represent the output variable we want to predict.

Now we have four characteristic variables, as shown in the figure below:

Lower case n can be used to represent the number of features, so in this case, n=4, there are four features. The n here is a different concept from the previous symbol m, where we used m to represent the number of samples, so if you have 47 rows, then M is the number of rows in this table or the number of training samples. And then we’re going to useTo represent the input eigenvalue of the ith training sample. To give you a specific example,It’s the eigen vector of the second training sample, namely, the four characteristic quantities that predict the price of the second house. The superscript 2 here is an index of the training set. Do not think of it as the second square of x. Instead, it corresponds to the second row in the table that you see, the second training sample, as a four-dimensional vector. In fact, more generally, this is an n-dimensional vector. whileIs the value of the JTH characteristic quantity in the ith training sample. In the corresponding table,Represents the value of the third characteristic quantity in the second training sample, which is 2.

Ok, so we know what the expression is, so we’re assuming that because of the number of variables, of course it’s going to change. The hypothetical function of one variable is zero, such an analogy, small Mi to test you! What about multivariable? It’s multivariable, it’s linear. Yes, the support for the multivariate hypothesis is in the form:

Here’s an example:

In detail, this assumption is to predict with thousands of units of housing prices, a basic house price may be 80 w, plus 1000 yuan per square meter, and then the price will be increased with the increase of floor number continues to grow, along with the increase with the increase of the number of bedroom, but, with the increase of using years and depreciation, so have a look at this example is very reasonable!

But obsessive-compulsive Mi also wants a more unified team form, for the sake of convenience, we can assumeIs equal to 1, and in particular, that means there’s a vector for each of the ith samplesAnd,. You can also say that the zeroth eigenquantity is defined in addition. So now the eigenvector x is an n+1 dimensional vector labeled from 0. The parameterIt’s also a vector labeled starting at zero, and it’s also an eigenvector in n plus 1 dimensions.

Thus, the hypothesis can be simplified as:, where the superscript T stands for the transpose of the matrix, is not very clever!I really have to brag about that hereIs it is it is it is it helps us to write the hypothesis in this very concise form. This is the hypothetical form in the case of multiple eigenvalues, which is called multiple linear regression. Multivariate means that we use multiple eigenvalues or variables to predict the Y value.

2. Multivariable gradient descent

Ok, so we have the form of the multivariable linear regression hypothesis, which means we have the ingredients, and we’re ready to start cooking!

In the hypothetical form of multiple regression, we have assumed conventionally. The parameters of this model includeBut instead of thinking of it as n independent parameters, think of it as an N +1 dimensional parameterSo the cost functionIs expressed as:

Among them,

Our goal, as in the univariate linear regression problem, is to find a set of parameters that minimize the cost function. The batch gradient descent algorithm of multivariable linear regression is:

That is:

After taking the derivative, we can get:

We have to keep updating each oneParameter, passMinus theTimes the derivative, let’s rearrange it and write it asTherefore, a feasible gradient descent method for multiple linear regression can be obtained.

Code examples:

Calculate the cost function:. Among them,

Python code:

def computeCost(X, y, theta):
    inner = np.power(((X * theta.T) - y), 2)
    return np.sum(inner) / (2 * len(X))
Copy the code

3. Gradient descent practice – characteristic scaling

Next, we will learn some practical techniques for gradient descent. First, we will introduce a method called feature scaling, which is as follows:

If you have a machine learning problem, and you have multiple features, and you can make sure that the features are in a similar range, that the different features are in a similar range, then gradient descent will converge faster. Specifically, suppose you have a problem with two characteristics, whereIt’s the size of the house, and it’s at 02000,It’s the number of bedrooms, maybe 1Between 5, if you were to plot the cost functionSo it should look something like this, is a parameter aboutandI’m going to ignore it hereAnd assume that the variables of this function are onlyand, but ifThe value range of is much greater thanAnd then the final cost functionThe contours of the line will take on such a very skewed and elliptical shape, but the ratio of 2000 to 5 will make the ellipse more elongated. So this skinny, tall ellipse contour map is these very tall, thin ellipses that form the cost functionIs the isoline of. If you run gradient descent on this kind of cost function, your gradient might end up taking a long time, might fluctuate back and forth, and then take a long time to converge to the global minimum.

In fact, as you can imagine, if these contours were exaggerated, if I made them thinner and longer, maybe even more exaggerated than that, the result would be that the gradient descent would be slower, it would take longer to oscillate back and forth, to find a way to the global minimum. In such cases, an effective method is feature scaling. Specifically, if you take the characteristicsI’m going to define it as the size of the house divided by 2,000, and I’m going to divide by 2,000Defined as the number of bedrooms divided by 5, so the cost functionThe contours will be less offset and will look a little bit more rounded. If you perform gradient descent on a cost function like this, then the gradient descent algorithm can be proved mathematically to find a more direct path to the global minimum, rather than following a much more complicated path to the global minimum. So, by scaling these features, the range of their values becomes close, and in this case, we end up making two featuresandAll between zero and one, and then your gradient descent will converge faster.

More generally, when we perform feature scaling, our usual goal is to constrain the value of features to the range of -1 to +1. Specifically, your characteristicsAlways equal to 1. So this is already in this range, but for other features, you might have to divide by different numbers to get them in the same range. The numbers minus one and plus one are not so important, so if you have a featureIt’s between 0 and 3, which is fine; If you have another traitIs between -2 and +0.5, which is also very close to -1 and +1, which is all fine; But if you have another trait like

If it is in units between -100 and +100, the range is very different from -1 to +1. So, this might be a feature that doesn’t quite fit the scope. Similarly, if you have a feature on a very, very small scale, like a featureIs in the range of -0.0001 to +0.0001, which is again a much smaller range than -1 to +1. Therefore, I would also argue that the scope of this feature is inappropriate. So, maybe the range that you agree with, maybe more than +1 or less than +1, but not too big, like +100, or not too small, like 0.001 here, different people have different experiences. But the normal way to think about it is, if a feature is in the range from -3 to +3, that’s acceptable, but if it’s in the range from -3 to +3, I might start paying attention, and if it’s in the range from -1/3 to +1/3, I’m fine with that, that’s acceptable, or 0A third or 1/30, all of these typical ranges are acceptable to me, but if the range of features is very small, like the one I just mentionedStart thinking about it. So, in general, don’t worry too much about whether your features are in exactly the same range or interval, but as long as they’re close enough, gradient descent will work fine. In addition to dividing the feature by the maximum, sometimes in feature scaling we also do something called mean normalization, which means you have a feature, can be usedSo that your eigenvalues have an average of zero. Obviously, we don’t need to apply this step tobecauseIt’s always equal to one, so it can’t have an average of zero, but for other features, like the size of the house, it’s somewhere between zeroBetween 2000, and if the average area of the house is equal to 1000, then you can use the following formula, andBecomes size, minus the average, and then divided by 2000. Similarly, if these houses have 15 bedrooms, and the average house has 2 bedrooms, so you can use this formula to average your second featureIn both cases, you can calculate new featuresand, so they can range from -0.5 to +0.5. Of course that’s not necessarily true,Can be a little more than 0.5, but close. More generally, you can takeSubstitute for, where, definitionIt is a characteristic of trainingThe average of theta, and thisIs the range of the eigenvalue, and the range is the maximum minus the minimum. Or for those of you who have learned about standard deviation, let’s sayWe could have called it the standard deviation of the variable, but we could have taken the maximum minus the minimum. Similarly, for the second featureYou can do the same thing minus the average divided by the range, which is the maximum minus the minimum. Characteristics of such formulas will you may not be so, but something like this, by the way, just need to transform characteristics of similar scope can be, the characteristics of the zoom actually not need too accurate, just to make gradient descent to be able to run a bit faster, less number of iterations required for convergence.

4. Gda practice-learning rate

Another technique that can make gradient descent work better in practice involves learning rates. Specifically, this is the update rule for the gradient descent algorithm, what we need to know is what is called debugging, a few tricks to make sure gradient descent works, and how do we choose the learning rateThe usual thing to do is make sure gradient descent works. All gradient descent does is find one for youValue, and hopefully it will minimize the cost function. Therefore, it is usually possible to plot the cost function while the gradient descent algorithm is runningThe value of the.

The X-axis here represents the number of iterations of the gradient descent algorithm. As the gradient descent algorithm runs, you might end up with a curve that looks like this, and notice that the X-axis is the number of iterations that we gave earlierIn, the X-axis represents the parameter vector, different from the current image. Specifically, the first red dot means that after 100 iterations of gradient descent you get somethingThe value, for this point, represents the gradient descent algorithm after 100 iterationsWork out the value. The second red dot corresponds to the gradient descent algorithm after 200 iterationsWork out the value. So this curve represents the value of the cost function after each iteration of gradient descent. If the gradient descent algorithm works, after each iterationShould go down. One of the things about this curve is that it tells you, when you get to 300 iterations, between 300 and 400 iterations,It doesn’t go down much, so by the time you get to 400 iterations, the curve looks pretty flat. At 400 iterations here, the gradient descent algorithm has almost converged. This curve will tell you whether the gradient descent algorithm has converged. , by the way, for each specific problem, the gradient descent algorithm is the required number of iterations may vary widely, so for a certain problem, only need 30 step iterative gradient descent algorithm can achieve convergence, change a problem, however, may need 3000 step iterative gradient descent algorithm step or 3000000 or more, It’s actually hard to tell in advance how many steps it takes for gradient descent to converge. In addition, you can do some automatic convergence tests, which means you have an algorithm that tells you that the gradient descent algorithm has converged.

The number of iterations required for the convergence of the gradient descent algorithm varies according to different models, and we cannot predict in advance. We can draw graphs of the number of iterations and cost function to observe when the algorithm tends to converge.

There are also some automatic test whether convergence of the method, such as changes in the cost function value is smaller than a certain threshold (e.g., 0.001), then test the decision function has converged, but choose a suitable threshold is very difficult, therefore, in order to detect whether the gradient descent algorithm is convergent, tend to be more by looking at the graph, rather than relying on automatic test of convergence.

In addition, this graph can tell us in advance that the algorithm is not working properly, specifically, the cost function

The fact that the curve with the number of iterative steps is actually increasing indicates that the gradient descent algorithm is not working properly, and such a curve implies that I should use a smaller learning rate. Each iteration of gradient descent algorithm is affected by the learning rate. If the learning rate is too small, the number of iterations required to achieve convergence will be very high and the convergence will be slow. If the learning rate is too large, each iteration may not reduce the cost function and may exceed the local minimum value, leading to failure of convergence. So if we plot the cost function, we know exactly what’s going on, right

Specifically, we can usually try a seriesValue:

For example: 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1… I’m going to take one value every three times, and then I’m going to take these different valuesValue to drawCurve as the number of iteration steps changes, thus choosingThe one that goes down fastValue.

Click to follow, the first time to learn about Huawei cloud fresh technology ~