Reproduced the original address: https://blog.csdn.net/zouxy09/article/details/24971995

Today we’re going to talk about a very frequent problem in machine learning: overfitting and regularization. Let’s first briefly understand the commonly used L0, L1, L2 and nuclear norm regularization. Finally, the selection of regularization parameters is discussed. In order not to freak you out, I have divided the five parts into two posts. Knowledge is limited, the following are some of my shallow views, if there is a mistake in understanding, I hope you don’t hesitate to correct. thank you

Supervised machine learning is nothing more than “minimizeyour error while regularizing your parameters”. The minimization of error is to make our model fit our training data, while the regularization parameter is to prevent our model from over-fitting our training data. What a philosophy of simplicity! Too many parameters will increase the complexity of our model and make it easy to overfit. In other words, our training error will be very small. However, small training error is not our ultimate goal. Our goal is to minimize the test error of the model, that is, to accurately predict new samples. Therefore, we need to minimize the training error on the basis of ensuring the “simplicity” of the model, so that the parameters obtained can have good generalization performance (that is, the test error is also small), and the “simplicity” of the model is achieved through the regular function. In addition, the use of rule items can constrain the characteristics of our model. In this way, people’s prior knowledge of the model can be integrated into the learning of the model, forcing the learned model to have the desired characteristics, such as sparse, low-rank, smooth and so on. You know, sometimes people’s priors are very important. Previous experience will let you avoid a lot of detours, that is why we usually learn the best to find a great leader. A word can clear away the clouds in front of us, inspiring us to a clear sky. The same is true of machine learning, which, with a little bit of prodding, can certainly learn tasks faster. However, since there is no such direct way to communicate with machines, the medium can only be handled by rule items at present.

There are several ways to look at regularization. Regularization conforms to the principle of Occam’s Razor. What an aggressive name, Razor! But the idea is approachable: of all possible models, we should choose the one that explains the given data well and is very simple. From the perspective of Bayesian estimation, the regularization term corresponds to the prior probability of the model. There is also a folk saying that regularization is the realization of structural risk minimization strategy, which is to add a regularizer or a penalty term to the empirical risk.

Generally speaking, supervised learning can be seen as minimizing the following objective function:

Where, the first term L(yi,f(xi; W)) measures the prediction of our model (classification or regression) for the ith sample f(xi; W) and the error before the actual label yi. Because our model is to fit our training sample, we require this item to be minimum, that is, our model is required to fit our training data as much as possible. But as mentioned above, not only do we want to minimize the training error, we also want to minimize the test error of our model, so we need to add a second term, which is the regularization function ω (w) of parameter W, to constrain our model to be as simple as possible.

OK, now, if you’ve been fighting machine learning for years, you’ll find that, ouch, ouch, most parametric models of machine learning not only look like this, but look like this. Yeah, most of it is just the transformation of these two terms. For the first Loss function, if it is Square Loss, it is least Square; Hinge Loss is the famous SVM. If it’s exp-Loss, that’s Boosting. If it’s log-loss, it’s Logistic Regression; And wait. Different Loss functions have different fitting characteristics, which should be analyzed according to specific problems. But instead of dealing with the loss function, let’s focus on the regular term OMEGA (w).

The regularization function ω (w) also has a variety of choices. It is generally a monotonically increasing function of model complexity. The more complex the model, the greater the regularization value. For example, regular terms can be the norm of model parameter vectors. However, different choices have different constraints on parameter W and achieve different effects. However, common ones in our papers are zero norm, one norm, two norm, trace norm, Frobenius norm and nuclear norm, etc. All these norms, what do they mean? What are your abilities? When will it be available? When do you need it? No hurry, below we pick a few common words.

L0 norm and L1 norm

The L0 norm is the number of non-zero elements of a vector. If we regularize a parameter matrix W by L0 norm, we want most of the elements of W to be 0. That’s too intuitive, too blatant, in other words, let the parameter W be sparse. OK, seeing the word “sparse”, we should wake up from the current popular “compressed sensing” and “sparse coding”, the original use of “sparse” is achieved through this thing. But you’re starting to wonder again, aren’t you? In the Papers world, isn’t sparsity all achieved by L1 norm? The mind is not everywhere is | | W | | 1 shadow! Almost everywhere. Yeah, that’s why they put L0 and L1 together, because they have some kind of unusual relationship. So what’s the L1 norm? Why can it be sparse? Why does everyone use the L1 norm to achieve sparsity instead of the L0 norm?

L1 norm refers to the sum of absolute values of all elements in a vector, also known as Lasso regularization. Now let’s analyze the $100 million question: why does L1 norm make weights sparse? Someone might answer you with something like, “It’s an optimal convex approximation of the norm L0.” In fact, there is a better answer: any regularization operator can achieve sparsity if it is not differentiable at Wi=0 and can be decomposed into a sum. This is so, W L1 norm is absolute value, | | W at W = 0 is not small, but it is not intuitive. Here because we need to do a comparative analysis with the L2 norm. So for an intuitive understanding of the L1 norm, look at section 2 later.

Oh, and here’s another question: if L0 can be sparse, why not L0 instead of L1? My personal understanding is that firstly, L0 norm is difficult to solve optimally (NP hard problem), and secondly, L1 norm is the optimal convex approximation of L0 norm, and it is easier to solve optimally than L0 norm. That’s why everyone turned their attention and attention to the L1 norm.

OK, to sum up in one sentence: L1 norm and L0 norm can achieve sparsity. L1 is widely used because it has better optimization solution characteristics than L0.

Okay, so at this point, we kind of know that L1 can be sparse, but we think, why sparse? What’s the advantage of making our parameters sparse? Here are two points:

1) Feature Selection:

One of the key reasons why sparse regularization is popular is that it can realize automatic feature selection. In general, most of the elements of the xi (characteristics) are has nothing to do with the final output yi or do not provide any information, while minimizing the objective function considering the xi these additional features, although can get smaller training error, but in the prediction of new samples, these would be considered useless information, thus to interfere with the correct yi forecasts. Sparse regularization operator is introduced to complete the glorious mission of automatic feature selection. It will learn to remove these features without information, that is, reset the weight corresponding to these features to 0.

2) Interpretability:

Another reason to favour sparsity is that models are easier to interpret. For example, the probability of getting a certain disease is Y, and the data we collect x is 1000 dimensions, so we need to find out how these 1000 factors affect the probability of getting that disease. Suppose we have a regression model: y=w1*x1+w2*x2+… +w1000*x1000+b (of course, in order to limit y to [0,1], we usually add a Logistic function). If w* only has a few non-zero elements after learning, such as only 5 non-zero WI, then we have reason to believe that these corresponding features will provide huge and decisive information for disease analysis. In other words, the disease is only related to these five factors, so the doctor is much easier to analyze. But if none of the 1000 WIs is zero, the doctor will feel tired in the face of these 1000 factors.

Second, L2 norm

In addition to the L1 norm, there’s a more luck regulation norm is L2 norm: | | W | | 2. It is also not inferior to L1 norm and has two good names. In regressions, some call regressions with it “Ridge Regression” and others call it “weight decay”. This is a lot of use, because it’s powerful enough to improve one of the most important problems in machine learning: overfitting. As for what overfitting is, it is also explained above, that the error of model training is very small, but the error of test is very large, that is, our model is complex enough to fit all our training samples, but in the actual prediction of new samples, it is terrible. Popular speaking is to take the test ability is very strong, the actual application ability is very poor. Good at reciting knowledge, but not flexible use of knowledge. For example, as shown below (from Ng’s course) :

The graph above is linear regression, and the graph below is Logistic regression, which can also be described as classification. From left to right, there are three cases of underfitting (also known as high-bias), appropriate fitting and overfitting (also known as High variance) respectively. As you can see, if the model is complex (it can fit any complex function), it allows our model to fit all the data points, i.e., essentially without error. For regression, our function curve passes through all the data points, as shown on the right. For classification, it is our function curve that classifies all the data points correctly, as shown on the right. These two cases are obviously overmatched.

OK, so that brings us to our crucial question, why does the L2 norm prevent overfitting? Before we can answer that question, we need to look at what the L2 norm is.

The L2 norm is the sum of squares of the elements of a vector and then take the square root. We let the L2 norm rules of the | | W | | 2 minimum, each element is small, can make W are close to zero, but unlike the L1 norm, it won’t let it equal to zero, but close to 0, here there is a big difference. The smaller parameter indicates that the model is simpler, and the simpler the model is, the less likely it is to produce over-fitting phenomenon. Why smaller parameters explain simpler models? I don’t know. My understanding is that by limiting the parameters to a small degree, you actually limit the effects of some components of the polynomial to a small degree (see the fitting diagram of the linear regression model above), which is equivalent to reducing the number of parameters. Actually I also don’t quite understand, hope everybody can give directions below.

Here we can also sum up in a word: through L2 norm, we can realize the limitation of model space, thus avoiding overfitting to some extent.

What are the benefits of the L2 norm? Here are two points:

1) Perspective of learning theory:

From the perspective of learning theory, L2 norm can prevent overfitting and improve the generalization ability of models.

2) Optimization calculation Angle:

From the point of view of optimization or numerical calculation, L2 norm is helpful to deal with the difficult problem of matrix inversion in the case of bad condition number. The condition number is the condition number. Let me Google it first.

Let’s talk about optimization in an elegant way. Optimization has two major problems, one is: local minimum value, the other is: ill-condition problem. The former I will not say, we all understand it, we want to find the global minimum, if the local minimum is too much, then our optimization algorithm is easy to fall into the local minimum and unable to extrabate themselves, which is obviously not the audience is willing to see the story. So let’s talk about ill-condition. Ill-condition corresponds to well-condition. What do they stand for? So let’s say we have a system AX is equal to b, and we need to solve for X. If A or B change slightly, the solution of X will change greatly, then the system of equations is ill-condition, and vice versa. For example:

Let’s look at the one on the left first. Let’s say our AX is equal to b in the first row, and in the second row we change b a little bit, and we get a very different x than we did before, as you can see. And then in the third row we change the coefficient matrix A A little bit, and you can see that it changes A lot. In other words, the solution of the system is too sensitive to the coefficient matrix A or B. And because our coefficient matrices A and B are generally estimated from experimental data, there is an error. It would be ok if our system could tolerate this error, but the system is so sensitive to this error that the error of our solution would be larger, and the solution would be unreliable. So the system of equations is ill-conditioned, abnormal, unstable, problematic, ha ha. That makes sense. The one on the right is called the well-condition system.

For an ill-condition system, if I change my input a little bit, the output will change a lot. This is not good. It shows that our system is not practical. If you think about it, for example, for a regression problem y=f(x), we use the training sample X to train model F so that y produces the desired value, such as 0. If we encounter a sample x ‘, which is very different from the training sample X, the system should output a similar value to y, such as 0.00001, but it outputs 0.9999 to me, which is obviously wrong. It’s like, if someone you know very well has a zit on his face, you don’t know him, you have a bad brain, haha. So if a system is ill-conditioned, we have doubts about its results. So how much do you believe it? We have to find a yardstick to measure it, because some systems are not so sick that their results can be trusted, not one size fits all. The condition number above is used to measure the credibility of the ILL-condition system. The condition number measures how much the output changes when the input changes slightly. The sensitivity of the system to small changes. (2) The conditioned number is small and (3) well-conditioned. (3) The conditioned number is large and (4) ill-conditioned.

If square matrix A is non-singular, then conditionNumber of A is defined as:

It’s the norm of A times the norm of its inverse. So the exact value depends on what norm you choose to be. If square matrix A is singular, then the condition number of A is positive infinity. In fact, every reversible square matrix has a condition number. But to calculate it, we need to know the norm and Machine Epsilon of the square matrix. Why norm? Norm is the same thing as measuring the size of A matrix, and we know that matrices have no size, when we’re not measuring the change in matrix A or vector B, how much does our solution x change? So there has to be something that measures the size of matrices and vectors, right? Right, it’s the norm, the size of the matrix or the length of the vector. OK, after a relatively simple proof, for AX=b, we can get the following conclusion:

So our solution the relative change in x is related to the relative change in A or b like the one above, where k of A is equal to the multiplier, see? That’s the bound on the change in x.

Condition number in one sentence: Conditionnumber is a measure of the stability or sensitivity of a matrix (or the linear system it describes). A matrix is well-conditioned if its conditionnumber is near 1. Then it is ill-conditioned, and if a system is ill-conditioned, its output should not be trusted too much.

Well, that’s a lot to say for such a thing. Why are we talking about this, by the way? To return to the first sentence: from the perspective of optimization or numerical calculation, L2 norm helps to deal with the problem that matrix inversion is difficult in the case of bad condition number. Because if the objective function is quadratic, there is actually an analytic solution for linear regression. The optimal solution can be obtained by taking the derivative and making the derivative equal to zero:

However, if the number of our samples X is smaller than the dimensions of each sample, the matrix XTX will not be full rank, that is, XTX will become irreversible, so w star cannot be computed directly. Or, more precisely, there will be infinitely many solutions (because the number of our equations is smaller than the number of unknowns). In other words, we don’t have enough data to determine a solution, and if we pick a random solution out of all the possible solutions, it probably won’t be a really good solution, but in short, we’re overfitting.

But if we add the L2 rule term, we get the following, and we can invert it directly:

To get the solution, we usually do not invert the matrix directly, but solve linear equations (such as gaussian elimination). Considering that there is no regular term, that is, λ=0, if the condition number of matrix XTX is very large, solving linear equations will be quite numerically unstable, and the introduction of this regular term can improve the condition number.

In addition, if the iterative optimization algorithm is used, a large condition number will still cause problems: it will slow down the convergence rate of the iteration, and the rule term actually changes the object function to λ-strongly convex from the optimization point of view. Oh, oh, here again appears a strong convex λ, what is the strong convex λ?

When f satisfies:

, we call f the λ -Stronglyconvex function, where the parameter λ>0. When λ=0, fall back to the definition of the ordinary convex function.

Before we get to strong convexity visually, let’s look at what ordinary convexity looks like. Suppose we ask f to do a first-order Taylor approximation in x. F (x) = f (a) + f ‘(a) (x – a) + o (| | x – a | |).) :

Intuitively, the convocation property refers to the tangent line of the function curve at this point, that is, the linear approximation. The strongly convex property further requires that the convex curve be located above a quadratic function at this point, that is, the function should not be “flat” but should have a certain “upward bending” trend. Specifically, the convex function can ensure that the function is above its first-order Taylor function at any point, while the strongly convex function can guarantee that the function has a very beautiful quadratic quadratic lower bound at any point. That’s a strong assumption, of course, but it’s also an important assumption. Maybe it’s hard to understand, but let’s draw a picture to visualize it.

You can see it all once you look at the picture above. I don’t have to tell you anything. Let’s just be verbose. Let’s take our optimal solution, w star. If our function f(w), on the left, which is the red function, is above the quadratic function on the blue dotted line, then even if WT and W * are close together, the values of f(wt) and f(w*) are quite different, which guarantees that there will be a large gradient near our optimal solution w*, In this way, we can achieve W * in a relatively small number of iterations. But for the right picture, the red function f(w) is constrained only to a linear blue dotted line. Assuming the unfortunate case as shown in the right picture (very flat), our approximate gradient (f(wt)-f(w*))/(wt-w*) is already very small while WT is still far from our best point w*. The approximate gradient ∂f/∂ W at WT is even smaller, so by gradient descent wt+1=wt-α*(∂ F /∂ W), what we get is that W changes very slowly, like a snail, very slowly crawling towards our optimal point, W *, which is still very far away from our optimal point in finite iteration time.

So the convex property alone does not guarantee that the convex point W will be a good approximation of the global minimum point W * in the case of gradient descent and a finite number of iterations. (By the way, it is also a method of normalizing or improving generalization performance to actually let the iterations stop near the optimum.) As analyzed above, if f(w) is very flat around the global minimum point W *, it is possible to find a very distant point. But if we had strong convexity, and we had some control over the situation, we could get a better approximation. As for how good this bound is, it depends on the size of the constant α in the strongly convex property. I don’t know if you’re being smart here. What to do if to obtain the most convex? The simplest is to join a (alpha / 2) * | | w | | 2.

Well, that’s a lot of space to make this “convex”. In fact, in gradient descent, the upper bound of the convergence rate of the objective function is actually related to the condition number of matrix XTX. The smaller the condition number of XTX is, the smaller the upper bound will be, that is, the faster the convergence rate will be.

This optimization says so many things. To sum up, the L2 norm not only prevents overfitting, but also makes our optimization solution stable and fast.

Ok, so let’s make good on that promise, and let’s talk about the difference between L1 and L2. Why does it make such a big difference if one minimizes the absolute value and the other minimizes the square? I see two geometrically intuitive interpretations:

1) Decline speed:

We know that L1 and L2 are regularized, and we put the weights in L1 or L2 in the cost function. The model then tries to minimize these weight parameters. This minimization is just like a downhill process. The difference between L1 and L2 lies in the difference of the “slope”, as shown in the figure below: L1 drops according to the “slope” of the absolute value function, while L2 drops according to the “slope” of the quadratic function. So actually L1 is falling faster than L2 near 0. So it’s going to go down to 0 really fast. But I think the explanation here is not quite relevant, of course, I do not know whether it is their own understanding of the problem.

L1 is known as Lasso, L2 as Ridge. But the names are confusing. Look at the image above, the Lasso image looks like ridge, and the Ridge image looks like Lasso.

2) Limitations of model space:

In fact, for the normalized cost functions of L1 and L2, we can write them in the following form:

That is, we limit the model space to a L1-ball of W. For visualization purposes, we consider a two-dimensional case where contour lines of the objective function can be drawn on the (w1, w2) plane and the constraint becomes a norm ball of radius C on the plane. Where the contour line first intersects the Norm Ball is the optimal solution:

It can be seen that the difference between L1-ball and L2-ball is that L1 has “angles” at the intersection of each coordinate axis, while the geodesic of the objective function will mostly intersect at angles unless it is placed in a very good position. Note that sparsity occurs at angular positions, such as the intersection of w1=0 in the figure, and higher dimensions (imagine what a THREE-DIMENSIONAL L1-ball looks like?). In addition to corner points, there are also many multilateral contours, which have a high probability of becoming the first intersection and generate sparsity.

In contrast, L2-ball has no such property, because there are no angles, so the probability that the first intersection will occur at a sparse location becomes very small. This intuitively explains why L1-regularization can produce sparsity, while L2-regularization cannot.

So, in a nutshell, L1 tends to produce fewer features and all the other features are zero, while L2 chooses more features and all of them are close to zero. Lasso is very useful for feature selection, whereas Ridge is just regularization.


OK, that’s all for now. In the next post we’ll talk about nuclear norm and regularization parameter selection. References for the whole article are also available in the next blog post, which is not repeated here. thank you