Transfer: http://www.cnblogs.com/xxrxxr/p/7535587.html

There are a lot of information about SVM on the Internet and in books, but I think some details are not very clear. The following is my derivation of the whole mathematical principle of SVM, in which every step of logical derivation should be reasonable and well-founded. Now sort it out and share it with you.

At present, I have not learned too much kernel function about the mathematical principle of SVM, so the following arrangement does not involve kernel function. And because I have not understood many parts thoroughly, I have sorted out the following parts:

(1) Maximum interval classifier, which includes step by step derivation of optimization objectives, as well as knowledge of mathematical optimization such as Lagrange function, KKT condition, and duality problem

(2) The form of soft interval optimization, that is, the step by step derivation of the optimization objective with relaxation variables added

(3) the SMO algorithm

The following is the first part to get the basic form of SVM.

I. Preface: #

Although there are still a lot of things I can’t understand, I think the current level of understanding has entered a bottleneck, maybe I can break through by sorting it out from scratch. (Later proved to be a great harvest.)

Have to say that SVM is really difficult, if the neural network is difficult in the back propagation algorithm, but that is only a chain derivation rule, and SVM involves a lot of mathematical knowledge of optimization, after all, just a SVM can write a book (” Introduction to support vector machines “)…

OK, start sorting out the whole SVM idea. There may be a lot of confusion at first, but take your time to fix it.

And I want to draw some pictures by myself, but I haven’t found a good tool yet, so I will cut some pictures from the book first, and I will mark the source.

2. Get the most basic form #

(I don’t know what better topic to start with)

(Figure from Machine Learning P121)

I think many of you are probably familiar with what SVM does, but here’s a bit more verbose:

Here’s your training data, includingAll we need to do is find a hyperplane that separates these two types of data. Now let’s say that this data set isLinear separableIf the data set is linearly divisible, then we can actually find many hyperplanes to separate the two types of data, as can be seen in the figure above. So what are we looking for?

Intuitively, we need to find a hyperplane that is “farthest” from the different types of data sets on either side, like the line in bold above. This line appears to be the “farthest” from the two sets of data, so it might be more accurate in forecasting, which I think you can understand. But this is intuitive, and if we want to use mathematics to solve it, we first need to find a standard for measuring the distance between the plane and the data set, which leads to the concepts of “functional spacing” and “geometric spacing.”

(Finding the maximum interval is intuitively optimal, and in fact there is a corresponding mathematical principle behind it. I remember it should be explained in chapter 4 of Introduction to Support Vector Machines, but I still don’t understand it very well at present, you can have a look if you are interested.)

1. Function interval and geometric interval definition #

First we define the expression of the hyperplane asSo just to review what we learned in high school, we know that the distance between a point x on the plane and the plane is zeroAnd here it isGeometric intervalThe definition.

If you’re not familiar with this, you’re familiar with the formula for the distance from the point (x,y) to Ax plus By plus C is equal to 0)

whileThe function betweenWhat is it? That’s what I want to emphasize hereThe value ofA:, its label is +1 for positive samples and -1 for negative samples. (So I don’t change the absolute value.)

We useTo represent a sample pointTo this hyperplaneThe function between.

(The definition of function spacing is found in chapter 2 of Introduction to Support Vector Machines.)

2. The relationship between functional interval and geometric interval #

It should be noted that if a hyperplane is found that can correctly separate the data set, it means that for each sample point X, its function interval is actually greater than or equal to 0, namely:

I think this is the key to understanding. The specific explanation is: for the positive example, because of itsLambda is equal to 1, and because it’s correctly classified, lambda isSo we can getThe same is true for negative samples.

(Many books give it directlyI think that’s a little bit of a leap of faithTo follow)

And then we can actually get, if this sample is correctly classified, then:

So function spacing actually works as wellSaid.

Review the previous formula for calculating geometric intervalsA lot of times the molecules areNo absolute value signWe can substitute the absolute value of the numeratorThe function betweenSo this sample goes to the hyperplaneGeometric intervalIn fact, it can be written as:

So I think this is itThe relationship between the functional interval and the geometric interval, that is, for a successfully classified hyperplane, the geometric interval between each sample point and the hyperplane in the case of linearly separable sample setIt’s just the interval divided bythenorm.

3. Define the interval # from the hyperplane to the training set

And then our previous geometric spacing, the function spacing and all that was the distance between points and the plane, now let’s define a hyperplane to the training set as the smallest geometric distance between that hyperplane and each point of the training set.

(For this definition, I refer to the explanation of SVM in Pattern Recognition and Machine Learning, which is the most detailed one I have ever seen.)

The distance between the plane and the data set is what we need to measure the distance to get the intuition going.

In this sense, when we measure the distance from a plane to the data set, we really only need to look at the nearest distance to all of the sample points.

So for a hyperplane (ω,b) we can get that its interval to the given training set is:

So this is an easy formula to understand,If I put it in min, that is, means to find the smallest of each geometric interval. becauseIt doesn’t depend on the specific sample point so it gets pulled out, and then the following

Is the minimum interval between each point in the training set and the hyperplane.

The other thing to understand here is that if you give this classified hyperplane, then the distance from this hyperplane to the training set is fixed, because the training set is given that every point is fixed. So if you adjust the hyperplane, this interval will change.

4. Get the most original optimization goal #

So now we need to find a hyperplane that not only can be correctly classified, but also has the largest interval from the training set of all correctly classified hyperplanes. That’s the same intuitive problem that we started with.

Then we can get to our very original optimization goal:

I’m thinking about the problem here, how can I make sure that the hyperplane that I find with this optimization goal is exactly the same distance between the positive sample set and the negative sample set. After reflection, I found that this has been included in the optimization goal. Because I actually think we can divide plane optimization into omega optimization and B optimization, and for the same omega, different B planes are parallel. When omega is fixed, then the minimum sum of the distances from a positive sample to the plane and the distances from a negative sample to the plane is also fixed! The following illustration is complementary.

But different B’s causeDifferent, different b’s will cause the plane to shift with the same slope. Because I’m taking the minimum here. So let’s say that this plane is a little bit off from the center of the positive sample to the positive sample, so maybe the distance to the positive sample is smaller, and the distance to the negative sample is larger, but notice thatIt’s taking the minimum, so the minimum is going to go down from 1/2, which is not our final optimal result, so from this point of view, it’s optimal for the line to be in the middle.

(This original optimization goal is not found in many books. I also found it in the book Pattern Recognition and Machine Learning, which is quite good.)

3. Get the basic form of SVM #

Once you have this very primitive optimization goal, it’s complicated to try to solve it directly. We can simplify even further. Notice that if we scale omega and b at the same time, theta.And then substitute kω and KB into the original expression, you will find that there is no effect on the original optimization problem, the factors are cancelled. So we can use this property to simplify things.

So we might as well just let thisPhi is 1, which means we’re scaling omega and b at the same timeThe sample points closest to the hyperplane are spaced one, this adjustment not only has no influence on the original optimization problem, but also can simplify the calculation.

(scaling need to understand in place here, zoom omega and b at the same time, will change every sample point to the function of the interval, but will not change every sample points to the geometry of the hyperplane interval, so we might as well let the hyperplane nearest point of the function of interval is 1, simplifies the calculation, but also brought constraints)

But now that it’s set upLambda equals 1, so that means we’re increasingThe constraint:

For the dissociative hyperplaneRecently,The sample points of the

For the other sample points, they’re

So now we can say: when a hyperplane can be found to correctly classify a given training set, the function interval of each sample point satisfies:

This is the constraint that simplification imposes

Now let’s take a look at the simplified optimization formula:

The constraint is that the above satisfies for each sample point

But we generally convert this optimization problem to its equivalent:

And they usually add a factor of 1/2, which the book says is just for later derivation.

So now our optimization problem becomes:

Which I = 1, 2, 3… m

This is the basic form of SVM described in machine Learning.

Now rearrange the process that led to this basic form:

(1) The most original optimization objective:, that is, we need to find one from each plane that can be correctly classified to the datasetintervalThe largest plane, of whichintervalIs defined as the minimum distance between the sample of this data set and the plane

(2) The optimization target remains unchanged after scaling according to w and B, so we can letPhi is 1 to simplify the problem, phi is 1And then convert to the equivalent

But notice that there is a constraint!