Machine learning reading Notes - Support vector machines

Assuming that the data is linear, that means that we can find a line that will separate it, of the multiple lines that will separate it, we want to choose the fattest line. Each line to call it with a w equation, we see how fat it is that’s the way we see the line with the distance of each point to calculate a calculate, after that, the minimum distance, which is the most close to the point, can be used to measure our line have a more than fat, we want to find out the most fat line, finally we have to do a maximum maximize.

In addition we also asked the line to correct, to all the tic tac toe to separate, this means we calculate the score with w with actual y is the same size, with a relatively simple mathematical expressions is multiplied by the x and y w calculated the value of the multiplication, because is the same, so it must be greater than zero, so we put the above expression make some changes, Become our ultimate goal.

In the next expression, we pull out w0 in terms of b.

So let’s first calculate the distance between the point and the plane, and if we already have a plane in hand, we can think of two points on the plane, one point I’ll call x prime, and one point I’ll call x”, where wx prime plus b is equal to 0, wx” plus b is equal to 0, so w times x prime is equal to minus b, and w times x” is equal to minus b. So w is also going to be equal to 0 if you subtract minus b minus minus b. Notice that x”, x prime are both points on the plane, so w times any vector on the plane is going to be equal to 0, which means that w is going to be perpendicular to the plane, which means that W is a normal vector to the plane.

For any point on the space x, point to the distance of the plane is to take this point to any point on the plane to come out even a vector, then the vector projection in the direction of w, so do their inner product, such as inner product of time but do want to get rid of the original length of w, such ability can correct calculated in that direction.

The first term is the inner product of w with x, the second term is the inner product of w with x prime, so the inner product of w with x prime is minus b, so we end up with something that looks like this.

The line we’re thinking about is the line that separates the circle from the cross perfectly, so the score that we’re going to get is the same number as the circle we want to get. We can use this property to take the absolute value off here, because y times this fraction must be positive.

Next, for mathematical convenience, we made the following special scaling:

Now we want to do a minimal action more, we had to do to maximize tired enough, now take out a minimal action, this as if is not too easy, that we tried to put a little bit to relax this condition, we put the conditions to relax into two type multiplication are greater than or equal to 1 the necessary condition. The question is if we relax and say each of them is greater than or equal to 1, are they all greater than 1? For example, if I say y times the value of the fraction, everything is now greater than 1.126. If we do a scaling between b and W, we divide b by 1.126, and we divide w by 1.126, and we just had each of them greater than 1.126, so after doing this division, it’s still going to be greater than or equal to 1, which satisfies all of our criteria. But 1 / | | w | | is what we want to maximize. Now we’re dividing w by 1.126, so we’re making w shorter, we’re making w shorter, so the reciprocal of w is going to be bigger. So this b and w are not going to be the best solution. This means that when we relax the conditions, it has no effect on the results.

We then convert the maximum value to the minimum value and remove the square root sign. Finally, the target to be optimized becomes as shown in the figure:

For example:

Quadratic programming for the general solution of the problem we want to optimize, we can use QUADRATIC programming:

Here, we can do a nonlinear transformation of x, as shown in the figure:

In addition to the above quadratic programming solution, we can also apply the Lagrange multiplier method to transform the original conditional optimization problem into an extreme value problem without conditions to solve it. As shown below:

(1). The calculated values that do not meet the restriction conditions are all greater than 0, multiplied by an alpha greater than or equal to 0 to perform the maximization operation. The larger alpha is, the better, and finally tends to be infinite. (2). What meets the restriction conditions is that the calculated values are all less than or equal to 0, multiplied by an alpha greater than or equal to 0 to perform the maximization operation. The larger the alpha is, the better, and the minimum of each alpha is 0. And then the last thing that’s left is the ones that need to be optimized.

Through a minimization process, I can weed out those that don’t meet the constraints. So what we are doing is transforming the SVM into a seemingly unconditional formula. There seems to be no condition, but the condition is hidden in maximization.

What’s the advantage of taking the problem on the left hand side and converting it to the problem on the right hand side according to strong duality, and the advantage of that is that we are taking the problem of optimizing b and W and making it unconditional, because now the problem of optimizing B and W is in the inside, which is the first step.