Ng machine learning: Support vector Machines

The course notes this time are quite different from last time, because it took a long time to understand SVM. As in previous neural networks courses, Ng’s Coursera presentation is very limited, and to understand SVM you have to turn to other sources. The Stanford CS229 handout is very clear after comparing some of the content on the Internet. The Coursera course is a simplified version of CS229, a machine learning course taught by Andrew Ng at Stanford University. The course is still being updated as a further reference. (I have put the handout and the paper on SMO algorithm together with the code for you to download.)

This course notes will explain the main idea of SVM, starting from the maximum spacing classifier, and then get the dual problem of the original problem through Lagrange duality, so that we can apply kernel trick to solve high-dimensional problems in an efficient way.

You can continue to learn Ng lessons by clicking on the course video, and the Python code for the course assignments has been posted on Github. You can click on the course code to go to Github (if you cannot access Github, you can click on Coding). Errors and improvements in the code are welcome.

Here are the notes from week 6 of Ng’s machine learning course.

Maximum spacing classifier

considerLogistic regressionPrediction functionH θ (x) = g (θ T x)” role=”presentation”>. Theta T x “role=”presentation”>θ T x ≪ 0″ role=”presentation”>“, can be quite surePrediction functionH θ (x)” role=”presentation”>It’s close to 1 and 0. The intuition is the farther away the sample isThe decision boundaryThe more certain it is of its classification. So it’s natural to think if you’re in multipleThe decision boundaryAnd we’re going to choose the one that’s farther away from all of the samples (in the linear case).



To get such decision boundaries, let’s first look at how to express the problem mathematically. By doing something simpleComputational geometryKnowledge can be calculated from each sample toThe decision boundaryThe distance,geometric margin), γ (I) “role=”presentation”>. And for the sake of the derivation, we need to define onefunctional margin, γ ^ (I) “role=”presentation”>.

In the followingSVMY (I) “role=”presentation”>{1, − 1}” role=”presentation”>When θ T x ≥ 0″ role=”presentation”>1, “role =” presentation “>, θ T x < 0″ role=”presentation”>− 1″ role=”presentation”>. Gamma ^ “role=”presentation”>γ ^ (I) “role=”presentation”>γ” role=”presentation”>γ (I) “role=”presentation”>The smallest value in. Therefore, the decision boundary problem which requires that all samples are far away becomes the problem of finding the maximum value under given constraints.

However, the above problem is not a directly solvable optimization problem, it needs to be transformed (the idea is to transform it into a standard convex optimization problem). First, we replace the geometric margin with functional margin. The original problem becomes:

W, b” role=”presentation”>Double random regardless | | w | | “role =” presentation “>The size of. γ ^ “role=”presentation”>W, b” role=”presentation”>γ ^ = 1″ role=”presentation”>Maximize, plus 1 | | w | | “role =” presentation “>Equivalent to minimize | | w | | 2 “role =” presentation “>So,Optimization problemThe final form of can be written as:

Such problems can be solved by using known optimization algorithms. However, for SVM, we will examine a corresponding problem. And in order to do that we need to know something about Lagrange duals.

Lagrange dual

There are a lot of questions about Lagrange duality, and I recommend the notes on C299 and this blog post for details. Consider the general convex optimization problem:

Let its Lagrange function be:

Consider the function:

In w “role =” presentation “>θ P (w) = f (w)” role=”presentation”>, θ P (w) = ∞” role=”presentation”>. So p ∗ “role=”presentation”>The result is the same as the original problem.

The interesting thing is that there is always an answer to this question (The dual problem), and under certain conditions. In order to getThe dual problem, we just swap m I n, m a x” role=”presentation”>Consider the function:

The dual problemD ∗ “role=”presentation”>Satisfies the following equation, and the inequality shows that it is a lower bound of the original problem.

And one last thingThe dual problemUnder what circumstances is the solution of phi equal toThe original problemThe same. If function f, g” role=”presentation”>Are allConvex functionH “role =” presentation “>Radiation function,(linear) and there is w” role=”presentation”>Make g I (w) < 0″ role=”presentation”>, w ∗, α ∗, β ∗ “role=”presentation”>And d ∗ = p ∗ = L (w ∗, alpha ∗, beta ∗) “role =” presentation “>. And then our solution is satisfiedKKTThe condition is otherwise satisfiedKKTThe solution of the condition is also the solution of the optimization problem (KKTThe conditions are as follows).

Partial partial w I L (w ∗, alpha ∗, beta ∗) = 0, I = 1,…, m “role =” presentation “> partial partial wiL (w ∗, alpha ∗, beta ∗) = 0, I = 1,… , m partial partial wiL, (w ∗, alpha ∗, beta ∗) = 0, I = 1,… , m partial partial beta I L (w ∗, alpha ∗, beta ∗) = 0, I = 1,…, L “role =” presentation “> partial partial beta iL (w ∗, alpha ∗, beta ∗) = 0, I = 1,… , l partial partial beta iL, (w ∗, alpha ∗, beta ∗) = 0, I = 1,… , alpha I l ∗ g (I), (w ∗) = 0, I = 1,…, k “role =” presentation “> alpha ∗ igi, (w ∗) = 0, I = 1,… Alpha, k I ∗ gi (w ∗) = 0, I = 1,… G, k I, (w ∗) 0 or less, I = 1,…, k “role =” presentation “> gi, (w ∗) 0 or less, I = 1,… Kgi, (w ∗) 0 or less, I = 1,… , alpha I acuity 0 k, I = 1,…, k “role =” presentation “> alpha acuity 0 I, I = 1,… , alpha I acuity 0 k, I = 1,… ,k

The dual problem

throughLagrange dualWe learned that the previous maximum spacing problem had a correspondingdualThe problem. And then we go throughLagrange dualTo get this problem.

G I (w) = − y (I) (w T x (I) + b) + 1 ≤ 0″ role=”presentation”>And set upLagrange functionTo:

W, b” role=”presentation”>L (w, b, α)” role=”presentation”>W “role=”presentation”>A and b “role =” presentation “>The derivative is obtained:

Will w “role =” presentation “>Insert L (w, b, α)” role=”presentation”>Simplify to obtain:

Alpha I ≥ 0″ role=”presentation”>And ∑ I = 1 m α I y (I) = 0″ role=”presentation”>And we have the final constraint of thetadualQuestion:

The red color is the most important part (Angle brackets denote inner products), and it allows us to use kernel techniques to reduce computational complexity, especially if features need to be mapped to very high or even infinite dimensions.

Kernel function

Assuming that each sample has three characteristics x 1, x 2, x 3 “role=”presentation”>Usually you need to map them to higher dimensions to fit more complex functions. Let’s assumeThe mapping functionTo:

The inner product between two different samples after mapping is:

It is not difficult to find that the inner product between the nine features after mapping is equal to the square of the inner product between the original three features. 2 “role=”presentation”>isKernel functionOne of them. For n” role=”presentation”>The situation of the characteristics ofKernel function K ( x , z ) = ( x T z ) 2 ” role=”presentation”>N 2 “role=”presentation”>The inner product of hours. N 2 “role=”presentation”>Dot dot dot dot dot dot dotKernel functionJust calculate n” role=”presentation”>Times. More generally,Kernel functionK (x, z) = (x T z + C) D = ⟨ ϕ (x), ϕ (z) ⟩” role=”presentation”>(n + d d) “role=”presentation”>D.Kernel functionThere’s a very interesting oneGaussian kernelIt is equivalent to mapping features to infinite dimensions.

So now you can understand what the kernel means. Since only the inner product of features is involved in the duality problem, and the value of kernel function computed in lower dimension is equal to the inner product of features mapped to higher dimensions, we can obtain a fairly efficient method to solve our problem. (See the handout on C299 for a more detailed discussion of kernels.)

regularization

Let’s briefly talk about the regularization of SVM. In order to solve the problem of linear inseparability of data sets, we appropriately adjust the original problem so that the spacing can be smaller than 1, but a certain cost is added to the optimization goal accordingly.

Lagrange duality can also be used to find its corresponding problem, which is the problem to be solved by SVM algorithm:

The SMO algorithm

Here’s an introductionSMOThe main idea of the algorithm, the specific details can refer to the paper. Max α W (α 1, α 2,…, α m)” role=”presentation”>α I “role=”presentation”>Fix other variables to maximize the value of the function, and repeat until the value converges (figure below).



forSVMIn thedualα I, α j “role=”presentation”>Fixed other variables. Alpha I “role=”presentation”>α j “role=”presentation”>ζ − α j y (j) y (I) “role=”presentation”>. So the optimization objective can be reduced to a question about α j “role=”presentation”>A α j 2 + b α j + c” role=”presentation”>Get the maximum. Alpha j “role=”presentation”>To tailor the results. Finally, the process is repeated until the optimization goal converges.

So, that’s all for week 6. Thank you for your patience.

The content of P.S. this week is somewhat compressed. I suggest you read the handout of CSS229 carefully. As the course of Andrew Ng is coming to an end, I am going to start some small projects to practice my skills. After all, applying what you learn is the real learning. For usually see what interesting things, what interesting ideas will be posted on the micro blog, welcome everyone to study together (● – ●)~


hertzcat

2018-05-13