Introduction to Support Vector Machines (Understanding the three levels of SVM)

By July. Thanks: pluskid, white rock, and JerryLead. Note: This article was originally written in June 2012, and then it was modified and optimized for hundreds of times, and finally modified in November 2016. Disclaiming: This is a “study note” with links to all references from 2012, and it refers specifically to articles by PlusKid et al. The 2013 PDF at the end of this article is for proof. In addition, if the picture/formula of this article does not display properly, please click the ** July online question bank version of this article **.

 

preface

It took a lot of effort and difficulty to write this support Vector machine, for one thing, it is not easy to understand, and it takes a lot of time and energy to study and study it further, and it is not easy to explain clearly. Although some online friends have done a good job (see the link at the end of this article), it is not enough to describe the mathematical formula. Thanks to bai Shi’s mathematical proof, I still want to try to write it. I hope this paper can be a complete summary and introduction of support vector machine on the basis of easy to understand.

This article in the process of writing, referred to a lot of information, including “introduction to support vector machine”, “statistical learning methods” and the support vector machine series of plusKID and so on, here, or a study notes, just to add their own understanding and summary, there are any improper, but also please forgive. In this paper, the concept and use of support vector machine are fully understood on the macro level, and the origin and context of some theorems are deeply studied on the micro level, proof and principle details, so as to keep the logic clear & easy to understand.

At the same time, when reading this article, it is recommended that you try to use Chrome and other browsers, so that the formula can be better displayed, in addition, when reading can take a piece of paper and pen out, the article all theorems. Formulas are derived by yourself or directly printed down (you can directly print the web version or the ATTACHED PDF at the end of this article) in the document to perform calculation, so as to enjoy the extreme pleasure of thinking and calculation anytime and anywhere.

OK, or the same sentence, any questions, welcome anyone at any time to comment & comment, thank you.

 

 

The first layer, understand SVM

Support vector Machine (SVM), as its English name is Support Vector Machine, is generally referred to as SVM. Generally speaking, it is a binary classification model. Its basic model is defined as a linear classifier with the largest interval in the feature space, and its learning strategy is to maximize the interval.

1.1Origin of classification criteria: Logistic regression

To understand SVM, we must first understand a concept: linear classifier.

Given some data points that belong to two different classes, you need to find a linear classifier that divides the data into two classes. If x represents the data point and y represents the category (y can be 1 or -1, respectively, representing two different classes), the learning goal of a linear classifier is to find a hyper plane in the n-dimensional data space. The equation of this hyper plane can be expressed as (T in wT represents the transpose) :

                                                           

Readers may have questions about taking 1 or -1 as a category. In fact, the 1 or -1 classification criteria originated in logistic regression.

The purpose of Logistic regression is to learn a 0/1 classification model from features, and this model takes the linear combination of features as the independent variable, since the value of the independent variable ranges from minus infinity to plus infinity. Therefore, the logistic function (or sigmoid function) is used to map the independent variable to (0,1), and the mapped value is considered to be the probability of y=1.

Hypothesis function

Where x is the n-dimensional eigenvector and gis the logistic function.

    而The image is

As you can see, infinity maps to 0,1.

And let’s say that the function is the probability that the feature belongs to y equals 1.

Thus, when we want to determine which class a new feature belongs to, we only needIf can,If the value is greater than 0.5, it belongs to the class y=1; otherwise, it belongs to the class y=0.

In addition,And the onlyAbout,> 0, thenWhile g(z) is only used for mapping, the real category decision remains. Moreover, whenWhen,= 1, and vice= 0. If we just go fromThe goal of the model is to train the feature y=1 in the dataIt’s a feature of y equals 0. Logistic regression is something you have to learn, so that the feature of positive example is far greater than 0, and that of negative example is far less than 0, and this goal should be achieved in all the training instances.

Next, try to transform the logistic regression. First, replace the result labels y = 0 and y = 1 with y = -1,y = 1, and then() in theI’ll replace it with b, and then I’ll end up withReplace with(i.e.). So there you have it. In other words, except that y changes from y=0 to y=-1, the formal representation of linear classification function and logistic regressionNo difference.

Further, we can take the hypothesis functionLet’s make a simplification of g(z) in, and map it simply to y=-1 and y=1. The mapping is as follows:

1.2, an example of linear classification

Here’s a simple example. As the figure below shows, we now have a two-dimensional plane on which there are two different kinds of data, represented by circles and crosses. Since the data are linearly separable, the two types of data can be separated by a line that acts as a hyperplane, where all the points on one side of the hyperplane correspond to y of negative 1 and all the points on the other side correspond to y of 1.

The hyperplane can be classifiedWhen f(x) is equal to 0, x is the point on the hyperplane, while the point f(x) is greater than 0 corresponds to the data point y=1, and the point f(x) is less than 0 corresponds to the point y=-1, as shown in the figure below:

Note: Some data define the feature to the result of the output functionWhich is the same as defined hereThe substance is the same. Why is that? Because no matter what, orDoes not affect the final optimization result. You’ll see below when we turn to optimization, in order to solve the convenient, take yf (x) to 1, namely yf (x) is y (w ^ x + b), or y (w ^ x – b), for us to optimize the formula max1 / | | w | | has no effect.

SVM functional margin: SVM functional margin: SVM functional margin: SVM functional margin: SVM functional margin: SVM functional margin: SVM functional margin: SVM functional margin: SVM functional margin: SVM functional margin: SVM functional margin: SVM functional margin: SVM functional margin: SVM functional margin: SVM functional margin: SVM functional margin: SVM functional margin: SVM functional margin: SVMIs equal to y of wTx plus b is equal to yf of x where y is just 1 and minus 1? The only function of Y is to ensure that functional margin is non-negative. Right? Is that true? Of course not, please comment on the 43rd floor)

In other words, when sorting, a new data point x is encountered, and x is substituted into f(x). If f(x) is less than 0, the category of X is assigned to -1, and if f(x) is greater than 0, the category of x is assigned to 1.

And then the question is, how do you determine this hyperplane? Intuitively, this hyperplane should be the best line to separate the two types of data. The criterion for “best fit” is the maximum distance between the line and the data on either side of the line. So, you have to look for the hyperplane with the maximum interval.

1.3Angular angular and Functional margin

In hyperplane w * x + b = 0, | | w * x + b can represent point x distance to the hyperplane, and by observing the w * x + b symbol symbols are consistent with the class mark y can judge the * * * * classification is correct, so, you can use (y * (w) * x + b) plus or minus of the correctness of the classification to determine or said. Here, we introduce the concept of functional margin.

Define function intervals (withWhere) is:

And the minimum value of function interval of hyperplane (w, b) with respect to all sample points (xi, yi) in T (where, x is the feature, y is the result label and I is the ith sample), is the function interval of hyperplane (w, b) with respect to training data set T:

支那= **minI (I = 1,… n)

The problem with this definition of spacing is that if you change w and b proportionally (for example, if you change them to 2w and 2b), then the value f(x) of the spacing becomes twice as big (even though the hyperplane doesn’t change), so spacing alone is not enough.

In fact, we can apply some constraints to the normal vector W and thereby derive the concept of a real definition of the distance between a point and a hyperplane — a angular margin.

Let’s say that for a point x, let’s say that the point that’s perpendicular to the hyperplane is x0, and w is a vector that’s perpendicular to the hyperplane,Is the distance between sample X and the hyperplane, as shown in the figure below:

According to plane geometry, yes

The | | w | | for w quadratic norm (norm is the representation of a similar mould length concept).Is a unit vector (a vector divided by its magnitude is called a unit vector).

And because of thexZero is a point on the hyperplane that satisfiesf(x0)=0, substitute in the equation of the hyperplaneThat they may have, i.e.,.

Then let this formulaLet’s multiply both sides of theta by thetaAnd then according to theand, can be calculatedgamma:

In order to getThe absolute value of, letMultiply by the corresponding category y to obtain the geometric interval (withRepresents the definition of

From the above function and geometric intervals can be seen: the definition of geometric interval is calculated by dividing the function space | | w | |, and function between y * (wx + b) = y * f (x) is actually | f (x) |, just people define an interval measurement, and the geometric interval (x) | | f / | | w | | is intuitively point the distance to the hyperplane.

 

1.4, the definition of the Maximum Margin Classifier

When classifying a data point, the greater the “interval” between the hyperplane and the data point, the greater the confidence of the classification. Therefore, in order to make the classification as confident as possible, the selected hyperplane should be able to maximize this “interval” value. This Gap is half of the Gap shown in the image below.

From the previous analysis, we can see that the interval function is not suitable for maximizing the interval value, because after the hyperplane is fixed, the length of w and the value of b can be scaled equally, which can makeThe value of is arbitrarily large, that is, the interval between functionsCan be arbitrarily large as long as the hyperplane remains constant. But the geometric spacing because we’re dividing, such that when scaling W and B is geometrically spacedThe value of phi doesn’t change, it only changes with the hyperplane, so this is a more appropriate interval. In other words, the maximum interval we are looking for here is the “interval” in the classified hyperplane refers to the geometric interval.

Therefore, the objective function of the maximum Margin classifier can be defined as:

At the same time, we need to satisfy some conditions, according to the definition of the interval, we have

S.t., which means subject to, leads to constraints.

Let’s review the definition of geometric spacing, we can know: If the function intervalIs equal to 1Is equal to 1 for the convenience of derivation and optimization, and it has no effect on the optimization of the objective function, as to why, please see the comment in this article= 1 / | | w | | and, thus the above objective function is transformed into

Equivalent to the corresponding constraints, to maximize the 1 / | | w | | value, and 1 / | | w | | is geometric spacing.

As shown in the following figure, the solid line in the middle is the Optimal hyperplane found. The distance between the solid line and the boundary of the two dashed lines is the same. This distance is the geometric intervalThe distance between the two dotted lines is equal to 2, and dotted lineThe points on the spacing boundary are support vectors. Since these support vectors are just on the dashed interval boundary, they satisfyRemember we set our functional margin to 1? In the previous section: For the sake of easy derivation and optimization, we can set=1), and for all points that are not support vectors, obviously.

OK, so far, can be regarded as the first level of UNDERSTANDING of SVM, for those who only care about how to use SVM friends are enough, do not have to go further into its deeper principles.

 

The second layer, deep SVM

2.1, from linear separable to linear indivisible

2.1.1 from the original problem to the solution of the dual problem

Then consider the objective function obtained previously:

Because oThe maximum value of theta is the same as finding thetaTherefore, the above objective function is equivalent to (w changes from the denominator to the numerator, and the original Max problem becomes the min problem, obviously, the two problems are equivalent) :

Since the objective function is now quadratic and the constraints are linear, it is a convex quadratic programming problem. This problem can be solved using the off-the-rack QP (Quadratic Programming) optimization package. In a word: under certain constraints, the goal is optimal and the loss is minimum.

In addition, due to the special structure of this problem, it can also be transformed into the optimization problem of dual variable through Lagrange Duality, that is, the optimal solution of the original problem can be obtained by solving the dual problem equivalent to the original problem. This is the dual algorithm of support vector machine under the condition of linear separability. They can be naturally introduced into the kernel function, and then extended to the nonlinear classification problem.

So what is Lagrange duality? Simply put, by adding a Lagrange multiplier to each of these constraints, define the Lagrangian function (by integrating the constraints into the objective function, we can clearly express our problem with only one function expression) :

And then make

Easy to verify, when a constraint is not met, for exampleThen obviously there is(as long as you make). When all constraints are satisfied, the optimal value isThat is, the quantity to be minimized initially.

So, it minimizes when the constraints are met, is actually equivalent to direct minimization(Of course, there is a constraint here, which is0 or higher,i= 1,… .n), because if the constraints are not met,It’s going to be equal to infinity, so it’s not going to be the minimum we want.

If I write that out, the target function becomes:

Here withRepresents the optimal value for this problem, and is equivalent to the original problem. If I just solve for it, then I have to start with w and B, andAgain, it’s an inequality constraint, and it’s not easy to do. Change the minimum and maximum positions to:

The new problem is the dual problem of the original problem, and the optimal value of the new problem isTo represent. And there areOr lessUnder certain conditions, the two are equal, and then the original problem can be solved indirectly by solving the dual problem.

In other words, the reason why minmax’s original problem, into the dual problem of MaxminFor one thingisIt is easier to solve the problem when it is transformed into a dual problem.

So let’s figure out how small L is with respect to w and b, and then L with respect to bGreatly.

2.1.2 KKT conditions

As mentioned above,”Or lessThey are equivalent as long as certain conditions are met, “and by satisfying certain conditions, we mean satisfying the KKT condition.

Corrections: Reader QQ_28543029 points out that the condition here should not be KKT condition, and strong duality should be satisfied to make the two equivalent. Then some scholars put forward KKT condition under strong duality, and the KKT condition should meet constraint qualifications. And one of the constraint qualifications is the Slater qualifications. The so-called Slater condition refers to: convex optimization problem, if there is a point X, such that all equality constraints are true, and all inequality constraints are strictly true (that is, take a strict inequality sign, not an equal sign), then the Slater condition is satisfied. In this case, Slater’s condition holds, so d*≤p* can be equal.

In general, an optimization mathematical model can be expressed in the following standard form:

Where, f(x) is a function to be minimized, h(x) is an equality constraint, g(x) is an inequality constraint, and p and q are the number of equality constraints and inequality constraints respectively.

In the meantime, be aware of two things:

  • The concept of convex optimization:Is a convex set,Is a convex function. Convex optimization is about finding a point, making each onemeet 。
  • The meaning of KKT condition: it is a necessary and sufficient condition for a Nonlinear Programming problem to have an optimal solution.

KKT condition means that the minimum point x* in the standard form of the optimization mathematical model above must satisfy the following conditions:

After argument, our problem here satisfies the KKT condition (Slater’s condition has been satisfied in the first place, and f and GI are differentiable, that is, L is differentiable with respect to w and B), so now we turn to solving the second problem.

In other words, the original problem has been transformed into a dual problem by satisfying the KKT condition. And to solve this dual learning problem, there are three steps: first, minimize L(w, b, a) with respect to W and B, and then find a pairAt last, the SMO algorithm is used to solve the Lagrange multiplier in the dual problem.

2.1.3 Three steps of solving dual problems

    (1), first fix ** To minimize L with respect to W and B, we take partial derivatives with respect to W and B, respectively, with respect to ∂L/∂ W and ∂L/∂ B equal to zero:

Substitute the above results into the previous L:

Get:

Caution: Some readers may ask how the above derivation came about. To tell you the truth, the specific derivation process is quite complicated, as shown in the figure below:

Finally, we get:

As jerrylead says, “step 4 to step 3” uses the transpose operation of linear algebra. Since ai and yi are real numbers, they are the same as themselves when transposed. “Penultimate step 3” leads to “penultimate step 2” using (a+ B + C +… (a + b + c +…). = aa + ab + BC + + ac + ba + bb… The multiplication rule of PI. The last step is a reordering of the previous step.

From the last expression above, we can see that the Lagrange function consists of only one variable, and that isTo calculate the (Then w, and B can be solved. It can be seen that the core problem mentioned in Section 1.2 above: classification function **** can be easily solved).

    (2)And beg for *Is the maximum of, * is the optimization problem about the duality problem. After the first step above to find w and b, the Lagrange function has no variables w, b, only. From the above equation, we get:

And there we have it, according to theWe can solve for w, and then pass, b can be solved, and finally the separation hyperplane and classification decision function can be obtained.

    (3)In finding L(w, b, a) with respect to w and b minimization, and with respect toIn the last step, the SMO algorithm can be used to solve the Lagrange multiplier in the dual problem.

The above formula is to solve the parametersThe problem of finding the maximum value of WandThese are all known numbers. To see how this SMO algorithm is derived, skip to section 3.5, SMO algorithms below.

So far, our SVM is relatively weak and can only deal with linear cases. Next, we will introduce kernel function and then extend to nonlinear classification problem.

2.1.4 Linear inseparability

OK, in order to move on to the kernel function described in section 2.2, let’s look at some of the interesting forms obtained from the above derivation. The first is about our Hyper plane, for a data pointxTo classify, in fact, by puttingxInto theI’m going to figure it out and then I’m going to classify it based on the signs. And in the previous derivation we got

Therefore, the classification function is:

What’s interesting about the form here is that for new pointsx, we only need to calculate its inner product with the training data points (Represents the inner product of a vector), which is very important, and is the basic premise for nonlinear generalization using Kernel later. In addition, the so-called Supporting Vector is shown here — in fact, the coefficients corresponding to all non-supporting vectorsAre equal to zero, so the inner product calculation for the new point actually only needs to target a small number of “support vectors” rather than all the training data.

Why not support vector correspondingIs equal to zero? Intuitively, it is these “back” points — as we have analyzed before — that have no effect on the hyperplane. Since classification is completely determined by the hyperplane, these unrelated points do not participate in the calculation of the classification problem, and therefore have no effect.

Recall that we obtained the target function from the Lagrange Multiplier in section 2.1.1:

Notice ifx iFor support vector, the red part is equal to 0 (because the functional margin of support vector is equal to 1), while for non-support vector, the functional margin is greater than 1, so the red part is greater than zero, andIt’s non-negative, and in order to maximize,It has to be equal to 0. So that’s the limitation of these non-supporting Vector points.

Thus, we get a maximum Margin hyper plane classifier, which is called the Support Vector Machine. Of course, so far our SVM is weak and can only handle the linear case, but after we get the dual form, it becomes very easy to generalize to the nonlinear case through the Kernel (believe, you remember from the beginning of this section: “The optimal solution is obtained by solving the dual problem, which is the dual algorithm of support vector machine under the condition of linear separability. The advantage of this method is that the dual problem of one is easier to solve; Both can be naturally introduced into the kernel function, and then extended to the nonlinear classification problem “).

2.2And Kernel function

2.2.1 Implicit mapping of feature space: kernel function

In fact, most of the time the data is not linearly separable, and then the hyperplane that satisfies this condition doesn’t exist at all. In the previous article, we have learned that SVM deals with linear separable cases, how to deal with nonlinear data? For the nonlinear case, the SVM method is to select a kernel function κ(State, State) and map the data to a high dimensional space to solve the problem of linear inseparable in the original space.

Specifically, in the case of linear inseparability, support vector machines first complete the calculation in the low-dimensional space, and then map the input space to the high-dimensional feature space through the kernel function, and finally construct the optimal separation hyperplane in the high-dimensional feature space, so as to separate the nonlinear data on the plane which is not separable by itself. As shown in the figure, a bunch of data cannot be divided in two-dimensional space, so it can be mapped to three-dimensional space to be divided:

 

But before we meet with kernel function, if in the original way, so in a linear learning is a nonlinear relationship, need to select a set of nonlinear characteristics, and the data as a new form of expression, this application is equivalent to a fixed nonlinear mapping, map the data to feature space, using linear learning device in the feature space, therefore, The hypothesis set considered is a function of this type:

Here ϕ : X->F is a mapping from the input space to a feature space, which means that building the nonlinear learner is divided into two steps:

  1. First we use a nonlinear map to transform the data into an eigenspace F,
  2. Then we use the linear learner in the feature space to classify.

Since the dual form is an important property of the linear learner, it means that the hypothesis can be expressed as a linear combination of training points, so the decision rule can be expressed by the inner product of test points and training points:

If there is a way to compute the inner product < φ(xi · φ(x) > directly in the feature space, as in the function of the original input point, then it is possible to combine the two steps to build a nonlinear learning machine. The method of direct computation is called the kernel method:

The kernel is a function K that satisfies all x, z of minus xWhere φ is a mapping from X to the inner product eigenspace F.

2.2.2 kernel function: How to deal with nonlinear data

Let’s do an example of a kernel. As shown in the following figure, the two types of data are respectively distributed in the shape of two circles. Such data itself is linearly indivisible. At this time, how can we separate these two types of data (a corresponding three-dimensional space diagram will be shown below)?

In fact, the data set shown above is generated by two circles of different radii plus a small amount of noise, so an ideal boundary would be a “circle” rather than a line (hyperplane). If you useandTo represent the two coordinates of the two-dimensional plane, we know that the equation of a conic (of which the circle is a special case) can be written in this form:

Notice the form above, if we construct another five-dimensional space, where the values of five coordinates are...., then obviously, the above equation can be written in the new coordinate system:

About the new coordinatesThis is exactly the equation of a Hyper plane! In other words, if we do a mappingThat will beFollow the above rules to map to, then the original data in the new space will become linearly separable, so that the linear classification algorithm we deduced before can be used for processing. This is the basic idea of Kernel method to deal with nonlinear problems.

Before going into the details of the Kernel, let’s take a look at how the above examples look visually after mapping. Of course, you and I may not be able to draw it in five dimensions, but since I used a special case to generate the data here, the actual equation for the hyperplane here looks like thisA perfect circle on an axis) :

So I just have to map it to..The following is the result of the mapping. If the coordinate axes are rotated properly, it is obvious that the data can be separated through a plane (pluskid: The following GIF animation, first draw a picture with Matlab, then use Imagemagick to paste it) :

Kernel function is equivalent to the original classification function:

Mapping:

And one of theCan be obtained by solving the following dual problem:

Does that solve the problem? It seems to be: take the nonlinear data and find a map, and then map the original data to the new space, and then do linear SVM. But it’s not as simple as that.

Think about it. Is there something wrong with that?

  • In the original example, we mapped a two-dimensional space, and the new space we chose was all of the first and second order combinations of the original space, resulting in five dimensions;
  • If the original space is three dimensions (a combination of first, second and third order), then we get: 3(first)+3(quadratic cross)+3(squared)+3(cubed)+1(x1*x2*x3)+2*3(cross, one at a time and one at a time and two at a time, similar to x1*x2^2) = 19 dimensions of new space, the number of which is growing exponentially and explodiously, and thus is bound to giveIn the case of infinite dimensions, it would be impossible to compute.

This is where Kernel might come in.

So let’s start with our very simple example, let’s say we have two vectorsandAnd theIs the mapping to the five-dimensional space mentioned above, so the inner product after the mapping is:

(Formula description: Above the two in the process of derivation, puts it in front of the five dimensions of space mapping, said here in front of the mapping way is described in section 2.2.1, review the mapping rules before, then look at the first deduction, is actually calculated x1, x2 their inner product, and then multiplied together, the second is derived directly square, remove the parentheses, It’s easy to roll it out.)

In addition, we note that:

There’s a lot of similarities between them, and in fact, what we’re going to do is we’re going to linearly scale some of the dimensions, and then we’re going to add a constant dimension, and in particular, what we’re going to do is we’re going to compute the same thing as the mapping

And then the inner productSo what’s the difference?

  1. One is to map it to a higher dimensional space, and then calculate it according to the formula for the inner product;
  2. The other is computed directly in the original low-dimensional space without having to explicitly write the result of the mapping.

(Formula description: in the above, the last two formulas, the first formula, is the complete square method with inner product, can be separated, and then, by collecting one, the second formula, is also based on the first formula)

Recall the dimension explosion of the mapping mentioned earlier, where the former method is no longer computable, the latter method is still able to handle it in its own right, even in the case of infinite dimensions.

We call the Function of calculating the inner product of two vectors in the implicitly mapped space as Kernel Function. For example, in the example just now, our Kernel Function is:

Kernel functions simplify inner products in mapping Spaces — it just “happens” that the data vectors that need to be computed in our SVM always appear as inner products. Compared with what we just wrote above, now our classification function is:

Among themCalculated from the following dual problem:

In this way, the problem of calculation is solved, and the result is equivalent without directly doing the calculation in higher dimensional space! Of course, because our example here is so simple, I can manually construct the correspondingIf you take any mapping, it’s very difficult to construct the kernel.

2.2.3 several kernel functions

Generally, people will choose from some commonly used kernel functions (according to different problems and data, choose different parameters, in fact, will get different kernel functions), such as:

  • The kernel, obviously the example we just gave is a special case of the kernel here (R = 1, d = 2). It’s cumbersome and unnecessary, but the mapping that this kernel corresponds to can actually be written, and the dimensions of this space are, where M is the dimension of the original space.
  • Gaussian kernelThe core is the guy that we talked about in the beginning that maps the original space to infinite dimensions. However, ifIf you pick a big one, the weight on the higher-order features actually decays very quickly, so it’s actually (numerically approximated) equivalent to a lower-dimensional subspace; On the other hand, ifChoose very small, and you can map any data to be linearly separable — which, of course, is not necessarily a good thing, as it can lead to very serious overfitting problems. However, in general, by adjusting the parameters, the Gaussian kernel actually has quite high flexibility, is also one of the most widely used kernel functions. The example shown in the following figure is to map low-dimensional linearly indivisible data into a high-dimensional space through the Gaussian kernel function:

  • The linear nuclearThis is actually the inner product of the original space. The existing main purpose is to make the nuclear problem in space after the “map” and “the problem of space mapping” both are unified in form (mean, sometimes, we write code, or writing formula, as long as write a template or a general expression, and then into the nucleus of different, can, in this, was unified in form, You don’t have to write a linear one and a nonlinear one).

2.2.4 Essence of kernel function

After all this, readers may still not understand what a kernel is. Let me briefly summarize the following three points:

  1. In practice, we often encounter linear inseparable samples. At this time, we usually map the sample features to a high-dimensional space (as shown in the first figure in section 2.2 above, after mapping to a high-dimensional space, the relevant features are separated, thus achieving the purpose of classification).
  2. But further, if every linear inseparable sample is mapped to a higher dimensional space, then the dimension size can be horribly high (19 dimensions and even infinite dimensions in the example above). So what do we do?
  3. At this point, the kernel function is held, although the value of the kernel function is that it also features from low to high dimension conversion, but the kernel function is absolutely no calculation on it in low dimension in advance, and will essentially the classification effect of performance in the high dimension, also as stated above to avoid the complicated calculation in the high-dimensional space directly.

Finally, an example is given to illustrate the intuitive effect of kernel function in solving nonlinear problems.

Let’s say you’re a farmer and you have a flock of sheep, but you need to build a fence to keep them in case wolves attack them. But where should the fence be? You probably need to build a “classifier” based on the location of cattle and wolves. Comparing these different classifiers, we can see that SVM achieves a perfect solution.

This example briefly illustrates the advantages of SVM using nonlinear classifier from the side, while both logical pattern and decision tree pattern use the linear method.

OK, no longer do too much introduction, there is further interest in kernel function, also can see this article.

2.3Use the relaxation variable to handle the outliers method

At the beginning of the discussion of support vector machines in the first section of this paper, we assume that the data are linearly separable, that we can find a feasible hyperplane to completely separate the data. Later, in order to deal with nonlinear data, the Kernel method was used to popularize the original linear SVM in section 2.2 above, so that the nonlinear cases could also be processed. Although by mappingBy mapping the original data to a higher-dimensional space, the probability of linear separation is greatly increased, but it is still difficult to deal with in some cases.

For example, it may not be because the data itself is nonlinear, but simply because the data is noisy. In our original SVM model, the presence of an outlier may cause a great impact, because the hyperplane itself consists of only a few support vectors. If outliers exist in these support vectors, the impact is significant. For example:

The blue dot with the black circle is an outlier that deviates from the half space where it should be. The hyperplane would have been fine if it had been ignored, but it would have been squished by the outlier. As shown in the black dotted line on the way (this is just a schematic diagram, and the exact coordinates are not strictly calculated), and the margin is correspondingly smaller. Of course, the more serious case is that if the outlier were moved any further to the right, we would not be able to construct a hyperplane that would separate the data.

To handle this, SVM allows data points to deviate from the hyperplane to a certain extent. For example, in the figure above, the distance corresponding to the black solid line is the deviation distance of the outlier, and if it is moved back, it will fall right on the original blue spacer boundary of the hyperplane without distorting the hyperplane.

Next reader @copper_pku’s understanding: “In other words, in the case of relaxation, the outline point also belongs to the support vector SV. At the same time, for different support vectors, the values of Lagrange parameters are also different, as shown in the following picture in the paper” Large Scale Machine Learning “:

For points far away from the classification plane, the value is 0; For the point value on the edge is between [0, 1/L], where, L is the number of training data sets, namely the size of the data set; The value is 1/L for the outline data and the internal data. See reference 51 at the end of this article for more information.

OK, back to our question. We, the original constraints are:

Now considering the outlier problem, the constraint becomes:

Among themCalled slack variables, they correspond to data pointsThe amount of allowable functional margin. Of course, if we runIf it’s arbitrarily large, then any hyperplane is going to work. So, let’s add a term after the original target function to make thisAnd the sum of:

Among themIs a parameter that controls the weight between two items in the objective function (” find the hyperplane with maximum margin “and” ensure the minimum deviation of data points “). Notice, whereIs one of the variables that needs to be optimized, andIs a predefined constant. Written in full, it looks like this:

Add the constraints or constraints to the objective function in the previous way to get the new Lagrange function as follows:

The analysis method is the same as before, after converting to another problem, we first letfor,andTo minimize:

     将 Back to theAnd simplify to get the same objective function as before:

But, because we getAnd there are(as a condition for a Lagrange multiplier), so there isSo the whole dual question now writes:

Compare the results before and after (error correction: Dual formulation should be maxmize) :

The only difference you can see is now dual variableThere’s an upper limit. And the Kernel nonlinear form is the same, as long as theSwitch toCan. This finally introduces a complete support vector machine that can handle both linearity and nonlinearity and can tolerate noise and outliers.

Writing at this point, you can do a summary, not accurately, it is essentially a SVM classification method, classification function are defined with the w ^ T + b, so o w, b, and to find the largest interval, which leads to the 1/2 | | w | | ^ 2, and then introduced the Lagrangian multipliers, Into solving the Lagrange multiplier a (or will involve a series of optimization in the process of solving convex quadratic programming, etc.), so, w.b and seeking for a equivalence, and a solution of the SMO can use a fast learning algorithm, as for the kernel function, is to deal with the nonlinear case, if calculation could directly mapped to high-dimensional dimension explosion, so in low dimension calculation, the equivalent high dimension.

OK, understanding to the second floor, already can satisfy most people a glimpse into the SVM principle of curiosity, but for those who want to prove that level of understanding of SVM is not enough, but to enter the third understanding before, you must have good mathematical foundation and logical proof ability, or you will like me, eat a lot.

 

The third layer proves SVM

Honestly, anything that comes down to proving. Theory is generally not a good thing to mess with. Most of the time, it is not difficult to understand a thing, but to prove a thing requires some mathematical skills, further, that a thing is not particularly difficult, difficult is starting from scratch to invent this thing, it was difficult (because of any age, most people’s income is based on the research achievements of predecessors’ research, predecessors have done is a pioneering work, And this is often the hardest and most valuable, they are called true pioneers. Newton once said he was standing on the shoulders of giants. You and I are even more so).

As Academician Chen Xiru puts it in chapter 4, The Least Square Method, in his book A Brief History of Mathematical Statistics: Innovations and breakthroughs in research on a lot of ideas are has a lot of not easy, and perhaps a theorem in a certain period by one person came forward, now we see everything for granted, but before all have not found, may be a lot of top scholars to pump, running out of life, efforts for decades and eventually succeeded.

If you want to prove something, you need to know where its foundation is, which theories are the basis of it. OK, the following content is basically the proof of some theorems not mentioned above, including the logic behind it, source background and other things, or reading notes.

This section leads

  • In Section 3.1 of linear learner, perceptron algorithm is mainly described.
  • In Section 3.2, the Mercer theorem is mainly expounded.
  • Section 3.3 loss function;
  • 3.4. Least square method;
  • Section 3.5, SMO algorithm;
  • Section 3.6 briefly discusses the application of SVM;

3.1, linear learner

3.1.1 Perceptron algorithm

This perceptron algorithm was proposed in 1956, so it is still very influential today, but it is certainly not optimal, and more on that later. However, you have to understand what this algorithm is for: training and trial and error to find the right hyperplane (yes, it’s that simple).

Here’s an example. As shown in the following figure, according to our intuition, the red line in the figure is the optimal hyperplane, while the blue line is the one that is continuously trained according to the perceptron algorithm. Finally, if the blue line can move to the red line position through continuous training, the training is successful.

Since the blue line needs to be trained continuously to become the optimal classification hyperplane, how many times does it need to be trained? Novikoff’s theorem tells us that when the interval is positive, the perceptron algorithm will converge in a finite number of iterations. In other words, Novikoff’s theorem proves the convergence of the perceptron algorithm, that is, it can get a bound, so that it will not go on forever.

  • Novikoff theorem: If the classification hyperplane exists, only a few iterations are needed on the sequence S, where the bound is, the classification hyperplane can be found, and the algorithm stops.

Here,.For the extended interval. According to the error count formula, the number of iterations is related to the interval of the training set corresponding to the extended weight (including bias).

By the way, explain this so-called expansion interval.Is the distance from the sample to the classification interval, i.eThe maximum classification interval elicited. OK, remember at the beginning of section 1.3 above? As follows:

Before we give the definition of the geometric interval, let’s first see, as shown in the figure above, for a point x, let’s make the corresponding vector x0 perpendicular to the hyperplane, and since w is a vector perpendicular to the hyperplane,Is the distance from sample X to the classification interval, and we have

Then how to deduce the maximum classification interval later please return to the first and second parts of this article, here do not repeat the board writing.

At the same time, it should be noted that although perceptron algorithm can generate correctly classified hyperplanes on linearly separable data through simple iteration, it is not the optimal effect. Then, how to get the optimal effect is to find the maximum classification interval hyperplane mentioned in the first part of the above article. In addition, the proof of Novikoff’s theorem is shown here.

3.2, nonlinear learner

3.2.1 Mercer’s Theorem

Mercer’s theorem: if the function K is(that is, from the two n-dimensional vectors to the real number field). So if K is a valid kernel (also known as the Mercer kernel), then if and only if for the training sample, the corresponding kernel matrix is symmetric semi – definite.

To understand Mercer’s theorem, we need to know what a positive semidefinite matrix is first, and we need to know what a positive definite matrix is first. And then there’s a proof of this theorem that you can look at.

As @copper_pku said: Kernel function plays an important role in the classification effect of SVM. Finally, here is a tutorial for you to see.

3.3, loss function

In section 1.0 of this article have so a word “support vector machine (SVM) is developed in the mid – 90 – a machine learning method based on statistical learning theory, by seeking structured minimum risk to improve the generalization ability and learning machine, minimizing empirical risk and confidence limit, so as to achieve under the condition of the statistical sample size is less, It also serves the purpose of achieving good statistical regularity.” But the reader who sees it for the first time may not understand what is structural risk and what is empirical risk. To understand these two so-called “risks”, we have to start with supervised learning.

Supervised learning is actually a problem of optimizing the empirical risk or structural risk function. The risk function measures the prediction of the model in the average sense, and the prediction of each time is measured by the loss function. It selects the model F from the hypothesis space F as the decision function. For a given input X, F (X) gives the corresponding output Y, whose predicted value F (X) may or may not be consistent with the true value Y. A loss function is used to measure the degree of prediction error. The loss function is denoted as L(Y, f(X)).

Commonly used loss functions are as follows (basically quoted from Statistical Learning Methods) :

     

In this way, SVM has a second understanding, that is, optimization + minimum loss, or as @xia Fan _ Baidu said, “SVM, Boosting, LR and other algorithms may have different gains from the perspective of loss function and optimization algorithm”.

OK, see this article for more questions about statistical learning methods.

The loss function is described in the reader’s comments below: Take a look at Zhang Tong’s Statistical Behavior and Consistency of Classification Methods Based on Convex Risk Minimization. The loss functions commonly used in various algorithms are basically fisher consistent, and the classifier obtained by optimizing these loss functions can be regarded as the “proxy” of a posteriori probability. In addition, Zhang Tong also published another paper “Statistical Analysis of Some Multi-Category Large Margin Classification Methods”, which analyzed margin loss in the case of multi-category. These two articles have a thorough analysis of the loss function used in Boosting and SVM.

3.4, least square method

3.4.1 what is the Least Squares Method?

Since the least squares method was mentioned at the beginning of this section, let’s make it a little simpler with a reference to the past and Present of the Normal Distribution.

We often say orally: on average, on average. For example, on average, the health of non-smokers is better than that of smokers. The reason why the word “average” is added is that there are exceptions to everything. There is always a special person who smokes but because of regular exercise, his health may be better than that of his non-smoking friends. One of the simplest examples of least squares is the arithmetic average.

The least square method (also known as the least square method) is a mathematical optimization technique. It seeks the best functional match of the data by minimizing the squared sum of the errors. Using the least squares method, unknown data can be easily obtained and the sum of squares of errors between the obtained data and the actual data can be minimized. The function is expressed as:

The method of finding an estimate by minimizing the sum of squares of the error, which is of course the difference between the observed value and the actual true value, is called the least squares method, and the estimate obtained by the least squares method is called the least squares estimate. Of course, taking the sum of squares as the target function is only one of many ways to do this.

The general form of the least squares method can be expressed as:

The effective least squares method was published by Legendre in 1805, and the basic idea is that there is an error in the measurement, so the cumulative error of all the equations is zero

The parameter that leads to the minimum cumulative error can be solved:

Legendre made several explanations on the advantages of the least squares method in his paper:

  • Least squares minimizes the sum of squares of errors and establishes a balance between the errors of each equation, thus preventing one extreme error from gaining dominance
  • In the calculation, only partial derivative is required to solve the linear equations, and the calculation process is clear and convenient
  • The least squares can be derived from the arithmetic mean as an estimate

This last point is an important property from a statistical point of view. The reasoning is as follows: Suppose the truth value is, the x1,… , xn is n measurements, and the error of each measurement is EI = xi –According to the least square method, the accumulated error is

To solve the 使To the minimum, which is the arithmetic mean.

Because the arithmetic mean is a tested method, and the above reasoning shows that the arithmetic mean is a special case of the least square, so from another point of view to illustrate the advantages of the least square method, so that we have more confidence in the least square method.

After the publication of the least square method was soon accepted by everyone, and was quickly widely used in the practice of data analysis. But in history, someone credited Gauss with the invention of the least squares method. Gauss also published the least squares method in 1809 and claimed to have been using it for many years. Gauss invented the math for locating asteroids and used least-squares calculations in his data analysis to accurately predict the location of Ceres.

Said so much, seems to have nothing to do with the topic of this article, SVM, don’t worry, please let me continue to elaborate. In essence, least squares is a method of parameter estimation, and when it comes to parameter estimation, let’s start with unary linear models.

3.4.2 Solution of least square method

What is a unitary linear model? If I may quote from here, let me first brush up on some basic concepts:

  • In supervised learning, if the predicted variable is discrete, we call it classification (such as decision tree, support vector machine, etc.); if the predicted variable is continuous, we call it regression.
  • In regression analysis, if only one independent variable and one dependent variable are included, and the relationship between them can be approximated by a straight line, this regression analysis is called unary linear regression analysis.
  • If the regression analysis includes two or more independent variables and there is a linear relationship between the dependent variables and the independent variables, it is called multiple linear regression analysis.
  • Linearity in two dimensions is a line; For three dimensional space linearity is a plane, for multidimensional space linearity is a hyperplane…

For the unary linear regression model, suppose that n groups of observations (X1, Y1), (X2, Y2)… (Xn, Yn). For these n points in the plane, you can use an infinite number of curves to fit them. The sample regression function is required to fit this set of values as well as possible. Collectively, this line makes the most sense at the center of the sample data.

The criterion for selecting the best fitting curve can be determined as: minimizing the total fitting error (i.e., the total residual). There are three criteria to choose from:

  1. One way to determine the position of a line is to minimize the residual sum. However, it was soon discovered that the calculation of the “residual sum” would cancel each other out.
  2. It is also a way to determine the position of a line by minimizing the sum of absolute resists. But the absolute value is a little tricky.
  3. The principle of least squares is to minimize the sum of squares of resists to determine the position of a line. In addition to the convenience of calculation, the estimator obtained by the least square method has excellent characteristics. This approach is very sensitive to outliers.

The most commonly used method is Ordinary Least Square (OLS) : the regression model selected should minimize the sum of squares of resists of all observed values, that is, the Square loss function should be adopted.

We define the sample regression model as:

Where ei is the error of sample (Xi, Yi).

Then, define the square loss function Q:

Then the line is determined by minimising QIn order toIf you think of them as functions of Q, it becomes a problem of finding the extremum, which you can get by taking the derivative.

The partial derivatives of Q with respect to the two parameters to be estimated are obtained:

We know from mathematics that the extreme point of the function is the point where the partial derivative is zero.

Solution:

        

So this is the least squares method, which is to find the extremum of the squared loss function. From there, you can see how solving the least squares method is similar to solving the SVM problem, in particular defining the loss function and then finding the extremum through the partial derivative.

OK, for more information, please see chapter 4, least squares method in Academician Chen Xiru’s Brief History of Mathematical Statistics.

3.5The SMO algorithm,

In the previous article, we mentioned the sequential minimum optimization SMO algorithm for solving the dual problem, but did not mention its specific solution. First, let’s look at the final unanswered questions:

Equivalent to solving:

In 1998, John C. Platt of Microsoft Research wrote a paper called Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines presents A solution to the above problem: THE SMO Algorithm, which soon became the fastest quadratic programming optimization Algorithm, especially for linear SVM and sparse data performance.

Let’s take a look at this article by John C. Platt to see what the solution to SMO looks like.

3.5.1 Derivation of SMO algorithm

Let’s first define the output function from the feature to the result:

Note: This u is the same as the one we defined earlierThe substance is the same.

Then, to redefine our original optimization problem, let’s review it as follows:

The derivative yields:

Plug inThe available.

By introducing the Lagrange multiplier into the dual problem, we can get:

S.t:

and

Note: The min function we get here is essentially the same as the Max function we got before, because the problem of changing the sign from min to Max, and yi is also the same as the previous oneSame thing, same thing for yj.

After adding the relaxation variable, the model is modified as follows:

So ultimately our question becomes:

The next problem to solve is: inMinimize the objective function above. So in order to solve for these multipliers, take two of them at any timeandAnd then fixandOther multipliers, such that the target function is only aboutandThe function. In this way, we can continuously extract two solutions arbitrarily from a bunch of multipliers, constantly iteratively solve the subproblem, and finally achieve the purpose of solving the original problem.

And the objective function of the subproblem of the primal dual problem can be expressed as:

Among them

To solve this subproblem, the first question is how to choose each timeand. In fact, one of the multipliers is the one with the most severe KKT violation, and the other multiplier is chosen by the other constraint.

The KKT condition tells us that in the objective functionMeaning of the value:

Here,Lagrange multiplier:

  1. For the first case, indicate thatIs the normal classification, inside the interval boundary (we know the correct classification points);
  2. For the second case, it indicatesIs the support vector, on the interval boundary;
  3. For the third case, it showsIt’s between two interval boundaries;

And the optimal solution needs to satisfy KKT condition, that is, all the above three conditions must be satisfied. In the following cases, there will be unsatisfied:

  • But < = 1<C is not satisfied, and the original=C
  • But > = 1>0 is not satisfied, and the original= 0
  • = 1 but= 0 or=C is not satisfied, which should be 0<<C

In other words, if there is one that doesn’t satisfy the KKT condition, then you need to update theseThat’s the first constraint. In addition, the update is also subject to the second constraint, namely

So if you assume that you choose two multipliersand, which are respectively before the update,, are respectively after the update,, then the values before and after the update must satisfy the following equation to ensure the constraint that sum is 0:

Among them,Is constant.

It’s hard to solve for two factors at once, so we can solve for the second multiplier firstThe solution of (),The solution of (), and thenThe solution of () saidThe solution of ().

In order to solve theWe have to make sureValue range of. Assuming its upper and lower boundaries are H and L respectively, then there is:

Next, synthesizeandSo these two constraints, let’s figure outValue range of.

When y1! When theta is equal to y2, according toavailable, so there are., as shown in the figure below:

When y1 is equal to y2, the same thing happensAvailable:, so there are., as shown in the figure below:

In this way, according to y1 and Y2 different or the same sign, can be drawnThe upper and lower bounds of are:

Review the second constraintMultiply both sides of this equation by y1

Among them,.

soYou can useSaid,In this way, the target function of the subproblem is transformed into onlyThe problem:

    对If I take the derivative, I get

Under the reduction:

then,, andSubstituted into the above equation, we can get:

 

 

    令(represents the difference between the predicted value and the true value),And then divide both sides of this equation, you get a phi with respect to one variableThe solution:

This solution doesn’t take into account its constraintsIs the unedited solution.

And then think about constraintsCan be obtained after editingThe analytical solution of:

Find out the backWe can solve for,.

So how do you choose the multiplierand?

  • forThe first multiplier, the first multiplier, can be found by looking at the three conditions that we just said do not satisfy KKT;
  • And the second multiplierWe can look for conditions that satisfy:The multiplier.

And B meets the following conditions:

Update B:

After updating the optimization of the two multipliers each time, b and the corresponding Ei value need to be recalculated.

Last update all, y and B, so the model is developed, and the classification function we proposed at the beginning can be solved:

In addition, there is a similar article here, you can refer to below.

3.5.2 Steps for the SMO algorithm

To sum up, the main steps of SMO are as follows:

Mean,

  1. Step 1 Pick a pairand, the selection method uses heuristic method;
  2. Step two, fixed divisionandTo determine the W extremum condition.bySaid.

Suppose that in one iteration, an update is needed.The corresponding Lagrange multiplier., then this small scale quadratic programming problem is written as:

So in each iteration, how do you update the multiplier? To quote two slides here:

Now that we know how to update our multipliers, which multipliers do we pick to update? The specific selection method has the following two steps:

  1. Step 1: First “scan” all multipliers, and take the first one that violates KKT condition as the update object, set it as A1;
  2. Step 2: in all not to violate the KKT conditions multiplier, choose a2 to maximize the | | E1 – E2 is updated, that can increase the maximum value of the objective function (similar to the gradient descent. In additionAnd the, E represents the difference between the predicted value of function UI for input xi and the real output class marker yi).

Finally, after updating the optimization of the two multipliers each time, b and the corresponding Ei value need to be recalculated.

In summary, the basic idea of THE SMO algorithm is to push the Chunking method proposed by Vapnik in 1982 to the end. The SMO algorithm only picks up two components, AI and AJ, each iteration for adjustment, leaving the other components unchanged, and after the ai and AJ are solved, the other components are improved with AI and AJ. Compared with the usual decomposition algorithm, although it may require more iterations, the computation amount of each iteration is relatively small, so the algorithm shows good fast convergence, and does not need to store kernel matrix, and does not have matrix operation.

3.5.3 Implementation of SMO algorithm

Writing at this point, I believe, the SVM after understanding at some point, in your mind is indeed can from beginning to end related formula is deduced, the original classification function, maximum classification interval, max1 / | | w | |, min1/2 | | w | | ^ 2, the convex quadratic programming, the Lagrange function, into the dual problem, the SMO algorithm, to find an optimal solution, An optimal classification plane. Step by step, comb down, why so many things can be pursued, and finally achieve. As shown in the figure below:

As for the kernel function described below, it is to better deal with the nonlinear separable situation, while the relaxation variable is to correct or restrain a small number of “restless” or uncollectible factors.

Professor Lin Chih-jen from Taiwan has written a libSVM library that encapsulates SVM algorithms. You can have a look at it. In addition, there is a annotated document of libSVM.

In addition to this paper “Fast Training of Support Vector Machines Using Sequential Minimal Optimization” Platt gives the logical code of SMO algorithm, Here is also a SMO implementation code, you can look at.

3.6, SVM application

We may have heard that SVM has many applications in many fields such as text classification, image classification, biological sequence analysis and biological data mining, handwritten character recognition and so on, but perhaps you are not fully aware that SVM can be successfully applied far beyond the fields that are currently under development.

3.6.1. Text Classification

A text classification system is not only a natural language processing system, but also a typical pattern recognition system. The input of the system is the text to be classified, and the output of the system is the category associated with the text. Due to space constraints, other more specific content will not be detailed in this article.

OK, the title of this section is Proof SVM, but the intelligent reader must already know that there is not much proof in this section (apologies), what to do? See “Introduction to Support Vector Machines” for a brief and interesting book. In this section.

 

 

Reader comments

After this article was published, many of our weibo friends gave us a lot of comments. Here are some of the best excerpts:

  1. //@zlkysl: I grew up reading July’s blog posts //@zlkysl: It was after reading the last one that I decided to pursue SVM as my research direction. — weibo.com/1580904460/… .
  2. Zhang Jinhui: “SVM triple state, had to turn a piece. In fact, In Coursera, Andrew Ng talked about support vector machines, but obviously he didn’t focus on it, and it’s hard for me to fully understand Ng’s approach to support vector machines. Really began to understand the support vector machine is to see this “triple realm”, after this algorithm has a general concept, even how to use, and then to the principle of why, and then to support vector machine proof. All in all, this paper started a months-long phase of my research into support vector machines that continues to this day.” — zhan.renren.com/profile/249… .
  3. @ Lonely Watchman: “Finally, the cost function of SVM is hinge loss, and then compared with the cost function of other methods, it shows that their target function is very similar, so the question is why SVM is so popular? You can add some VC dimension and some error bound math, click, to provide an idea and direction.” — weibo.com/1580904460/… .
  4. Xiaofan _ Baidu: “In the interview, the study of SVM can investigate the ability of machine learning in all aspects: objective function, optimization process, parallel method, algorithm convergence, sample complexity, applicable scenarios, tuning experience, but I think the study of Boosting and LR are also good. In addition, with the continuous progress of statistical machine learning, SVM is only regarded as an alternative to the 01 loss hinge research, and more general methods are proposed. Loss function research replaces the relationship between loss and Bayesian loss, algorithm stability research replaces the relationship between loss and generalization performance, convex optimization research how to solve the convex objective function. SVM, Boosting and other algorithms are just a specific component of these general methods.”
  5. @Curie monkey: As for the problem of SVM loss function, Take a look at professor Zhang Tong’s Statistical Behavior and Consistency of Classification Methods Based on Convex Risk Minimization. The loss functions commonly used in various algorithms are basically fisher consistent, and the classifier obtained by optimizing these loss functions can be regarded as the “proxy” of a posteriori probability. In addition, Professor Zhang Tong also published another paper “Statistical Analysis of Some Multi-Category Large Margin Classification Methods”, which analyzed margin loss in the case of multi-category. These two articles have a thorough analysis of the loss function used in Boosting and SVM.
  6. @xia fan _ Baidu: SVM uses Hinge loss, hinge loss can not guide, as other replacement loss convenient optimization and conversion probability trouble. Kernel function is also not very useful. In the era of big data, the sample size is very large, and it is impossible to imagine how to store and calculate a kernel matrix of n^2. And, now, nonlinearity is generally deep learning. // @copper_pku: What are the typical applications of SVM in industry? How to select kernel function in industry, empirical method? How to optimize the training process of SVM?
  7. Copper_PKU: July’s SVM tutorial I personally think we can add and modify the following parts: (1) For the explanation of support vector, it can be expressed by combining graph and Lagrange parameters, and SV is not written in the relaxation. (2) For SMO algorithm, the algorithm mentioned in Joachims paper and the method of SMO algorithm to select workset are added, including the convergence judgment of SMO algorithm and the previous conjugate gradient solution method. Although it is an earlier algorithm, it is very effective for understanding the SMO algorithm. The optimization and solution of the model are both iterative processes, and the historical algorithm is added to enhance the three-dimensional sense. — weibo.com/1580904460/… .
  8. // @Liao Lintran: The reason why SGD is better for large training sets is that it is much faster to optimize each iteration using a subset of samples than using the whole training set (especially in the order of millions). 2. If the objective function is convex or pseudo-convex, SGD will almost certainly converge to the global optimal; Otherwise, it converges to the local optimal; 3.SGD generally does not need to converge to the global optimal, and can be stopped immediately as long as a good enough solution is obtained. // @copper_pku: The core idea of SGD: The loss function based on the current W (t) can be calculated every time a sample is obtained. T represents the TTH training, and then the next W (t+1) is updated. W (t+1)= W (t)-(learning rate) * gradient of loss function. This is analogous to the parameter training method in BP neural network. Sample by sample means to process only one sample at a time instead of one batch.
  9. // @copper_pku: From the perspective of lossfunction: Primal problem can be understood as regularization term +lossfunction. The goal of solving is to strike a balance between the two. If the lossfunction is emphasized to be minimum, it will be overfitting, so there is parameter C. // @researcher July: SVM really is to find the optimal value of the objective function under certain limited conditions, that is, the constraint condition. At the same time, in order to reduce the misjudgment rate, try to minimize the loss.
  10. .

I really enjoy this age of public discussion, no one is necessarily right or wrong, but each of them express their own understanding, great!

 

 

References and recommended reading

  1. Introduction to Support Vector Machines, by Nello Cristianini/John Shawe-Taylor;
  2. Support website for introduction to Support Vector Machines: www.support-vector.net/;
  3. An Introduction to Data Mining, by Pang-Ning Tan/Michael Steinbach/Vipin Kumar;
  4. Data Mining: Concepts and Techniques, (plus)Jiawei Han; Micheline Kamber.
  5. A New Method in Data Mining: Support Vector Machines, by Naiyang Deng and Yingjie Tian;
  6. Support Vector Machines: Theory, Algorithm and Extension, by Naiyang Deng and Yingjie Tian;
  7. Support Vector Machine series, PlusKid: blog.pluskid.org/?page_id=68… ;
  8. www.360doc.com/content/07/… ;
  9. Ten classical algorithms of data mining;
  10. A guide to support vector machines for pattern recognition by C.J.C Burges;
  11. Statistical Learning Methods, by Hang Li;
  12. Statistical Natural Language Processing, chap. 12, text classification.
  13. Introduction to the SVM series, Jasper:www.blogjava.net/zhenandaci/… ;
  14. Implementation and comparison of nearest neighbor decision and SVM numerical recognition, author unknown;
  15. Pure white push SVM:www.julyedu.com/video/play/…
  16. Machine learning courses at Stanford original notes: www.cnblogs.com/jerrylead/a… ;
  17. Stanford machine learning course notes: www.cnblogs.com/jerrylead/t… ;
  18. www.cnblogs.com/jerrylead/a… ;
  19. The SMO algorithm mathematical deduction: www.cnblogs.com/jerrylead/a… ;
  20. Knowledge of probability theory and mathematical statistics required in data mining;
  21. Articles about machine learning, can read: www.cnblogs.com/vivounicorn… ;
  22. Mathematics department textbook recommendation: blog.sina.com.cn/s/blog_5e63… ;
  23. Neural Networks and Machine Learning (third edition). [plus] Simon Haykin.
  24. Normal distribution: t.cn/zlH3Ygc;
  25. A Brief History of Mathematical Statistics, by Academician Chen Xiru;
  26. Optimization Theory and Algorithms (2nd edition), edited by Baolin Chen.
  27. A Gentle the Introduction to Support Vector those Biomedicine:www.nyuinformatics.org/downloads/s in… , this PPT is very good, except for the introduction of the Lagrangian dual variables after the convex quadratic programming problem depth is not enough, other good, with excellent illustrations, this paper has several pictures are quoted from this PPT;
  28. From Carnegie Mellon university, Carnegie Mellon university (CMU) the interpretation of the SVM PPT:www.autonlab.org/tutorials/s… ;
  29. Libsvm: wenku.baidu.com/link?url=PW… ;
  30. Staff.ustc.edu.cn/~ketang/PPT… ;
  31. Introduction to Support Vector Machines (SVM), By Debprakash Patnai M.E (SSA), www.google.com.hk/url?sa=t&rc… ;
  32. Libsvm recommended by many people: www.csie.ntu.edu.tw/~cjlin/libs… ;
  33. Machine Learning in Action (Chinese edition)
  34. Sequential Minimal Optimization A Fast Algorithm for Training Support Vector Machines Research.microsoft.com/en-us/um/pe… ;
  35. The Nature of Statistical Learning Theory by Vladimir N. Vapnik. Very obscure. Not much to recommend;
  36. Zhang Zhaoxiang, Machine Learning Lecture 5: Support Vector Machines irip.buaa.edu.cn/~zxzhang/co… ;
  37. Theoretical explanation of VC dimension: www.svms.org/vc-dimensio… , explain xiaoxia001.iteye.com/blog/116333 VC dimension… ;
  38. Handout on SVM by Jason Weston from NEC Labs America www.cs.columbia.edu/~kathy/cs47… ;
  39. SVM handout from MIT: www.mit.edu/~9.520/spri… ;
  40. PAC question: www.cs.huji.ac.il/~shashua/pa… ;
  41. Two papers by Zhang Tong from Baidu: Statistical Behavior and Consistency of Classification Methods Based on convex Risk Minimization “home.olemiss.edu/~xdang/676/… , Statistical Analysis of Some Multi-category Large Margin Classification Methods;
  42. jacoxu.com/?p=39;
  43. Matrix Analysis and Applications, By Zhang Xianda, Tsinghua University;
  44. The realization of the SMO algorithm: blog.csdn.net/techq/artic… ;
  45. Common thought simple combing machine learning algorithm of the interview: www.cnblogs.com/tornadomeet… ;
  46. Matrix wikipedia page: zh.wikipedia.org/wiki/%E7%9F… ;
  47. The least squares method and its implementation: blog.csdn.net/qll12559671… ;
  48. An introduction to statistical learning methods: blog.csdn.net/qll12559671… ;
  49. www.csdn.net/article/201… ;
  50. A Tutorial on Support Vector Regression:alex.smola.org/papers/2003… ; The SVR JianMingBan: www.cmlab.csie.ntu.edu.tw/~cyy/learni… .
  51. SVM Org: www.support-vector-machines.org/;
  52. R. Collobert. Large Scale Machine Learning. The Universite Paris VI. PhD thesis 2004:ronan.collobert.com/pub/matos/2… ;
  53. Making Large – Scale SVM Learning Practical:www.cs.cornell.edu/people/tj/p… ;
  54. Text classification with SVM:blog.csdn.net/zhzhl202/ar… ;
  55. Working Set Selection Using Second Order Information for Training Support Vector Machines: www.csie.ntu.edu.tw/~cjlin/pape… ;
  56. The SVM Optimization: Inverse Dependence on Training Set Size: icml2008. Cs. Helsinki. Fi/cca shut / 266…. ;
  57. Large-scale Support Vector Machines: Algorithms and Theory cseweb.ucsd.edu/~akmenon/Re… ;
  58. The concept of convex optimization: cs229.stanford.edu/section/cs2… ;
  59. Convex Optimization by Stephen Boyd/Lieven Vandenberghe, Convex Optimization;
  60. Large-scale Non-Linear Classification: Algorithms and Evaluations Zhuang Wang: ijcai13.org/files/tutor… ;
  61. SVM based on SMO algorithm: www.cs.iastate.edu/~honavar/sm… ;
  62. Recommend some related papers SVM cooper (of course, many of which paper mentioned in the above items have been) : c.blog.sina.com.cn/profile.php… ;
  63. Online editing Latex formula: private.codecogs.com/latex/eqned… .

 

 

Afterword.

OK, this article from the original in May 2012 to start, to the subsequent changes unceasingly, creates the three written the longest, the biggest effort spent, change the most, because I don’t have any machine learning goal is to make basic can read this, so is always change, constantly change, don’t want to miss a little details. In addition, quote hou Jie’s sentence is: the world masterpiece, must be made in detail.

Finally, I would like to thank PlusKid and many friends for their articles and books, which gave me the opportunity to summarize and deepen. Any questions, please feel free to criticize and correct the readers, thank you.

 

 

In this paper, the PDF version

  • 13 years on November 25, open the article with chrome, right to print, the pop-up box printing, the upper left corner of the goal to “save as PDF”, as the first PDF:vdisk.weibo.com/s/zrFL6OXKg… .
  • December 7, 13 years, friends Wu Xinlong notes “impression” was used to extract the blog text, in the office edited into the PDF:vdisk.weibo.com/s/zrFL6OXKg… , complete bookmarks have been added since the previous release.
  • 14 years on February 18, friend WuShuZhe use Latex all rearrangement all formula, this paper also gave all formulas and pictures all tags, Latex version PDF download address is: vdisk.weibo.com/s/zrFL6OXKg… .
  • On January 8, 2015, Chen Sheng, a friend of mine, created a second version of SVM for LaTeX, which can be downloaded at pan.baidu.com/s/1eQrgiOU.

This article will be updated all the time. Besides, the above 4 PDFS are not the best reading experience. If anyone has made a better PDF, please share it with me: weibo.com/julyweibo.

The NTH modification (N > 100).

Breaking news: My new book, The Way to Code: Tips on Interviewing and Algorithms, finally hits shelves on October 14, 2015! Jingdong purchase address: item.jd.com/11786791.ht… . Jd.com, Dangdang, Amazon and other major online stores have been sold spot. The new book includes this SVM, and the quality of the new book is much higher than the blog. July, October 29th, 2015.