Reproduced the original address: https://blog.csdn.net/v_july_v/article/details/7624837


preface

It takes a lot of effort and difficulty to write this support Vector Machine. The reasons are simple: First, it is not easy to understand this thing itself, and it takes a lot of time and energy to further study and research, and it is also difficult to explain it clearly. Although some online friends have done a good job (see link at the end of this article), it’s not enough to describe mathematical formulas. Thanks to my classmate Bai Shi’s mathematical proof, I still want to try to write it. I hope this paper can be a complete summary and introduction of support vector machines on the basis of being easy to understand.

In the process of writing this article, reference a lot of information, including “introduction to support vector machines”, “statistical learning methods” and netizens pluskid support vector machine series and so on, here, or a study notes, just added their own understanding and summary, there is any improper place, but also hope to contain. In this paper, the whole concept and use of support vector machine are understood macroscopically, and the details of proof and principle are studied microscopically to ensure the logic is clear and easy to understand.

At the same time, when reading this article, we recommend that you use chrome and other browsers as much as possible, so that the formula can be better displayed, moreover, when reading, you can take a piece of paper and pen out, all the theorems in this article. Formula derivation or directly printed down (can directly print the web version or the attached PDF at the end of this article) in the manuscript, so as to enjoy the ultimate pleasure of thinking and calculating anytime and anywhere.

OK, still the same words, any question, welcome anyone at any time to comment & comments, thank you.

The first layer, understand SVM

Support Vector machine (SVM), as its English name is Support Vector Machine, is generally referred to as SVM. Generally speaking, it is a kind of binary classification model. Its basic model is defined as a linear classifier with the largest interval in the feature space.

1.1 Origin of classification criteria: Logistic regression

To understand SVM, we must first understand the concept of linear classifier.

Given some data points that belong to two different classes, now find a linear classifier that divides the data into two classes. If x represents the data point and Y represents the category (y can be 1 or -1, respectively, for two different classes), the learning goal of a linear classifier is to find a Hyper plane in the n-dimensional data space. The equation of the hyper plane can be expressed as (T in wT stands for transpose) :

Some readers may have doubts about the category of 1 or -1. In fact, the classification criteria of 1 or -1 originated in logistic regression.

Logistic regression aims to learn a 0/1 classification model from features, and this model takes the linear combination of features as independent variables, since the value range of independent variables is from minus infinity to plus infinity. Therefore, the logistic function (or sigmoid function) is used to map the independent variable to (0,1), and the mapped value is considered to be the probability of y=1.

Hypothesis function

Where x is the n-dimensional eigenvector and the function G is the logistic function.

The image is

So you can see that you’re mapping infinity to 0,1.
And the hypothesis function is the probability that the features are y equals one.

Thus, when we want to determine which class a new feature belongs to, we just needIf can,A value greater than 0.5 is y=1, and a value greater than 0.5 is y=0.

In addition,And the onlyAbout,> 0, thenG (z) is just a mapping, but the real category decision is still. Moreover, whenWhen,= 1, and vice= 0. If we just go fromStarting from, the goal of the model is to make the characteristics of y=1 in the training dataIt’s the property of y equals 0. Logistic regression is all about learning, so that the characteristics of positive cases are much greater than 0, and the characteristics of negative cases are much less than 0, and this goal should be achieved in all training cases.

Next, try to transform the logistic regression. First, replace the result tags y = 0 and y = 1 used with y = -1,y = 1, and then() in theI’m going to replace it with b, and I’m going to replace it with bReplace with(i.e.). In this case, there is. In other words, except for y changing from y=0 to y=-1, the formal representation of linear classification function and logistic regressionNo difference.

Further, we can take the hypothesis functionG (z) is a simplification, simply mapped to y=-1 and y=1. The mapping is as follows:

1.2. An example of linear classification

To take a simple example, as shown in the figure below, you now have a two-dimensional plane with two different types of data, denoted by a circle and a cross. Since these data are linearly separable, the two types of data can be separated by a line that corresponds to a hyperplane in which the data points on one side correspond to all y 1s, and the data points on the other side correspond to all Y 1s.

The hyperplane can be classifiedDenotes that when f(x) is equal to 0, x is the point on the hyperplane, and the point f(x) greater than 0 corresponds to the data point y=1, and the point f(x) less than 0 corresponds to the point y=-1, as shown in the figure below:

Note: Some data define feature to result output functionsTheta is defined hereThe essence is the same. Why is that? Because no matter what, or, does not affect the final optimization results. You will see below when we turn to optimization, in order to solve the convenient, take yf (x) to 1, namely yf (x) is y (w ^ x + b), or y (w ^ x – b), for us to optimize the formula max1 / | | w | | has no effect.

(A friend from Mare_Desiderii flying dog, after reading the above definition, asked: please ask SVM functional margin forIs equal to y of wTx plus b equal to yf of x, is the y just 1 and minus 1? The only function of Y is to ensure that functional margin is non-negative, right? Is that true? Of course not. See 43rd floor in the comments.)

, of course, in some cases, or most of the time the data is not linearly separable, this time to meet such conditions hyperplane does not exist at all, but about how to deal with this problem behind us speak), here to begin with the simplest case of derivation, it assumes that the data are linearly separable, i.e. the hyperplane is there.

In other words, in the process of classification, a new data point x is encountered, and x is substituted into f(x). If f(x) is less than 0, the category of X is assigned -1, and if f(x) is greater than 0, the category of x is assigned 1.

And then the question is, how do you determine this hyperplane? Intuitively, this hyperplane should be the best line to separate the two types of data. The criterion for “best fit” is that the line is the most distant from the data on both sides of the line. So, you have to look for hyperplanes with maximum spacing.

Angular displacement and angular displacement

In hyperplane w * x + b = 0, | | w * x + b can represent point x distance to the hyperplane, and by observing the w * x + b symbol symbols are consistent with the class mark y can determine classification is correct, so, you can use (y * (w) * x + b) plus or minus of the correctness of the classification to determine or said. Thus, we introduce the concept of functional margin.

Define function spacing (withRepresents) as:

And the minimum function interval of the hyperplane (w, b) with respect to all sample points (xi, yi) in T (where x is the feature, y is the result label, and I represents the ith sample) is the function interval of the hyperplane (w, b) with respect to the training data set T:

= minI (I = 1,… n)

However, there is a problem with the spacing of functions defined in this way, that is, if you change w and B proportionally (such as changing them to 2w and 2b), the value of the spacing of functions f(x) becomes twice the original value (although the hyperplane does not change at this time), so only the spacing of functions is not enough.

In fact, we could apply some constraints to the normal vector W to introduce the concept of actually defining the distance from a point to the hyperplane — geometric spacing (CCD).

Suppose that for a point x, the corresponding point projected vertically onto the hyperplane is x0, and w is a vector perpendicular to the hyperplane,Is the distance between sample X and the hyperplane, as shown in the figure below:

Based on plane geometry, yes

The | | w | | for w quadratic norm (norm is the representation of a similar mould length concept).Is the unit vector (a vector divided by its magnitude is called a unit vector).

And since x0 is a point on the hyperplane, f of x0 is equal to 0, substitute into the hyperplane equationThat they may have, i.e.,.

Then let this formulaLet’s multiply both sides of PI times PIAnd then according to theand, can be calculated:



gamma

In order to getThe absolute value of, letMultiply by the corresponding class y to obtain the geometric interval (withDefinition of) :

From the above function and geometric intervals can be seen: the definition of geometric interval is calculated by dividing the function space | | w | |, and function between y * (wx + b) = y * f (x) is actually | f (x) |, just people define an interval measurement, and the geometric interval (x) | | f / | | w | | is intuitively point the distance to the hyperplane.

1.4. Definition of Maximum Margin Classifier

When classifying a data point, the greater the “distance” between the hyperplane and the data point, the greater the confidence of classification. Therefore, in order to make the classification as confident as possible, the selected hyperplane needs to be able to maximize this “interval” value. This interval is half the Gap in the figure below.

From the previous analysis, it can be seen that the interval of the function is not suitable for maximizing the interval value, because after the hyperplane is fixed, the length of W and the value of B can be scaled equally, which can makeIs arbitrarily large, i.e., the interval between functionsCan be arbitrarily large while the hyperplane remains the same. But the geometric interval is divided by, which makes it geometrically spaced when scaling W and bThe value of does not change, it only changes with the hyperplane changes, therefore, this is a more appropriate interval. So, in other words, the “interval” in the maximum interval classification hyperplane we’re looking for here refers to the geometric interval.

Therefore, the objective function of the maximum margin classifier can be defined as:

At the same time, some conditions need to be met. According to the definition of the interval, there are

Where, S.T., which means subject to, is derived from constraints.

Let’s review the definition of geometric intervalsWe know: If the function is spacedIs equal to 1Is equal to 1 in order to facilitate derivation and optimization, and doing so has no impact on the optimization of the objective function. As for why, please refer to the comment of this article= 1 / | | w | | and, thus the above objective function is transformed into

The objective function is the corresponding constraint, to maximize the 1 / | | w | | value, and 1 / | | w | | is geometric spacing.

As shown in the figure below, the solid line in the middle is the Optimal Hyper Plane that is found. The distance between the two dashed lines is the geometric interval, the distance between the two dashed line interval boundaries is equal to 2, and dotted lineThe points on the interval boundary are support vectors. Since these support vectors are just on the dashed interval boundary, they satisfy(Remember we set functional margin to 1? In the previous section: for the purposes of convenient derivation and optimization, we can make=1), and for all points that are not support vectors, there is obviously.

OK, so far, it is understood that the first layer of SVM, for those friends who only care about how to use SVM is enough, there is no need to go further into the deeper principle.

The second layer, deep into SVM

2.1. From linear to linear and indivisible

2.1.1 from the original problem to the solution of the dual problem

Then consider the objective function obtained previously:

Because o

The maximum value of theta is the same thing as finding theta

, so the above objective function is equivalent to (w changes from denominator to numerator, thus the original Max problem becomes min problem, obviously, the two problems are equivalent) :

Because the objective function is now quadratic and the constraints are linear, it is a convex quadratic programming problem. This problem can be solved with a ready-made QP (Quadratic Programming) optimization package. In a word: under certain constraints, the target is optimal and the loss is minimum.

In addition, due to the special structure of this problem, it can also be transformed into the optimization problem of dual variable through Lagrange Duality, that is, the optimal solution of the original problem can be obtained by solving the dual problem equivalent to the original problem. This is the dual algorithm of support vector machines under linearly separable conditions. The advantages of this method are as follows: dual problems are often easier to solve; They can be naturally introduced into kernel function and then extended to nonlinear classification problems.

So what is Lagrangian duality? Basically, by adding a Lagrange multiplier to each of the constraints,, define the Lagrange function (through the Lagrange function to integrate constraints into the objective function, so that only a function expression can clearly express our problem) :

And then make

Easy to verify when a constraint is not met, for exampleSo obviously there is(as long as you make). When all constraints are met, the optimal value isThat is, the quantity to be minimized initially.

Therefore, minimization under the condition that the required constraints are satisfiedIs actually equivalent to minimizing directly(Of course, there are constraints here, too0 or higher, I = 1,… N), because if the constraints are not met,It’s going to be infinity, so it’s not going to be the minimum that we’re looking for.

To write this down, the objective function becomes:

Here withRepresents the optimal value of the problem and is equivalent to the original problem. If you solve it directly, you have to deal with w and b at the beginning, andAgain, inequality constraints, so this is not easy to do. Let’s switch the smallest and largest positions to:

The new problem after the swap is the dual problem of the original problem, and the optimal value of the new problem is usedTo represent. And there areOr lessUnder certain conditions, the two are equal, and then the original problem can be solved indirectly by solving the dual problem.

In other words, the reason for the original problem from Minmax, is transformed into maxmin’s dual problemOne is becauseisThe approximate solution of, and the two, when converted into a duality problem, are easier to solve.

Now we can find the minimum of L with respect to W and b, and then L with respect toGreatly.

2.1.2 KKT conditions

As mentioned above”Or lessIn the case of meeting certain conditions, the two are equivalent “, the so-called “meet certain conditions” is to meet KKT conditions.

In general, an optimization mathematical model can be expressed in the following standard form:

Where, f(x) is the function to be minimized, h(x) is the equality constraint, g(x) is the inequality constraint, p and q are the number of equality constraint and inequality constraint respectively.

At the same time, understand the following two points:

  • Convex optimization concept:Is a convex set,Is a convex function. Convex optimization is all about finding a pointSo that every onemeet
  • Significance of KKT condition: it is a necessary and sufficient condition for optimal solution of Nonlinear Programming problems.

The KKT condition refers to the minimum point x* in the standard form of the optimization mathematical model above must satisfy the following conditions:

It has been proved that the problem here satisfies KKT condition (Slater condition has been satisfied first, and f and GI are differentiable, that is, L is differentiable to w and B), so now we turn to solving the second problem.

In other words, the original problem has been transformed into a dual problem by satisfying KKT conditions. Solving this duality learning problem is divided into three steps: first, L(w, b, a) with respect to W and b should be minimized, and then it should be correctFinally, SMO algorithm is used to solve the Lagrange multiplier in the dual problem.

2.1.3 Three steps of solving the dual problem

(1)First fix, to minimize L with respect to w and b, we take partial derivatives of w and b respectively, namely, ∂L/∂ W and ∂L/∂ B equal to zero.

Substitute the above results into the previous L:

Get:

Warning: Where does this derivation come from, some readers may ask? To be honest, the specific derivation process is quite complicated, as shown in the figure below:

Finally, we get:

As jerrylead says: the derivation from the “4th to last step” to the “3rd to last step” uses the transpose operation of linear algebra, since ai and yi are both real numbers, the transpose is the same as itself. The penultimate step is derived from the penultimate step using (a+b+ C +…). (a + b + c +…). = aa + ab + BC + + ac + ba + bb… The multiplication algorithm. The last step is the reordering of the previous step.

L(

From the last expression above, we can see that the Lagrange function now contains only one variable, and that isTo calculate the (W and B can be solved. Thus, the core problem mentioned in section 1.2 above: classification functionsYou can easily figure it out.

(2), I pray forIs the optimization problem of the dual problem. After the first step above, the Lagrange function has no variables w, b, only. From the above formula:

And there we have it, according to the.We can solve for w, and then we can pass, b can be obtained, and finally the separation hyperplane and classification decision function can be obtained.

(3)In order to find L(w, b, a) minimized with respect to w and b, and with respect toIn the last step, SMO algorithm can be used to solve the Lagrange multiplier in the dual problem.

The above formula is to solve the parameter

The problem of finding the maximum value of W

and

These are all known numbers. To see how this SMO algorithm is derived, skip to section 3.5, SMO Algorithm, below.
So far, our SVM is still weak and can only deal with linear cases. Next, we will introduce kernel function and then extend it to nonlinear classification problems.

2.1.5. Linear indivisibility

OK, to move on to the kernel in section 2.2, let’s take a look at some interesting forms from the above derivation. So the first one is about our Hyper Plane, categorizing a data point X, actually by plugging in x toYou calculate the results and then classify them according to their plus or minus signs. And in the previous derivation we got that

Therefore, the classification function is:

What is interesting about the form here is that for the prediction of the new point X, it only needs to calculate its inner product with the training data point (Represents the inner product of vectors), which is very important and is the basic premise of nonlinear generalization using Kernel later. In addition, the so-called Supporting Vector is also shown here — in fact, all the coefficients that are not Supporting the VectorAre equal to zero, so the inner product of the new point is actually computed for a small number of “support vectors” rather than all the training data.

Why not support vector correspondingWhat about zero? Intuitively, these “back” points — as we have analyzed before — have no effect on the hyperplane. Since the classification is entirely determined by the hyperplane, these irrelevant points do not participate in the calculation of the classification problem and therefore have no effect.

Recall our objective function obtained by the Lagrange Multiplier in Section 2.1.1:

Notice that if xi is a support vector, the red part of the above equation is equal to 0 (because the functional margin of the support vector is equal to 1), while for non-support vector, the functional margin is greater than 1, so the red part is greater than zero, andIt’s non-negative, and in order to maximize,It has to be equal to 0. This is the limitation of these non-supporting Vector points.

From Section 1.5 to all of the above, you get a Maximum Margin Hyper Plane classifier, which is called the Support Vector Machine. Of course, so far, our SVM is weak and can only handle linear cases, but once we get the dual form, it becomes very easy to generalize to nonlinear cases via Kernel (remember what we said at the beginning of this section: “The optimal solution is obtained by solving the duality problem, which is the duality algorithm of support vector machine under the condition of linearly separable. The advantages of this method are: dual problems are often easier to solve; The two can be naturally introduced into the kernel function, and then extended to nonlinear classification problems “).

2.2 Kernel function

2.2.1 Implicit mapping of feature space: kernel function

In fact, most of the time the data is not linearly separable, and the hyperplane that satisfies this condition does not exist at all. In the above, we have learned about the linear separable situation of SVM processing, how to deal with the nonlinear data of SVM? For the non-linear situation, SVM’s processing method is to select a kernel function κ(⋅,⋅) to solve the problem of linearity inseparability in the original space by mapping data to high-dimensional space.

Further, because the training sample is generally not independent, they always take the form of inner product of the sample in pairs, and in the form of a dual advantage of learning for the number of adjustable parameters in the said don’t rely on the input the number of attributes, through the use of the appropriate kernel function to replace the inner product, will have the nonlinear implicit training data mapped to high-dimensional space, Without increasing the number of tunable parameters (provided, of course, that the kernel can compute the inner product corresponding to the two input eigenvectors).

Specifically, in the case of linear inseparability, support vector machines first complete the calculation in the low-dimensional space, then map the input space to the high-dimensional feature space through the kernel function, and finally construct the optimal separation hyperplane in the high-dimensional feature space, so as to separate the nonlinear data on the plane itself. As shown in Figure 7-7, a pile of data cannot be divided in 2d space, so it can be mapped to 3d space and divided:

But before we meet with kernel function, if in the original way, so in a linear learning is a nonlinear relationship, need to select a set of nonlinear characteristics, and the data as a new form of expression, this application is equivalent to a fixed nonlinear mapping, map the data to feature space, using linear learning device in the feature space, therefore, The hypothesis set considered is a function of this type:

Here ϕ : X->F is a mapping of the input space to a feature space, which means that building a nonlinear learner has two steps:
  1. First, a nonlinear mapping is used to transform the data into a feature space F,
  2. Then linear learner classification is used in the feature space.
Since the duality form is an important property of the linear learner, it means that the hypothesis can be expressed as a linear combination of training points, so the decision rule can be expressed by the inner product of test points and training points:

If there was a way to
The inner product < φ(x) is computed directly in the characteristic spaceiPhi (x),
>Just as in the function of the original input point, it is possible to fuse the two steps together to build a nonlinear learner,
The method of such direct calculation is called kernel function method:
The kernel is a function K that satisfies all x, z of minus x
Where φ is a mapping from X to the inner product eigenspace F.

2.2.2 Kernel Function: How to deal with nonlinear data

Let’s take an example of a kernel. The two types of data, as shown in the figure below, are respectively distributed in the shape of two circles. Such data itself is linear and indivisible. In this case, how can we separate these two types of data (a corresponding three-dimensional space map will be shown below)?

In fact, the data set described above is generated by adding a small amount of noise to two circles of different radii, so an ideal boundary would be a “circle” rather than a line (hyperplane). If two coordinates of the two-dimensional plane are expressed in terms of X1 and X2, we know that the equation of a conic (a circle is a special case of a conic) can be written as follows:

Notice the form above, if we construct another five-dimensional space with the values of five coordinates Z1=X1, Z2=X21, Z3=X2, Z4=X22, and Z5=X1X2, then obviously, the above equation can be written in the new coordinate system:

With regard to the new coordinate Z, this is exactly the equation of a Hyper Plane! That is, if we make a mapping ϕ:R2→R5, mapping X to Z according to the rules above, then the original data in the new space becomes linearly divisible, and can be processed using the linear classification algorithm we derived earlier. This is the basic idea of Kernel method to deal with nonlinear problems.

Before going into the Kernel details, let’s look at an intuitive example of this mapping. Of course, you and I might not be able to draw a 5-dimensional space, but since I used a special case to generate the data here, specifically, the actual equation of my hyperplane here looks like this (a perfect circle centered on the X2 axis) :

So I just need to map it to a three-dimensional space like Z1=X21, Z2=X22, and Z3=X2. The following is the result of the mapping. After proper rotation of the axes, it is obvious that the data can be separated by a plane (pluskid: The following GIF animation, first use Matlab to draw a picture, and then use Imagemagick collage) :

Kernel function is equivalent to the original classification function:

Mapping:

And one of theIt can be obtained by solving the following Dual problem:

Does that solve the problem? It seems so: given the nonlinear data, look for a mapping, and then map the original data to the new space, and then do linear SVM. But it’s not that simple! In fact, if you think about it a little bit, you can see the problem: in the original example, we mapped a two-dimensional space. The new space we chose was a combination of all the first and second orders of the original space, and we got five dimensions. If the original space was three dimensions, then we would have 19 dimensions of new space, and that number is exploding, which givesIt is very difficult to calculate, and in the case of infinite dimensions it is impossible to calculate at all. So Kernel is needed.

So let’s start with the simple example we started with, and let’s say we have two vectorsandAnd theIs the mapping to the five-dimensional space mentioned above, so the inner product after the mapping is:

(Formula description: Above the two in the process of derivation, puts it in front of the five dimensions of space mapping, said here in front of the mapping way is described in section 2.2.1, review the mapping rules before, then look at the first deduction, is actually calculated x1, x2 their inner product, and then multiplied together, the second is derived directly square, remove the parentheses, It’s also easy to deduce)

In addition, we have noted that:

There are a lot of similarities, and in fact, we just have to scale some of the dimensions linearly, and then add a constant dimension, and in particular, this is actually a mapping

And then the inner productThe results are the same, so what’s the difference?

  1. One is to map it to higher dimensions, and then calculate it according to the formula for the inner product;
  2. The other computs directly in the original low-dimensional space, without explicitly writing out the result of the mapping.

(Formula description: above, the last two formulas, the first formula, is completely flat with inner product, can be broken up, and then, by combining one, the second formula, is also based on the first formula.)

Recall the dimension explosion of the map, which can be easily handled even in infinite dimensions, when the former method can no longer be calculated.

We call the Function here to calculate the inner product of two vectors in the space after implicit mapping Kernel Function. For example, in the previous example, our Kernel Function is:

Kernel functions simplify inner product operations in the mapping space – it just “happens” that the local data vectors that need to be computed in our SVM always appear as inner products. Compared with the formula we just wrote above, our classification function is:

Among themIt is calculated from the following dual problems:

In this way, the calculation problem is solved, avoiding the direct calculation in higher dimensional space, and the results are equivalent! And, of course, because our example here is so simple, I can construct by hand that corresponds to thetaFor any mapping, it’s very difficult to construct the corresponding kernel.

2.2.3 Several kernel functions

Usually people will choose from some common kernel (depending on the problem and data, choose different parameters, in effect get different kernel), for example:

  • Polynomial kernelObviously, the example we just gave is a special case of the polynomial kernel here (R = 1, d = 2). It’s cumbersome and unnecessary, but the mapping that this kernel corresponds to is actually writable, and the dimensions of this space are, includingIt’s the dimension of primordial space.
  • Gaussian kernelAnd this kernel is the guy I mentioned at the beginning that maps the original space to infinite dimensions. However, ifIf you choose very large, the weights on higher-order features actually decay very fast, so that they actually (numerically approximate) correspond to a lower-dimensional subspace; On the other hand, ifChoose small, and you can map arbitrary data to linearly separable — which is not always a good thing, of course, since it can lead to serious overfitting problems. However, in general, by adjusting the parametersGaussian kernel is actually quite flexible and one of the most widely used kernel functions. The following example maps low-dimensional linearly indivisible data to a higher-dimensional space using a Gaussian kernel:
  • The linear nuclearThis is actually the inner product in our original space. The existing main purpose is to make the nuclear problem in space after the “map” and “the problem of space mapping” both are unified in form (mean, sometimes, we write code, or writing formula, as long as write a template or a general expression, and then into the nucleus of different, can, in this, was unified in form, Don’t write a linear one and a nonlinear one.

2.2.4. The nature of kernel functions

After all this, the reader may still not understand what a kernel is. Let me briefly summarize the following three points:
  1. In practice, we often encounter linear and indivisible samples. In this case, we usually map sample features to higher-dimensional space (as shown in the diagram at the beginning of Section 2.2 above, after mapping to higher-dimensional space, relevant features are separated, thus reaching the purpose of classification).
  2. But further, if all linearly indivisible examples are mapped to higher dimensions, then this dimension size will be horribly high (the 19-dimensional or even infinity-dimensional example above). So what do we do?
  3. At this point, the kernel function is held, the value of the kernel function is that it is about characteristics of converting from low dimension to high dimension, but the kernel function is absolutely no calculation on it in low dimension in advance, and will essentially the classification effect of performance in the high dimension, also as stated above to avoid the complicated calculation in the high-dimensional space directly.

Finally, an example here is used to illustrate the intuitive effect of kernel function to solve nonlinear problems.

Suppose you are a farmer and you have a flock of sheep, but you need to build a fence around them in case wolves attack them. But where should the fence go? You will most likely need to create a “classifier” based on the location of the herd and the Wolf pack. Comparing the different classifiers shown in the following figure, we can see that SVM provides a perfect solution.

This example simply illustrates the advantages of SVM using nonlinear classifier from the side, while logical mode and decision tree mode both use linear method.

OK, I won’t go into that too much, but if you’re more interested in kernels, you can take a look
This article.

2.3. Use relaxation variables to handle outliers methods

At the beginning of the first section of this article, when we discussed support vector machines, we assumed that the data were linearly separable, meaning that we could find a feasible hyperplane to separate the data completely. Later, in order to process nonlinear data, the Kernel method was used to extend the original linear SVM in Section 2.2 above, so that non-linear situations can be processed. Although by mappingAfter the raw data is mapped to a higher dimensional space, the probability of linear separation is greatly increased, but it is still difficult to handle in some cases.

For example, it may not be because the data itself is non-linear, but simply because the data is noisy. For such data points that deviate far from the normal position, we call them outliers. In our original SVM model, the existence of outliers may cause great influence, because the hyperplane itself is composed of only a few support vectors. If there are outliers in these support vectors, the impact is significant. For example:

The blue dot circled by the black circle is an outlier that deviates from the half space it should be in. If it is ignored directly, the original partition hyperplane is fine. However, due to the appearance of this outlier, the partition hyperplane has to be squeezed sideways. It becomes the black dotted line on the way (this is just a schematic diagram, and the exact coordinates are not calculated strictly), and the margin is correspondingly smaller. Of course, the worse case is that if this outlier moves a little further to the right, we won’t be able to construct a hyperplane that separates the data.

To deal with this, SVM allows data points to deviate somewhat from the hyperplane. For example, in the figure above, the black solid line corresponds to the distance from which the outlier deviated, and if it were moved back, it would fall right on the original hyperplane’s blue interval boundary without distorting the hyperplane.

In other words, in the case of slack, outline point also belongs to support vector SV, and for different support vectors, Lagrange parameter values are different, as shown in the following figure in the paper Large Scale Machine Learning:

For points far from the classification plane, the value is 0; For the point values on the edge between [0, 1/L], where L is the number of training data sets, i.e., the data set size; For outline data and internal data values are 1/L. See article 51 at the end of this article for more information.”

OK, let’s go back to our question. Our original constraint conditions are:

Now considering the Outlier problem, the constraint becomes:

Among themThey are called slack variables and correspond to data pointsThe amount of functional margin allowed to deviate. Of course, if we runArbitrarily large, then any hyperplane is valid. So, we add a term to the original target function that makes theseMinimize the sum of:

Among themIs a parameter that controls the weight of two terms in the objective function (” find the hyperplane with the largest margin “and” ensure the minimum amount of deviation of data points “). Notice, whereIs one of the variables to be optimized, whileIs a predetermined constant. The full version looks like this:

Add the constraints or constraints to the objective function using the previous method to obtain a new Lagrange function, as shown below:

The analysis is the same as before, but after switching to another problem, let’s letfor,andTo minimize:

Back to theAnd simplify, and get the same objective function as the original:

However, since we getAnd there are(as a condition of the Lagrange Multiplier), so, so the whole dual question is now written as:

Combine with Minimize and combine with maxmize

You can see that the only difference is dual variable nowThere’s an extra upper limit. And the nonlinear form of kernelization is the same, as long as you takeSwitch toCan. Thus, a complete support vector machine that can handle both linearity and nonlinearity and tolerate noise and outliers is finally introduced.

Writing at this point, you can do a summary, not accurately, it is essentially a SVM classification method, classification function are defined with the w ^ T + b, so o w, b, and to find the largest interval, which leads to the 1/2 | | w | | ^ 2, and then introduced the Lagrangian multipliers, Into solving the Lagrange multiplier a (or will involve a series of optimization in the process of solving convex quadratic programming, etc.), so, w.b and seeking for a equivalence, and a solution of the SMO can use a fast learning algorithm, as for the kernel function, is to deal with the nonlinear case, if calculation could directly mapped to high-dimensional dimension explosion, so in low dimension calculation, the equivalent high dimension.

OK, understanding to the second floor, already can satisfy most people a glimpse into the SVM principle of curiosity, but for those who want to prove that level of understanding of SVM is not enough, but to enter the third understanding before, you must have good mathematical foundation and logical proof ability, or you will like me, eat a lot.

The third layer, prove SVM

Honestly, anything that involves proof. Theory, then generally not how easy to mess with things. Most of the time, it is not difficult to understand a thing, but to prove a thing requires some mathematical skills, further, that a thing is not particularly difficult, difficult is starting from scratch to invent this thing, it was difficult (because of any age, most people’s income is based on the research achievements of predecessors’ research, predecessors have done is a pioneering work, This is often the hardest and most rewarding, and they are called true pioneers. Newton also said that he was standing on the shoulders of giants. You and I are even more so).

As Academician Chen Xiru said in chapter 4, Least Square method, of his book A Brief History of Mathematical Statistics: Innovations and breakthroughs in research on a lot of ideas are has a lot of not easy, and perhaps a theorem in a certain period by one person came forward, now we see everything for granted, but before all have not found, may be a lot of top scholars to pump, running out of life, efforts for decades and eventually succeeded.

To prove something, you need to know where it is based, that is, what theories it is based on. OK, the following content is basically a proof of some theorems not mentioned above, including the logic behind them, the background of the source, etc., or reading notes.

This section introduces

  • Section 3.1 linear learner mainly describes perceptron algorithm;
  • In section 3.2 nonlinear learner, mercer’s theorem is mainly elaborated.
  • Section 3.3. Loss function;
  • Section 3.4 least square method;
  • Section 3.5 SMO Algorithm;
  • Section 3.6 briefly discuss the application of SVM;

3.1. Linear learner

3.1.1 Perceptron algorithm

This perceptron algorithm was proposed in 1956, and it’s still very old, and it’s not optimal, to be sure, and we’ll talk more about that later. One thing, though, is that you need to understand what this algorithm is for: constant trial and error in the hope of finding a suitable hyperplane (yes, it’s that simple).
So let’s take an example. As shown in the figure below, it can be seen from our intuition that the red line in the figure is the optimal hyperplane, while the blue line is based on the perceptron algorithm in continuous training. Finally, if the blue line can move to the position of the red line through continuous training, the training will be successful.

How many times does it take to train the blue line to become the optimal classified hyperplane? The Novikoff theorem tells us that when the interval is positive, the perceptron algorithm will converge in a finite number of iterations. In other words, the Novikoff theorem proves the convergence of the perceptron algorithm, that is, it can get a bound that will not go on endlessly.
  • Novikoff theorem: If classified hyperplanes exist, only in the sequenceIterate a few times, at the boundThe classification hyperplane can be found under the error times of, and the algorithm stops.
here
.
Is the expansion interval. According to the formula of error score times, the number of iterations is related to the interval of the training set corresponding to the expanded (including bias) weight.
And by the way, explain this so-called expansion interval
.
Is the distance from the sample to the classification interval, i.e
The maximum classification interval of the derivation. OK, remember the beginning of section 1.3.2 above? As follows:

Before we define the geometric interval, let’s first consider that, as shown in the figure above, for a point x, the projection perpendicular to the hyperplane is x0, and since W is a vector perpendicular to the hyperplane,Is the distance from sample X to the classification interval, and we have

Then how to deduce the maximum classification interval please go back to the first and second parts of this paper.
At the same time, it should be noted that although the perceptron algorithm can generate the correctly classified hyperplane through simple iteration on the linear fractionable data, it is not the optimal effect. Then, how to obtain the optimal effect is the search for the maximum classification interval hyperplane mentioned in the first part above. In addition, the proof of Novikoff theorem can be found here.

3.2 nonlinear learner

3.2.1. Mercer’s Theorem

Mercer’s theorem: If the function K is
That is, from two n-dimensional vectors to the real number field. So if K is a valid kernel (also known as the Mercer kernel), then if and only if for the training sample
, its corresponding kernel function matrix is symmetric semidefinite.

To understand the Mercer theorem, we have to know what is positive semi-definite matrix, to understand what is positive semi-definite matrix, need to know first what is the positive definite matrix, matrix theory “profound”, I’m not myself thoroughly, wait me clear written this section again, by the way, could you recommend I was watching a matrix analysis and applications). And then there’s a proof of this theorem, which we can look at.
As @copper_pku said: Kernel function plays an important role in the classification effect of SVM. Finally, here is a tutorial for you to see.

3.3. Loss function

In section 1.0 of this article have so a word “support vector machine (SVM) is developed in the mid – 90 – a machine learning method based on statistical learning theory, by seeking structured minimum risk to improve the generalization ability and learning machine, minimizing empirical risk and confidence limit, so as to achieve under the condition of the statistical sample size is less, Can also achieve the purpose of good statistical law.” But first-time readers may not understand what is structured risk and what is empirical risk. To understand these two so-called “risks”, we have to start from supervised learning.

Supervised learning is actually an optimization problem of empirical risk or structural risk function. The risk function measures the quality of model prediction in the average sense, and the quality of each prediction of the model is measured by the loss function. It selects model F from the hypothesis space F as the decision function. For a given input X, the corresponding output Y is given by F (X). The predicted value F (X) of this output may or may not be consistent with the real value Y. The loss function is called L(Y, f(X)).

The commonly used loss functions are as follows (basically quoted from Statistical Learning Methods) :

Thus, SVM has a second understanding, that is, optimization + minimum loss, or as @Xiafen_ Baidu said, “SVM, Boosting, LR and other algorithms may have different gains from the perspective of loss function and optimization algorithm”.

OK, see this article for more questions about statistical learning methods.

About the loss function, as described in the reader comments below: Check out Statistical Behavior and Consistency of Classification Methods Based on Convex Risk Minimization by Zhang Tong. The loss functions commonly used in various algorithms basically have Fisher consistency, and the classifier obtained by optimizing these loss functions can be regarded as the “proxy” of posterior probability. In addition, Zhang Tong also wrote another paper Statistical Analysis of Some Multi-Category Large Margin Classification Methods, which analyzed margin loss under the condition of multiple classification. These two articles thoroughly analyze the loss function used by Boosting and SVM.

3.4. Least square method

3.4.1 What is the least square method?

Since the least square method was mentioned at the beginning of this section, let’s make it a little bit more simple by quoting from Normal Distribution.

We often say orally: generally speaking, on average. If, on average, non-smokers have better health than smokers, the reason why I use the word “average” is because there are exceptions to everything, there is always a special person who smokes but because of regular exercise, his health is better than that of his non-smoking friends. One of the simplest examples of least squares is arithmetic average.

Least square method (also known as least square method) is a mathematical optimization technique. It finds the best function match for the data by minimizing the sum of the squares of error. The least square method can be used to obtain the unknown data easily and minimize the sum of squares of errors between the obtained data and the actual data. Expressed as a function:

The method of finding an estimate by minimizing the sum of squares of the errors “the error is, of course, the difference between the observed value and the actual value” is called the least square method, and the estimate obtained by using the least square method is called the least square estimate. Of course, taking the sum of squares as the target function is just one of many ways to do it.

The general form of the least square method can be expressed as:

The effective least square method was published by Legendre in 1805. The basic idea is that there is an error in measurement, so the cumulative error of all equations is

We can solve the parameter leading to the minimum cumulative error:

Legendre explained the advantages of the least square method in his paper:

  • The least square method minimizes the sum of the squares of errors and establishes a balance between the errors of the various equations, thus preventing any one extreme error from dominating
  • In the calculation, only the partial derivative is required to solve the linear equations, and the calculation process is clear and convenient
  • The least square results in the arithmetic mean as an estimate

This last point is a statistically important property. The reasoning is as follows: assume that the truth values are θ, x1,… and xn are n measurements, and the error of each measurement is EI =xi−θ. By the least square method, the accumulated error is

To solve the 使It gets to the minimum, which is exactly the arithmetic average.

Since the arithmetic mean is a tested method, and the above reasoning shows that the arithmetic mean is a special case of the least square method, so it shows the superiority of the least square method from another Angle, which makes us more confident in the least square method.

The least square method was quickly accepted and widely used in data analysis practice. But gauss has been credited with the least square method, and that’s what happened. Gauss also published the method of least squares in 1809 and claimed to have used it for years. Gauss invented the mathematical method of asteroid location, and used the least square method to calculate in the data analysis, and accurately predicted the location of Ceres.

Said so much, seems to have nothing to do with the topic of this article SVM ah, don’t worry, please let me continue to elaborate. In essence, the least square method is a parameter estimation method, when it comes to parameter estimation, we have to start from a linear model.

3.4.2 Solution of least square method

What is a linear unary model? Let me quote from here, but let me tease out a few basic concepts:
  • In supervised learning, if the predicted variables are discrete, we call them classification (such as decision tree, support vector machine, etc.); if the predicted variables are continuous, we call them regression.
  • In regression analysis, if only one independent variable and one dependent variable are included, and the relationship between them can be approximated by a straight line, the regression analysis is called unary linear regression analysis.
  • If the regression analysis includes two or more independent variables and the relationship between dependent variables and independent variables is linear, it is called multiple linear regression analysis.
  • Linearity is a straight line in two dimensions; Linearity is a plane for three dimensional space, linearity is a hyperplane for multi-dimensional space…
For the unary linear regression model, suppose n sets of observations (X1, Y1), (X2, Y2),… are obtained from the population. , (Xn Yn). For these n points in the plane, you can fit them with an infinite number of curves. The sample regression function is required to fit this set of values as well as possible. Taken together, this line makes the most sense at the center of the sample data.
The criterion for selecting the best fitting curve can be determined as: to minimize the total fitting error (that is, the total residual). There are three criteria to choose from:
  1. One way to determine the position of a straight line is by summing residuals. But it was soon discovered that there was a problem of cancelling each other out in calculating residual sums.
  2. Another way to determine the position of a straight line is to use the absolute sum of residuals. But the absolute value is tricky to calculate.
  3. The principle of least square method is to determine the position of a line by “minimizing the sum of residual squares”. The estimator obtained by the least square method has excellent properties besides the convenience of calculation. This method is very sensitive to outliers.

The most commonly used method is Ordinary Least Square (OLS) : the regression model chosen should minimize the sum of squares of residuals of all observations, i.e., the Square loss function is used.

We define the sample regression model as:

Where EI is the error of sample (Xi, Yi).
Next, define the squared loss function Q:

Then the line is minimized by Q, that is, determine

In order to

Are variables, think of them as functions of Q, and it becomes a problem of taking an extreme, which you can get by taking a derivative.
Calculate the partial derivatives of Q with respect to the two parameters to be evaluated:

According to mathematics, we know that the extreme point of the function is the point where the partial derivative is 0.
Solution:

That’s the least square solution, the extreme point of the squared loss function. From there, you can see how solving the least square method is similar to solving the SVM problem, in particular defining the loss function and then finding the extremum by partial derivatives.

OK, for more information, please refer to chapter 4, least Square method, in a Brief History of Mathematical Statistics by Academician Chen Xiru.

3.5 SMO algorithm

In the above paper, we mentioned the sequential minimum optimization SMO algorithm for solving dual problems, but did not mention its specific solution. Let’s start with the last remaining questions:

Equivalent to solving:

In 1998, John C. Platt of Microsoft Research wrote a paper entitled Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines provides A solution to the above problem: the SMO Algorithm, which quickly became the fastest quadratic programming optimization Algorithm, especially for linear SVM and data sparsely.

Next, let’s refer to this article by John C. Platt to see what the SMO solution looks like.

3.5.1 Derivation of SMO algorithm

Let’s first define the output function from feature to result:

Note: this u is different from our previous definitionThe essence is the same.

Next, we redefine our original optimization problem as follows:

The derivative is obtained:

Plug inThe available.

After the Lagrange multiplier is converted into a dual problem, the following equation can be obtained:

S.t:

and

Note: The min function obtained here is essentially the same as our previous Max function, because the problem of changing the sign from min to Max is solved, and yi is also the same as the previous oneThe same thing for yj.

By adding the relaxation variable
After that, the model was modified as follows:

So ultimately our question becomes:

The following problems to be solved are: inFind the minimum value of the above objective function. To solve for these multipliers, arbitrarily take two at a timeandAnd then fixedandOther than the multiplier, such that the objective function is only aboutandThe function. In this way, continuously extract two solutions from a pile of multipliers arbitrarily, continuously solve the sub-problem iteratively, and finally achieve the purpose of solving the original problem.

The objective function of the subproblem of the primal dual problem can be expressed as:

Among them

To solve this subproblem, the first question is how to pick each timeand. In fact, one of the multipliers is the most illegal KKT condition, and the other multipliers are selected by another constraint.

According to KKT conditions, the objective function can be obtainedValue meanings:

Here,
Lagrange multiplier again:


  1. For the first case, show thatIs normal classification, inside the interval boundary (we know the point of correct classification);
  2. In the second case, it saysIs the support vector, on the interval boundary;
  3. For the third case, it shows thatIt’s between two spacing boundaries;

The optimal solution needs to meet KKT conditions, that is, all the above three conditions must be met. The following conditions will not be met:


  • But < = 1<C is not satisfied, whereas the original=C
  • But > = 1>0 is not satisfied, whereas the original= 0
  • = 1 but= 0 or=C means not satisfied, which should be 0<<C

That is, if there are any that do not meet KKT conditions
, so you need to update these
That’s the first constraint. In addition, updates are subject to a second constraint, namely
.

So if you assume the two multipliers that you choose
and
, they were respectively before the update
,
, respectively after the update
,
, then the values before and after the update need to satisfy the following equation to ensure the constraint that and is 0:

Among them,Is constant.

It’s hard to solve both factors at the same time, so we can solve for the second multiplier first
The solution of (
),
The solution of (
) and then use it again
The solution of (
) said
The solution of (
).

In order to solve theYou have to make sureThe value range of. Assuming that its upper and lower boundaries are H and L, then:

Next, synthesize
and
So these two constraints, let’s find
The value range of.

When y1! When = y2, according toavailable, so there are., as shown in the figure below:

When y1 is y2, the same thing appliesAvailable:, so there are., as shown in the figure below:

In this way, according to the different or the same sign of y1 and y2, it can be obtained
The upper and lower bounds of are:

Review the second constraint
If you multiply both sides of the above equation by y1, you get

Among them,
.
so
You can use
Said,
, so the objective function of the sub-problem is converted to include only
The problem:


If I take the derivative, I get

Under the reduction:

then
,
, and
Substitute into the above formula to obtain:

(represents the difference between the predicted value and the actual value),
And then divide both sides of this equation
, get a single variable
The solution:

This solution doesn’t take into account its constraints
Is the unclipped solution.
And then think about constraints
You can get the edited version
The analytical solution of is:

Find out the back
, and you can solve it
,
.
So how do you choose the multiplier
and
?
  • for, that is, the first multiplier can be found by the three conditions that do not satisfy KKT;
  • And the second multiplierYou can look for conditions that meet:The multiplier.

And B meets the following conditions:

Update B:

And after updating the optimization of the two multipliers each time, it is necessary to recalculate B and the corresponding Ei value.
Last Update all
, y and b, then the model is out, and the classification function we proposed at the beginning can be solved:

In addition, there is a similar article here, you can refer to.

3.5.2 Steps of SMO algorithm

To sum up, the main steps of SMO are summarized as follows:

Mean,
  1. Step one: Pick a pairand, selection method using heuristic method;
  2. The second step, fixed divisionandIn addition to the other parameters, determine the W extreme condition.bySaid.
Suppose that in an iteration, an update is required
.
The corresponding Lagrange multiplier
.
, then the small-scale quadratic programming problem is written as:

So in each iteration, how do you update the multiplier? reference
hereThe following two PPT illustrations:



Now that we know how to update our multipliers, what multipliers do we pick to update? The specific selection method includes the following two steps:

  1. Step 1: First “scan” all multipliers, take the first one that violates KKT condition as the update object, make it a1;
  2. Step 2: in all not to violate the KKT conditions multiplier, choose a2 to maximize the | | E1 – E2 is updated, that can increase the maximum value of the objective function (similar to the gradient descent. In additionAnd the, E represents the difference between the predicted value of input xi by function UI and the real output class mark yi).
Finally, after each update of the optimization of the two multipliers, it is necessary to recalculate B and the corresponding Ei value.

To sum up, the basic idea of SMO algorithm is to push the Chunking method proposed by Vapnik in 1982 to the extreme. In each iteration of SMO algorithm, only two components, AI and AJ, are selected for adjustment, while the other components remain fixed. After ai and AJ are solved, the other components can be improved with AI and AJ. Compared with the usual decomposition algorithm, although it may need more iterations, the computation amount of each iteration is smaller, so the algorithm shows better convergence, and there is no need to store kernel matrix, and there is no matrix operation.

3.5.3 Implementation of SMO algorithm

Writing at this point, I believe, the SVM after understanding at some point, in your mind is indeed can from beginning to end related formula is deduced, the original classification function, maximum classification interval, max1 / | | w | |, min1/2 | | w | | ^ 2, the convex quadratic programming, the Lagrange function, into the dual problem, the SMO algorithm, to find an optimal solution, An optimal classification plane. Step by step comb down, why so, too many things can be pursued, and finally achieve. As shown below:

As for the kernel function described below, it is to better deal with the nonlinear separable situation, while the relaxation variable is to correct or constrain a small number of “restless” factors or factors that are separated from the collective and difficult to classify.

Professor Lin zhiren from Taiwan has written a libSVM library that encapsulates the SVM algorithm. You can check it out. In addition, there is a note document for libSVM.

In addition to fast Training of Support Vector Machines Using sequential minimal Optimization platt gives the logical code for the SMO algorithm, Here is also a copy of the SMO implementation code, you can take a look.

3.6. Application of SVM

Perhaps we have heard that SVM has many applications in many fields, such as text classification, image classification, biological sequence analysis and biometric data mining, handwritten character recognition, etc., but perhaps you are not strongly aware that SVM can be successfully applied in far more fields than are currently being developed.

3.6.1 text classification

A text classification system is not only a natural language processing system, but also a typical pattern recognition system. The input of the system is the text to be classified, and the output of the system is the category associated with the text. Due to space limitations, other more specific content will not be detailed in this paper.

OK, the title of this section is proving SVM, but smart readers must already know that there is not much proof in this section (sorry), what to do? See Introduction to Support Vector Machines, a concise and interesting book. In this section.

Reader comments

After this article was published, many of my friends on Weibo gave me a lot of comments. Here are some of the best ones:
  1. @zlkysl: It was after reading the last post that I decided my research direction was SVM. — weibo.com/1580904460/… .
  2. Zhang Jinhui: “The triple realm of SVM has to be transformed. In fact, Andrew Ng talked about support vector machines on Coursera, but he obviously didn’t focus on it. Plus, the way Ng talked about support vector machines was hard for me to fully digest, so I heard only half of it. Really began to understand the support vector machine is to see this article “triple realm”, after this algorithm has a general concept, and how to use, and then to the principle of why, and then to the support vector machine proof. All in all, this article kicked off my months of research into SUPPORT vector machines to this day.” — zhan.renren.com/profile/249… .
  3. “Finally, the cost function of SVM is Hinge loss, and then compared with the cost function of other methods, it shows that their target function is very similar. So the question is why SVM is so popular? You can add some VC Dimension and error bound math and click on it to provide an idea and direction. — weibo.com/1580904460/… .
  4. In the interview, SVM can be used to investigate various aspects of machine learning abilities: objective function, optimization process, parallel method, algorithm convergence, sample complexity, applicable scenarios, and parameter tuning experience. However, I think boosting and LR are also good. In addition, with the continuous progress of statistical machine learning, SVM is only used as a substitute for 01 loss hinge research, more general methods are proposed, loss function research on the relationship between alternative loss and Bayesian loss, algorithm stability research on the relationship between alternative loss and promotion performance, convex optimization research on how to solve convex objective function, SVM, Boosting and other algorithms are just a concrete component of these general methods.”
  5. @Curie monkey Sister: On the loss function of SVM, You can check out “Statistical Behavior and Consistency of Classification Methods Based on Convex Risk Minimization” by Zhang Tong. The loss functions commonly used in various algorithms basically have Fisher consistency, and the classifier obtained by optimizing these loss functions can be regarded as the “proxy” of posterior probability. In addition, Ms. Zhang Tong also wrote another paper titled Statistical Analysis of Some Multi-category Large Margin Classification Methods, which analyzed margin loss under the condition of multiple classification. These two articles thoroughly analyze the loss function used by Boosting and SVM.
  6. SVM uses Hinge loss, hinge loss can not be guided, not as convenient as other alternative loss optimization and conversion probability trouble. Kernel function is also not used very much, now is the era of big data, the sample is very large, it is impossible to imagine how an N ^2 kernel matrix can be stored and calculated. And now, nonlinearities are generally deep learning. Copper_PKU: What are the typical applications of SVM in industry? How does industry select kernel function, empirical method? How to optimize the training process of SVM?
  7. Copper_PKU: July’s SVM tutorial, I personally think the following parts can be added and modified: (2) In the SMO algorithm part, the algorithm mentioned in Joachims paper and the SMO algorithm selection method of workset are added, including the convergence judgment of SMO algorithm and the previous conjugate gradient solution method. Although it is an older algorithm, it is good for understanding SMO algorithms. The optimization and solution of the model are iterative processes, and the historical algorithm is added to enhance the three-dimensional sense. — weibo.com/1580904460/… .
  8. // @Liao Linchuan: The reason why SGD is better for large training sets is that 1. SGD optimization uses subset of samples for each iteration, which is much faster than using complete training sets (especially millions of orders of magnitude). 2. If the objective function is convex or pseudo-convex, SGD will almost certainly converge to the global optimum; Otherwise, it converges to local optimum; 3.SGD generally does not need to converge to global optimum, but can be stopped immediately as long as a good enough solution is obtained. // @copper_pku: SGD core idea: It is iterative training. The loss function based on the current W (t) is calculated every time you get a sample. T represents the t training time, and then the next W (t+1) update is carried out. This is analogous to the parameter training method in BP neural network. Sample by sample means that only one sample is processed at a time instead of one batch.
  9. Copper_PKU: From the perspective of lossfunction, Primal problem can be understood as regularization term + LossFunction. The solution goal is to balance between the two. If minimum lossfunction is emphasized, overfitting will occur, so there is C parameter. // @researcher July: SVM is really a problem of finding the optimal value of the objective function under certain limited conditions, namely constraint conditions. At the same time, in order to reduce the misjudgment rate, the loss should be minimized as far as possible.
  10. .
I really enjoy this era of national debate, no one is necessarily right or wrong, but to express their own understanding and opinions, great!

References and recommended reading

  1. [English] Nello Cristianini/John Shawe-Taylor, Introduction to Support Vector Machines;
  2. The support website for the introduction to Support Vector Machines: www.support-vector.net/;
  3. Introduction to Data Mining, Pang -ning Tan/Michael Steinbach/Vipin Kumar;
  4. Data Mining: Concepts and Techniques, (plus)Jiawei Han; Micheline Kamber.
  5. A New Approach to Data Mining: Support Vector Machines, Deng Naiyang, Tian Yingjie;
  6. Support Vector Machines: Theory, Algorithm and Extension, Deng Naiyang, Tian Yingjie;
  7. Support Vector Machines, PlusKid: blog.pluskid.org/?page_id=68… ;
  8. www.360doc.com/content/07/… ;
  9. Research on ten classical algorithms of data mining;
  10. C.J. C. Burges, GUIDE to Pattern Recognition Support Vector Machines;
  11. Statistical Learning Methods, Li Hang;
  12. Statistical natural Language Processing, zong Chengqing, chapter 12, text classification.
  13. Introduction to the SVM series, Jasper:www.blogjava.net/zhenandaci/… ;
  14. Implementation and comparison of nearest Neighbor decision and SVM digital recognition, author unknown;
  15. Machine learning courses at Stanford original notes: www.cnblogs.com/jerrylead/a… ;
  16. Stanford machine learning course notes: www.cnblogs.com/jerrylead/t… ;
  17. www.cnblogs.com/jerrylead/a… ;
  18. The SMO algorithm mathematical deduction: www.cnblogs.com/jerrylead/a… ;
  19. Knowledge of probability theory and mathematical statistics needed in data mining;
  20. Articles about machine learning, can read: www.cnblogs.com/vivounicorn… ;
  21. Mathematics teaching material recommendation: blog.sina.com.cn/s/blog_5e63… ;
  22. Neural Networks and Machine Learning (3rd Edition), [plus] Simon Haykin;
  23. Normal distribution of past life: t.cn/zlH3Ygc;
  24. A Brief History of Mathematical Statistics, by Academician Chen Xiru;
  25. Optimization Theory and Algorithms (2nd edition), baolin Chen, Ed.
  26. A Gentle the Introduction to Support Vector those Biomedicine:www.nyuinformatics.org/downloads/s in… , this PPT is very good, except for the introduction of the Lagrangian dual variable convex quadratic programming problem depth is not enough, other good, with wonderful pictures, this paper has several pictures are quoted from this PPT;
  27. From Carnegie Mellon university, Carnegie Mellon university (CMU) the interpretation of the SVM PPT:www.autonlab.org/tutorials/s… ;
  28. Machine learning handout SVM: wenku.baidu.com/link?url=PW… ;
  29. Staff.ustc.edu.cn/~ketang/PPT… ;
  30. Introduction to Support Vector Machines (SVM), By Debprakash Patnai M.E. (SSA), www.google.com.hk/url?sa=t&rc… ;
  31. Libsvm: www.csie.ntu.edu.tw/~cjlin/libs… ;
  32. Machine Learning in Action (Chinese version);
  33. Sequential Minimal Optimization A Fast Algorithm for Training Support Vector Machines Research.microsoft.com/en-us/um/pe… ;
  34. The Nature of Statistical Learning Theory, By Vladimir N. Vapnik. Very obscure, not recommended.
  35. Zhang Zhaoxiang, Support Vector Machines for Machine Learning lecture 5 irip.buaa.edu.cn/~zxzhang/co… ;
  36. Theoretical explanation of VC dimension: www.svms.org/vc-dimensio… , explain xiaoxia001.iteye.com/blog/116333 VC dimension… ;
  37. Jason Weston’s handout on SVM from NEC Labs America www.cs.columbia.edu/~kathy/cs47… ;
  38. SVM handouts from MIT: www.mit.edu/~9.520/spri… ;
  39. PAC questions: www.cs.huji.ac.il/~shashua/pa… ;
  40. Two papers by Baidu Teacher Zhang Tong: Statistical behavior and consistency of classification methods based on convex risk Minimization “home.olemiss.edu/~xdang/676/… , Statistical Analysis of some Multi-category Large Margin Classification Methods;
  41. jacoxu.com/?p=39;
  42. Matrix Analysis and Application, Zhang Xianda, Tsinghua university;
  43. The realization of the SMO algorithm: blog.csdn.net/techq/artic… ;
  44. Common thought simple combing machine learning algorithm of the interview: www.cnblogs.com/tornadomeet… ;
  45. Matrix’s Wikipedia page: zh.wikipedia.org/wiki/%E7%9F… ;
  46. The least squares method and its implementation: blog.csdn.net/qll12559671… ;
  47. An introduction to statistical learning methods: blog.csdn.net/qll12559671… ;
  48. www.csdn.net/article/201… ;
  49. A Tutorial on Support Vector Regression:alex.smola.org/papers/2003… ; The SVR JianMingBan: www.cmlab.csie.ntu.edu.tw/~cyy/learni… .
  50. SVM Org: www.support-vector-machines.org/;
  51. R. Collobert. Large Scale Machine Learning. The Universite Paris VI. PhD thesis 2004:ronan.collobert.com/pub/matos/2… ;
  52. Making Large – Scale SVM Learning Practical:www.cs.cornell.edu/people/tj/p… ;
  53. Text classification with SVM:blog.csdn.net/zhzhl202/ar… ;
  54. Working Set Selection Using Second Order Information for Training Support Vector Machines: www.csie.ntu.edu.tw/~cjlin/pape… ;
  55. The SVM Optimization: Inverse Dependence on Training Set Size: icml2008. Cs. Helsinki. Fi/cca shut / 266…. ;
  56. Large-scale Support Vector Machines: Algorithms and Theory: cseweb.ucsd.edu/~akmenon/Re… ;
  57. The concept of convex optimization: cs229.stanford.edu/section/cs2… ;
  58. Convex Optimization, by Stephen Boyd and Lieven Vandenberghe;
  59. Large-scale non-Linear Classification: Algorithms and Evaluations, Zhuang Wang, a lot of new advances on SVM Algorithms: ijcai13.org/files/tutor… ;
  60. SVM based on SMO algorithm: www.cs.iastate.edu/~honavar/sm… ;
  61. Recommend some related papers SVM cooper (of course, many of which paper mentioned in the above items have been) : c.blog.sina.com.cn/profile.php… ;
  62. Online editing Latex formula: www.codecogs.com/latex/eqned… .