1. About the SVM

1.1 A fairy tale about SVM

Support Vector Machine (SVM) is one of the most outstanding supervised learning methods, which will be introduced in almost all textbooks on classical Machine learning methods. About SVM, there is a story about angels and demons.

Legend has it that the devil played a game with an angel, and the devil put two colored balls on the table. The devil made heaven use a stick to separate them. It seemed too easy for angels. Without thinking, the angel put and completed the task. The devil added more balls. As the number of balls increases, it seems that some balls can no longer be separated properly by the original sticks, as shown in the picture below.

The SVM is actually finding the best place for the angels to place the sticks so that the balls on either side are far enough away from the sticks that separate them. According to the position of the stick chosen by SVM for the angel, even if the devil continued to add new balls in the same way, the stick could separate the two different types of balls well.

Seeing that the angel had solved the problem of splitting the ball linearly with the stick, the devil gave the angel a new challenge, as shown below.

With the way the balls are arranged, it seems that there is not a stick in the world that can perfectly separate them. But the angel had magic power, and with a blow of his hand he sent the balls flying into the air. Then he picked up a piece of paper and stuck it between the two kinds of balls. From the devil’s point of view, the balls look like they’ve been cut perfectly by a curve.

Later, bored scientists called the balls “data,” the sticks “classification surfaces,” the process of finding the positions of the sticks with the greatest spacing “optimization,” the mental force of slapping the table to send the balls flying into the air “nuclear mapping,” and the pieces of paper separating the balls in the air “classification hyperplane.” This is the fairy tale of SVM.

1.2 Understanding SVM: Layer 1

Support Vector machine (SVM), as its English name is Support Vector Machine, is generally referred to as SVM. Generally speaking, it is a kind of binary classification model. Its basic model is defined as a linear classifier with the largest interval in the feature space.

** Linear classifiers: ** Given some data points that belong to two different classes, find a linear classifier that divides the data into two classes. If x represents the data point and Y represents the category (y can be 1 or 0, respectively, for two different classes), the learning goal of a linear classifier is to find a Hyper plane in the n-dimensional data space. The equation of the hyper plane can be expressed as (T in wT stands for transpose) :


Check out my previous logical-regression chapter review here: Click open

The hyperplane can be classifiedDenotes that when f(x) is equal to 0, x is the point on the hyperplane, and the point f(x) greater than 0 corresponds to the data point y=1, and the point f(x) less than 0 corresponds to the point y=-1, as shown in the figure below:

1.2.1 Functional interval and geometric interval

Established in the hyperplane wx + b = 0 cases, | | wx + b can represent point x distance to the hyperplane, and by observing the wx + b sign symbols are consistent with the class mark y can determine classification is correct, so, you can use (y (w) * x + b) plus or minus of the correctness of the classification to determine or said. Thus, we introduce the concept of functional margin.

Function interval formula:? \gamma=y(w^Tx+b)=yf(x)?

And the minimum function interval of the hyperplane (w, b) with respect to all sample points (xi, yi) in the data set T (where x is the feature, y is the result label, and I is the ith sample) is the function interval of the hyperplane (w, b) with respect to the training data set T:


However, there is a problem with the spacing of functions defined in this way, that is, if you change w and B proportionally (such as changing them to 2w and 2b), the value of the spacing of functions f(x) becomes twice the original value (although the hyperplane does not change at this time), so only the spacing of functions is not enough.

Geometric interval

In fact, we could apply some constraints to the normal vector W to introduce the concept of actually defining the distance from a point to the hyperplane — geometric spacing (CCD). Suppose that for a point x, the corresponding point projected vertically onto the hyperplane is x0, and w is a vector perpendicular to the hyperplane,Is the distance between sample X and the hyperplane, as shown in the figure below:

Here I directly give the geometric interval formula, detailed to please see the blog: click to enter

Geometric interval:

From the above function and geometric intervals can be seen: the definition of geometric interval is calculated by dividing the function space | | w | |, and function between y * (wx + b) = y * f (x) is actually | f (x) |, just people define an interval measurement, and the geometric interval (x) | | f / | | w | | is intuitively point the distance to the hyperplane.

1.2.2 Definition of maximum interval classifier

When classifying a data point, the greater the “distance” between the hyperplane and the data point, the greater the confidence of classification. Therefore, in order to make the classification as confident as possible, the selected hyperplane needs to be able to maximize this “interval” value. This interval is half the Gap in the figure below.

From the previous analysis, it can be seen that the interval of the function is not suitable for maximizing the interval value, because after the hyperplane is fixed, the length of W and the value of B can be scaled equally, which can makeIs arbitrarily large, that is, the function interval can be arbitrarily large while the hyperplane remains unchanged. But the geometric interval, because of the division, does not change the value of the geometric interval when scaling W and B, it only changes with the hyperplane, so it is a more appropriate interval. In other words, the ** “interval” in the maximum interval classification hyperplane we are looking for here refers to the geometric interval. 六四风波

As shown in the figure below, the solid line in the middle is the Optimal Hyper Plane that is found. The distance between the two dotted lines is equal to the geometric interval. The distance between the dotted lines is equal to 2 times the geometric interval, and the points on the dotted lines are support vectors. Since these support vectors are just on the dashed interval boundary, they satisfy, for all points that are not support vectors, there is obviously.

OK, so far, it is understood that the first layer of SVM, for those friends who only care about how to use SVM is enough, there is no need to go further into the deeper principle.

1.2.3 Maximum interval loss function Hinge Loss

SVM solution enables the original problem of quadratic programming to be established, the Lagrange multiplier method is introduced, and then converted into the dual form to solve, which is a solution with very substantial theory. Here to think from another perspective, in the field of machine learning, the general approach is empirical risk minimization (ERM), that is, to construct Hypothesis function as the mapping between input and output, and then use loss function to measure the advantages and disadvantages of the model. The model that minimizes the loss is the optimal hypothesis function, and different loss functions can be used to obtain different machine learning algorithms. SVM adopts Hinge Loss, which is used for “max-margin” classification.


  • For the ith data xi in the training set
  • Under W there will be a score f of xi,W.
  • The JTH score is let’s call it f(xi,W)j

To understand this formula, take a look at the following picture:

  1. In life, we all think that what is not threatening is the best. For example, if we get the first 99 points in the exam, and the second 98 points, we will feel insecure and think that the guy may surpass me at any time. If the second place score was 85, it would feel much safer and it would take a lot of effort to catch up. And the same is true for this picture up here.
  2. In the diagram above, the red dot to the left of delta is a security cordon. What does that mean? That is to say, if the prediction error score exceeds the safety warning line, a penalty weight will be given to make the prediction error value fall back beyond the safety warning line, so as to ensure the uniqueness of the correct prediction result.
  3. Corresponds to the formula,That’s the score for misclassification. The next term is going to beCorrect score – delta = safety warning line valueAnd the difference between the two terms represents the penalty weight. The closer you get to the correct score, the greater the weight. When the error score is outside the warning line, the two terms subtract to a negative number, and the maximum value of the loss function is 0, i.e., no loss.
  4. Train the data back and forth until the loss function is minimized, and you find the classified hyperplane.

1.3 Deep into SVM: The second layer

1.3.1 From linear to linear indivisible

Then consider the previously obtained objective function (let the function interval =1) :


Switch to a duality problem, and explain what a duality problem is. A duality problem is a pair of problems that are essentially the same but come up with different propositions from different perspectives.

Because oThe maximum value of theta is the same thing as finding theta, so the above objective function is equivalent to (w changes from denominator to numerator, thus the original Max problem becomes min problem, obviously, the two problems are equivalent) :


Because the objective function is now quadratic and the constraints are linear, it is a convex quadratic programming problem. This problem can be solved with a ready-made QP (Quadratic Programming) optimization package. In a word: under certain constraints, the target is optimal and the loss is minimum.

In addition, due to the special structure of this problem, it can also be transformed into the optimization problem of dual variable through Lagrange Duality, that is, the optimal solution of the original problem can be obtained by solving the dual problem equivalent to the original problem. This is the dual algorithm of support vector machines under linearly separable conditions. The advantages of this method are as follows: dual problems are often easier to solve; They can be naturally introduced into kernel function and then extended to nonlinear classification problems.

Please refer to the reference link at the end of this article for details.

1.3.2 Kernel function

In fact, most of the time the data is not linearly separable, and the hyperplane that satisfies this condition does not exist at all. In the above, we have learned about the linear separable situation of SVM processing, how to deal with the nonlinear data of SVM? For the non-linear situation, SVM’s processing method is to select a kernel function κ(⋅,⋅) to solve the problem of linearity inseparability in the original space by mapping data to high-dimensional space.

Specifically, in the case of linear indivisibility, the support vector machine first completes the calculation in the low-dimensional space, then maps the input space to the high-dimensional feature space through the kernel function, and finally constructs the optimal separation hyperplane in the high-dimensional feature space, so as to separate the nonlinear data on the plane that is not separable. ** As shown in the figure, a pile of data cannot be divided in 2d space, so it is mapped to 3d space and divided:

Usually people will choose from some common kernel (according to the problem and data, choose different parameters, in fact, get different kernel), such as: polynomial kernel, Gaussian kernel, linear kernel.

The reader may still not understand what a kernel is. Let me briefly summarize the following three points:

  1. In practice, we often encounter linear and indivisible samples. At this point, our common practice is to map the sample features to a higher-dimensional space (after mapping to a higher-dimensional space, the relevant features are separated, thus achieving the purpose of classification).
  2. But further, if all linearly indivisible examples are mapped to a higher dimension, then this dimension size will be horribly high. So what do we do?
  3. At this point, the kernel function comes on stage. The value of the kernel function lies in that although it also converts features from low dimension to high dimension, the kernel function must not calculate in the low dimension in advance, but shows the actual classification effect in the high dimension, avoiding the complex calculation in the high dimensional space directly.

If there are outliers in the data, they can be addressed using relaxation variables.

1.3.3 summary

Not accurately, it is essentially a SVM classification method, classification function are defined with the w ^ T + b, so w, b, to find the largest interval, which leads to the 1/2 | | w | | ^ 2, and then introduced the Lagrangian multipliers, into solving the Lagrange multiplier a (or will involve a series of optimization in the process of solving convex quadratic programming, etc.), so, Finding W.B is equivalent to finding A, and a can be solved with a fast learning algorithm SMO. As for the kernel function, it is to deal with nonlinear cases. If it is directly mapped to high-dimensional calculation, it is afraid of dimension explosion, so in low-dimensional calculation, equivalent high-dimensional performance.

OK, the understanding of the second layer has satisfied the curiosity of most people to see the principle of SVM, which is enough for the interview.

1.4 Application of SVM

SVM has many applications in many fields such as text classification, image classification, biological sequence analysis and biometric data mining, handwritten character recognition and so on, but you may not be strongly aware that SVM can be successfully applied in far more fields than is currently under development.

2. Some problems of SVM

  1. Is there a set of parameters that make the SVM training error zero?

    A: there are

  2. Does a SVM classifier with a training error of 0 necessarily exist?

    A: There must be

  3. Can the training error of SVM with relaxation variables be 0?

    A: Linear classifiers trained with SMO algorithms do not necessarily produce models with training errors of 0. This is because our optimization goal has changed and is no longer to minimize the training error.

  4. Why can SVM with kernel classify nonlinear problems? 六四风波

    Answer: The essence of kernel function is the inner product of two functions. By the kernel function, it is insinuated into the high-dimensional space. In the high-dimensional space, nonlinear problems are transformed into linear problems. The classification results are also regarded as the nonlinear classification results of low-dimensional space, so the SVM with kernel can classify nonlinear problems.

  5. How do I choose the kernel function?

    • If the number of features is large enough to be similar to the number of samples, LR or SVM with linear kernel is selected.
    • If the number of features is small and the number of samples is normal, SVM+ Gaussian kernel function is used.
    • If the number of features is small and the number of samples is large, you need to manually add some features to make the first case.

3. The connection and difference between LR and SVM

3.1 similarities

  • All linear classifiers. It’s all about finding an optimal classified hyperplane.
  • It’s all supervised learning algorithms.
  • These are all discriminant models. The discriminant model does not care about how the data is generated, it only cares about the differences between signals, and then uses the differences to simply classify a given signal. Common discriminant models include KNN, SVM and LR, and common generation models include Naive Bayes and hidden Markov model.

3.2 the difference between

  • LR is a parametric model, SVM is a non-parametric model, linear and RBF are for the difference between linearly separable and indivisible data.
  • From the objective function, the difference is that logistic regression uses the Logistical loss, while SVM uses hinge loss, both of the two loss functions aim to increase the weight of the data points that have a greater impact on classification, and reduce the weight of the data points that are less related to classification.
  • The processing method of SVM is to only consider support Vectors, that is, the few points most relevant to classification, to learn the classifier. Logistic regression greatly reduces the weight of the points far from the classification plane through nonlinear mapping, and relatively increases the weight of the data points most relevant to the classification.
  • Logistic regression models are relatively simple and easy to understand, especially for large-scale linear classification. However, the understanding and optimization of SVM are relatively complicated. After SVM is transformed into a dual problem, classification only needs to calculate the distance between SVM and a few support vectors, which has obvious advantages in the calculation of complex kernel functions and can greatly simplify the model and calculation.
  • Logic can do what SVM can do, but there may be some problems in accuracy. Logic cannot do what SVM can do.

4. The differences between linear classifier and nonlinear classifier and their advantages and disadvantages

Linearity and nonlinearity are based on model parameters and input characteristics. For example, if you input x, the model y is equal to ax+ax^2 then you have a nonlinear model, and if you input x and x^2 then you have a linear model.

  • Linear classifier has good interpretability and low computational complexity, but its disadvantage is that the model fitting effect is relatively weak.

    LR, Bayesian classification, single-layer perceptron, linear regression

  • The nonlinear classifier has strong effect fitting ability, but its disadvantages are that it is easy to overfit due to insufficient data, high computational complexity and poor interpretability.

    Decision tree, RF, GBDT, multilayer perceptron

SVM has both (see linear kernel or Gaussian kernel)

5. Code implementation

News category GitHub: Click to enter

Machine Learning

6. References

Popular Introduction to Support Vector Machines (Understanding the three layers of SVM)

Author: @ mantchs

GitHub:github.com/NLP-LOVE/ML…

Welcome to join the discussion! Work together to improve this project! Group Number: [541954936]