• By Han Xinzi @Showmeai
  • Tutorial address: www.showmeai.tech/tutorials/3…
  • This paper addresses: www.showmeai.tech/article-det…
  • Statement: All rights reserved, please contact the platform and the author and indicate the source

The introduction

The model we are going to explain in this paper is the well-known support vector machine (SVM), which once had a near-monopoly position in machine learning and its influence lasted for many years. Until now, even though the influence of deep learning neural network is gradually increasing, SVM still has excellent effects and model robustness that can compete with neural network in small and medium-sized data sets.

Support vector machine learning methods, for different situations, there are different models from simple to complex:

  • Issues in Computers linear Support Vector Machine in Eal cases: In the case of training data linearly separable, we learn a linear classifier, namely linearly separable support vector machine (also known as hard interval support vector machine), through hard margin maximization.

  • Linear Support Vector Machine: When the training data is approximately linearly separable, a linear classifier is learned through soft margin maximization, which is called linear support vector machine (also called soft margin support vector machine).

  • Non-linear Support vector Machine: It is called nonlinear support vector machine to learn nonlinear classifier by using kernel trick and soft interval maximization when training data to be linearly non-separable.

Support vector machines can complete nonlinear classification in complex scenes with the help of kernel techniques. When the input space is Euclidean space or discrete set and the feature space is Hilbert space, kernel function represents the inner product between the feature vectors obtained by mapping the input from the input space to the feature space.

Nonlinear support vector machines can be learned by using kernel functions, which is equivalent to implicitly learning linear support vector machines in higher-dimensional feature Spaces. Such methods are called nuclear techniques.

1. Maximum interval classifier

1) Classification problems and linear models

Classification is a core problem in supervised learning. In supervised learning, when output variables take finite discrete values, the prediction problem becomes a classification problem. In practical life, the essence of many problems is classification, such as recognition of certain patterns: text classification, word segmentation, part-of-speech tagging, image content recognition and target detection.

Classification problem of mathematical understanding is space partition (or looking for different categories of decision boundary), as shown in the figure below is a simple linear classifier (refer to the interpretation of this part more detailed ShowMeAI article graphic machine learning based knowledge and graphical machine learning | | machine learning logistic regression algorithm, a).

2) Maximum interval classifier

When solving classification problems, different models will have different processing methods. Intuitively, we will use different decision boundaries to divide samples.

We classify the sample points of “Ice Dun dun” and “snow melting” in the figure below, and there are countless decision boundaries that can complete the classification task. However, the SVM model introduced here has higher requirements. It not only hopes to distinguish the two types of sample points, but also hopes to find the decision boundary with the highest robustness and best stability (corresponding to the black line in the figure).

This decision boundary has the “maximum” distance from the “nearest” data points on both sides, which means that the decision boundary has the strongest fault tolerance and is not susceptible to the interference of noisy data. Intuitively, if the decision boundary wobbles, it is the least likely to “bump into” sample points or lead to misjudgment.

2. Support vector machine details

1) Linearly separable SVM and hard interval maximization

We need to find the optimal dividing line that separates the red and blue figures in the figure below. The red figure punctuation = + 1 + 1 = = + 1, blue icon point = – = – = 1-1 of 1, straight f (x) = w ⋅ x + bf (x) = \ boldsymbol {w} \ cdot \ boldsymbol {x} + bf ⋅ x (x) = w + b, Where w, xw, xw, x are vectors, in fact, the formula is equivalent to f(x)=w1x1+w2x2+… WNXN +bf(x) = W_1x_1 + W_2x_2 +… + w_nx_n + bf (x) = w1x1 + w2x2 +… + WNXN + b.

  • When vector XXX is a two-dimensional vector, f(x)f(x)f(x) represents a straight line in a two-dimensional space.

  • When vector XXX is a three-dimensional vector, f(x)f(x)f(x) represents a plane in three-dimensional space.

  • When the NNN dimension vector of vector XXX (n>3), f(x)f(x)f(x) represents the n-1 dimensional hyperplane in the n-dimensional space.

SGN (f(x)) SGN (f(x)) SGN (f(x)) SGN (f(x)) when a new point XXX needs to predict which category it belongs to. SGNSGNSGN stands for symbolic function:

  • When f (x) > 0 f (x) > 0 f (x) > 0, SGN (f (x) = + 1 SGN (f (x) = + 1 SGN (f (x)) = + 1.

  • When f (x) < 0 f (x) < 0 f (x) < 0 SGN (f (x)) = 1 SGN (f (x)) = 1 SGN (f (x)) = 1.

So back to the point, how do we get an optimal partition line f of x, f of x, f of x? The line below represents several possible f(x)f(x)f(x) :

We want this line to satisfy the “maximum spacing” principle, which is shown below. The ICONS on the red and blue lines are called support vectors.

The decision boundary is F (x), f(x), f(x), the red and blue lines are the surface where the support vector is located, and the gap between the red and blue lines is the gap M (Margin Width) between categories that we want to maximize.

Here directly gives the calculation formula of interval M: M = 2 ∥ ∥ w M = \ frac {2} {\ left \ \ | w right \ |} M = 2 ∥ of ∥ w.

SVM requires the solution of the separation hyperplane that can correctly partition the training data set and has the maximum “geometric interval”.

As shown below, W ⋅x+b=0\ boldSymbol {w} \cdot \ BoldSymbol {x}+b= 0W ⋅x+b=0 is the separated hyperplane. For linearly separable data sets, there are infinitely many such hyperplanes (i.e. perceptrons), but the separated hyperplane with the largest geometric spacing is unique.

Let’s start with some definitions. Suppose we have a training data set on the feature space: T=(x1,y1),(x2,y2)… , xN, yN T = {(x_1, y_1), (x_2, y_2), \ ldots, (x_N, y_N)} T = (x1, y1), (x2, y2),… , xN, yN). And the training data set is linearly separable, where:

  • Xi ∈ Rnx_i \ \ mathbb in ^ {R} {n} xi ∈ Rn, yi ∈ {+ 1, 1} – y_ {I} \ \ in left \ \} {+ 1, 1 \ right yi ∈ {+ 1-1}, I = 1, 2,… N, assist = 1, 2, \ ldots N, x_ {I} I = 1, 2,… N,xi is the third eigenvector.

  • Yiy_ {I}yi is a class marker, and it is a positive example when it equals +1. If it is -1, it is negative.

(1) Geometric interval

For a given dataset TTT and the hyperplane wx+b=0wx+b=0wx+b=0.

  • The geometric interval of the hyperplane with respect to the sample point (xi,yi)(x_{I}, y_{I})(xi,yi) is defined as: ∥ gamma I = yi (w w ∥ ⋅ xi + b ∥ w ∥) \ gamma_ {I} = y_i (\ frac {w} {\ left \ \ | w right \ |} \ cdot x_i + \ frac {b} {\ left \ \ | w right \ | }) gamma I = yi (⋅ xi + ∥ ∥ ∥ w w w b ∥)

  • The minimum geometric interval of the hyperplane with respect to all sample points is: γ=min⁡ I =1,2… , N gamma \ gamma = \ min I _ {I = 1, 2 \ ldots, N} \ gamma_ = {I} both mini = 1, 2… Nγ I, and actually this distance is what we call the distance between the support vector and the hyperplane.


① According to the above definition, “Solving the maximum partitioned hyperplane problem” of SVM model can be expressed as the following constrained optimization problem:


max w . b gamma s . t . y i ( w w x i + b w ) p gamma . i = 1 . 2 . . N \begin{aligned} & \max _{w,b} \gamma \\ & s.t. \quad y_{i}(\frac{w}{\left \| w \right \|} \cdot x_i+\frac{b}{\left \| w \ right \ | \ geq \ gamma}), I = 1, 2, \ \ \ \ ldots, N end} {aligned
  • In the above formula, S.t is the abbreviation of subject to, which means limiting condition.


② Divide both sides of the constraint conditions by γ\gammaγ to obtain:


y i ( w w gamma x i + b w gamma ) p 1 y_{i}(\frac{w}{\left \| w \right \| \gamma} \cdot x_i+\frac{b}{\left \| w \right \|\gamma}) \geq 1


(3) because ∥ w ∥ \ left \ \ | w right \ | ∥ w ∥, gamma \ gamma gamma is a scalar, so for the sake of concise expression, to: ∥ gamma ∥ w w w = w = \ frac {w} {\ left \ \ | w right \ | \ gamma} w = ∥ w ∥ ∥ gamma gamma w, b = b ∥ w = b \ frac {b} {\ left \ \ | w right \ | \ gamma} b = ∥ gamma ∥ w b are:


y i ( w x i + b ) p 1 . i = 1 . 2 . . N Y_i (w \cdot x_i+b) \geq 1, I =1,2, \ldots, N


(4) and maximum because gamma \ gamma gamma, equivalent to the maximization ∥ 1 w ∥ \ frac {1} {\ left \ \ | w right \ |} ∥ w ∥ 1, Is equivalent to minimize 12 ∥ w ∥ 2 \ frac {1} {2} \ left \ \ | w right \ | ^ {2} ∥ 21 w ∥ 2, and the final segmentation hyperplane SVM model to solve the biggest problem and can be expressed as the following constrained optimization problem:


min w . b 1 2 w 2 s . t . y i ( w x i + b ) p 1 . i = 1 . 2 . . N \ begin} {aligned & \ min _ {w, b} \ frac {1} {2} \ left \ \ | w right \ | ^ {2} \ \ & s.t. \ quad y_ {I} (w \ cdot x_i + b) \ geq 1, I = 1, 2, \ldots, N \end{aligned}

(2) Dual algorithm

To solve the optimization problem of linearly separable support vector machines, we often transform it into dual problem to solve it, that is, to obtain the optimal solution of primal problem by solving “Dual problem” by applying “Lagrange duality”. That is, the dual algorithm of linearly separable support vector machine.

This has some advantages:

  • Duality problems tend to be easier to solve.

  • By introducing natural kernel function, it can be extended to nonlinear classification.


① We first construct the Lagrange function. Therefore, the Lagrange multiplier α I ≥0, I =1,2… is introduced for each inequality constraint. , N \ alpha_i \ geq 0, I = 1, 2,… , N alpha I acuity 0, I = 1, 2,… N, define the Lagrange function:


L ( w . b . Alpha. ) = 1 2 w T w + i = 1 N Alpha. i ( 1 y i ( w T x i + b ) ) = 1 2 w T w i = 1 N Alpha. i y i ( w T x i + b ) + i = 1 N Alpha. i \begin{aligned} L(w, b, \alpha)=&\frac{1}{2}w^Tw+\sum_{i=1}^{N}\alpha_i(1-y_i(w^T x_i + b)) \\ =& \frac{1}{2}w^Tw-\sum_{i=1}^{N}\alpha_i y_i(w^T x_i + b) +\sum_{i=1}^{N}\alpha_i \end{aligned}

According to Lagrangian duality, the dual problem of the original problem is the minimax problem: Max ⁡αmin⁡w,bL(w,b,α)\max_{\alpha}\min_{w,b}L(w, b, alpha) Max αminw,bL(w,b,α). Therefore, in order to obtain the solution of the dual problem, it is necessary to find the minimum value of L(w,b,α)L(w, b, \alpha)L(w,b,α) for W, bw, bw,b, and then find the maximum value for α\alpha α.


(2) o
L ( w . b . Alpha. ) L(w, b, \alpha)
right
w , b W, b
The minimum

The partial derivatives of Lagrangian functions L(w,b,α)L(w, b, \alpha)L(w,b,α) with respect to W, bw, bw and b are taken and set to 0 respectively.


partial L partial w = w i = 1 N Alpha. i y i x i = 0 w = i = 1 N Alpha. i y i x i \ frac {\ partial L}} {\ partial w = w – \ sum_ {I = 1} ^ {N} \ alpha_i y_i x_i = 0 = \ \ Rightarrow w sum_ {I = 1} ^ {N} \ alpha_i y_i x_i

Substitute the above formula into the Lagrange function, and obtain:


min w . b L ( w . b . Alpha. ) = 1 2 ( i = 1 N Alpha. i y i x i ) T i = 1 N Alpha. i y i x i i = 1 N Alpha. i y i ( ( i = 1 N Alpha. i y i x i ) T x i + b ) + i = 1 N Alpha. i = 1 2 i = 1 N j = 1 N Alpha. i y i Alpha. j y j x i T x j i = 1 N j = 1 N Alpha. i y i Alpha. j y j x j T x i + i = 1 N Alpha. i = 1 2 i = 1 N j = 1 N Alpha. i y i Alpha. j y j x i T x j + i = 1 N Alpha. i \begin{aligned} \min_{w,b} L(w, b,\alpha)=&\frac{1}{2} (\sum_{i=1}^{N}\alpha_i y_i x_i)^T \sum_{i=1}^{N}\alpha_i y_i x_i-\sum_{i=1}^{N}\alpha_i y_i((\sum_{i=1}^{N}\alpha_i y_i x_i)^T x_i + b) +\sum_{i=1}^{N}\alpha_i \\ =&\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\alpha_i y_i\alpha_j y_jx_i^Tx_j-\sum_{i=1}^{N}\sum_{j=1}^{N}\alpha_i y_i\alpha_j y_jx_j^Tx_i+\sum_{i=1}^{N}\alpha_i \\ =& -\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\alpha_i y_i\alpha_j y_jx_i^Tx_j+\sum_{i=1}^{N}\alpha_i \end{aligned}


(3) o
min w . b L ( w . b . Alpha. ) \min_{w,b}L(w, b,\alpha)
right
Alpha. \alpha
Is the duality problem


max Alpha. 1 2 i = 1 N j = 1 N Alpha. i y i Alpha. j y j x i T x j + i = 1 N Alpha. i s . t . i = 1 N Alpha. i y i = 0 Alpha. i p 0 . i = 1 . 2 N \begin{aligned} & \max_{\alpha} -\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\alpha_i y_i\alpha_j Y_jx_i ^ Tx_j + \ sum_ {I = 1} ^ {N} \ alpha_i \ \ & s.t. \ quad \ sum_ {I = 1} ^ {N} \ alpha_iy_i = 0 \ \ &, quad, quad, alpha_i \ geq 0, I = 1, 2… N \end{aligned}

By converting the objective function in the formula from maximization to minimization, the following equivalent dual optimization problem can be obtained:


min Alpha. 1 2 i = 1 N j = 1 N Alpha. i y i Alpha. j y j x i T x j i = 1 N Alpha. i s . t . i = 1 N Alpha. i y i = 0 Alpha. i p 0 . i = 1 . 2 N \begin{aligned} & \min_{\alpha} \frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\alpha_i y_i\alpha_j Y_jx_i ^ Tx_j – \ sum_ {I = 1} ^ {N} \ alpha_i \ \ & s.t. \ quad \ sum_ {I = 1} ^ {N} \ alpha_iy_i = 0 \ \ &, quad, quad, alpha_i \ geq 0, I = 1, 2… N \end{aligned}


④ For linearly separable training data sets, the dual optimization problem is assumed to provide α\alphaα ∗=(α1∗,α2∗,… Alpha N ∗) T \ alpha ^ * = (\ alpha_1 ^ *, \ alpha_2 ^ *,… Alpha, \ alpha_N ^ ^ *) T ∗ = (alpha 1 ∗, alpha 2 ∗,… , alpha N ∗) T, can be obtained by alpha alpha ∗ ∗ \ alpha ^ * original solution of optimization problem w ∗ ∗, b w ^ ^ *, b * w ∗ ∗, b.

Take α∗=(α1∗,α2∗,… Alpha N ∗) T \ alpha ^ * = (\ alpha_1 ^ *, \ alpha_2 ^ *,… Alpha, \ alpha_N ^ ^ *) T ∗ = (alpha 1 ∗, alpha 2 ∗,… ,αN∗)T is the solution of the dual optimization problem, then there is subscript JJJ, which makes αj∗>0\alpha_j^*>0αj∗>0, and the solution of the original optimization problem W ∗,b∗ W ^*, B ^* W ∗,b∗ can be obtained as follows:


w = i = 1 N Alpha. i y i x i w^*=\sum_{i=1}^*{N} \alpha_i^*y_ix_i


b = y j i = 1 N Alpha. i y i ( x i T x j ) b^*=y_j-\sum_{i=1}^*{N} \alpha_i^*y_i(x_i^*Tx_j)

Proof: According to the theorem, KKT condition is established, then:


partial L ( w . b . Alpha. ) partial w = w i = 1 N Alpha. i y i x i = 0 partial L ( w . b . Alpha. ) partial b = i = 1 N Alpha. i y i = 0 partial L ( w . b . Alpha. ) partial Alpha. = 0 Alpha. i ( 1 y i ( w T x i + b ) ) = 0 Alpha. i p 0 . i = 1 . 2 . . N 1 y i ( w T x i + b ) Or less 0 . i = 1 . 2 . . N \begin{aligned} & \frac{\partial L(w^*,b^*,\alpha ^*)}{\partial w} = w^*-\sum_{i=1}^{N}\alpha_i^* y_i x_i=0 \\ & \frac{\partial L(w^*,b^*,\alpha ^*)}{\partial b} = -\sum_{i=1}^{N}\alpha_i^*y_i=0 \\ & \frac{\partial L(w^*,b^*,\alpha ^ *)} {\ partial \ alpha} = 0 \ \ & \ alpha_i (1 – y_i (w ^ Tx_i + b)) = 0 \ \ & \ alpha_i \ geq 0, \ \ & I = 1, 2,… ,N \\ & 1-y_i(w^Tx_i+b) \leq 0, I =1,2,… ,N \end{aligned}

So w ∗ = ∑ I = 1 N alpha I ∗ yixiw ^ * = \ sum_ {I = 1} ^ {N} \ alpha_i ^ * y_ix_iw ∗ = ∑ I = 1 N alpha I ∗ yixi with at least one alpha j ∗ > 0 \ alpha_j ^ * > 0 alpha j ∗ > 0 (by anyway, Given that αj∗=0\alpha_j^*=0αj∗=0, w∗= 0W ^*=0w∗=0, and W ∗= 0W ^*=0w∗=0 is not the solution of the original optimization problem, resulting in a contradiction), thus, for JJJ: Yj, (w ∗ ∗ ⋅ xj + b) – 1 = 0 y_j (w ^ * \ cdot x_j + b ^ *) – 1 = 0 yj, (w ∗ ⋅ xj + b ∗) – 1 = 0

Substitute the formula and note that yj2=1y_j^2=1yj2=1, then: B ∗ = yj – w ∗ xj = yj – ∑ I = 1 N alpha I ∗ yi (xixj) b = y_j – w ^ ^ * * {…}) x_j = y_j – \ sum_ {I = 1} ^ {N} \ alpha_i ^ * y_i (x_ix_j) B ∗ = yj – w ∗ xj = yj – ∑ I = 1 n alpha I ∗ yi (xixj)


⑤ According to the theorem, the separated hyperplane can be written as


i = 1 N Alpha. i y i ( x x i ) + b = 0 \sum_{i=1}^{N} \alpha_i^*y_i(x\cdot x_i) + b^* = 0


⑥ Classification decision function can be written


f ( x ) = s i g n ( i = 1 N Alpha. i y i ( x x i ) + b ) f(x) = sign(\sum_{i=1}^{N} \alpha_i^*y_i(x\cdot x_i) + b^*)

In other words, the decision function here depends only on the inner product of the input XXX and the training sample input. The above equation is also called the dual form of linearly separable SVM.

To sum up, for a given linearly separable training data set, we can first solve the dual problem α∗\alpha^*α∗; The solution of the original problem W ∗,b∗ W ^*,b^* W ∗,b∗; The separation hyperplane and classification decision function are obtained. This algorithm is called the dual learning algorithm of linearly separable support vector machines and is the basic algorithm of linearly separable support vector machines.

2) Linear SVM and soft interval maximization

The case we mentioned earlier is linearly separable, but in practice completely linearly separable is very rare. The diagram below is a typical linear indivisible classification (there is no way to find a straight line dividing space into two regions, one with black spots and the other with white spots).

To slice it, there are two schemes:

Plan 1: Use curves to completely separate them, i.e., nonlinear decision boundaries, which will be associated with the kernel function discussed later.

Scheme 2: Straight lines are still used, but they do not pursue complete separability, and some misclassification cases will be appropriately accommodated. In this process, we will add penalty functions into the model and try not to make too many misclassification points too outrageous. The penalty function for a misplaced point is the distance from the point to its correct position, as shown below:

The yellow and blue lines in the figure are respectively the boundaries where the support vectors are located, the black lines are decision functions, and the green lines represent the distance between the error points and their corresponding decision surfaces. In this way, we can add a penalty function to the original function with its limiting conditions as follows:


min 1 2 w 2 + C i = 1 R Epsilon. i . s . t . . y i ( w T x i + b ) p 1 Epsilon. i . Epsilon. i p 0 \begin{aligned} & \min \frac{1}{2}|w|^{2}+C \sum_{i=1}^{R} \varepsilon_{i}, \\ & s.t., y_{i}\left(w^{T} x_{i}+b\right) \geq 1-\varepsilon_{i}, \varepsilon_{i} \geq 0 \end{aligned}

When xix_ixi is on the right side, ε=0\varepsilon =0 ε=0, R is the total number of points, and C is a user-specified coefficient, indicating how much punishment should be added to the points that are misdivided:

  • When C is large, there are fewer misfractions, but overfitting can be serious.

  • When C is very small, you might have a lot of wrong points, but you might get a model that’s not quite right.

We actually adjust and choose the right C.

After this transformation, we can also solve a Lagrangian dual problem and obtain the expression of the dual problem of the original problem:


max Alpha. i = 1 n Alpha. i 1 2 i . j = 1 n Alpha. i Alpha. j y i y j x i T x j  s.t.  . C p Alpha. i p 0 . i = 1 . . n . i = 1 n Alpha. i y i = 0 \begin{aligned} & \max _{\alpha} \sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i, j=1}^{n} \alpha_{i} \alpha_{j} y_{i} y_{j} x_{i}^{T} x_{j} \\ & \text { s.t. }, C \geq \alpha_{i} \geq 0, i=1, \cdots, n , \sum_{i=1}^{n} \alpha_{i} y_{i}=0 \end{aligned}

The difference is that the range of α changes from [0, +∞) to [0, C]. The added penalty ε\varepsilonε does not add much complexity to the duality problem.

3) Nonlinear SVM and kernel function

What if we’re dealing with a more complex classification problem that’s not even nearly linearly separable, in which case the degree of misclassification found is too high to be acceptable?

For such problems, one solution is to map the samples from the original space to a higher-dimensional feature space to make the samples linearly separable in this feature space, and then apply SVM to solve the problem, as shown in the figure below:

For example, the typical linearly indivisible case is shown below:

When we use a mapping function z1 = x12, z2 = x22, z3 = x2z_ {1} = x_ {1} ^ {2}, z_ {2} = x_ {2} ^ {2}, Z_ {3}=x_{2}z1=x12,z2=x22,z3=x2 After mapping these two ellipsoid points to three dimensional space (z1,z2,z3) (Z_1, Z_2,z_3) (z1,z2,z3), rotating the mapped coordinates, a linearly separable set of points can be obtained.

Let’s recall our previous expression for the dual problem:


max Alpha. i = 1 n Alpha. i 1 2 i . j = 1 n Alpha. i Alpha. j y i y j x i T x j s . t . . C p Alpha. i p 0 . i = 1 . . n . i = 1 n Alpha. i y i = 0 \begin{aligned} & \max _{\alpha} \sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i, j=1}^{n} \alpha_{i} \alpha_{j} y_{i} y_{j} x_{i}^{T} x_{j} \\ & s.t. , C \geq \alpha_{i} \geq 0, i=1, \cdots, n, \sum_{i=1}^{n} \alpha_{i} y_{i}=0 \\ \end{aligned}

Do a little modification, make: xiTxj = kappa (xi, xj) predominate x_ {I} ^ {T} x_ = {j} \ kappa \ left (x_ {I}, x_ {j} \ right) xiTxj = kappa (xi, xj) predominate. All this does is map linear space to higher dimensional space, κ(xi,xj)\kappa\left(x_{I}, x_{j}\right) there are several types of κ(xi,xj). Here are two typical ones:


κ ( x i . x j ) = ( x i x j + 1 ) d \kappa\left(x_{i}, x_{j}\right)=\left(x_{i} \cdot x_{j}+1\right)^{d}

κ ( x i . x j ) = exp ( ( x i x j ) 2 2 sigma 2 ) \kappa\left(x_{i}, x_{j}\right)= \exp \left(-\frac{\left(x_{i}-x_{j}\right)^{2}}{2\sigma^{2}}\right)

The above two kernel functions are polynomial kernel and Gaussian kernel respectively. The Gaussian kernel even maps the original space into infinite dimensional space. In addition, the kernel function has some good properties, for example, it will not increase the amount of extra calculation compared with the linear condition, and so on. Different kernels may have different results for a problem, and we need to make some attempts to support the choice (see the Python implementation at the bottom for this section).

3. The SVM

1) Model summary

Support Vector Machines (SVM) is a binary classification model. Its basic model is a linear classifier with the largest interval defined in the feature space. Its learning strategy is interval maximization, and the method can be formalized into a quadratic programming for solving graphs.

SVM can be divided into three categories from simple to complex: linearly separable support vector machines and hard-margin SVM; Linear support vector machine, soft-margin SVM; Nonlinear support vector machine, Kernel SVM.

There are three gems in SVM: spacer, dual and nuclear skills.

2) Advantages and disadvantages of the model

(1) Advantages of SVM model

  • SVM is a convex optimization problem, the solution must be global optimal rather than just local optimal.
  • This applies not only to linear problems, but also to nonlinear problems (with the help of a nuclear trick).
  • The model is robust and the decision boundary depends only on the support vector rather than the whole data set.
  • The results are excellent in small and medium sample size data sets.
  • You don’t have to rely on the whole data.
  • Strong generalization ability.

(2) Disadvantages of SVM model

  • Quadratic programming problem solving will involve the calculation of n-order matrix (where N is the number of samples), the amount of calculation increases sharply with the sample size, so SVM is not suitable for large data sets.

  • The original SVM is only suitable for dichotomous problems. (Of course, the generalized SVR of SVM is also applicable to regression problems; We can also solve the multi-classification problem by combining SVM with one-VS-one, one-Vs-REST and other ideas).

  • There is no universal solution to nonlinear problems, and it is sometimes difficult to find a suitable kernel.

  • The interpretation power of higher dimensional mapping of kernel function is not strong, especially for radial basis function.

  • SVM is sensitive to missing data.

More supervised learning algorithm model summary articles can view ShowMeAI AI knowledge skill quick | machine learning – learning supervision.

4. SVM code practice based on Python

1) Algorithm package description

We directly use the Python machine learning toolkit sklearn to demonstrate the application of SVM. The algorithm implementation of SVM in SkLearn is packaged under sklearn. SVM, see the official Document of SkLearn for details, where SVC class is used for classification tasks. SVR is used for numerical regression tasks.

Different kernel functions need to specify different parameters.

  • For linear functions, you only need to specify parameter C, which represents the penalty for samples that do not conform to the maximum spacing rule.
  • For polynomial kernel functions, in addition to the parameter C, you need to specify degree, which represents the degree of the polynomial.
  • For the Gaussian kernel, in addition to parameter C, you need to specify the gamma value, which corresponds to the value of 12σ2\frac{1}{2\sigma ^2}2σ21 in the Gaussian formula.

2) Use linear kernel functions

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets

iris = datasets.load_iris()
X = iris.data[:, :2]  # Take only the first two dimensions for easy visualization
y = iris.target

svc = svm.SVC(kernel='linear', C=1).fit(X, y)

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
h = (x_max / x_min) / 100
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

plt.subplot(1.1.1)
Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('SVC with linear kernel')
plt.show()
Copy the code

3) Use polynomial kernel functions

The code to initialize the SVM object is replaced with the following line

svc = svm.SVC(kernel='poly', degree=3).fit(X, y)
Copy the code

4) Use RBF kernel (Gaussian kernel)

The code to initialize the SVM object is replaced with the following line

svc = svm.SVC(kernel='rbf', C=1).fit(X, y)
Copy the code

The larger the gamma value is, the more accurate the SVM will tend to divide the data in each training set, which will lead to large generalization error and over-fitting problems.

C: penalty parameter C for the error item. It also controls the tradeoff between smoothing decision boundaries and correctly classifying training points.

Refer to the link

  • Support Vector Machine (SVM)
  • Support vector machine (SVM) foundation
  • Read this article you still don’t understand SVM you hit me

Video tutorial

You can click on theB stationCheck out the [bilingual subtitles] version of the video

Stanford bilingual + data download 】 【 CS229 | machine learning – Wu En of speaker (2018 · complete version)

www.bilibili.com/video/BV1TT…

ShowMeAI related articles recommended

  • 1. Machine learning basics
  • 2. Model evaluation methods and criteria
  • 3.KNN algorithm and its application
  • 4. Detailed explanation of logistic regression algorithm
  • 5. Detailed explanation of naive Bayes algorithm
  • 6. Detailed explanation of decision tree model
  • 7. Detailed explanation of random forest classification model
  • 8. Detailed analysis of regression tree model
  • 9. Detailed explanation of GBDT model
  • 10. The XGBoost model is fully resolved
  • 11. Detailed explanation of LightGBM model
  • 12. Detailed explanation of support vector machine model
  • 13. Detailed explanation of clustering algorithm
  • 14.PCA dimension reduction algorithm in detail

ShowMeAI series tutorials recommended

  • Illustrated Python programming: From beginner to Master series of tutorials
  • Illustrated Data Analysis: From beginner to master series of tutorials
  • The mathematical Basics of AI: From beginner to Master series of tutorials
  • Illustrated Big Data Technology: From beginner to master
  • Illustrated Machine learning algorithms: Beginner to Master series of tutorials