Support Vector Machines (SVM) is a binary classification algorithm. The so-called binary classification is to divide the data with multiple characteristics (attributes) into two classes. Among the current mainstream machine learning algorithms, neural network and other machine learning models have been able to complete binary classification, multi-classification, learning and research SVM well. Understanding the rich algorithm knowledge behind SVM is of great benefit to future research on other algorithms. In the process of SVM implementation, the previous one-dimensional search, KKT condition, penalty function and other related knowledge will be comprehensively used. This article first explains the principle of SVM in detail, and then introduces how to use Python to implement SVM algorithm from zero.

To facilitate understanding, assume that the sample has two attributes, and the attribute values can be respectively corresponding to the X and Y axes of the two-dimensional space axis, as shown in the figure below:

In the example, the sample is obviously divided into two categories. The black solid point can be named as category 1, and the hollow dot can be named as category 2. In practical application, the category will be numeralized, for example, category 1 is represented by 1, and category 2 is represented by -1, and the numeric category is called label. Whether each category corresponds to label 1 or -1 indicates that there is no hard and fast rule, and you can choose it according to your own preference. It should be noted that the category label cannot be set to 0 because the SVM algorithm label will also participate in mathematical calculation.

Again corresponding to the figure above, if a straight line can be found, the solid point and the hollow point mentioned above are divided into two parts. When there are other sample points next time, their attribute values are drawn as coordinates on the coordinate axis, and their categories can be determined according to the position relationship between the new sample points and the straight line. There are countless lines that meet such requirements, and SVM is to find the most suitable one: observe the figure above, the green line is definitely not good, and the classification line is already wrong without verification set; Both the blue line and the red line can be classified. The blue line is too close to the solid black dot, so the red line is more ‘fair’ by comparison. The red line is the classification line that SVM needs to find. The “fair” characteristic of the red line described by mathematical language can be expressed as follows: black point and hollow point are regarded as two sets. If a straight line is found, some points in the set of black points are closest to the straight line, and the distance between these points and the straight line is D. A series of points can also be found in the set of hollow points, which are closest to the line and the distance is also D, so the line is the linear classifier we need to find, and the point closest to the line in the two sets is called the support vector, hence the name of SVM support vector machine.

Some algorithm books describe the SVM algorithm as follows: to find a straight line and maximize the space between the line and the nearest point on both sides of the set. It can also be seen from the figure above that the distance between the black point and the blue line is less than the distance to the red line (the hypotenuse of the right Angle). If the support vector can be found, the classification line must be found, and vice versa. The above is for the two attribute values, the characteristics of SVM algorithm can be deduced by observing the two-dimensional plane. If the sample attributes are very many, how can the regularity of the algorithm be summarized? First, the separable theorem of convex sets is not only the core theoretical support of SVM, but also the cornerstone of machine learning algorithms.

1. Separable theorem of convex sets

Taking two-dimensional space as an example, we learned the equation of a line in middle school, such as the equation of a line y= 2X-1, as shown in the following figure:

Write the equation of the line y=2x-1 as the inner product:

Vector (-2,1) corresponds to the OA vector in the figure above, and the OA vector is changed into a unit vector, that is, the direction is the same as OA, the module is 1 vector OS, and the coordinate of S isDivide both sides of this equation byMay go to:

(x,y) represents any point on the line y=2x-1, and the formula above shows that any point on y=2x-1 and unit vector S:The inner product isIn the figure, the length of vector OP is, the negative sign is taken because the direction of OP vector is opposite to that of OS; In the figure above, the projection of vectors v1 and v2 onto OS is OP. This example illustrates that: by introducing a vector OS, countless points on the line y= 2X-1 can be used on OSOr the line y is equal to 2x minus 1 and the vector OS can be represented by the coordinates 0,) said. By means of inner product projection, high-dimensional data can be transformed into a real number on a vector, which is a linear functional process. In mathematics, inner product is often used to reduce the dimension of data, and to process multidimensional data into a real number is convenient for later analysis and processing.

Extended to arbitrary dimension, the above equation set available formula: S: {x | pTx = alpha} x ∈ Rn. Corresponding to the figure above: p vector corresponds to vector OS,x is any point on the line, and in SVM algorithm, the set S formed by X points is called hyperplane. Sometimes to project high dimensional data set to the value of the inner product of the vector p a range: such as {x | pTx > = alpha} or {y | pTy < = alpha}, high-dimensional data is projected onto two interval vector p:

Next, the separation theorem of convex sets is introduced:

In the figure above, X and Y are convex sets in higher dimensional space, and X∩Y= φ; The projection of X and Y onto vector P yields two convex sets X’ and Y’, on which X’ and Y’ must be at both ends of vector P, i.e., X’ and Y’ are separable.

Notice that there are two conditions in the theorem, one is that X, Y is convex, and the other is that the intersection of X and Y is empty. Separable X’ and Y’ means that the two can be distinguished on vector P, while X and X’ and Y and Y’ are all one-to-one mappings. Separable X’ and Y’ indirectly means that X and Y can also be distinguished. The so-called ‘separable’ here is the dichotomy to be realized by SVM. The separable theorem of convex sets indicates that two convex sets without intersection are linearly separable. At the same time, if two data sets cannot be separated linearly, the data can be changed into two convex sets by kernel function. Therefore, the separable theorem of convex sets also has guiding significance for kernel function generation.

2. SVM algorithm

2.1 Hyperplane and maximum interval

Described above, can realize the classification of the data of linear classifier called hyperplane, the maximum interval of hyperplane SVM algorithm needs, can set a hyperplane to S: {x | pTx = alpha}, because the pTx = alpha on both sides of the equation can be divided by a positive number can classify the p into a unit vector, might as well set p is a process of a unit vector, This setting does not affect the SVM algorithm. In general literature, hyperplanes are usually written as implicit functions:

S: {x | pTx – alpha = 0} x ∈ Rn, p ∈ Rn, | | p | | = 1

According to geometric knowledge, the distance between any point x in space and the hyperplane S can be expressed as:

The distance between the support vectors belonging to the two categories and the hyperplane is D, and D >0. Since the support vectors are the closest point to the hyperplane in the data of their respective classification, the following inequalities apply to all data:

         ⑴

Divide both sides of equation (1) by d, you can get:, using the substitution method, let:

                               ⑵

In this way, the common form of constraint conditions is obtained:

                             ⑶

Next, we need to remove the absolute value sign from formula (3). SVM is a binary classification problem: we can set the classification label y=1 when ωTx+b>=1; ωTx+b<=-1, y=-1, so all data have inequality:

Y (omega Tx + b) > = 1 (4)

Again looking back on it in yuan set, omega = pT/d, on both sides of the equation for norm operation are: | | omega | | = | | pT | | = 1 / d/d, can get d = 1 / | |. | |, the SVM algorithm needs on the basis of the formula 4 constraint, makes the largest distance d. Assuming that there are m data to be classified, the implementation of SVM algorithm is equivalent to solving a nonlinear programming problem with inequality constraints:

The above problem can also be expressed as a minimization problem:

 ⑤

The classification effect of hyperplane realization and the relationship among various parameters can be refer to the following figure:

2.2 Maximum soft interval

In practice, due to the existence of abnormal data, the hyperplane cannot completely divide data into two parts, as shown in the following figure:

A small number of anomalous data are mixed in the two classifications. Removing the few anomalous points, the hyperplane in the figure above can separate most of the samples. Equation (5) can be compatible with the situation shown above after the introduction of relaxation variables:

(5.1)

In normal cases, it equals 0, indicating that sample points are on both sides of the support vector.>0 indicates abnormal data:When 0 <<1, represents the data between the hyperplane and the support vector, such as point 2 and point 4 in the figure above; And * *>1 indicates that the data is in the space of the other, such as points 1 and 3 in the figure above. The interval satisfying Equation (5.1) is called soft interval. Soft interval optimization results not only require the maximum interval of support vector, but also make eachKeep it as small as possible.In other words,.After the hyperplane is determined, the number of samples defined as abnormal data should be minimized.

[x_train, y_train, x_test, y_test]= data(); % load data


% Project part 1

disp('--------Project part 1----------');

plot_alldata(x_train, y_train, x_test, y_test); % plot all data

fit_data(x_train, y_train, x_test, y_test); % fitting data & plot the best model


% Project part 2

disp('--------Project part 2----------');

svm(x_train,y_train,x_test,y_test); % svm model fitting & plots
Copy the code

The results show

>> main_project

警告: 在为表创建变量名称之前,对文件中的列标题进行了修改,以使其成为有效的 MATLAB 标识符。原始列标题保存在 VariableDescriptions 属性

中。

将 'PreserveVariableNames' 设置为 true 以使用原始列标题作为表变量名称。

--------Project part 1----------

For n =

1


mse_train=

2.9834e+04


r2_train=

0.1084


mse_test=

3.2158e+04


r2_test=

0.1305


---------------

For n =

2


mse_train=

1.3195e+04


r2_train=

0.6057


mse_test=

1.5085e+04


r2_test=

0.6028


---------------

For n =

3


mse_train=

1.2732e+04


r2_train=

0.6195


mse_test=

1.4313e+04


r2_test=

0.6237


---------------

For n =

4


mse_train=

1.2687e+04


r2_train=

0.6208


mse_test=

1.4152e+04


r2_test=

0.6283


---------------

For n =

5


mse_train=

1.1969e+04


r2_train=

0.6423


mse_test=

1.3453e+04


r2_test=

0.6470


---------------

For n =

6


mse_train=

6.3150e+03


r2_train=

0.8113


mse_test=

6.8526e+03


r2_test=

0.8285


---------------

For n =

7


警告: 多项式未正确设置条件。请添加具有不同 X 值的点,减少多项式的次数,或者尝试按照 HELP POLYFIT 中所述进行中心化和缩放。

> In polyfit (line 79)

In fit_data (line 5)

In main_project (line 6)

mse_train=

6.2697e+03


r2_train=

0.8126


mse_test=

6.8162e+03


r2_test=

0.8296


---------------

For n =

8


警告: 多项式未正确设置条件。请添加具有不同 X 值的点,减少多项式的次数,或者尝试按照 HELP POLYFIT 中所述进行中心化和缩放。

> In polyfit (line 79)

In fit_data (line 5)

In main_project (line 6)

mse_train=

3.0802e+03


r2_train=

0.9079


mse_test=

3.3461e+03


r2_test=

0.9258


---------------

For n =

9


警告: 多项式未正确设置条件。请添加具有不同 X 值的点,减少多项式的次数,或者尝试按照 HELP POLYFIT 中所述进行中心化和缩放。

> In polyfit (line 79)

In fit_data (line 5)

In main_project (line 6)

mse_train=

2.9532e+03


r2_train=

0.9117


mse_test=

3.3297e+03


r2_test=

0.9261


---------------

警告: 多项式未正确设置条件。请添加具有不同 X 值的点,减少多项式的次数,或者尝试按照 HELP POLYFIT 中所述进行中心化和缩放。

> In polyfit (line 79)

In fit_data (line 32)

In main_project (line 6)

model coefficient:

0.0000 -0.0001 0.0074 -0.2230 3.8894 -39.3375 219.0309 -587.1015 589.8787 0.1253


--------Project part 2----------

-------Gaussian-------

CASE1 - default:

mse =

1.4268e+03


r2 =

0.9771


CASE2:

mse =

1.6445e+03


r2 =

0.9758


CASE3:

mse =

1.4664e+03


r2 =

0.9772


OPTIMIZED:

|====================================================================================================================|

| Iter | Eval | Objective: | Objective | BestSoFar | BestSoFar | BoxConstraint| KernelScale | Epsilon |

| | result | log(1+loss) | runtime | (observed) | (estim.) | | | |

|====================================================================================================================|

| 1 | Best | 10.431 | 0.89758 | 10.431 | 10.431 | 394.32 | 736.91 | 10.115 |

| 2 | Accept | 10.437 | 0.1401 | 10.431 | 10.432 | 0.0011055 | 0.053194 | 47.555 |

| 3 | Best | 10.426 | 0.071798 | 10.426 | 10.426 | 25.627 | 30.172 | 19150 |

| 4 | Accept | 10.426 | 0.054275 | 10.426 | 10.426 | 1.1377 | 0.0012314 | 14020 |

| 5 | Accept | 10.428 | 0.14271 | 10.426 | 10.426 | 470.09 | 836.88 | 0.41086 |

| 6 | Accept | 10.426 | 0.037576 | 10.426 | 10.426 | 2.6562 | 0.0010627 | 3310.4 |

| 7 | Accept | 10.426 | 0.034094 | 10.426 | 10.426 | 0.0010521 | 24.307 | 5816.8 |

| 8 | Best | 10.074 | 0.065287 | 10.074 | 10.074 | 17.708 | 0.056046 | 1.6934 |

| 9 | Accept | 10.426 | 0.038535 | 10.074 | 10.074 | 132.62 | 0.08944 | 3956.1 |

| 10 | Best | 9.9965 | 0.06095 | 9.9965 | 9.9965 | 21.302 | 0.036971 | 0.28347 |

| 11 | Best | 8.3865 | 0.07324 | 8.3865 | 8.3867 | 642.83 | 0.0044812 | 0.22288 |

| 12 | Accept | 9.5704 | 0.046863 | 8.3865 | 8.3867 | 993.59 | 0.0028159 | 156.21 |

| 13 | Best | 8.3864 | 0.071449 | 8.3864 | 8.3864 | 975.49 | 0.0012285 | 0.27834 |

| 14 | Accept | 8.3865 | 0.064198 | 8.3864 | 8.3862 | 507.99 | 0.0010299 | 0.24398 |

| 15 | Best | 8.3853 | 0.06352 | 8.3853 | 8.386 | 611.28 | 0.0019797 | 0.51237 |

| 16 | Accept | 8.3865 | 0.069143 | 8.3853 | 8.3751 | 660.76 | 0.0013634 | 0.22497 |

| 17 | Accept | 8.386 | 0.068376 | 8.3853 | 8.3773 | 828.03 | 0.0024918 | 0.35096 |

| 18 | Accept | 8.3861 | 0.067393 | 8.3853 | 8.3799 | 732.96 | 0.0014109 | 0.32464 |

| 19 | Accept | 8.3861 | 0.062025 | 8.3853 | 8.3797 | 520.66 | 0.0022738 | 0.30965 |

| 20 | Accept | 8.3863 | 0.066882 | 8.3853 | 8.3797 | 668.85 | 0.0018481 | 0.24784 |

|====================================================================================================================|

| Iter | Eval | Objective: | Objective | BestSoFar | BestSoFar | BoxConstraint| KernelScale | Epsilon |

| | result | log(1+loss) | runtime | (observed) | (estim.) | | | |

|====================================================================================================================|

| 21 | Accept | 8.3866 | 0.063036 | 8.3853 | 8.3796 | 903.96 | 0.0023935 | 0.22712 |

| 22 | Accept | 8.3864 | 0.065384 | 8.3853 | 8.3807 | 988.43 | 0.0056305 | 0.26435 |

| 23 | Accept | 8.3857 | 0.080765 | 8.3853 | 8.3808 | 768.95 | 0.0028004 | 0.40427 |

| 24 | Accept | 10.434 | 0.047994 | 8.3853 | 8.3807 | 0.036485 | 962.56 | 0.22485 |

| 25 | Accept | 10.432 | 0.091279 | 8.3853 | 8.3805 | 0.065077 | 0.001002 | 0.22826 |

| 26 | Accept | 10.426 | 0.03629 | 8.3853 | 8.3804 | 0.15322 | 962.81 | 20662 |

| 27 | Accept | 8.3865 | 0.063816 | 8.3853 | 8.3815 | 648.08 | 0.0022537 | 0.24789 |

| 28 | Accept | 10.426 | 0.035603 | 8.3853 | 8.3814 | 0.0013648 | 0.0010192 | 21509 |

| 29 | Accept | 10.434 | 0.048371 | 8.3853 | 8.3813 | 0.0010406 | 767.16 | 0.32152 |

| 30 | Accept | 8.3866 | 0.079493 | 8.3853 | 8.3814 | 912.54 | 0.0035796 | 0.24067 |


__________________________________________________________

优化完成。

达到 MaxObjectiveEvaluations 30。

函数计算总次数: 30

总经过时间: 26.8651 秒。

总目标函数计算时间: 2.808


观测到的最佳可行点:

BoxConstraint KernelScale Epsilon

_____________ ___________ _______


611.28 0.0019797 0.51237


观测到的目标函数值 = 8.3853

估计的目标函数值 = 8.3814

函数计算时间 = 0.06352


估计的最佳可行点(根据模型):

BoxConstraint KernelScale Epsilon

_____________ ___________ _______


668.85 0.0018481 0.24784


估计的目标函数值 = 8.3814

估计的函数计算时间 = 0.067649


mse =

2.0132e+03


r2 =

0.9671


-------RBF-------

CASE1 - default:

mse =

1.4268e+03


r2 =

0.9771


CASE2:

mse =

1.4664e+03


r2 =

0.9772


CASE3:

mse =

1.4591e+03


r2 =

0.9766


OPTIMIZED:

|====================================================================================================================|

| Iter | Eval | Objective: | Objective | BestSoFar | BestSoFar | BoxConstraint| KernelScale | Epsilon |

| | result | log(1+loss) | runtime | (observed) | (estim.) | | | |

|====================================================================================================================|

| 1 | Best | 10.437 | 0.064096 | 10.437 | 10.437 | 0.0024199 | 0.012131 | 0.82686 |

| 2 | Best | 10.424 | 0.035964 | 10.424 | 10.425 | 792.75 | 0.51879 | 2798.3 |

| 3 | Accept | 10.424 | 0.033958 | 10.424 | 10.424 | 0.0015712 | 1.7285 | 12257 |

| 4 | Best | 10.315 | 0.052699 | 10.315 | 10.315 | 64.709 | 60.677 | 0.43676 |

| 5 | Accept | 10.43 | 0.053824 | 10.315 | 10.341 | 598.05 | 816.9 | 0.22621 |

| 6 | Best | 10.315 | 0.049229 | 10.315 | 10.315 | 65.348 | 60.686 | 0.73483 |

| 7 | Accept | 10.424 | 0.032769 | 10.315 | 10.315 | 0.068404 | 38.268 | 382.69 |

| 8 | Best | 10.23 | 0.046858 | 10.23 | 10.23 | 163.2 | 61.687 | 38.56 |

| 9 | Accept | 10.424 | 0.03518 | 10.23 | 10.23 | 4.1446 | 62.152 | 2623.5 |

| 10 | Accept | 10.424 | 0.033162 | 10.23 | 10.23 | 199.56 | 62.459 | 2073.2 |

| 11 | Best | 10.113 | 0.048122 | 10.113 | 10.113 | 307.74 | 60.662 | 18.094 |

| 12 | Accept | 10.44 | 0.052386 | 10.113 | 10.113 | 0.0020345 | 55.312 | 10.057 |

| 13 | Best | 9.5718 | 0.048005 | 9.5718 | 10.002 | 694.03 | 28.973 | 12.634 |

| 14 | Best | 7.8382 | 0.057983 | 7.8382 | 7.8399 | 932.32 | 4.265 | 11.006 |

| 15 | Accept | 10.433 | 0.041807 | 7.8382 | 7.8384 | 0.39209 | 4.3466 | 348.78 |

| 16 | Accept | 8.111 | 0.055765 | 7.8382 | 7.8388 | 976.56 | 5.5148 | 1.4862 |

| 17 | Accept | 10.424 | 0.032728 | 7.8382 | 7.8386 | 992.59 | 4.0578 | 2128.8 |

| 18 | Best | 7.65 | 0.066125 | 7.65 | 7.65 | 819.88 | 2.352 | 4.6514 |

| 19 | Accept | 7.7333 | 0.062206 | 7.65 | 7.6576 | 912.6 | 3.2069 | 4.5506 |

| 20 | Best | 7.6008 | 0.071054 | 7.6008 | 7.6011 | 990.85 | 1.656 | 8.9775 |

|====================================================================================================================|

| Iter | Eval | Objective: | Objective | BestSoFar | BestSoFar | BoxConstraint| KernelScale | Epsilon |

| | result | log(1+loss) | runtime | (observed) | (estim.) | | | |

|====================================================================================================================|

| 21 | Accept | 7.6387 | 0.098309 | 7.6008 | 7.6007 | 982.52 | 0.83259 | 2.7969 |

| 22 | Accept | 7.6275 | 0.076077 | 7.6008 | 7.6018 | 797.68 | 0.9112 | 0.29266 |

| 23 | Accept | 7.9363 | 0.0648 | 7.6008 | 7.6019 | 994.85 | 0.32932 | 0.23043 |

| 24 | Accept | 7.6935 | 0.045446 | 7.6008 | 7.6018 | 29.346 | 0.967 | 0.37377 |

| 25 | Accept | 7.6274 | 0.080696 | 7.6008 | 7.6017 | 910.2 | 0.9449 | 0.91925 |

| 26 | Accept | 7.6121 | 0.077758 | 7.6008 | 7.603 | 943 | 1.1589 | 5.0625 |

| 27 | Accept | 7.6153 | 0.062063 | 7.6008 | 7.6032 | 517.55 | 1.4588 | 0.25744 |

| 28 | Accept | 7.6051 | 0.058127 | 7.6008 | 7.6033 | 254.49 | 1.155 | 0.38512 |

| 29 | Accept | 7.6205 | 0.075103 | 7.6008 | 7.603 | 999.64 | 1.4418 | 1.9978 |

| 30 | Accept | 7.6088 | 0.051488 | 7.6008 | 7.604 | 227.74 | 1.2748 | 1.9489 |


__________________________________________________________

优化完成。

达到 MaxObjectiveEvaluations 30。

函数计算总次数: 30

总经过时间: 20.9975 秒。

总目标函数计算时间: 1.6638


观测到的最佳可行点:

BoxConstraint KernelScale Epsilon

_____________ ___________ _______


990.85 1.656 8.9775


观测到的目标函数值 = 7.6008

估计的目标函数值 = 7.604

函数计算时间 = 0.071054


估计的最佳可行点(根据模型):

BoxConstraint KernelScale Epsilon

_____________ ___________ _______


254.49 1.155 0.38512


估计的目标函数值 = 7.604

估计的函数计算时间 = 0.057088


mse =

1.3855e+03


r2 =

0.9748


-------Polynomial-------

CASE1 - default:

mse =

1.6898e+04


r2 =

0.6155


CASE2:

mse =

1.5912e+04


r2 =

0.6422


CASE3:

mse =

6.6300e+03


r2 =

0.8565


OPTIMIZED:

|====================================================================================================================|

| Iter | Eval | Objective: | Objective | BestSoFar | BestSoFar | BoxConstraint| KernelScale | Epsilon |

| | result | log(1+loss) | runtime | (observed) | (estim.) | | | |

|====================================================================================================================|

| 1 | Best | 17.436 | 28.927 | 17.436 | 17.436 | 232.42 | 0.041278 | 0.77342 |

| 2 | Best | 10.42 | 0.071368 | 10.42 | 10.775 | 21.908 | 0.010392 | 12625 |

| 3 | Accept | 10.42 | 0.043085 | 10.42 | 10.42 | 0.0058184 | 18.398 | 3455.5 |

| 4 | Accept | 71.286 | 28.847 | 10.42 | 11.828 | 0.0059341 | 0.0013698 | 2.1953 |

| 5 | Accept | 10.42 | 0.047948 | 10.42 | 10.418 | 24.457 | 0.0030582 | 9739.2 |

| 6 | Accept | 10.42 | 0.037128 | 10.42 | 10.418 | 0.10555 | 1.1212 | 22115 |

| 7 | Accept | 42.845 | 28.004 | 10.42 | 10.415 | 998.17 | 0.0026635 | 83.309 |

| 8 | Accept | 10.42 | 0.038793 | 10.42 | 10.412 | 0.36885 | 0.0012841 | 5198.9 |

| 9 | Accept | 81.729 | 28.321 | 10.42 | 10.418 | 944.81 | 0.0043766 | 0.22306 |

| 10 | Best | 9.571 | 0.40222 | 9.571 | 9.5701 | 172.69 | 0.60399 | 0.77291 |

| 11 | Accept | 9.5906 | 0.17913 | 9.571 | 9.5705 | 0.021205 | 0.14439 | 154.14 |

| 12 | Accept | 10.435 | 0.045818 | 9.571 | 9.5705 | 42.148 | 256.32 | 0.22485 |

| 13 | Accept | 10.42 | 0.036242 | 9.571 | 9.5706 | 65.078 | 995.86 | 9943.1 |
Copy the code