Recently, I have been learning SVM algorithm, so I recorded my learning process in this paper. I have used teacher Z’s handout and Li Hang’s statistics for reference in many parts of my study. Please excuse me if there are any shortcomings. In two-dimensional terms, SVM algorithm is to find a dividing line to separate the two categories. The problem is that as shown in the following figure, the three colors can divide the points and stars, but which line is optimal? This is the problem we need to consider;

 

So the first thing we’re going to do is we’re going to assume that the line W dot X plus b is equal to 0 is the optimal dividing line, and we’re going to separate the two categories and we’re going to have to figure out how do we get the optimal line? And the values of W and b; In SVM, the optimal segmentation surface (hyperplane) is the maximum of the minimum distance between support vector and hyperplane;

Our goal is to find a hyperplane so that points closer to the hyperplane have greater spacing. So instead of thinking that all the points have to be away from the hyperplane, what we care about is finding a hyperplane that has the maximum distance between the points closest to it.

Blue stars such as the above hypothesis has 5 samples, and set the sample tag for Y = 1, purple circle class there are five samples, and set the tag for Y = 1, a total of T = {(if x1, Y ₁), (₂ X, Y ₂) (₃ X, Y ₃)…… } For 10 samples, the hyperplane (dividing line) is wx+b=0; The geometric distance from the sample point to the hyperplane is:

     

Here to explain: function distance and geometric distance relationship; | w forced the definition on the sample if x1 + b | known as the function of distance, and the formula for the geometric distance, you will find that when w and b increased with multiple, multiple function distance will also increase; A simple example is that the distance between the sample X number and 2wX number +2b =0 is twice that of the number wX number +b =0; The geometric matrix remains unchanged;

Now we’re going to talk about how do we get hyperplanes? !

Hyperplane is to satisfy the minimum distance from support vector to its maximum, and is: Max [the minimum distance from support vector to hyperplane]; Then we just need to calculate the distance between the support vector and the hyperplane, and the minimum distance between the support vector and the hyperplane can be expressed as follows:

Therefore, the final optimized formula is:

According to the functional distance and geometric distance, when w and b increase, the geometric distance remains unchanged. Therefore, how can we substitute y(w*x+b) =1 into the sample on the support vector (the sample point closest to the hyperplane) by increasing w and B in the same multiple without affecting the optimization of the above formula? The sample point distance is as follows: In the figure above, the distance of r1 is one, the distance of K1 is one, and the other

The function distance of the sample points is greater than 1, and is: y(w•x+b)>=1. By substituting this condition into the above optimization formula, a new optimization formula 1-3 can be obtained:

 


Formulas 1-3 are shown below: Optimize the maximum fraction, convert to optimize the minimum denominator, and convert to formulas 1-4 for optimization convenience

In order to optimize the above formula, Lagrange formula and KTT condition optimization formula are used to transform into:

 

The above optimization formula is explained here: for example, our target problem is. We can construct:

 

 

Theta and theta are equivalent. Because, so only in the case

So our target function can be written as. If we use the dual expression:,

Since our optimization satisfies strong duality (strong duality means that the optimal value of the dual expression is equal to the optimal value of the original problem), under the condition of obtaining the optimal value, it satisfies:

.

 

In combination with the duality of degree above, our optimization function is shown as follows, where a >0

Now the optimization scheme is up to the top. First, take the minimum value, and then take the partial derivatives of W and B respectively to obtain the following formula:

 

The parameters obtained in the above equation are substituted into the formula to optimize the Max value:

In the last step, the optimal value of A can be obtained:


Above can get hyperplane!

But under normal circumstances, there may be some specific points. After removing these specific points, most of the remaining points can be linearly separable, and some points cannot be linearly separable, which means that the function distance of this point is not greater than or equal to 1, but less than 1. In order to solve this problem, we introduce the relaxation variable ε>=0; Then the constraint becomes:

Therefore, the original optimization function becomes:

After adding the relaxation variable, there are several explanations as follows: Sample points with a distance less than 1 are d away from the hyperplane, and sample points between the green line and the hyperplane are all lost by,

The relation between the loss variable and distance D is shown as follows: ξ = 1-d; when d >1, ξ =0; when d<1, ξ = 1-d; Therefore, a loss function diagram can be drawn, as shown in Figure 1-7. The pattern is like flipping a book, and we call this loss function hinge loss;

Let’s discuss the kernel function simply: the function of kernel function is actually very simple is to map low dimensions to high dimensions, easy to classify. Kernel functions include Gaussian kernel and so on. The influence of parameters on the model is directly shown in the figure below. As can be seen from the figure below, when C changes, fault tolerance becomes smaller and generalization ability becomes smaller. When the Gaussian kernel function is selected, the R parameter is increased at any time to improve the accuracy, and there is a risk of overfitting in the end.

Tic % timer %% clear CLC %format compact load(' issl-isomap. mat') % load CMPE raw % mappedX=X; %% data extraction zc=mappedX(1:60,:); % feature input lie=mappedX(61:120,:); mo=mappedX(121:180,:); que=mappedX(181:240,:); duan=mappedX(241:300,:); mm=size(zc,1); nn=20; a=ones(mm,1); % behavior population b=2*ones(mm,1); c=3*ones(mm,1); d=4*ones(mm,1); f=5*ones(mm,1); n1=randperm(size(zc,1)); n2=randperm(size(lie,1)); n3=randperm(size(mo,1)); n4=randperm(size(que,1)); n5=randperm(size(duan,1)); train_wine = [zc(n1(1:nn),:);lie(n2(1:nn),:);mo(n3(1:nn),:);que(n4(1:nn),:);duan(n5(1:nn),:)]; % of the corresponding training set of tags to separate train_wine_labels = [(1: nn, :), b (1: nn, :), c (1: nn, :), d (1: nn, :); f (1: nn, :)]; % of the first class 31-59, the second class 96-130, the third class 154-178 as a test set test_wine = [zc(n1((nn+1):mm),:);lie(n2((nn+1):mm),:);mo(n3((nn+1):mm),:);que(n4((nn+1):mm),:);duan(n5((nn+1):mm),:)]; % corresponding test set of tags to separate test_wine_labels = [a ((nn + 1) : mm, :) and b ((nn + 1) : mm, :), c ((nn + 1) : mm, :), d ((nn + 1) : mm, :); f ((nn + 1) : mm, :)]; % data preprocessing, the training set and test set are normalized to the interval [0,1] [mtrain,ntrain] = size(train_wine); [mtest,ntest] = size(test_wine); dataset = [train_wine;test_wine]; [dataset_scale,ps] = mapminmax(dataset_scale, 0,1); dataset_scale = dataset_scale'; train_wine = dataset_scale(1:mtrain,:); test_wine = dataset_scale( (mtrain+1):(mtrain+mtest),: ); %% Default parameters n=10; % Population size, typically10 to 40 N_gen=150; % Number of generations A=0.5; % Loudness (constant or sharp) r=0.5; % Pulse rate (constant or decreasing) % This frequency range determines the scalings % You should change these values if  necessary Qmin=0; % Frequency minimum Qmax=2; % Frequency maximum % Iteration parameters N_iter=0; % Total number of function evaluations % % Dimension of the search variables d=2; % Number of dimensions % Lower limit/bounds/ a vector Lb=[0.01,0.01]; % Lower bound of the value Ub=[100,100]; % Initializing Arrays Q= ZerOS (n,1); % Frequency v=zeros(n,d); % Velocities % Initialize the population/solutions % Output/display disp(['Number of evaluations: ',num2str(N_iter)]); disp(['Best =',num2str(best),' fmin=',num2str(fmin)]); % % using optimum parameters for SVM training cmd_gwosvm = [' - c ', num2str (best (:, 1)), '-g', num2str (best (:, 2))]; model_gwosvm = svmtrain(train_wine_labels,train_wine,cmd_gwosvm); %% SVM network prediction [predict_label] = SvMpredict (test_wine_labels,test_wine, model_gwosVM); total = length(test_wine_labels); Right = length(find(predict_label == test_wine_labels)); Accuracy=right/total; % disp(' print test set classification accuracy '); % str = sprintf( 'Accuracy = %g%% (%d/%d)',accuracy(1),right,total); % disp(str); Actual classification and predicted classification of test sets hold on; plot(test_wine_labels,'o'); plot(predict_label,'r*'); Xlabel (' test set sample ','FontSize',12); Ylabel (' category label ','FontSize',12); Legend (' Actual test set classification ',' Predictive test set classification '); Title (' Actual classification and predicted classification graph of test set ','FontSize',12); grid on snapnow figure plot(1:N_gen,AAA);Copy the code