5.1 Linearly separable and linearly inseparable

Linear separability refers to the hyperplane (linear function) wTx+b=0w^{T}x +b=0 wTx+b=0, that is, for all sample points xix_{I}xi and its corresponding category yiy_iyi


{ w T x + b > 0 . y = + 1 w T x + b < 0 . y = 1 (1) \begin{cases} w^{T}x+b>0, & \text{y} = +1 \\ w^{T}x+b<0, & \text{y} = -1 \\ \end{cases} \tag1

Here’s what it looks like:

The figure above is linearly indivisible and the figure below is linearly divisible.

5.2 Support Vector Machine(SVM)

For the linear separability problem, our goal is to find a hyperplane that can correctly distinguish samples. Since the sample size is limited, we can have many hyperplanes that can correctly classify samples, as shown in the figure below:

So which hyperplane is a better choice? The choice of support vector machine is to find the hyperplane with the largest interval and consider it as the classified hyperplane.

5.2.1 interval

To introduce spacing, we need to introduce several concepts:

1. Support vector

In a linearly separable problem, the support vector refers to the sample point with the smallest distance between the classified hyperplanes. We all know the formula for the distance from a point to a straight line, but extending it to higher dimensions is the formula for the distance from a point to a hyperplane:


r = w T x + b w (2) r = \frac {|w^Tx+b|}{||w||} \tag2

2. Properties of hyperplane equations

For the hyperplane wTx+b=0w^{T}x +b=0 wTx+b=0 after scaling WWW and BBB equally, the hyperplane represented by the equation is invariant. So basically, 5x plus 6 is equal to 0 is the same line as 10x plus 12 is equal to 0.

The spacing here refers to the sum of the distances between two different categories of support vectors and the hyperplane, also known as the hard spacing, and the corresponding soft spacing will be described later.

5.2.2 Optimal hyperplane

However, we find that there may be more than one hyperplane with the maximum interval, or even an infinite number of hyperplanes, such as in the following case:

Three different lines with the same spacing, which one should we choose? At this time, many people intuitively feel that they should choose the middle one, and so does the support vector machine, which decides to choose the hyperplane with the same distance between the different classes of support vectors and the hyperplane. Because the classification results produced by the separated hyperplane at this time are the most robust and have the strongest generalization ability for unknown instances. Based on the properties of hyperplane equation mentioned above, (1) can be rewritten as follows:


{ w T x + b p + 1 . y = + 1 w T x + b Or less 1 . y = 1 (3) \begin{cases} w^{T}x+b\ge +1, & \text{y} = +1 \\ w^{T}x+b\le -1, & \text{y} = -1 \\ \end{cases} \tag3

When sample points to support vector equal sign, that is for support vector xix_ixi, have ∣ wTxi + b ∣ = 1 | w ^ Tx_i | = 1 + b ∣ wTxi + b ∣ = 1. Then by a (2) the support vector xix_ixi hyperplane to distance is 1 ∣ ∣ w ∣ ∣ \ frac {1} {| | w | |} ∣ ∣ w ∣ ∣ 1, and the interval of support vector machine (SVM) for 2 ∣ ∣ w ∣ ∣ \ frac {2} {| | w | |} ∣ ∣ w ∣ ∣ 2, Our goal is to find the largest 2 ∣ ∣ w ∣ ∣ \ frac {2} {| | w | |} ∣ ∣ w ∣ ∣ 2, is also the smallest ∣ ∣ w ∣ ∣ 2 \ frac {| | w | |} {2} 2 ∣ ∣ w ∣ ∣, take the square root is not convenient to calculate, So we changed to looking for the smallest ∣ ∣ w ∣ ∣ 22 \ frac {| | w | | ^ 2} {2} 2 ∣ ∣ w ∣ ∣ 2.

And the restriction (3) can be abbreviated as:


y i ( w T x + b ) p 1 (4) y_i(w^Tx+b) \ge 1 \tag4

To sum up, the problems to be solved are:


{ m i n w 2 2 s . t . y i ( w T x + b ) p 1 (5) \begin{cases} min \frac {||w||^2}{2}\\ s.t.{\quad}y_i(w^Tx+b) \ge 1 \end{cases} \tag5

In this way, the problem that support vector machine needs to solve is clear, but it is difficult to solve this problem directly. Later, the original problem and dual problem will be introduced to simplify the solution of (5).