This is the third day of my participation in the August More text Challenge. For details, see: August More Text Challenge

7.1 the original problem

There are roughly two cases of linear inseparability: only the edges are linearly inseparable (left) and most samples are linearly inseparable (right). For the case on the left we use the soft spacing method.

Let’s review the steps used to find the classification hyperplane using hard intervals:

  1. Suppose the hyperplane function is: wTx+b=0w^Tx+b=0wTx+b=0
  2. Yi (wTxi+b)≥1y_i(w^Tx_i+b) \ge 1yi(wTxi+b)≥1 (indicating that the hyperplane can correctly classify all samples)
  3. Scaling WWW and BBB makes the support vector xix_ixi such that YI (wTxi+b)=1y_i(w^Tx_i+b)=1yi(wTxi+b)=1.
  4. Calculate the heterogeneous support vector sum of the distance to the hyperplane ∣ ∣ w ∣ ∣ \ frac {2} {| | w | |} ∣ ∣ w ∣ ∣ is hard 2 interval, find makes it the smallest hyperplane (make sure find the hyperplane only) is the final requirements of the classification hyperplane.

In step 2, we find that there is no hyperplane that satisfies the restriction condition, so it is a good choice to find a hyperplane that can correctly classify most samples. After all, the lack of a perfect hyperplane is not necessarily a problem with the algorithm, but also the noise of the data itself.

Therefore, we relax the restriction condition as yi(wTxi+b)≥1−δiy_i(w^Tx_i+b) \ge 1-\delta_iyi(wTxi+b)≥1−δ I, and δ I \delta_iδ I is called the relaxation variable. With the relaxation variable added, our interval looks like this:

The interval between the two dashed lines is called the soft interval. In this case, the support vector is the vector on and within the interval boundary. Each sample xix_ixi has a corresponding relaxation variable δ I ≥0\delta_i \ge 0δ I ≥0, which represents the tolerance of the hyperplane for the sample’s classification error. For example, for a certain sample point xix_ixi, the corresponding category yi=+ 1y_I =+1yi=+1, the hyperplane classification result wTxi+b=−5w^Tx_i+b= -5wtxi +b=−5, obviously wrong classification. But if the delta I = 7 \ delta_i = 7 delta I = 7 then yi (wTxi + b) = > 1-5 – delta I = – 6 y_i (w ^ Tx_i + b) = 5 \ gt – \ delta_i = 1-6 yi (wTxi + b) = > 1-5 – delta I = – 6, We can accept this hyperplane as a classification hyperplane, even if it misclassifies a particular sample point.

However, our tolerance has a limit, and the size of δ I \delta_iδ I must be limited, otherwise the classification accuracy will be too low. Although the classification is not 100% accurate, we still hope to make the hyperplane classification as accurate as possible, that is, to give the hyperplane as little tolerance as possible:


m i n i = 1 m Delta t. i min \sum^{m}_{i =1}\delta_i

Combining it with the hard spacing problem, the soft spacing original problem is obtained:


{ m i n w 2 2 + C i = 1 m Delta t. i s . t . 1 Delta t. i y i ( w T x i + b ) Or less 0 i = 1.. m s . t . Delta t. i Or less 0 \begin{cases} min \frac {||w||^2}{2}+C\sum^{m}_{i =1}\delta_i\\ s.t.{\quad} 1-\delta_i-y_i(w^Tx_i+b) \le 0{\quad}i=1.. m\\ s.t.{\quad} -\delta_i \le 0 \end{cases}

Note ∣ ∣ w ∣ ∣ 2 | | w | | ^ 2 ∣ ∣ w ∣ ∣ 2 and the delta I \ delta_i delta I is the relationship between the increase of one another will reduce, can’t get to minimum at the same time, so here to control the size of both share, this is the super parameter of the CCC. When CCC larger values seasonal ∑ I = 1 m delta \ sum I ^ {m} _ {I = 1} \ delta_i ∑ I = 1 m delta smaller than I make ∣ ∣ w ∣ ∣ 2 | | w | | ^ 2 ∣ ∣ w ∣ ∣ 2 smaller to important, represents our low classification error tolerance, to close the soft interval (reduce the generalization ability) to improve the classification accuracy. On the contrary, it means that we have a high tolerance for classification errors and sacrifice classification accuracy to increase soft interval.

7.2 Dual problem

And you can see that this problem still satisfies the strong duality theorem, so it’s going to look a lot like the hard interval problem, except that there are more constraints and more parameters. Note that CCC is a hyperparameter that we determine before training, not a variable. Delta I \delta_i delta I is the same variable that we want to minimize as WWW and BBB. The corresponding function is:


L ( w . b . Delta t. . Alpha. . Beta. ) = w 2 2 + C i = 1 m Delta t. i + i = 1 m Alpha. i ( 1 Delta t. i y i ( w T x + b ) ) j = 1 m Beta. j Delta t. j Theta. ( Alpha. . Beta. ) = m i n w . b . Delta t. L ( w . b . Delta t. . Alpha. . Beta. ) \begin{aligned} L(w,b,\delta,\alpha,\beta) &= \frac {||w||^2}{2} +C\sum^{m}_{i =1}\delta_i+\sum^{m}_{i =1}{\alpha_i (1-\delta_i- y_i(w^Tx+b))} -\sum^{m}_{j =1}\beta_j\delta_j\\ \theta(\alpha,\beta) &= min_{w,b,\delta}L(w,b,\delta,\alpha,\beta) \end{aligned}

The dual problem is:


{ m a x Alpha. . Beta. Theta. ( Alpha. . Beta. ) s . t . Alpha. i p 0 i = 1… m s . t . Beta. i p 0 i = 1… m \ begin max_ {cases} {, alpha, and beta} \ theta (, alpha, and beta) \ \ s.t. ge 0 {\ quad} \ alpha_i \ {\ quad} I = 1… m\\ s.t.{\quad}\beta_i \ge 0{\quad}i=1… m \end{cases}

Let the partial derivative of LLL with respect to w,b,δiw,b,\delta_iw,b,δ I be 0:


partial L partial w = w i = 1 m Alpha. i y i x i = 0 partial L partial b = i = 1 m Alpha. i y i = 0 partial L partial Delta t. i = C Alpha. i Beta. i = 0 \begin{aligned} \frac{\partial L}{\partial w}&=w-\sum^{m}_{i =1}\alpha_iy_ix_i=0\\ \frac{\partial L}{\partial b}&=-\sum^{m}_{i =1}\alpha_iy_i=0\\ \frac{\partial L}{\partial \delta_i}&=C -\alpha_i-\beta_i =0\\ \end{aligned}

Into theta (alpha, beta, theta, alpha, and beta) (theta (alpha, beta) get off after the beta \ beta beta is:


Theta. ( Alpha. ) = i = 1 m Alpha. i 1 2 i = 1 m j = 1 m Alpha. i Alpha. j y i y j x i T x j \theta(\alpha) = \sum^{m}_{i =1}\alpha_i – \frac{1}{2}\sum^{m}_{i =1}\sum^{m}_{j =1}\alpha_i\alpha_jy_iy_jx_i^Tx_j

Combined with KKT conditions, the final problem we want to solve is obtained:


{ m a x Alpha. ( i = 1 m Alpha. i 1 2 i = 1 m j = 1 m Alpha. i Alpha. j y i y j x i T x j ) s . t . i = 1 m Alpha. i y i = 0 s . t . Alpha. i p 0 . Beta. i p 0 s . t . Alpha. i ( y i ( w T x i + b ) 1 + Delta t. i ) = 0 . Beta. i Delta t. i = 0 s . t . y i ( w T x i + b ) 1 + Delta t. i p 0 s . t . Delta t. i p 0 \begin{cases} max_\alpha (\sum^{m}_{i =1}\alpha_i – \frac{1}{2}\sum^{m}_{i =1}\sum^{m}_{j =1}\alpha_i\alpha_jy_iy_jx_i^Tx_j)\\ s.t.{\quad}\sum^{m}_{i =1}\alpha_iy_i=0\\ s.t.{\quad}\alpha_i \ge 0,\beta_i \ge 0\\ s.t.{\quad}\alpha_i(y_i(w^Tx_i+b)-1+\delta_i)=0,{\quad}\beta_i\delta_i=0\\ s.t.{\quad}y_i(w^Tx_i+b)-1+\delta_i \ge 0\\ s.t.{\quad}\delta_i \ge 0\\ \end{cases}

By alpha I (yi (wTxi + b) – 1 + delta I) = 0 \ alpha_i (y_i (w ^ Tx_i + b) – 1 + \ delta_i) = 0 alpha I (yi (wTxi + b) – 1 + delta I) = 0 is the selection of hyperplane is only related to support vector, and the dual problem is still a quadratic programming problem, It’s not hard to solve.

The resources

Machine Learning. By Zhihua Zhou. Tsinghua University Press

Detailed explanation of machine learning formulas, By Xie Wenrui, People’s Posts and Telecommunications Press

The mooC is a machine learning course taught by Hu Haoji from Zhejiang University