The background knowledge needed to read this article: hard interval support vector machines, relaxation variables, a drop programming knowledge

One, the introduction

In the previous section, we introduced one of the most basic support vector machine models — hard-interval support vector machines, which can classify linearly separable data sets, but in reality data sets are often linearly indivisible. So in this section, we will introduce the second kind of Support Vector Machine — soft-margin Support Vector Machine 1 (SOFT-margin Support Vector Machine), to solve the problem of linear indivisibility of data set mentioned above.

Ii. Model introduction

The original model

Let’s first look at the original model of hard interval support vector machine as follows:

\begin{array}{} \underset{w, {b} \ Max} \ frac {1} {2} w ^ Tw \ \ \ text {s.t.} \ quad y_ {I} \ left (w ^ {T} x_ + b \ {I} right) 1 \ \ geq quad I = 1, 2, \ \ cdots, N \end{array}

Type 2-1

The constraint y(wx + b) is greater than or equal to 1 so that all sample points are correctly classified, and that’s why it’s called a hard interval. However, the data set cannot be completely separated by a hyperplane, so it is necessary to allow some data sets not to meet the above constraints. Modify the above cost function and add the penalty term for the sample points that are incorrectly classified, where 1(x) was mentioned in the previous section, namely indicator function 2. If the following inequality is satisfied, the function returns 1; if not, the function returns 0. At the same time, constant C is used to control the punishment intensity, where C > 0, it can be seen that when C is infinite, it will force each sample point to be correctly classified, which is equivalent to hard interval support vector machine.

\begin{array}{} \underset{w, b}{\min} \frac{1}{2} w^{T} w+C \sum_{i=1}^{N} 1_{x_{i}}\left(y_{i}\left(w^{T} x_{i}+b\right) \lt 1\right) \\ \text { s.t. } \quad C>0 \end{array}

Type 2-2

The indicating function in Equation 2-2 is neither convex nor continuous, so it is troublesome to handle. In this case, it can be replaced by Max function, in the following form:

\begin{array}{} \underset{w, b}{\min} \frac{1}{2} w^{T} w+C \sum_{i=1}^{N} \max\left(0, 1 – y_{i}\left(w^{T} x_{i}+b\right)\right) \\ \text { s.t. } \quad C>0 \end{array}

Type 2-3

At this time, the slack variable 3 is introduced, and the following conditions are added to further remove the Max function. Equation 2-3 is the original model of soft interval support vector machine:

\begin{array}{} \underset{w, b, \xi}{\min} \frac{1}{2} w^{T} w+C \sum_{i=1}^{N} \xi_{i} \\ \text { s.t. } \quad C>0 \quad \xi_{i} \geq 0 \quad \xi_{i} \ \ geq 1 – y_ {I} left (w ^ {T} x_ + b \ {I} right) \ quad I = 1, 2, \ \ cdots, N end {array}

Type 2-4

Dual model

In the same way as the hard interval support vector machine, the original model is transformed by Lagrange multiplier method: (1) Lagrange multiplier method is used to introduce two kinds of Lagrange parameters λ and μ. Get the Lagrange function (2) Take partial derivative of the Lagrange function with respect to W and set it to zero (3) get the analytical solution of W as hard interval support vector machine (4) Take partial derivative of the Lagrange function with respect to B and set it to zero (5) Take partial derivative of the Lagrange function with respect to ξ and set it to zero (6) get C = λ + μ

\begin{aligned} L(w, b, \xi, \lambda, \mu) &=\frac{1}{2} w^{T} w+C \sum_{i=1}^{N} \xi_{i}+\sum_{i=1}^{N} \lambda_{i}\left(1-\xi_{i}-y_{i}\left(w^{T} x_{i}+b\right)\right)-\sum_{i=1}^{N} \mu_{i} \xi_{i} & (1)\\ \frac{\partial L}{\partial w} &=w-\sum_{i=1}^{N} \lambda_{i} y_{i} x_{i}=0 & (2)\\ w &=\sum_{i=1}^{N} \lambda_{i} y_{i} x_{i} & (3)\\ \frac{\partial L}{\partial b} &=\sum_{i=1}^{N} \lambda_{i} y_{i}=0 & (4)\\ \frac{\partial L}{\partial \xi_{i}} &=C-\lambda_{i}-\mu_{i}=0 & (5)\\ C &=\lambda_{i}+\mu_{i} & (6) \end{aligned}

Type 2-5

The results obtained in Equation 2-5 are brought back to the Lagrange function: (1) Lagrange function (2) is substituted into the expanded parentheses (3). It can be seen that the 2nd and 5th terms in Equation (2) cancel each other, the 3rd and 8th terms cancel each other, and the 7th term is zero after merging

\begin{aligned} L(w, b, \xi, \lambda, \mu) &=\frac{1}{2} w^{T} w+C \sum_{i=1}^{N} \xi_{i}+\sum_{i=1}^{N} \lambda_{i}\left(1-\xi_{i}-y_{i}\left(w^{T} x_{i}+b\right)\right)-\sum_{i=1}^{N} \mu_{i} \xi_{i} & (1)\\ &=\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{N} \lambda_{i} \lambda_{j} y_{i} y_{j} x_{i}^{T} x_{i}+\sum_{i=1}^{N} \lambda_{i} \xi_{i}+\sum_{i=1}^{N} \mu_{i} \xi_{i}+\sum_{i=1}^{N} \lambda_{i}-\sum_{i=1}^{N} \lambda_{i} \xi_{i}-\sum_{i=1}^{N} \sum_{j=1}^{N} \lambda_{i} \lambda_{j} y_{i} y_{j} x_{i}^{T} x_{i}-\sum_{i=1}^{N} \lambda_{i} y_{i} b-\sum_{i=1}^{N} \mu_{i} \xi_{i} & (2)\\ &=\sum_{i=1}^{N} \lambda_{i}-\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{N} \lambda_{i} \lambda_{j} y_{i} y_{j} x_{i}^{T} x_{i} & (3) \end{aligned}

Type 2-6

In view of the Lagrange multiplier parameters: (1) constraints on the Lagrange multiplier μ (2) add the Lagrange multiplier λ (3) to both sides of the equation 2-5 (6) results obtained

\begin{aligned} \mu_i &\ge 0 & (1) \\ \lambda_i + \mu_i &\ge \lambda_i & (2) \\ C &\ge \lambda_i & (3) \end{aligned}

Type 2-7

As with hard interval support vector machines, KKT conditions are introduced as follows:

\left\{\begin{aligned} \nabla_{w, b, \xi} L(w, b, \xi, \lambda, \mu) &=0 \\ \lambda_{i} & \geq 0 \\ y_{i}\left(w^{T} x_{i}+b\right)-1+\xi_{i} & \geq 0 \\ \lambda_{i}\left(y_{i}\left(w^{T} x_{i}+b\right)-1+\xi_{i}\right) &=0 \\ \mu_{i} & \geq 0 \\ \xi_{i} & \geq 0 \\ \mu_{i} \xi_{i} &=0 \end{aligned}\right.

Type 2-8

After applying the Lagrange multiplier method, the Lagrange dual model of the original model is obtained and KKT conditions as shown above are required:

\begin{array}{} \underset{\lambda}{\max} \sum_{i=1}^{N} \lambda_{i}-\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{N} \lambda_{i} \lambda_{j} y_{i} y_{j} x_{i}^{T} x_{j} \\ \text { s.t. } \quad \sum_{i=1}^{N} \lambda_{i} y_{i}=0 \quad 0 \leq \lambda_{I} \leq C \quad I =1,2, \cdots, N \end{array}

Type 2-9

Three, algorithm steps

The above type 2-9 dual model with hard intervals dual model of support vector machine (SVM) after compare, will find that the difference lies in the constraint condition of Lagrange multiplier, the former more than just a upper bound of the penalty factor C, so the solving algorithm and hard intervals of support vector machine (SVM) is roughly same, only the following two places are different.

One change in the algorithm is the calculation of the upper and lower bounds of the new λ _B: (1), (2) all λ needs to be greater than or equal to zero and less than or equal to C (3) to discuss the limitations of λ _B. (4) Combining equation (1) to obtain the final variable λ _B

\begin{aligned} &0 \leq \lambda_{b}^{\text {new }} \leq C & (1)\\ &0 \leq \lambda_{a}^{\text {old }}+y_{a} y_{b}\left(\lambda_{b}^{\text {old }}-\lambda_{b}^{\text {new }}\right) \leq C & (2)\\ \Rightarrow&\left\{\begin{array}{} \lambda_{a}^{\text {old }}+\lambda_{b}^{\text {old }}-C \leq \lambda_{b}^{\text {new }} \leq \lambda_{a}^{\text {old }}+\lambda_{b}^{\text {old }}, \quad y_{a} y_{b}=+1 \\ \lambda_{b}^{\text {old }}-\lambda_{a}^{\text {old }} \leq \lambda_{b}^{\text {new }} \leq C-\lambda_{a}^{\text {old }}+\lambda_{b}^{\text {old }}, \quad y_{a} y_{b}=-1 \end{array}\right. & (3)\\ \Rightarrow&\left\{\begin{array}{} \max \left(0, \lambda_{a}^{\text {old }}+\lambda_{b}^{\text {old }}-C\right) \leq \lambda_{b}^{\text {new }} \leq \min \left(C, \lambda_{a}^{\text {old }}+\lambda_{b}^{\text {old }}\right), \quad y_{a} y_{b}=+1 \\ \max \left(0, \lambda_{b}^{\text {old }}-\lambda_{a}^{\text {old }}\right) \leq \lambda_{b}^{\text {new }} \leq \min \left(C, C-\lambda_{a}^{\text {old }}+\lambda_{b}^{\text {old }}\right), \quad y_{a} y_{b}=-1 \end{array}\right. & (4) \end{aligned}

Type 3-1

Another change in the algorithm is the judgment of KKT conditions: (2) According to the result of (6) in Formula 2-5, it can be known that μ is equal to C (3) Because μ is equal to C, it can be known that ξ is equal to 0 according to KKT condition of Formula 2-8. (4) It can be concluded that ξ is equal to 0 by substituting ξ to KKT condition

\begin{aligned} \lambda &=0 & (1)\\ \mu &=C & (2)\\ \xi &=0 & (3)\\ y_{i}\left(w^{T} x_{i}+b\right)-1 & \geq 0 & (4) \end{aligned}

Type 3-2

(1) considering the situation that λ is equal to C (2) according to the result of (6) in Equation 2-5, μ is equal to 0 (3) because λ is not equal to 0, according to KKT condition of equation 2-8, (4) because ξ is greater than or equal to 0 can be obtained by combining equation (3)

\begin{aligned} \lambda &=C & (1)\\ \mu &=0 & (2)\\ y_{i}\left(w^{T} x_{i}+b\right)-1 + \xi &=0 & (3)\\ y_{i}\left(w^{T} x_{i}+b\right)-1 & \leq 0 & (4) \end{aligned}

Type 3-3

(2) according to the result of (6) in Equation 2-5, we know that μ is also between 0 and C. (3) according to the KKT condition of equation 2-8, we know that (4) since λ is not equal to 0, we can obtain according to the KKT condition of equation 2-8 and in combination with equation (3)

\begin{aligned} 0 \lt \lambda &\lt C & (1)\\ 0 \lt \mu &\lt C & (2)\\ \xi &=0 & (3)\\ y_{i}\left(w^{T} x_{i}+b\right)-1 & = 0 & (4) \end{aligned}

Type 3-4

Based on equations 3-2, 3-3 and 3-4 above, the following new constraints can be obtained:

\left\{\begin{array}{} y_{i}\left(w^{T} x_{i}+b\right)-1 \geq 0, & \lambda=0 \\ y_{i}\left(w^{T} x_{i}+b\right)-1=0, & 0 \lt \lambda \lt C \\ y_{i}\left(w^{T} x_{i}+b\right)-1 \leq 0, & \lambda=C \end{array}\right.

Type 3-5

Please refer to section 3 of hard interval support vector machines and the code implementation below for the algorithm steps.

Four, code implementation

Using Python

import numpy as np

class SMO:
    Implementation of Sequential minimal Optimization Algorithm for Soft Interval Support Vector Machine (SMO)

    def __init__(self, X, y) :
        Training sample eigenmatrix (N * p)
        self.X = X
        Training sample tag vector (N * 1)
        self.y = y
        Lagrangian multiplier vector (N * 1)
        self.alpha = np.zeros(X.shape[0])
        Error vector, default negative tag vector (N * 1)
        self.errors = -y
        # the offset
        self.b = 0
        # Weight vector (p * 1)
        self.w = np.zeros(X.shape[1])
        # generation value
        self.cost = -np.inf

    def fit(self, C = 1, tol = 1e-4) :
        """ Algorithm from John C. Platt's paper https://www.microsoft.com/en-us/research/uploads/prod/1998/04/sequential-minimal-optimization.pdf """
        # number of updates
        numChanged = 0
        # Check all
        examineAll = True
        while numChanged > 0 or examineAll:
            numChanged = 0
            if examineAll:
                for idx in range(X.shape[0]):
                    numChanged += self.update(idx, C)
            else:
                for idx in range(X.shape[0) :if self.alpha[idx] <= 0:
                        continue
                    numChanged += self.update(idx, C)
            if examineAll:
                examineAll = False
            elif numChanged == 0:
                examineAll = True
            # Calculate the generation value
            cost = self.calcCost()
            if self.cost > 0:
                # End the algorithm when the change of contemporary value is less than the allowable range
                rate = (cost - self.cost) / self.cost
                if rate <= tol:
                    break
            self.cost = cost

    def update(self, idx, C = 1) :
        Update Lagrange multiplier with subscript IDX
        X = self.X
        y = self.y
        alpha = self.alpha
        # Check whether the current Lagrange multiplier satisfies KKT condition, if so, return 0 directly
        if self.checkKKT(idx, C):
            return 0
        if len(alpha[(alpha != 0)]) > 1:
            Find the subscript of the second Lagrange multiplier to be optimized according to the principle of | E1-e2 | maximum
            jdx = self.selectJdx(idx)
            Update Lagrangian multiplier with subscripts idx and JDX, return 1 on success
            if self.updateAlpha(idx, jdx, C):
                return 1
        # If the update is not successful, iterate over the non-zero Lagrange multiplier to update
        for jdx in range(X.shape[0) :if alpha[jdx] == 0:
                continue
            Update Lagrangian multiplier with subscripts idx and JDX, return 1 on success
            if self.updateAlpha(idx, jdx, C):
                return 1
        # If there is still no update success, traverse the Lagrange multiplier of zero to update
        for jdx in range(X.shape[0) :ifalpha[jdx] ! =0:
                continue
            Update Lagrangian multiplier with subscripts idx and JDX, return 1 on success
            if self.updateAlpha(idx, jdx, C):
                return 1
        # return 0 if still not updated
        return 0

    def selectJdx(self, idx) :
        Search for the subscript of the Second Lagrange multiplier to be optimized
        errors = self.errors
        if errors[idx] > 0:
            # When the error term is greater than zero, select the subscript of the smallest value in the error vector
            return np.argmin(errors)
        elif errors[idx] < 0:
            # When the error term is less than zero, select the subscript of the maximum value in the error vector
            return np.argmax(errors)
        else:
            # When the error term is equal to zero, select the subscript with the largest absolute values of the maximum and minimum values in the error vector
            minJdx = np.argmin(errors)
            maxJdx = np.argmax(errors)
            if max(np.abs(errors[minJdx]), np.abs(errors[maxJdx])) - errors[minJdx]:
                return minJdx
            else:
                return maxJdx

    def calcB(self) :
        Calculate the offset of each Lagrange multiplier that is not zero and take its average value.
        X = self.X
        y = self.y
        alpha = self.alpha
        alpha_gt = alpha[alpha > 0]
        # Number of non-zero Lagrange multiplier vectors
        alpha_gt_len = len(alpha_gt)
        # return 0 if all values are zero
        if alpha_gt_len == 0:
            return 0
        # b = y-wx, please refer to the specific algorithm in the article
        X_gt = X[alpha > 0]
        y_gt = y[alpha > 0]
        alpha_gt_y = np.array(np.multiply(alpha_gt, y_gt)).reshape(-1.1)
        s = np.sum(np.multiply(alpha_gt_y, X_gt), axis=0)
        return np.sum(y_gt - X_gt.dot(s)) / alpha_gt_len

    def calcCost(self) :
        "" The generation value can be calculated according to the algorithm in the paper.
        X = self.X
        y = self.y
        alpha = self.alpha
        cost = 0
        for idx in range(X.shape[0) :for jdx in range(X.shape[0]):
                cost = cost + (y[idx] * y[jdx] * X[idx].dot(X[jdx]) * alpha[idx] * alpha[jdx])
        return np.sum(alpha) - cost / 2

    def checkKKT(self, idx, C = 1) :
        2. Alpha <= C 3. y * f(x) -1 >= 0 4. Alpha * (y * f(x) -1) = 0
        y = self.y
        errors = self.errors
        alpha = self.alpha
        r = errors[idx] * y[idx]
        if (alpha[idx] > 0 and alpha[idx] < C and r == 0) or (alpha[idx] == 0 and r > 0) or (alpha[idx] == C and r < 0) :return True
        return False

    def calcE(self) :
        Calculate the error vector E = f(x) -y
        X = self.X
        y = self.y
        alpha = self.alpha
        alpha_y = np.array(np.multiply(alpha, y)).reshape(-1.1)
        errors = X.dot(X.T).dot(alpha_y).T[0] + self.b - y
        return errors

    def calcU(self, idx, jdx, C = 1) :
        Calculate the upper bound of the Lagrangian multiplier so that the two optimized Lagrangian multipliers are both greater than or equal to 0 according to the algorithm in the paper.
        y = self.y
        alpha = self.alpha
        if y[idx] * y[jdx] == 1:
            return max(0, alpha[jdx] + alpha[idx] - C)
        else:
            return max(0.0, alpha[jdx] - alpha[idx])

    def calcV(self, idx, jdx, C = 1) :
        "" Calculate the lower bound of the Lagrange multiplier so that the two optimized Lagrange multipliers are both greater than or equal to 0 according to the algorithm in the paper.
        y = self.y
        alpha = self.alpha
        if y[idx] * y[jdx] == 1:
            return min(C, alpha[jdx] + alpha[idx])
        else:
            return min(C, C + alpha[jdx] - alpha[idx])

    def updateAlpha(self, idx, jdx, C = 1) :
        Update the Lagrange multiplier with subscripts IDX and JDX according to the algorithm in the article.
        if idx == jdx:
            return False
        X = self.X
        y = self.y
        alpha = self.alpha
        errors = self.errors
        # error term of idx
        Ei = errors[idx]
        The error term of JDX
        Ej = errors[jdx]
        Kii = X[idx].dot(X[idx])
        Kjj = X[jdx].dot(X[jdx])
        Kij = X[idx].dot(X[jdx])
        # to calculate K
        K = Kii + Kjj - 2 * Kij
        oldAlphaIdx = alpha[idx]
        oldAlphaJdx = alpha[jdx]
        # Compute the new Lagrange multiplier of JDX
        newAlphaJdx = oldAlphaJdx + y[jdx] * (Ei - Ej) / K
        U = self.calcU(idx, jdx, C)
        V = self.calcV(idx, jdx, C)
        if newAlphaJdx < U:
            # When the new value exceeds the upper bound, change it to the upper bound
            newAlphaJdx = U
        if newAlphaJdx > V:
            # When the new value is below the lower bound, change it to the lower bound
            newAlphaJdx = V
        if oldAlphaJdx == newAlphaJdx:
            # if the new value equals the old value, the new value is not updated
            return False
        # Compute the new Lagrange multiplier of IDx
        newAlphaIdx = oldAlphaIdx + y[idx] * y[jdx] * (oldAlphaJdx - newAlphaJdx)
        # Update weight vector
        self.w = self.w + y[idx] * (newAlphaIdx - oldAlphaIdx) * X[idx] + y[jdx] * (newAlphaJdx - oldAlphaJdx) * X[jdx]
        Update the Lagrange multiplier vector
        alpha[idx] = newAlphaIdx
        alpha[jdx] = newAlphaJdx
        oldB = self.b
        Recalculate the offset
        self.b = self.calcB()
        # recalculate the error vector
        eps = np.finfo(np.float32).eps
        newErrors = []
        for i in range(X.shape[0) : newError = errors[i] + y[idx] * (newAlphaIdx - oldAlphaIdx) * X[idx].dot(X[i]) + y[jdx] * (newAlphaJdx - oldAlphaJdx) * X[jdx].dot(X[i]) - oldB + self.bif np.abs(newError) < eps:
                newError = 0
            newErrors.append(newError)
        self.errors = newErrors
        return True
Copy the code

Fifth, third-party library implementation

Scikit – learn4 implementation

from sklearn.svm import SVC

svc = SVC(kernel = "linear", C = 1)
# Fit data
svc.fit(X, y)
# Weight coefficient
w = svc.coef_
# intercept
b = svc.intercept_
print("w", w, "b", b)
Copy the code

Six, animation demonstration

The following figure shows a linearly indivisible data set, where red represents sample points with a label value of -1 and blue represents sample points with a label value of 1:

Figure 6-1

The following figure shows the results of data fitting by soft interval support vector machine, where the light red area is the part with the predicted value of -1, and the light blue area is the part with the predicted value of 1:

Figure 6-2

7. Mind mapping

Figure 7-1

8. References

En.wikipedia.org/wiki/Suppor…
En.wikipedia.org/wiki/Indica…
En.wikipedia.org/wiki/Slack_…
Scikit-learn.org/stable/modu…

The full demo can be found here

Note: this article strives to be accurate and easy to understand, but because the author is a beginner, the level is limited, such as the existence of errors or omissions in the article, please readers through the way of comment criticism

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Machine Learning Algorithm series (15) – Soft-margin Support Vector Machine

One, the introduction

Ii. Model introduction

The original model

Dual model

Three, algorithm steps

Four, code implementation

Fifth, third-party library implementation

Six, animation demonstration

7. Mind mapping

8. References

Machine Learning Algorithm series (15) – Soft-margin Support Vector Machine

One, the introduction

Ii. Model introduction

The original model

Dual model

Three, algorithm steps

Four, code implementation

Fifth, third-party library implementation

Six, animation demonstration

7. Mind mapping

8. References

Related Posts

NLP (28) multi-label text classification

Introduction to Reinforcement Learning (1)

Algorithm (6): Graphic greedy algorithm