1.5.1 Minimize the misclassification rate

Suppose our goal is simply to have as little misclassification as possible. We need a rule that assigns each value of x\bf xx to an available class. Such a rule divides the input space into regions called decision regions, one region for each class, so that all points in RK\mathcal R_KRK are assigned to class Ck\mathcal C_kCk. The boundaries between decision regions are called decision boundaries or decision surfaces. Note that each decision area does not need to be contiguous, but can include areas that do not intersect. We will encounter examples of decision boundaries and decision regions in a later section. To find the optimal decision rule, first consider the case of two classes, such as in the cancer problem. An error occurs when an input vector belonging to category C1\mathcal C_1C1 is assigned to category C2\mathcal C_2C2, and vice versa. The probability of this happening is given by the following formula:


p ( m i s t a k e ) = p ( x R 1 . C 2 ) + p ( x R 2 . C 1 ) = R 1 p ( x . C 2 ) d x + R 2 p ( x . C 1 ) d x . (1.78) \begin{aligned} p(mistake)&=p(\mathbf x\in\mathcal R_1,\mathcal C_2)+p(\mathbf x\in\mathcal R_2,\mathcal C_1)\\ &=\int_{\mathcal R_1}p(\mathbf x,\mathcal C_2)d\mathbf x+\int_{\mathcal R_2}p(\mathbf x,\mathcal C_1)d\mathbf x. } {\ tag \ {1.78} end aligned

We are free to choose the decision rule that assigns each point X to one of the two classes. Obviously, in order to minimize P (mistake) p (mistake) p (mistake), we should arrange for each x\mathbf xx to be assigned to the class with smaller integrand values in (1.78). Therefore, if for a given x\mathbf xx value, p(x,C1)>p(x,C2)p(\mathbf x,\mathcal C_1)>p(\mathbf x,\mathcal C_2)p(x,C1)>p(x,C2), Then we should assign this x\mathbf xx to class C1\mathcal C_1C1. According to the product rule, We get the p (x, Ck) = p (Ck ∣ x) p (x) p (\ mathbf x, \ mathcal C_k) = p (\ mathcal C_k \ mid \ mathbf x) p (\ mathbf x) p (x, Ck) = p (Ck ∣ x) p (x). Since this factor p(x)p(\mathbf x)p(x) is the common denominator between the two terms, we can reiterate this result, That is, if we assign each value of x\mathbf xx to the category with the largest posterior probability p(Ck∣x \ C_k\mid \mathbf x) P (Ck∣x) we have the smallest probability of error. Figure 1.24 shows the result of two classes and one input variable X. For the more general KKK class, it is easier to maximize the correct probability, as shown below


p ( c o r r e c t ) = k = 1 K p ( x R k . C k ) = k = 1 K R k p ( x . C k ) d x (1.79) \begin{aligned} p(correct)&=\sum_{k=1}^Kp(\mathbf x\in\mathcal R_k,\mathcal C_k)\\ &=\sum_{k=1}^K\int_{\mathcal R_k} P (\mathbf x,\mathcal C_k)d\mathbf x \tag{1.79} \end{aligned}

Maximizes when selecting the region Rk\mathcal R_kRk so that each x\mathbf xx is assigned to the class with the largest p(x,Ck)p(\mathbf x,\mathcal C_k)p(x,Ck). Again, using the product rule P (x,Ck)= P (Ck∣x)p(x)p(\mathbf x \mid \mathbf x)p(\mathbf x)p(x,Ck)= P (Ck)p(Ck) given x, And note that the factor of p(x)p(\mathbf x)p(x) is common for all terms, We see that every x\mathbf xx should be assigned to the class with the maximum posterior probability P (Ck∣x)p(Ck C_k\mid \mathbf x) P (Ck∣x).

Figure 1.24

A schematic diagram of the joint probability p(x,Ck)p(\mathbf x,\mathcal C_k)p(x,Ck) for each of the two classes, drawn with the decision boundary x=x^x=\widehat xx=x. X ≥x^ x\ge \widehat xx≥x values are classified as c2mathcal C_2C2 and therefore belong to the decision region R2\ Mathcal R_2R2, The x