Empirical error and overfitting

Error rate Error rate and accuracy

If aaa classification errors are found in MMM samples, the error rate is E= A /mE= A /mE= A /m, and the accuracy is 1−E=1-E=1−E.

Traning error Training error or empirical error

The error of the learner on the training set is called the training error (empirical error)

Generalization error

The error of the learner on the new sample (test set) is called generalization error

Overfittting fitting

The learner learns the training sample “too well” and takes some of its features as the general properties of all potential samples, leading to the reduction of generalization performance, which is called overfitting.

The opposite of overfitting is called underfitting

Evaluation methods

For a data set containing MMM samples D={(x1,y1)… ,(xm,ym)}D=\{(\pmb{x}_1,y_1),… ,(\pmb{x}_m,y_m)\}D={(xx1,y1),… ,(XXM, YM)}, how to divide it into test set and training set? Here are a few methods

Hold – out set aside

Directly split the data set into two mutually exclusive sets, one for the training set SSS and the other for the test set TTT, D=S∪T,S∩T=∅D=S \cup T,S\ Cap T=\emptysetD=S∪T,S∩T=∅.

Cross alidation validation

The data set is divided into KKK mutually exclusive subsets of similar size (also known as K-fold cross validation), and each time a subset of K − 1K-1K −1 is trained and a subset is left for testing, so the TRAINING/test results of KKK group can be obtained, and then the mean is taken to get the final result, KKK is often valued as 10.

Boostrapping self-help method

Based on boosrtapping sampling, data set D’d’d ‘is obtained by sampling DDD of data set containing MMM samples. Each time a random sample from DDD is put into D’D ‘, and the sampled sample is put back into DDD, so that the sample may be sampled again in the next sample. D’d’d ‘is obtained after MMM times, some samples in DDD will appear repeatedly in D’d’d’, and the probability of samples not being sampled in MMM times is (1−1m)m(1-\frac{1}{m})^m(1−m1)m, Calculate the limit (see calculus limit method and formula) :


lim m up ( 1 1 m ) m 1 e material 0.368 \ \ lim_ {m rightarrow \ infty} (1 – \ frac {1} {m}) ^ m \ rightarrow \ frac {1} {e} \ approx 0.368

About 36.8% of DDD samples did not appear in D’d’D ‘.

Performance Measure Performance measure

The performance measure reflects the task requirements and is used to measure the model generalization ability, predicting the given sample set D={(x1,y1),… ,(xm,ym)}D=\{(\pmb{x}_1,y_1),… ,(\pmb{x}_m,y_m)\}D={(xx1,y1),… ,(XXM, YM)}, where yiy_iyi is the real marker of xi\ PMB {x} _IXXI. The model performance is evaluated by comparing the mapping functions f(x)f(\ PMB {x})f(xx) and YYy.

Mean squared error

The most commonly used performance measures for regression tasks are:


E ( f ; D ) = 1 m i = 1 m ( f ( x i ) . y i ) 2 E(f; D)=\frac{1}{m}\sum^m_{i=1}(f(\pmb{x}_i),y_i)^2

⋅ : P (⋅)p(\cdot)p(⋅) for data distribution D\mathcal{D}D and probability density function P (⋅)


E ( f ; D ) = x …… D ( f ( x ) y ) 2 p ( x ) d x E(f; \mathcal{D})=\int_{\pmb{x} \sim \mathcal{D}}(f(\pmb{x})-y)^2p(\pmb{x})d\pmb{x}

Error rate and accuracy

The two most commonly used performance measures in categorization tasks apply to both binary and multi-categorization tasks. For sample set DDD, the error rate is (ⅱ means 0 or 1) :


E ( f ; D ) = 1 m i = 1 m ( f ( x i ) indicates y i ) E(f; D) = \ frac {1} {m} \ sum ^ m_ {I = 1} Ⅱ (f (\ PMB {x} _i) \ neq y_i)

Precision is defined as:


a c c ( f ; D ) = 1 m i = 1 m ( f ( x i ) = y i ) = 1 E ( f ; D ) acc(f; D) = \ frac {1} {m} \ sum ^ m_ {I = 1} Ⅱ (f (\ PMB {x} _i) = y_i) = 1 – E (f; D)

⋅ (⋅)p(\cdot)p(⋅) and data distribution D\mathcal{D}D can be described as follows:


E ( f ; D ) = x …… D ( f ( x ) indicates y ) p ( x ) d x E(f; ) = \ \ mathcal {D} int_ {\ PMB {x} \ sim \ mathcal {D}} Ⅱ (f (\ PMB {x}) \ neq y), p (\ PMB {x}) D \ PMB {x}

a c c ( f ; D ) = x …… D ( f ( x ) = y ) p ( x ) d x acc(f; ) = \ \ mathcal {D} int_ {\ PMB {x} \ sim \ mathcal {D}} Ⅱ (f (\ PMB {x}) = y), p (\ PMB {x}) D \ PMB {x}

Accuracy, recall rate and F1 score

The combination of real classification and predictor classification results can be divided into four categories as shown in the following table:

Accuracy PPP and recall RRR are defined as:


P = T P T P + F P . R = T P T P + F N P=\frac{TP}{TP+FP},R=\frac{TP}{TP+FN}

Accuracy and recall rates are a bunch of contradictory measures, and when one is high the other can be low, as shown in the figure below.

Therefore, in learning, people hope to find a point to balance the two measures, so as to find the most appropriate value.

The break-event Point (BEP) is the Point when the accuracy is equal to the recall rate. In the figure above, the BEP of learner C is 0.64, and the equilibrium Point of learner A is the highest. Therefore, it can be considered that A is the optimal learner.

The BEP method is simpler, and the F1 score is a better measure (it is the reciprocal of the harmonic mean of the two measures) :


F 1 = 2 x P x R P + R = 2 x T P Total sample + T P T N F1=\frac{2\times P\times R}{P+R}=\frac{2\times TP}{sample size + tp-tn}

In some applications users may have a “preference” for accuracy and recall, so the general form FβF_\betaFβ is defined:


F Beta. = ( 1 + Beta. 2 ) x P x R ( Beta. 2 x P ) + R F_\beta=\frac{(1+\beta^2)\times P\times R}{(\beta^2 \times P)+R}

When β=1\beta =1β=1 β=1, the two measures are equal, when 1>β>01> beta >01>β>0, the accuracy of the two measures has a greater influence, when β>1\beta >1β>1 β>1, Recall rates matter more (because harmonic averaging focuses on smaller values).

ROC curve and AUC area

The binary classification learner generates a real value or probability prediction for the test sample through mapping function, and then compares the predicted value with a certain classification threshold to determine its classification. How to determine the threshold affects the generalization performance of the learner. Usually, in the learning process, test samples are sorted according to some rules (such as the value of accuracy or recall rate) and then a “cut point” is determined to divide the samples into two parts, and the value of the cut point is the classification threshold.

The ROC (Receiver Operating Characteristic) curve is drawn based on the test results of samples. Its principle is similar to p-R curve, but its horizontal axis is FPR and its vertical axis is TPR:


T P R = T P T P + F N . F P R = F P T N + F P TPR=\frac{TP}{TP+FN},FPR=\frac{FP}{TN+FP}

If the ROC curve of learner A is completely covered by learner B, then learner B is better than learner A. If there is crossover between the two, the AUC (Area Under ROC Curv) Area is generally compared. AUC is obtained by calculating the Area and of each part Under the ROC curve, assuming that the coordinate value of ROC curve is {(x1,y1)… (xm,ym)},x1=0,xm=1\{(x_1,y_1),… (x_m,y_m)\},x_1=0,x_m=1{(x1,y1),… (xm, YM)},x1=0,xm=1 (the sample points are discrete, so the ROC curve is actually smoothed by connecting the values of discrete sample points), as shown in Figure (b) below, then the AUC area can be estimated as:


A U C = 1 2 i = 1 m 1 ( x i + 1 x i ) x ( y i + y i + 1 ) AUC=\frac{1}{2}\sum^{m-1}_{i=1}(x_{i+1}-x_i)\times (y_i+y_{i+1})

Cost-sensitive Error rate and cost curve

Errors are unequally costly, and most of the previous performance measures set all types of errors as equally costly, without considering that different errors may have different consequences, some of which are “more costly” and others less costly. Therefore, in the measurement of cost sensitive tasks, we hope to minimize the “total cost” rather than simply minimize the error rate.

The following table shows a binary cost matrix:

The cost sensitive error rate obtained based on the information in the table is (D+,D−D^+,D^ -d +,D− represents two classified subsets of the sample set) :


E ( f ; D ; c o s t ) = 1 m ( x i D + ( f ( x i ) indicates y i ) x c o s t 01 + x i D ( f ( x i ) indicates y i ) x c o s t 10 ) E(f; D; Cost) = \ frac {1} {m} (\ sum_ {\ PMB {x} _i \ ^ +} in D Ⅱ (f (\ PMB {x} _i) \ neq y_i) \ times cost_ {01} + \ sum_ {\ PMB {x} _i \ in D ^ -} Ⅱ (f (\ PMB {x} _i) \ neq y_i) \ times cost_ {10})

In this case, ROC curve cannot reflect the overall expectation of the learner, so cost curve should be used to formalize the problem. The horizontal axis of the cost curve is the probability cost of positive examples of [0,1][0,1][0,1] :


P ( + ) c o s t = p x c o s t 01 p x c o s t 01 + ( 1 p ) x c o s t 10 P(+)_{cost}=\frac{p\times cost_{01}}{p\times cost_{01}+(1-p)\times cost_{10}}

Among them
p p
Is the probability that the sample is positive, and the vertical axis is zero
[ 0 . 1 ] [0, 1]
The normalized cost of the interval.

Comparison test

Based on the above experimental evaluation methods and performance metrics, how to evaluate the performance of the learner? There are several problems with performance evaluation that need to be addressed:

  • Most people want to compare generalization performance, but the experimental evaluation is the performance on the test set, and the two are not necessarily common.
  • Performance on a test set has a lot to do with test set selection. How do you eliminate the influence of test sets?
  • Most machine learning algorithms contain a certain degree of randomness. Even if the same parameters are tested in the same test set, different results may be obtained. How to compare the performance of the learner in this case?

The next section discusses comparative tests using error rates (ϵ\ Epsilon ϵ) as a performance measure.

Hypothesis test

For example, ϵ=ϵ0\epsilon=\epsilon_0ϵ=ϵ0. The generalization error rate of the learner cannot be sensed in the task, but its test error rate can only be obtained through the test set. The two may not be the same. However, the general trend is close (the likelihood of a very large difference is very small), so ϵ epsilonϵ can be inferred from ϵ^ hat{epsilon}ϵ^.

For a set containing MMM test samples, ϵ^\hat{\epsilon}ϵ^ means that ϵ^ * m\hat{\epsilon} \times mϵ^ * m samples are misclassified, and the generalization error rate is ϵ\epsilon, It will m ‘m’ m ‘entire sample classification is the probability of correct ϵ m’ (1 – ϵ) ‘m – m \ epsilon ^’ {m} (1 – \ epsilon) ^ {m – m ‘} ϵ m ‘(1 – ϵ) m – m’, the learning can be calculated exactly will therefore ϵ ^ * m \ hat {\ epsilon} \ times The probability of mϵ^×m sample classification error is (i.e., the probability that a learner test with a generalization error rate of ϵ\epsilonϵ gets ϵ^ hat{\epsilon}ϵ^) :


P ( ϵ ^ ; ϵ ) = ( m ϵ ^ x m ) ϵ ϵ ^ x m ( 1 ϵ ) m ϵ ^ x m P(\hat{\epsilon}; \epsilon)= \left( \begin{matrix} m\\ \hat{\epsilon}\times m \end{matrix} \right) \epsilon^{\hat{\epsilon}\times m}(1-\epsilon)^{m-\hat{\epsilon}\times m}

Given ϵ^\hat{\epsilon}ϵ^, ∂P(ϵ^; ϵ) partial ϵ = 0 \ frac {\ partial P (\ hat {\ epsilon}; \ epsilon)} {\ partial \ epsilon} = 0 partial ϵ partial P (ϵ ^; ϵ)=0, ϵ^=ϵ\hat{\epsilon}=\epsilonϵ^= ϵ) P (\ hat {\ epsilon}; \ epsilon) P (ϵ ^; ϵ), in accordance with the principle of binomial distribution.

A binomial test

The following figure shows the histogram of misclassification probability of a binomial distribution:

Consider assuming ϵ≤ϵ0\epsilon \leq \epsilon_0ϵ≤ϵ0, then the maximum error rate observed in the probability of 1−α1-\alpha1−α is shown in the following equation (S.T. That is, subject to) :


ϵ ˉ = m a x   ϵ   s . t .   i = ϵ 0 x m + 1 m ϵ i ( 1 ϵ ) m i < Alpha. \bar{\epsilon}= max\ \epsilon\ s.t. \ \sum^m_{i=\epsilon_0\times m + 1}\epsilon^i(1-\epsilon)^{m-i} < \alpha

1−α 1-\ alpha1−α represents confidence, i.e., the range of non-shaded parts in Figure 2.6.

T – test t test

In the case of multiple training and testing (such as cross validation), you will get more than ϵ^\hat{\epsilon}ϵ^, this case binomial test is not applicable, can use the T test method, assuming that KKK test error rate: ϵ^1,… , ϵ ^ k \ hat {\ epsilon} _1,… , \ hat {\ epsilon} _k ϵ ^ 1,… ,ϵ^k, the mean error rate μ\muμ and variance σ2\sigma^2σ2 are obtained.

this
k k
The test error rate can be seen as
ϵ 0 \epsilon_0
Independent sampling of variables
T t = k ( mu ϵ 0 ) sigma \mathcal{T}_t=\frac{\sqrt{k}(\mu-\epsilon_0)}{\sigma}
Degree of freedom obedience
k 1 k-1
the
t t
distribution, as shown in the figure below:

Cross validation t test

For learners A and B, KKK folding cross-validation is used to obtain test error rates: (ϵ1A,… , ϵ kA), (ϵ 1 b,… , ϵ) 1 b (\ epsilon_1 ^ A,… ,\epsilon^A_k),(\epsilon_1^B,… , \ epsilon_1 ^) (ϵ 1 a, B… , ϵ kA), (ϵ 1 b,… ,ϵ1B), and paired t-tests can be used for comparison.

First, the image result is calculated as follows: δ I =ϵiA−ϵiB\Delta_i=\epsilon_i^A-\epsilon^B_i δ I =ϵiA−ϵiB, if δ I =0\Delta_i=0 δ I =0, the performance of the two learning devices is the same; If not 0, use: δ 1,… , Δ k \ Delta_1,… , \ Delta_k Δ 1,… , δ K to test the hypothesis that the performance of the two learners is the same, calculate the mean and variance, in the case of significance is α\alphaα, if the variable is:


T t = k mu sigma \mathcal{T}_t=\frac{\sqrt{k}\mu}{\sigma}

If the value is less than the critical value tα/2, K − 1T_ {\alpha /2, K-1}tα/2, K −1, the hypothesis cannot be rejected, that is, there is no significant difference between the performance of the two learners. If the value is greater than the critical value, the learners with lower average error rate are significantly better.

McNemar test

For learner A and B, not only the test error rate can be estimated, but also the difference of classification results (i.e., all true, all wrong or one true and one wrong) can be calculated, as shown in the following table:

If two learning performance, the same shall be e01 = e10e_ {01} = e_ {10} e01 = e10, variable ∣ e01 – e10 ∣ | e_ {} 01 – e_ {10} | ∣ e01 – e10 ∣ should obey the normal distribution, and the average is 1, The variance is E01 + E10E_ {01}+ E_ {10} E01 + E10. Variables:


T X 2 = ( e 01 e 10 1 ) 2 e 01 + e 10 \mathcal{T}_{\mathcal{X}^2}=\frac{(|e_{01}-e_{10}|-1)^2}{e_{01}+e_{10}}

The X2\ Mathcal {X}^2X2 distribution (Chi-square distribution) of degree of freedom 1 should be obeyed.

Friedman test and Nenyl follow-up test

The above methods are all about the comparison of two algorithms. When multiple algorithms are compared, one method is to compare them in pairs, and the other method is to use the Freidman test based on algorithm sorting.

Suppose there are three learning algorithms, A, B and C
D 1 . D 2 . D 3 . D 4 D_1,D_2,D_3,D_4
The four data sets were compared, and the sequential values of 1,2… were given according to the test performance. . As shown in the following table:

Freidman test is used to determine whether the performance of these algorithms is the same, if they are the same, the average order value should be similar. It is assumed that KKK algorithms are compared on NNN data sets, rir_IRI represents the average order value of the third algorithm, rir_IRI obeys normal distribution, The mean and variance as the (k + 1) / 2, (k2-1)/(k + 1) / 2, 12 (k ^ 2-1)/(k + 1) / 2, 12 (k2-1) / 12. Variables:


T X 2 = k 1 k x 12 N k 2 1 i = 1 k ( r i k + 1 2 ) 2 \mathcal{T}_{\mathcal{X}^2}=\frac{k-1}{k}\times \frac{12N}{k^2-1}\sum^k_{i=1}(r_i-\frac{k+1}{2})^2

When k,Nk,Nk,N is large, the X2\mathcal{X}^2X2 distribution with k− 1K-1K −1 degree of freedom is obeyed.

Friedman test can be drawn in Table 2.5, and the difference can be seen intuitively. If there is interface, the algorithm performance is similar; if there is no interface, the algorithm performance is significantly different.

Bias -variance Bias -variance

After estimating the generalization performance of a learner, it is usually expected to explain the reason for its generalization performance, and bias-variance decomposition is an important tool to explain the generalization performance.

The bias variance decomposition explains the composition of the learning algorithm by disassembling the expectation of the learning algorithm. For the test sample x\ PMB {x}xx, yDy_DyD is the marker set of x\ PMB {x} XX in the data set, yyy is the real marker of x\ PMB {x} XX, f(x; D)f(\pmb{x}; D)f(xx; D) is the output of learning model FFF based on training set DDD on x\ PMB {x}xx, and the expected prediction of learning algorithm is:


f ˉ ( x ) = E D [ f ( x ; D ) ] \bar{f}(x)=\mathbb{E}_D[f(\pmb{x};D)]

The variance generated by using different training sets with the same sample number is:


v a r ( x ) = E D [ ( f ( x ; D ) f ˉ ( x ) ) 2 ] var(\pmb{x})=\mathbb{E}_D[(f(\pmb{x};D)-\bar{f}(\pmb{x}))^2]

Noise is:


Epsilon. 2 = E D [ ( y D y ) 2 ] \varepsilon^2=\mathbb{E}_D[(y_D-y)^2]

The difference between the expected output and the real tag is called bias:


b i a s 2 ( x ) = ( f ˉ ( x ) y ) 2 bias^2(\pmb{x})=(\bar{f}(\pmb{x})-y)^2

Assuming noise is 0 (to simplify the problem), the expected generalization error of the algorithm can be decomposed by polynomial expansion merging:

Finally, (generalization error can be decomposed into the sum of deviation, variance and noise) :


E ( f : D ) = b i a s 2 ( ( x ) 2 ) + v a r ( x ) + Epsilon. 2 E(f:D)=bias^2(\pmb(x)^2)+var(\pmb{x})+\varepsilon^2