5 bias – Complexity trading

In Chapter 2, we see that unless one is careful, training data can mislead learners and lead to overfitting. To overcome this problem, we limit the search space to some hypothetical class H\mathcal{H}H. Such a hypothetical class can be seen as reflecting some prior knowledge of the learner’s task — believing that the class member H\mathcal{H}H is a task with a low error model. For example, in our papaya taste problem, based on our previous experience with other fruits, we can assume that some rectangles on the color-hardness plane predict (at least approximate) papaya taste.

Is such prior knowledge really necessary for successful learning? Perhaps there is some kind of universal learner, that is, a learner with no prior knowledge of a task and who is ready to be challenged by any task? Let us elaborate on this point. A specific learning task is defined by an unknown distribution D\mathcal{Y}X {X} times\mathcal{Y}X {Y} D, where the learner’s goal is to find a predictor H: X→Y\mathcal{X}\rightarrow\mathcal{Y}X→Y, its risk LD(h)L_D(h)LD(h) is small enough. Therefore, the question is whether there exists A learning algorithm A and A training set of size M such that, for each distribution D\mathcal{D}D, if A receives mi.i.D. From the example of D\mathcal{D}D, it is likely to output a predictor H with low risk.

The first part of this chapter deals formally with this issue. The no free lunch theorem states that there are no such universal learners. More precisely, the theorem states that for binary classification prediction tasks, for every learner, there is a distribution in which it fails. We say learner failure, if received I.I.D. from this distribution example, its output assumption may have a large risk, such as ≥\ GEq ≥ 0.3, while for the same distribution there exists another learner will output a small risk assumption. In other words, the theorem states that no learner can succeed at all the learnable tasks — every learner has failed tasks and every other learner succeeds.

Therefore, when dealing with a particular learning problem defined by some distribution D\mathcal{D}D, we should have some prior knowledge about D\mathcal{D}D. One type of such prior knowledge is D\mathcal{D}D comes from a particular family of parameter distributions. We will examine learning under these assumptions later in Chapter 24. In another prior knowledge of D\mathcal{D}D, which we assume when defining the PAC learning model, is the existence of H in some predefined class H\mathcal{H}H, i.e. LD(H)=0. L_D (h) = 0. LD (h) = 0. A softer prior knowledge of D\mathcal{D}D is to assume that MINh ∈HLD(h)min_{h\in h}L_D(h)minh∈HLD(h) is smaller. In a sense, this weaker assumption for D\mathcal{D}D is used in the unknowable PAC model, where we require that the risk of the output assumption does not exceed minh∈HLD(h)min_{h\in h}L_D(h)minh∈HLD(h).

In the second part of this chapter we will examine the benefits and pitfalls of using hypothesis classes as a means of formalizing prior knowledge. We decompose the error of ERM algorithm on class H\mathcal{H}H into two components. The first component reflects the quality of our prior knowledge as measured by the minimum risk of our hypothetical MINH ∈HLD(H)min_{H \in H}L_D(H)minh∈HLD(H) in our hypothetical class. This component is also called the approximation error, or the deviation of the algorithm from the choice of assumptions from H\mathcal{H}H. The second component is the error due to overfitting, which depends on the size or complexity of the H\mathcal{H}H class and is called the estimation error. These two terms imply a tradeoff between choosing the more complex H\ Mathcal {H}H, which reduces bias but increases the risk of overfitting, or the less complex H\ Mathcal {H}H, which may increase bias but reduce potential overfitting.

5.1 No free lunch Theorem

In this section, we demonstrate that there is no universal learner. We do this by proving that no learner can succeed in all learning tasks, formalizing the following theorem:

Theorem 5.1 (No free lunch) Let A be any learning algorithm for A binary classification task with 0−1 losses on A field X\mathcal{X}X. Let m be any number less than ∣X∣/2\mid X\mid/2∣X∣/2, representing the size of a training set. Then, there exists a distribution D\mathcal{D}D on X×{0,1}\mathcal{X}\times\{0,1\}X×{0,1}, such that:

1. There is a function f: X – > {0, 1} with LD (f) = 0 \ mathcal {X} \ rightarrow \ {0, 1 \} and L_D (f) = 0 X – > {0, 1} with LD (f) = 0. 2. If the probability of choosing S ~ DmS\sim D^mS ~ Dm is at least 1/7, we have that LD(A(S))≥1/8. L_D \ geq 1/8 (A (S)). LD (A) (S) or 1/8.

The theorem states that for every learner, there is a task that it fails, even if the task can be successfully learned by another learner. In fact, in this case, a simple successful learner would be an ERM learner with the hypothesis class H={f}\mathcal{H}=\{f\}H={f}, or more generally, For anything containing F whose size satisfies the equation m≥8log(7∣H∣/6)m\geq 8log(7\mid \mathcal{H}\mid/6)m≥8log(7∣H∣/6) (see corollation 2.3).

Proof: Let C be a subset of X\mathcal{X}X of size 2m. The intuition of the proof is that any learning algorithm that looks at only half of the instances in C has no information about the labels of the other instances C. Thus, there exists A “reality” of some objective function f, which will contradict the label predicted by A(S) on instances not observed in C.

Note that there are possible functions from C to {0,1} T=22mT=2^{2m}T=22m. In f1,… , fTf_1,… , f_Tf1,… FT stands for these functions. For each such function, let DiD_iDi be a distribution at C×{0,1}C\times\{0,1\}C×{0,1} defined as:


D i ( { ( x . y ) } ) = { 1 / C i f y = f i ( x ) 0 other D_i (\ {\} (x, y)) = \ left \ {\ begin \ {matrix} & 1 / C quad if \ quad y = f_i (x) \ \ & 0 {matrix} \ \ \ qquad other end right.

That is, if the label Y according to fif_ifi is indeed a real label, then the probability of choosing a pair of (x, y) is 1/∣C∣1/\mid C\ MID1 /∣C∣ if y≠ Fi (x)y\not=f_i(x)y= Fi (x) is 0. Obviously, LDi(Fi)=0. L_ {D_i} (f_i) = 0. LDi (fi) = 0.

We will show that, for each algorithm, A, it receives A training set of m examples from C×{0,1}C\times\{0,1\}C×{0,1} and returns A function A(S) : C→{0,1}C\rightarrow\{0,1\}C→{0,1}, which proves


m a x i [ T ] E S …… D i m [ L D i ( A ( S ) ) ] p 1 / 4. ( 5.1 ) \ underset {I \ [T] in} {Max} \ underset \ {S sim D ^ m_i} {E} [L_ (A) (S)] {D_i} \ geq1/4. (5.1)

Obviously, this means that A ‘A^{‘}A’ exists A function f: X→{0,1}f: for every algorithm that receives m examples of A training set from X×{0,1}\mathcal{X}\times {0,1} X×{0,1} : \ rightarrow \ \ mathcal {X} {0, 1 \} f: X – > {0, 1}, and a more than X X {0, 1} \ mathcal {X} \ times \ {0, 1 \} distribution of X X {0, 1} D \ mathcal {D} D, LD (f) = 0 so l_d LD (f) (f) = 0 = 0 and


E S …… D m [ L D ( A ( S ) ) ] p 1 / 4. ( 5.2 ) \ \ underset {S sim D ^ m} {E} [L_D (A ^ {‘} (S))] \ geq 1/4 (5.2)

It is easy to verify the previous content is enough to prove that the P [LD (A ‘(S)) or 1/8] 1/7 or higher P [L_D (A ^ {‘} (S)) \] geq 1/8 \ geq1/7 P [LD (A’ (S)) or 1/8] or 1/7, this is we need to prove the practice (see 1).

We now turn to proving equation (5.1) to be true. There are m possible sequences k= 2m mk= 2m ^mk= 2m from C. Use these sequences to represent S1,… SkS_1,\cdots,S_kS1,… Sk. In addition, if Sj=(x1,…, XM)S_j=(x_1,\cdots, X_m)Sj=(x1,…, XM), we use SjiS^i_jSji to represent the sequence containing instances of SjS_jSj marked by the function FIF_IFi. Namely Sji = ((x1, fi (x1),…, (xm, fi (xm))) ^ S i_j = ((x_1, f_i (x_1), \ cdots, (x_m, f_i (x_m))) Sji = ((x1, fi (x1)),…, (xm, fi (xm))). If the distribution is DiD_iDi, then the training set A may receive is S1i,… SkiS^i_1,\cdots, S^i_kS1i,… Ski, all of which have the same sampling probability. As a result,


E S …… D i m [ L D i ( A ( S ) ) ] = 1 k j = 1 k L D i ( A ( S j i ) ) . ( 5.3 ) \ \ underset {S sim D ^ m_i} {E} [L_ {D_i} (A) (S)] = \ frac {1} {k} \ sum ^ {k} _ {j = 1} L_ {D_i} (A (S ^ i_j)). (5.3)

Using the fact that “maximum” is greater than “average” and “average” is greater than “minimum”, we have


m a x i [ T ] 1 k j = 1 k L D i ( A ( S j i ) ) p 1 T i = 1 T 1 k j = 1 k L D i ( A ( S j i ) ) = 1 k j = 1 k 1 T i = 1 T L D i ( A ( S j i ) ) p m i n j [ k ] 1 T i = 1 T L D i ( A ( S j i ) ) . ( 5.4 ) \begin{aligned} \underset{i\in[T]}{max}\frac{1}{k}\sum^{k}_{j=1}L_{D_i}(A(S^i_j)) & \geq\frac{1}{T}\sum^T_{i=1}\frac{1}{k}\sum^{k}_{j=1}L_{D_i}(A(S^i_j))\\ &=\frac{1}{k}\sum^{k}_{j=1}\frac{1}{T}\sum^T_{i=1}L_{D_i}(A(S^i_j))\\ &\geq \ underset {j \ [k] in} {min} \ frac {1}, {T} \ sum ^ T_ (I = 1} L_ {D_i} (A (S ^ i_j)). \ quad (5.4) \ end} {aligned

Next, fix some J ∈[k]j\in[k]j∈[k]. Said Sj = (x1,…, xm) S_j = (x_1, \ \ cdots, x_m) Sj = (x1,…, xm), let the v1,…, vpv_1, \ \ cdots, v_pv1,… The vp is did not appear in the SjS_jSj example in C. Obviously, p≥mp\geq mp≥m. So for every function h: C→{0,1}C\rightarrow\{0,1\}C→{0,1} and for every I we have:


L D i ( h ) = 1 2 m x C 1 [ h ( x ) indicates f i ( x ) ] p 1 2 m r = 1 p 1 [ h ( v r ) indicates f i ( v r ) ] p 1 2 p r = 1 p 1 [ h ( v r ) indicates f i ( v r ) ] . ( 5.5 ) \begin{aligned} L_{D_i}(h)& =\frac{1}{2m}\sum_{x\in C}1[h(x)\not=f_i(x)]\\ & M \ geq \ frac {1} {2} \ sum ^ p_ {r = 1} 1 (vr) \ [h not = f_i (vr)] \ \ & \ geq \ frac {1} {2} p \ sum ^ p_ {r = 1} 1 (vr) \ [h not = f_i (vr)]. \ quad (5.5) \end{aligned}

so


1 T i = 1 T L D i ( A ( S j i ) ) p 1 T i = 1 T 1 2 p r = 1 p 1 [ A ( S j i ) ( v r ) indicates f i ( v r ) ] = 1 2 p r = 1 p 1 T i = 1 T 1 [ A ( S j i ) ( v r ) indicates f i ( v r ) ] p 1 2 . m i n r [ p ] 1 T 1 [ A ( S j i ) ( v r ) indicates f i ( v r ) ] ( 5.6 ) \begin{aligned} \frac{1}{T}\sum^T_{i=1}L_{D_i}(A(S^i_j))& \geq\frac{1}{T}\sum^T_{i=1}\frac{1}{2p}\sum^p_{r=1}1[A(S^i_j)(vr)\not=f_i(vr)]\\ &=\frac{1}{2p}\sum^p_{r=1}\frac{1}{T}\sum^T_{i=1}1[A(S^i_j)(vr)\not=f_i(vr)]\\ &\geq\frac{1}{2}.\underset{r\in [p]} {min} \ frac {1} {T} 1 [A (S ^ i_j) (vr) \ not = f_i (vr)] \ quad (5.6) \ end} {aligned

Next, fix some R ∈[p]r\in[p]r∈[p]. We can divide all functions in f1,… fTf_1,\cdots,f_Tf1,… fT into T/2 separated pairs. For a pair of (fi, fi ‘f_i, f_I ^{‘}fi, fi’), We for each c ∈ c, fi (c) indicates a fi ‘(c) c \ in c, f_i (c) \ not = f_i ^ {‘} (c) c ∈ c, fi (c)  = fi’ (c), if and only if c = VRC = v_rc = vr. Because for such a pair, we must have Sji=Sji’s ^i_j=S^{I ^{‘}}_jSji=Sji ‘, so it looks like this


1 [ A ( S j i ) ( v r ) indicates f i ( v r ) ] + 1 [ A ( S j i ) ( v r ) indicates f i ( v r ) ] = 1 1_{[A(S^i_j)(v_r)\not=f_i(v_r)]}+1_{[A(S^{i^{‘}}_j)(v_r)\not=f_{i^{‘}}(v_r)]}=1

What kind of production


1 T i = 1 T 1 [ A ( S j i ) ( v r ) indicates f i ( v r ) ] \frac{1}{T}\sum^{T}_{i=1}1[A(S^i_j)(v_r)\not=f_i(v_r)]

Combining equations (5.6), (5.4) and (5.3), we obtain that Equation (5.1) holds and thus our proof.