2.2 Guarantee of finite hypothesis set — consistent case

In the case of the axis-aligned rectangle we examined, the algorithm returns the assumption that hSh_ShS is always consistent, i.e., it does not admit errors on the training sample SSS. In this section, we put forward a general sample complexity boundaries, or equivalent, a generalization, for the same assumptions, on the base of ∣ H ∣ | H | ∣ H ∣ under the condition of a set of assumptions is limited. As we consider consistent assumptions, we will assume the target concept CCC in HHH.

Theorem 2.1 Learning boundary — finite H, consistent case

Let HHH be a finite set of functions mapped from XXX to YYY. Let AAA be an algorithm for any target concept c ∈ Hc ∈ Hc ∈ H and I.I.D. Sample SSS returns a consistent assumption hS: R^(hS) = 0h_S: \widehat R(h_S) = 0hS: R(hS) = 0. Then, for any ϵ,δ > 0\epsilon,δ > 0ϵ,δ > 0, Inequality PrS ~ Dm [R (hS) ϵ] or less 1 or more – the delta \ underset ~ D_m {S} {Pr} [R (h_S) \ epsilon] or less or greater 1 – the delta S ~ DmPr [R (hS) ϵ] or less 1 or more – the delta, if


m p 1 ϵ ( l o g H + l o g 1 Delta t. ) . ( 2.8 ) M \ ge \ frac {1} {\ epsilon} \ tightly (log | H | + log \ frac {1} {g} \ tightly. (2.8)

The sample complexity results allow the following equivalent statements as generalization bounds: for any ϵ, δ > 0\epsilon, δ > 0ϵ, δ > 0, probability at least 1 − δ,


R ( h s ) Or less 1 m ( l o g H + l o g 1 Delta t. ) . ( 2.9 ) R (h_s) \ leq \ frac {1} {m} \ tightly (log | H | + log \ frac {1} {g} \ tightly. (2.9)

To prove that with A fixed ϵ>0\epsilon>0ϵ>0, we do not know which consistent assumption hS ∈ Hh_S ∈ HhS ∈ H is chosen by algorithm A. This hypothesis is further dependent on the training sample SSS. Therefore, we need to give a consistent convergence bound for the set of all consistent assumptions, including, more importantly, hSh_ShS. Therefore, we will limit the probability that some H ∈ Hh ∈ Hh ∈ H is consistent and the error is greater than ϵ\epsilonϵ :


P r [ h H : R ^ ( h ) = 0 Sunday afternoon R ( h ) > ϵ ] = P r [ ( h 1 H . R ^ ( h 1 ) = 0 Sunday afternoon R ( h 1 ) > ϵ ) ( h 2 H . R ^ ( h 2 ) = 0 Sunday afternoon R ( h 2 ) > ϵ ) . . . ] Or less h H P r [ R ^ ( h ) = 0 Sunday afternoon R ( h ) > ϵ ] ( The joint constraint ) Or less h H P r [ R ^ ( h ) = 0 R ( h ) > ϵ ] ( The definition of conditional probability ) \begin{aligned} & \quad Pr[\exists h\in H:\widehat R(h)=0\land R(h)>\epsilon]\\ &=Pr[(h_1\in H,\widehat R(h_1)=0\land R(h_1)>\epsilon)\lor (h_2\in H,\widehat R(h_2)=0\land R(h_2)>\epsilon)\lor …] \ \ & \ le \ sum_ \ {h in h} Pr [\ widehat R (h) (h) = 0 \ land R > \ epsilon], quad, quad, quad, quad, quad, quad (joint constraints) \ \ & \ le \ sum_ {h \ in H} [\ Pr widehat R | R (H) (H) = 0 > \ epsilon] \ quad, quad, quad, quad, quad, quad, quad (the definition of conditional probability) \ end} {aligned

Now, consider any hypothesis h ∈ Hh \ in Hh ∈ h, of which R (h) > ϵ R (h) > \ epsilonR (h) > ϵ. Then, the probability of HHH is consistent on the training sample SSS drawn with I.I.D., that is, it has no error at any point in the SSS, and the bounds can be defined as:


P r [ R ^ ( h ) = 0 R ( h ) > ϵ ] Or less ( 1 ϵ ) m . Pr[\widehat R(h)=0|R(h)>\epsilon]\le(1-\epsilon)^m.

The previous inequality means that


P r [ h H : R ^ ( h ) = 0 Sunday afternoon R ( h ) > ϵ ] Or less H ( 1 ϵ ) m . Pr[\exists h\in H:\widehat R(h)=0\land R(h)>\epsilon]\le |H|(1-\epsilon)^m.
  • Set the right-hand side equal to delta and solve for ε to prove it.

The theorem shows that the consistent algorithm AAA is the PAC learning algorithm when the hypothesis set HHH is finite, because the sample complexity given by (2.8) is dominated by polynomials in 1/ϵ1/\epsilon1/ϵ and 1/δ1/δ1/δ. As shown in (2.9), the upper limit of the generalization error of the consensus hypothesis is a term decreasing as a function of the sample size MMM. It is a general fact that, as expected, learning algorithms benefit from larger labeled training samples. However, the reduction rate of O(1/m)O(1/m)O(1/m) guaranteed by the theorem is particularly favorable. The cost of proposing a consistent algorithm is to use a larger hypothesis set HHH that contains the target concept. Cap (2.9), of course, with ∣ H ∣ | H | ∣ H ∣ increased. However, this dependence is only logarithmic. Please note that the term log ∣ H ∣ log | H | log ∣ H ∣, or related terms log2 ∣ H ∣ log_2 | H | log2 ∣ H ∣ differ with it a constant factor, HHH said can be interpreted as the required number of digits. As a result, the theorem of generalized guarantee by the digit ratio, log2 ∣ H ∣ log_2 | H | log2 ∣ H ∣ and MMM control sample size.