This is the 19th day of my participation in the August Wenwen Challenge.More challenges in August

Effective Number of Hypotheses

So let’s remember where this M came from. Let the MTH equation encounter bad sample as an event, then the probability of encounter bad sample is the joint probability of all equations encounter bad sample. If each equation encounters the bad sample independently, then the probability of encountering the bad sample is the sum of the probabilities of each equation encounters the bad sample. Therefore, their joint probability must be less than or equal to the sum of the probabilities of each event occurring independently.

But in fact, bad events are not completely independent. Imagine two very similar equations, whose bad samples are respectively events and. Because these two equations are very close to each other, they have overlapping overlapping areas as they occur.

Then we wonder if we can treat the equations that are close together as a class, like some equations where the predictions are always the same or very close.

Let’s say our algorithm wants to pick an equation of a line on the plane as alpha, and there are infinitely many equations, but we can classify these equations into two categories. One is the circle, the other is the cross.

What if we have 2 data points and sum? So the infinite number of lines can be divided into four categories. Using these four kinds of line pairs and making predictions can produce a total of four different results.

What if we have 3 data points, and?

There are at most 8 kinds of straight lines acting on the above 8 results.

What happens if we have four data points. In the previous example, we can basically combine the information of each data point by permutation and combination. But for more than four points, it’s not so easy.



Of these 16 combinations, there are two that the equation of the line cannot produce. Therefore, if it is all Lines in a 2-dimensional space, it appears to be selected from countless equations of Lines. However, since most of the equations of Lines produce the same results, the categories of Lines with different results correspond to Effective Number of Lines 2, 4, 8 and 14 respectively in the examples above. Lines belonging to the same class will encounter or not encounter the bad sample at the same time. Since the previous Union bound was based on the assumption of independence, the probability of encountering the bad sample was obviously exaggerated. So, we should rewrite the inequality as:

How many different outcomes can phi have? (Dichotomies)

Pick any of these equations, make the pair binary, and output a result vector, say a prediction of 4 points, output, an output vector like that we call a dichotomy. It is not difficult to conclude that an equation of a line corresponds to a Dichotomy in, but a Dichotomy corresponds to at least one equation of a line. We regard all the equations of a line corresponding to a Dichotomy as one kind, So the effective number of lines is equal to the number of different dichotomy. Obviously, if the number of dichotomy is less than or equal to the number of permutations and combinations of all data points, such as the permutations and combinations corresponding to the picture with a big cross in the figure above, it cannot be a Dichotomy, because they cannot be generated by any equation of a line. (Of course, if we’re not thinking about the equation of a line, then that permutation and combination can be a dichotomy.) So what we’re looking for is how many different kinds of lines can be found in the plane = how many different dichotomy can be generated by acting on phi. Because not all permutations and combinations can become Dichotomy, the number of different Dichotomy must not exceed the number of permutations and combinations. In the above example, if three points are collinear, the number of dichotomy will be smaller, so: = How many different Dichotomy can be generated when acting on “at most”

Growth Function

So, how many different dichotomy can you have acting on “at most”? This quantity is related to and also related to the amount of data. In mathematical formula, it can be expressed as:

The above formula is also called growth function. In certain cases, the growth function is a function dependent on N. The following are the growth functions of several common Hypothesis sets.

  1. Positive Rays



    The input space is a one-dimensional real number space. If the threshold is greater than +1, otherwise -1 is predicted. For example: When N=4, Positive Rays act on ~ and can produce a total of 5 different dichotomies. The diagram below:



    It’s not hard to think of four points, five possible pointcuts, and at most five kinds of dichotomies. Therefore, the growth function of Positive Rays is:
  2. Positive Intervals



    The Positive Intervals are similar to the previous ones, except that there are two thresholds. The predictions sandwiched between the two thresholds are +1, and the other predictions are -1. For example: when N=4, Positive Intervals act on ~, and a total of 11 dichotomies can be generated. The diagram below:



    It is also not difficult to think that two of the four points and five possible tangent points are chosen as thresholds and one generated by the coincidence of the two thresholds. Therefore, the growth function of Positive Intervals is:
  3. Convex Sets



    Select any k points, and predict +1 for all points within the convex polygon composed of the k points, and -1 for all points otherwise. We said earlier that the growth function describes”most“The number of dichotomy types that can be produced, so if our N inputs are placed in a circle, any combination of the N points can be a Dichotomy. Therefore, the growing function of the Convex set is:

A Shatter & Break shotgun

We shatter the number of dichotomies produced when N inputs are equal to the number of permutations and combinations of the N inputs. Or we could say that the resulting dichotomies gave Shatter this arrangement of N points.

This shatter seems to have a very difficult meaning to understand. Here is Teacher Lin’s reply in the discussion forum:

“We have had a good discussion about break point, but we have also noticed that shatter means’ to break ‘, in this case ‘all possible (shard) scenarios of N points have been generated’. We have shatter this situation.”

I shatter my understanding from the perspective of playing games to overcome obstacles.

“It’s a shotgun, and in each level (level N) it can have small bullets (a dichotomy for each bullet) and you’re up against an enemy. You need to shatter everyone with one shot. For the shotgun, the first and second levels were ok, and the six small bullets in the third level did not shatter eight people, so it broke.”

For a given growth function, k is the break point of the growth function when N increases slowly to k, and we cannot shatter them in any inputs.

According to the growth function, the break point of Positive Rays growth function is 2, the break point of Positive Intervals growth function is 3, and no matter how large the Convex Sets are, we can shatter the N points. So its growth function has no break point. The break point of 2D Perceptrons is 4, because it can shatter and produce dichotomies when N=3, but it cannot shatter when N=4, and can only produce 14 () dichotomies at most. Therefore, the break point of 2D Perceptrons growth function is 4.

The Restriction of Break Point

Some of the growth functions are easy to find, such as Positive Rays, Positive Intervals, and Convex Sets. Some are not so easy, such as 2D perceptrons. We can’t directly see what the growth function is, so we can’t do anything about it? Not exactly. At least we still have its break point. Can we do something with it? If you can’t get a growth function, it’s nice to get upper bound for the growth function.

Let’s see how many dichotomies can be produced by acting on “at most, at most” when we don’t know what it is at all, just its break point K. Notice that I use two “most” here. Since we cannot know the growth function exactly, the number of dichotomies calculated by using this break point is still a high estimate. This high estimate is actually the upper bound on the actual number of dichotomies produced by any action of the break point k.

For example, if we do not know a certain growth function, but know its break point k=2, how many kinds of dichotomies can be produced “at most” when acting on N=3?

From k=2, we can know that any two data points cannot be shatter. Remember Shatter’s concept? It means that the dichotomies I generate cannot contain all the permutations and combinations of any 2 data points. Let’s start with a dichotomy.

1 dichotomy 

2 dichotomies 

3 dichotomies 

If you look at the alpha and omega columns, these three dichotomies already contain three of the four permutations of the alpha and omega points. One more, and we will shatter.

4 dichotomies 

Look at the two columns on the right, and have been shattered. However, as stated before, k=2 means that any two points cannot be shatter, so it is impossible to produce these four dichotomies. So let’s try a dichotomy instead.

4 dichotomies 

After a dichotomy is changed, the two columns on the right contain only 3 of the 4 permutations and combinations, so the two points were not shatter. Continue to check any two points (,), (,), and have not been shatter, it seems that these four kinds of dichotomies are ok.

The case of the five dichotomies is not going to be drawn here, but it is easy to see that no matter how many dichotomies are added, two points will be shattered. So you can only have four dichotomies at most. So, upper bound for is 4. We use it to express the upper limit (” at most “) of the number of dichotomies that can be generated by any action of break point k on any action of size N. Therefore, the conclusion just drawn can be expressed as, because any 2 data points cannot be shatter, at that time,, therefore, at most, there are.

Now, the nice thing about this is, although a lot of times we can’t figure out the growth function directly, if we know what the break point is, it looks like we can figure out the upper bound of this. So we have a new goal, not to study directly, but to study.

And we already know that. It’s not hard to know:

  • When, a point (two permutations) cannot be shatter, so it is identical to 1.
  • Therefore, the number of dichotomies produced is equal to the number of permutations and combinations of this point.
  • , remove one of the permutations and combinations, and the rest can be used as dichotomies, so the number of dichotomies generated by it can be “at most”.

Therefore, we can get the following table:

How do I fill in the rest of the form? Is it relevant? At this time, the graduate student who helped the teacher flip a coin in xx lab was about to take the stage. He ran through all the possible dichotomies and found the following results:

Let’s put the results in order:

See, the orange parts come in pairs, and only the purple parts come alone:

  1. If you remove it, just look at these three points:



    Part of the dichotomies of these three points can be shatter, because any of these three points cannot be shatter.
  2. Let’s look at the remaining part:



    Pay attention to the part before is exist in pairs, and some may not be dubbed “off any two points, because if the part of the dichotomies can dubbed” off any two points, he again tie-in and the two situations of each line, so the dichotomies can dubbed “drop three points, And the break point is 3. So we can’t shatter any two points. Thus there are:

Combining the above two inequalities, we can get:

This completes the previous table:

It can be proved by mathematical induction:

The inequality is always true, so we’re just talking about the case. When, the inequality is true. If the inequality is true for all inequalities, then we need to prove that the inequality is also true. According to the previous conclusions, there are: