1 introduction

Today, I’m going to talk about a very important concept in data analysis or machine learning, confidence and confidence intervals. Why are confidence levels and confidence intervals important? Let me give you an example. \

Get a movie data set, and in order to pick out the top 10 comedy movies on Douban. This does not seem difficult, and in a few lines of code it is possible to do the analysis and give a result.

However, when I went back and checked it carefully, I found that five of the 10 movies were rated by only one person, and all of them gave 10 points.

Based on this situation, the selection of the top 10, naturally not convincing, not a strong persuasive.

What we would expect is for a movie to be rated by a large number of moviegoers, and then from those movies, pick the ones that score better. \

This leads to the concept of confidence and confidence intervals.

If a movie is rated by many people and the average score is 8.5, it will score between 8.2 and 8.8, which is a high level of confidence, let’s say 90%.

On the other hand, a movie is rated by only two people, and although it ends up with an average of 9.5, the confidence rating in the range of 9.2 to 9.8 May not be as high, with an estimated 50 percent. In other words, the confidence interval between 9.2 and 9.8 is more likely to be rejected. After all, it’s only 50%.

2 Theoretical Explanation

If we ask an infinite number of moviegoers to rate a movie, the following graph shows the overall distribution, with an average score of μ and a standard deviation of σ : \

If we have μ and σ, we can say that about 68% of the samples will fall in the red region: the average score within two σ is 95% confidence.

Assuming that the sample is infinite, the average score for a given movie is the overall distribution score, with an average score of 0.65 (out of 1) and a standard deviation of 0.03.

Then the average score of this movie with confidence interval of 0.62~0.68 is about 95%.

So, to make the results more convincing, it is possible to filter out films with fewer ratings, and then it is also possible to filter out films with fewer ratings.

3. Calculate the number of samples corresponding to 95% confidence

Given the sample standard deviation, Z value, and the length of confidence interval, the number of samples can be calculated according to the formula. The specific calculation formula is available for you to check and will not be listed here.

As shown in the table above, if we take 95% confidence and allow 5% error, the number of samples required is at least 385.

So, our problem is solved. Find all the movies that have been rated at least 385 times, and rank the top 10 in order of average comedy scores.

Because the use of Z value, in this description of the Z value of the method, as a knowledge expansion.

4. Calculate the Z value corresponding to 95% confidence

The film score is allowed to have an error of 0.05/2=0.025. In this case, we need to check the Z value when the tail area is 0.025.

Look up the Z table and find 0.975 in the middle. If you go to the left of this row, you get 1.9. If you go up, you get 0.06. If you add the two numbers together, you get 1.96.

5. Calculate the confidence interval corresponding to 95% confidence

Calculate the confidence interval:

The first step is to calculate the mean, standard deviation, and standard error of a given sample. Sample standard error:

The second step is to determine the confidence level (confidence level). The commonly used confidence level is 95%. \

The third step is to calculate the upper and lower limits of the confidence interval [a,b]. The z-value calculation method is referred to above, so it is easy to get:

A = population mean -z * standard error \

B is the population mean plus Z standard error

These knowledge points, I believe we can also find out on the Internet, but the most important thing to learn is to comb the knowledge logic. The points of knowledge are like beads placed there one by one, and the logical system of knowledge is like a string that connects the beads one by one. This string is the logic line. I hope to form such a string of beads through hard summing up, which is the biggest value, and there is no shortage of knowledge acquisition means like beads at present.

Note: the menu of the official account includes an AI cheat sheet, which is very suitable for learning on the commute.

Highlights from the past2019Machine learning Online Manual Deep Learning online Manual AI Basic Download (Part I) note: To join our wechat group or QQ group, please reply "add group" to join knowledge planet (4500+ user ID:92416895), please reply to knowledge PlanetCopy the code

Like articles, click Looking at the