Several typical application scenarios and calculation methods of hypothesis testing

For more exciting articles, please pay attention to the author’s wechat official number: Code worker’s note.

Today we will review several typical application scenarios and calculation methods of hypothesis testing.

I. Concept of hypothesis testing

Hypothesis testing is the analysis of sample data to test whether an assertion about a population is true.

The statement for the population to be tested is called the null hypothesis, and is generally expressed as H0H_0H0.
The hypothesis that is opposed to the null hypothesis is called the alternative hypothesis and is usually expressed by HaH_aHa.

Generally speaking, in the process of hypothesis testing, the hypothesis H0H_0H0 will be presupposed to be valid, and the analysis will be carried out from here until there is evidence (sample data and sample statistics) that it is not valid, and then the hypothesis will be rejected.

Second, the general steps of hypothesis testing

The null hypothesis H0H_0H0 and alternative hypothesis HaH_aHa are established
Select a sample set randomly from the population, and calculate some sample statistics (such as mean, standard difference, etc.) against the sample set.
According to different scenarios, call corresponding methods to calculate test values from sample statistics (see the next section for specific methods)
- The test value is the normalized data, indicating that the current sample statistical value is several standard deviations from the target value
- The farther the test value is from the center, the less likely it is to occur, as shown in the figure below (here ±2\pm2±2 is the test value) :
By looking up the table and other ways, from the test value to the P value
- The p value represents the probability of sample occurrence if H0H_0H0 is true
The null hypothesis H0H_0H0 is rejected based on the value of p
- If the p value is less than a certain threshold, H0H_0H0 is rejected based on the assumption that something unlikely has happened under H0H_0H0

Application scenario & calculation method of test value

1. Scenario 1: Test the mean value of a population

For example, someone claims that the average weight of adults is 70 kg. Now we are going to test this statement.

Assuming that
- $H_0: \mu=\mu_0$
- $H_a: \mu>\mu_0$
Among them:
- μ\muμ represents the true population mean (the average weight of all adults)
- μ0\mu_0μ0 represents the population mean assumed in H0 H_0H0 (μ0=70\mu_0 =70 μ0=70)
Test value calculation formula

$Z = \frac{\bar{x} – \mu_0}{\frac{\sigma}{\sqrt{n}}}$

Among them:
- N represents the number of samples;
- X ˉ\bar{x}xˉ represents the average value of the sample;
- Sigma \sigma represents the standard deviation of the sample;
If 100 samples are taken, x1, X2… ,x100x_1, x_2, … , x_{100}x1,x2,… ,x100, then:
- $n=100$
- $\bar{x} = \frac{x_1 + x_2 + … + x_{100}}{100}$
- $\sigma = \sqrt{\frac{\sum_{i=1}^{100}(x_i – \bar{x})^2}{n-1}}$
And then the Z that you get from the formula is the test.

2. Scenario 2: Check the proportion of parts that meet certain conditions in a population

For example, we want to test the claim that 50% of all adults weigh more than 70kg.

Assuming that
- $H_0:p = p_0$
- $H_a:p \neq p_0$
Among them:
- P0p_0p0 represents the proportion of targets assumed in the claim, which in this case is 50%
Test value calculation formula

$\frac{\hat{p}-p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}$

Among them:
- P ^\hat{p} P ^ represents the actual proportion of individuals in the sample set who are eligible (i.e. weigh more than 70 kg);
- N is the number of samples

3. Scenario 3: Compare the average values of two populations

For example, someone claims that adults who smoke weigh the same as non-smokers on average.

Assuming that
- $H_0: \mu_x – \mu_y = 0$
- $H_a:\mu_x – \mu_y \neq 0$
Among them:
- μx\mu_xμx and μy\mu_yμy represent the average weight of population XXX (smokers) and population YYY (non-smokers) respectively
Test value calculation formula

$\frac{(\bar{x}-\bar{y})-0}{\sqrt{\frac{s_x^2}{n_1}+\frac{s_y^2} {n_2}}}$

Among them:
- X ˉ\bar{x}xˉ denotes the mean (average weight) of the samples taken from the population XXX (smokers)
- Y ˉ\bar{y}yˉ denotes the mean (average body weight) of the samples taken from the population YYy (non-smokers)
- Sx2s_x ^2sx2 represents the variance of a sample taken from the population XXX (smokers)
- Sy2s_y ^ 2SY2 represents the variance of the sample taken from the population YYy (smokers)
- N1n1n1 represents the number of samples taken from the population XXX (smokers)
- N2n2n2 represents the number of samples taken from the population YYY (smokers)

4. Scenario 4: Testing the average of the difference between two variables: the two variables are paired data

For example, someone claims that adults weigh the same when they wake up in the morning as they did before they went to bed at night.

When the difference between pairs of data needs to be compared, the two data in each sample are subtracted first to get a new sample set, and the subsequent analysis is based on this new sample set. The new sample set in this example contains the difference in weight between the morning and evening for each person in the original sample.

Assuming that
- $H_0: \mu_d = 0$
- $H_a: \mu_d \neq 0$
Among them:
- μd\mu_dμd represents the average weight difference between the morning and evening for all adults
Test value calculation formula

$\frac{\bar{d}-\mu_d}{\frac{s_d}{\sqrt{n}}}$

Among them:
- D ˉ\bar{d} D ˉ and SDS_DSD were the mean and standard deviation of body weight difference in the morning and evening in the sample set
- NNN indicates the number of samples

5. Scenario 5: Test the proportion difference between the two populations that meet certain conditions

For example, someone claims that the proportion of adult men who smoke is the same as adult women.

There are two populations — male and female; There are also two sample sets — male and female samples.

Assuming that
- $H_0: p_1 – p_2 = 0$
- $H_a: p_1 – p_2 \neq 0$
Test value calculation formula

$\frac{(\hat{p_1}-\hat{p_2})-0}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1}+\frac{1}{n_2})}}$

Among them:
- P ^\hat{p} P ^ is the proportion of smokers in the total population when all samples, including men and women, are mixed
- P1 ^\hat{p_1}p1^ denotes the proportion of smokers in the male sample
- P2 ^\hat{p_2}p2^ denotes the proportion of smokers in the female sample
- N1n1n1 is the number of men in the sample
- N2n2n2 is the number of women in the sample

Iv. Inspection method

With the test statistics calculated in the previous step, you can look up the table to get the p value:

When the sample number NNN is large, the p value is obtained by querying the calculated test statistics in the Z-distribution table
When the sample number n<30n <30n <30, the test statistics were queried in the t-distribution table with n−1n-1n−1 degree of freedom, and the P value was obtained

The p value represents the probability of occurrence of the current sample when H0H_0H0 is true:

If the P value we obtained in a certain scene is too small (less than the target threshold), it indicates that a small probability event has occurred under the condition that H0H_0H0 hypothesis is established, so the hypothesis of H0H_0H0 can be rejected accordingly.
If p value > significance level, it indicates that the probability of events represented by samples is not low, and there is not enough evidence to reject the hypothesis of H0H_0H0.

Z distribution table:

T distribution table: