This article originated from personal public account: TechFlow, original is not easy, for attention


Today, in the sixth installment of our probability and statistics series, we’ll look at the concept of variance dependence.


Definition of variance


Variance is very common in our daily life, and it is mainly used to provide a description of sample outlier degree. For a simple example, if we go to buy a bag of crisps, there is usually a fixed number of crisps in a bag. We’re going to assume that there’s an average of 50 chips in each bag. Well, even with machine filling, it’s impossible to get exactly 50 chips in each bag, and it’s going to be a little bit off. The mean does not measure this error.

If there are two brands of crisps now, they all taste the same, and the average bag is 50 chips. But half of brand A’s chips are 80 and half are 20. For brand B, 99% are between 45 and 55. Which brand do you think you will buy? (in the case of not considering by weighing).

In modern society, the concept of variance is basically inseparable from the products left by the factory. The lower the variance is, the stronger the production capacity of the factory is, so that every product is very fine. On the contrary, the higher the variance is, it means that there are many defects and it is not fine enough. In other words, the variance measures the expected distance from the mean of the sample.

It should have read:.

But because we have absolute values in it, we usually square it to get rid of the absolute values. Written as:


This E here stands for expectation, and that’s the way it’s written in statistics, but if you don’t understand it, you can expand it as:


This N is the number of samples,It’s the mean of the sample. Var is short for Variance and can also be written as D(X).

And since we get the variance by squaring it, we can also take the square root of it, we getThe standard deviation.And we can also write that.


Properties of variance


There are some famous properties about variance, if X is a variable and C is a constant. So:


So if you multiply each variable by a constant, then the variance of the whole thing goes up by C squared. And this makes a lot of sense, because we’re going to increase our sample size by C times, and since we’re squaring the variance, we’re going to increase our sample size by C times squared. We can easily prove this by substituting the formula we developed above.

The next property is:


That’s the whole sample plus a constant, the variance of the whole population stays the same. If our sample is not a value, but a vector, then this formula can be extended to a sample plus a constant vector, and the variance of the sample stays the same. And this makes a lot of sense, the sample plus a constant vector, is the same thing as moving the whole thing a distance in the direction of the vector, and it doesn’t affect the distribution of the whole thing.

If the variance of some sample X is 0, then there’s only one value in the sample.

The following property is slightly more complicated:


In other words, the variance is equal to the expectation of the sample squared minus the expectation of the sample squared, which is very difficult to get by definition, and requires rigorous derivation:


In some cases, it’s not convenient to directly solve for the variance of the sample, but it’s easy to solve for the expectation of the square, and then we can think about using this formula for substitution.


Variance and covariance


Variance is not directly used in machine learning, but more often used in feature analysis. We can perceive its discrete situation by looking at the variance of feature and decide whether to process some features. For some models, if the variance of features is too large, the model may be difficult to converge, or the effect of convergence may be affected. At this time, it is often necessary to consider using some methods to standardize eigenvalues.

In addition to variance, a similar concept is often used to measure the covariance between two variables.

In fact, the formula for covariance is also closely related to variance, so let’s derive it briefly.

First, let’s look at D of X plus Y, where X and Y are two variables, D of X plus Y is just the variance of X plus Y, and let’s look at the relationship between D of X plus Y and D of X and D of Y.

We can derive it, by definition of variance:


This N is a constant, so we can ignore it, just look at the numerator. Let’s expand this out:


Let’s look at this simplification:


In this formula.It’s fixed, it doesn’t change whether XY is related or not. But the latter term is not, and it has to do with the correlation of XY.

We can use this term to reflect the correlation between X and Y, and that’s the formula for the covariance:


So the covariance reflects not the dispersion and distribution of variables, but the correlation between two variables. Now, if we can’t see it very well, it doesn’t matter, let’s do a simple transformation of it, divide it by the standard deviation of both of them:


This form already looks very much like the cosine of the Angle between two vectors, known as Pearson’s value. Pearson value is similar to cosine value, which can reflect the correlation between the two distributions. If p value is greater than 0, it indicates that the two groups of variables are positively correlated; otherwise, they are negatively correlated. We can calculate that p is a number between minus 1 and 1.

If p is equal to 0, then X and Y are completely independent of each other. If p is equal to 1, that means you can find the coefficients W and b such that Y is equal to WX plus b.


At the end


In machine learning, calculating the correlation between two sets of variables is very important. Because essentially what machine learning models do is make predictions by mining correlations between features and predicted values, if a set of features is completely independent from the predicted values, then it’s useless for the model, no matter what model we choose.

Therefore, we often measure the importance of features by analyzing Pearson values between features and labels, so as to make trade-offs and reprocess features. If we only look at Pearson’s value and its formula, it is difficult to fully understand and remember, but it is much easier to comb the whole link from the variance. Even if we forget it later, we can also deduce it again according to the relationship between them.

Today’s article is here, the original is not easy, scan code to pay attention to me, get more wonderful articles.