background

In work and study, we often need to evaluate data in certain dimensions. For example, in the following example, we want to count the profit of a single commodity on two platforms (for example, we want to tell merchants which platform has a higher profit). The example is not very rigorous, but it is simplified for convenience.

goods Individual profit The order quantity platform
A 10 10 A treasure
B 5 30 A treasure
C 30 20 A treasure
D 100 1 A treasure
B 10 1 A east
C 35 20 A east
D 100 40 A east
E 40 80 A east

If we look at different platforms, we are likely to encounter some problems. For example, product A is sold in A certain treasure, but not in A certain east; product E is sold in A certain east, but not in A certain treasure. That is to say, when we get some data, we cannot compare perfectly controlled variables (inconsistent data distribution), so we need to adopt certain ways to avoid the impact of inconsistent data distribution on evaluation indicators.

Several calculation methods of evaluation index

Simple evaluation method

After we get the data, the most intuitive thing we can do is to get the profit per unit directly from the gross profit/gross goods, that is, we can tell the merchant that the profit per unit is higher for a certain east or a certain treasure. According to this line of thinking, the statistical results are

A treasure


10 x 10 + 5 x 30 + 30 x 20 + 100 x 1 10 + 30 + 20 + 1 = 950 61 = 15.57 \frac{10 \times 10 + 5 \times 30 + 30 \times 20 + 100 \times 1}{10 + 30 + 20 + 1} = \frac{950}{61} = 15.57

A east


10 x 1 + 35 x 20 + 100 x 20 + 40 x 80 1 + 20 + 20 + 80 = 5910 121 = 48.84 \frac{10 \times 1 + 35 \times 20 + 100 \times 20 + 40 \times 80}{1 + 20 + 20 + 80} = \frac{5910}{121} = 48.84

So the conclusion is that the profit per unit of a certain east is 213.7% higher than that of a certain treasure!

Why would so many high, the problem is that some east there is A lot of D and E commodity goods, and the corresponding A treasure correspondence is A commodity and B commodity (some differences in the east and A treasure as platform, data distribution is very different), compared to direct it is not objective, because for the same business, he wants to see the same kind of goods in A certain eastern and treasure profit more!

Why not compare each item individually?

Yes, but if there are a lot of goods, it will take more time to analyze them one by one. We will discuss how to use one index to describe them more objectively.

Reasonable method to control variables — take intersection

On the basis of the above comparison, a slightly more reasonable approach would be to compare only goods that are shared by both, i.e., B, C, and D

A treasure


5 x 30 + 30 x 20 + 100 x 1 30 + 20 + 1 = 850 51 = 16.67 \frac{5 \times 30 + 30 \times 20 + 100 \times 1}{30 + 20 + 1} = \frac{850}{51} = 16.67

A east


10 x 1 + 35 x 20 + 100 x 20 1 + 20 + 20 = 2710 41 = 66.10 \frac{10 \times 1 + 35 \times 20 + 100 \times 20}{1 + 20 + 20} = \frac{2710}{41} = 66.10

So the conclusion is that the profit per unit of a certain east is 296.5% higher than that of a certain treasure!

Look at the data even more outrageous…

Because the problem of inconsistent data distribution is not solved by simply taking the intersection of data, commodity D has a great influence on the result (a certain east sells more and a certain treasure sells less). What we hope is to compare commodity C, which sells a lot on both platforms.

A reasonable method to control variables – taking intersection & weighting samples

So in the case of the intersection of samples, we need to set a weight on the samples, which samples are more believable, which samples are not. The principle of setting is this

  • When sample X sells a lot of something and a lot of something, we think that sample is very believable
  • Sample X only sold a lot on one platform, which we don’t think is very convincing

So the weight can be (for reference only)


W X = s x A east s x A treasure W_{X} = SQRT {s_x^{X}

Sx x east indicates the number of goods x sold in a certain east, and sx x treasure s_x^{treasure}sx treasure indicates the number of goods X sold in a certain east.

So we get the weight of B, C, and D with respect to theta

goods The weight w Normalization of weights
B 30 0.0638
C 400 0.8511
D 40 0.0851

Weighted by weight of use, the result is as follows

A treasure


5 x 30 x 30 + 30 x 20 x 400 + 100 x 1 x 40 30 x 30 + 20 x 400 + 1 x 40 = 248500 8940 = 27.80 \frac{5 \times 30 \times 30 + 30 \times 20 \times 400+ 100 \times 1 \times 40}{30 \times 30 + 20 \times 400 + 1 \times 40} = \frac{248500}{8940} = 27.80

A east


10 x 1 x 30 + 35 x 20 x 400 + 100 x 20 x 40 1 x 30 + 20 x 400 + 20 x 40 = 358300 8830 = 40.58 \frac{10 \times 1 \times 30 + 35 \times 20 \times 400 + 100 \times 20 \times 40}{1 \times 30 + 20 \times 400 + 20 \times 40} = \frac{358300}{8830} = 40.58

So the conclusion is that the unit profit of a certain east is 45.0% higher than that of a certain treasure!

Sounds a lot more reasonable.

conclusion

The above is just a simple weighted method to solve the problem of how to get more objective evaluation results in the case of inconsistent sample distribution. What we need to pay more attention to is

  • Is there a problem of inconsistent sample distribution in the evaluation? If so, how to solve it?
  • Evaluate whether the indicators are credible

How should we evaluate the problem of inconsistent sample distribution? Is there a similar meta-analysis