• By Han Xinzi @Showmeai
  • Tutorial address: www.showmeai.tech/tutorials/3…
  • This paper addresses: www.showmeai.tech/article-det…
  • Statement: All rights reserved, please contact the platform and the author and indicate the source

Some people summarize the core of data analysis as six words, that is, comparison, subdivision, traceability, but also by the three plate axe data analysis, support the core application of data analysis, specifically:

Contrast: To compare in pairs.

  • Horizontal comparisons: compared to “others,” for example, the turnover rate between the two companies.
  • Vertical comparison: Comparing yourself to a company in terms of time, for example, last year and this year’s turnover rate.

Subdivision: The analysis of data by adding dimensions and reducing granularity.

  • Subdimension: Add dimension, for example, dimission rate is analyzed by department dimension.
  • Reduce granularity: Reduce the degree of data aggregation, for example, by month rather than year.

Traceability: After comparing and subdividing and locking to specific dimensions and granularity, there is still no conclusion, so we need to look at the original data, insight into the data, and find inspiration from the data.

I. Data “comparison”

There is no point in putting the data there. It is only when the data are compared that the value of data analysis is realized. Contrast is simple, just compare A with B. However, there is no comparative comparison must be playing hooligan.

1.1 Comparability of indicators

The comparability of indicators can be evaluated based on four “consistency” principles: consistency of objects, consistency of time attributes, consistency of definition and algorithm, and consistency of data sources.

(1) The price comparison object is consistent

The object of comparison is consistent. Object consistency is the most basic principle of comparison, tomato sales and pig sales are not comparable, this is actually because the object of comparison is not consistent.

(2) Consistent time attributes

The time attributes of indicators are consistent. The time attribute is special. The season, month and other time attributes of the object should be comparable. For example, the sales volume of ice cream in winter of a convenience store is not comparable with that in summer, because the time attributes of objects are different, but it is possible to do the year-on-year sales volume.

(3) The definition is consistent with the algorithm

The definition and calculation methods of analysis objects are consistent. For example, the definition of youth is different between the National Bureau of Statistics of China (15-34 years old) and the Communist Youth League of China (14-28 years old). When calculating the proportion of youth in the total population, the two indicators are definitely different.

(4) Consistent data sources

The statistical data samples are consistent.

1.2 “Three Essentials” of data comparison

There are three “DOS” to keep in mind when doing correlation analysis of data comparison: comparison should be comparable, difference should be significant, and description should be comprehensive.

(1) Compare

Comparative analysis should be comparable.

(2) The difference should be significant

The differences between groups should be significant, but the differences within groups should be subtle. The commonly used significance tests are T test and analysis of variance.

(3) The description should be comprehensive

When characterizing a set of data, not only the general level (mean) of the data should be described, but also the fluctuation level of the data should be taken into account. If the fluctuation is large, the general level is poorly representative of the data as a whole. Taking into account average levels without taking into account fluctuations and differences makes the data much less reliable.

Ii. Data “segmentation”

Subdivide data by adding dimensions and reducing granularity, dig deep into the data, and uncover hidden patterns in the data.

2.1 Adding dimensions

A dimension is a column of a data table. In general, dimensions refer to qualitative data. For example, the type of services the product provides, the geographical distribution of users, and so on. When analyzing data, we can increase the dimension of analysis, change the perspective of looking at problems, analyze data at a more subdivided level, gain insight into more knowledge, and increase the depth of data analysis.

For example, the retention rate of new users, by adding dimensions to the source of acquisition, can monitor the retention rate of new users from all sources, and use limited funds to make a real difference.

2.2 Reducing particle Size

Granularity is the degree of aggregation of data. The data with the least granularity is the raw data without aggregation.

For example, daily data is raw data, its granularity is daily, and the amount of data is huge; The weekly statistics are the aggregation of daily data, and the granularity is weekly, so the number of data becomes 1/7 of the original.

Iii. Data “traceability”

Traceability is to go to the details of the data, look at the original data, reflect on the user’s behavior. When doing data analysis, it is important to understand whether the data you are analyzing is secondary or primary.

  • Primary data is the most original data and contains the most abundant content, but the data may not be standardized.

  • Secondary data is processed, or even analyzed, and can be one-sided, emasculated, topic-oriented, and the resulting analysis can be unfair.


Information and code download

This series of tutorials can be downloaded from Github on ShowMeAI, which can be run locally in Python. If you can use Google Colab, you can also use Google Colab.

The cheat sheets covered in this tutorial series can be downloaded at:

  • Pandas quick table
  • Matplotlib quick table
  • Seaborn quick table

Expanded Reference materials

  • Data analysis with Python. 2nd edition
  • w3schools pandas tutorial
  • Kaggle’s Pandas

ShowMeAI related articles recommended

  • Introduction to Data Analysis
  • Analytical thinking of data
  • Business cognition and data exploration
  • Data cleaning and preprocessing
  • Business analysis and data mining
  • Data analysis tool map
  • Introduction to statistical and data science computing tool library Numpy
  • Numpy with 1 – dimensional array manipulation
  • Numpy with 2 – dimensional array manipulation
  • Numpy and high-dimensional array manipulation
  • Description of the data analysis tool library Pandas
  • The core operation functions are used in Pandas
  • The high-level functions for data transformation are Pandas
  • Pandas Groups and operates data
  • Data visualization principles and methods
  • Data visualization based on Pandas
  • Seaborn tools and data visualization

ShowMeAI series tutorials recommended

  • Illustrated Python programming: From beginner to Master series of tutorials
  • Illustrated Data Analysis: From beginner to master series of tutorials
  • The mathematical Basics of AI: From beginner to Master series of tutorials
  • Illustrated Big Data Technology: From beginner to master