By Han Xinzi @Showmeai
Tutorial address: www.showmeai.tech/tutorials/3…
This paper addresses: www.showmeai.tech/article-det…
Statement: All rights reserved, please contact the platform and the author and indicate the source

Some people summarize the core of data analysis as six words, that is, comparison, subdivision, traceability, but also by the three plate axe data analysis, support the core application of data analysis, specifically:

Contrast: To compare in pairs.

Horizontal comparisons: compared to “others,” for example, the turnover rate between the two companies.
Vertical comparison: Comparing yourself to a company in terms of time, for example, last year and this year’s turnover rate.

Subdivision: The analysis of data by adding dimensions and reducing granularity.

Subdimension: Add dimension, for example, dimission rate is analyzed by department dimension.
Reduce granularity: Reduce the degree of data aggregation, for example, by month rather than year.

Traceability: After comparing and subdividing and locking to specific dimensions and granularity, there is still no conclusion, so we need to look at the original data, insight into the data, and find inspiration from the data.

I. Data “comparison”

There is no point in putting the data there. It is only when the data are compared that the value of data analysis is realized. Contrast is simple, just compare A with B. However, there is no comparative comparison must be playing hooligan.

1.1 Comparability of indicators

The comparability of indicators can be evaluated based on four “consistency” principles: consistency of objects, consistency of time attributes, consistency of definition and algorithm, and consistency of data sources.

(1) The price comparison object is consistent

The object of comparison is consistent. Object consistency is the most basic principle of comparison, tomato sales and pig sales are not comparable, this is actually because the object of comparison is not consistent.

(2) Consistent time attributes

The time attributes of indicators are consistent. The time attribute is special. The season, month and other time attributes of the object should be comparable. For example, the sales volume of ice cream in winter of a convenience store is not comparable with that in summer, because the time attributes of objects are different, but it is possible to do the year-on-year sales volume.

(3) The definition is consistent with the algorithm

The definition and calculation methods of analysis objects are consistent. For example, the definition of youth is different between the National Bureau of Statistics of China (15-34 years old) and the Communist Youth League of China (14-28 years old). When calculating the proportion of youth in the total population, the two indicators are definitely different.

(4) Consistent data sources

The statistical data samples are consistent.

1.2 “Three Essentials” of data comparison

There are three “DOS” to keep in mind when doing correlation analysis of data comparison: comparison should be comparable, difference should be significant, and description should be comprehensive.

(1) Compare

Comparative analysis should be comparable.

(2) The difference should be significant

The differences between groups should be significant, but the differences within groups should be subtle. The commonly used significance tests are T test and analysis of variance.

(3) The description should be comprehensive

When characterizing a set of data, not only the general level (mean) of the data should be described, but also the fluctuation level of the data should be taken into account. If the fluctuation is large, the general level is poorly representative of the data as a whole. Taking into account average levels without taking into account fluctuations and differences makes the data much less reliable.

Ii. Data “segmentation”

Subdivide data by adding dimensions and reducing granularity, dig deep into the data, and uncover hidden patterns in the data.

2.1 Adding dimensions

A dimension is a column of a data table. In general, dimensions refer to qualitative data. For example, the type of services the product provides, the geographical distribution of users, and so on. When analyzing data, we can increase the dimension of analysis, change the perspective of looking at problems, analyze data at a more subdivided level, gain insight into more knowledge, and increase the depth of data analysis.

For example, the retention rate of new users, by adding dimensions to the source of acquisition, can monitor the retention rate of new users from all sources, and use limited funds to make a real difference.

2.2 Reducing particle Size

Granularity is the degree of aggregation of data. The data with the least granularity is the raw data without aggregation.

For example, daily data is raw data, its granularity is daily, and the amount of data is huge; The weekly statistics are the aggregation of daily data, and the granularity is weekly, so the number of data becomes 1/7 of the original.

Iii. Data “traceability”

Traceability is to go to the details of the data, look at the original data, reflect on the user’s behavior. When doing data analysis, it is important to understand whether the data you are analyzing is secondary or primary.

Primary data is the most original data and contains the most abundant content, but the data may not be standardized.
Secondary data is processed, or even analyzed, and can be one-sided, emasculated, topic-oriented, and the resulting analysis can be unfair.

Information and code download

This series of tutorials can be downloaded from Github on ShowMeAI, which can be run locally in Python. If you can use Google Colab, you can also use Google Colab.

The cheat sheets covered in this tutorial series can be downloaded at:

Pandas quick table
Matplotlib quick table
Seaborn quick table

Expanded Reference materials

Data analysis with Python. 2nd edition
w3schools pandas tutorial
Kaggle’s Pandas

ShowMeAI related articles recommended

Introduction to Data Analysis
Analytical thinking of data
Business cognition and data exploration
Data cleaning and preprocessing
Business analysis and data mining
Data analysis tool map
Introduction to statistical and data science computing tool library Numpy
Numpy with 1 – dimensional array manipulation
Numpy with 2 – dimensional array manipulation
Numpy and high-dimensional array manipulation
Description of the data analysis tool library Pandas
The core operation functions are used in Pandas
The high-level functions for data transformation are Pandas
Pandas Groups and operates data
Data visualization principles and methods
Data visualization based on Pandas
Seaborn tools and data visualization

ShowMeAI series tutorials recommended

Illustrated Python programming: From beginner to Master series of tutorials
Illustrated Data Analysis: From beginner to master series of tutorials
The mathematical Basics of AI: From beginner to Master series of tutorials
Illustrated Big Data Technology: From beginner to master

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Graphical data analysis | data analytical thinking

I. Data “comparison”

1.1 Comparability of indicators

1.2 “Three Essentials” of data comparison

Ii. Data “segmentation”

2.1 Adding dimensions

2.2 Reducing particle Size

Iii. Data “traceability”

Information and code download

The cheat sheets covered in this tutorial series can be downloaded at:

Expanded Reference materials

ShowMeAI related articles recommended

ShowMeAI series tutorials recommended

Graphical data analysis | data analytical thinking

I. Data “comparison”

1.1 Comparability of indicators

1.2 “Three Essentials” of data comparison

Ii. Data “segmentation”

2.1 Adding dimensions

2.2 Reducing particle Size

Iii. Data “traceability”

Information and code download

The cheat sheets covered in this tutorial series can be downloaded at:

Expanded Reference materials

ShowMeAI related articles recommended

ShowMeAI series tutorials recommended

Related Posts

Spring-boot Multi-environment boot parameters are added

What is a neural network

Machine learning (iii) : Understanding logistic regression and binary and multi-classification code practices