In addition to cleaning and screening, I will also be involved in statistical calculation. Here we will introduce some common statistical calculation functions.

1. Preview data

This article case demonstration data from the national data center of each region in the last 5 years of gross national product data, background reply GDP can get data files, convenient for yourself to try ha.

In [1]: df.head() Preview the first 5 pieces of data
Out[1] : the district2020years2019years2018years2017years2016years0The Beijing municipal36102.6  35445.1  33106.0  29883.0  27041.2
1tianjin14083.7  14055.5  13362.9  12450.6  11477.2
2In hebei province36206.9  34978.6  32494.6  30640.8  28474.1
3Shanxi Province17651.9  16961.6  15958.1  14484.3  11946.4
4Inner Mongolia Autonomous Region17359.8  17212.5  16140.8  14898.1  13789.3

In [2]: # check the data type of each field, the number of items and the number of null values
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 6 columns):
 # Column Non-Null Count Dtype
---  ------  --------------  -----  
 0region32 non-null     object 
 1   2020years31 non-null     float64
 2   2019years31 non-null     float64
 3   2018years31 non-null     float64
 4   2017years31 non-null     float64
 5   2016years31 non-null     float64
dtypes: float64(5), object(1)
memory usage: 1.6+ KB
2. Description statistics

The describe function method can return descriptive statistics of a data set

    datetime_is_numeric=False) - >'FrameOrSeries'
Generate descriptive statistics.
For Dataframe types, each row corresponds to one statistic, total, mean, standard deviation, minimum, quartile (default: 25/50/75), and maximum.

In [3]: df.describe()
Out[3] :2020years2019years2018years2017years2016In the count31.000000      31.000000     31.000000     31.000000     31.000000
mean    32658.551613   31687.758065  29487.661290  26841.819355  24224.148387
std     26661.811640   25848.652250  24136.181387  22161.575235  20008.278500
min      1902.700000    1697.800000   1548.400000   1349.000000   1173.000000
25%     13940.650000   13826.300000  13104.700000  12381.800000  11634.800000
50%     25115.000000   24667.300000  22716.500000  20210.800000  18388.600000
75%     42612.500000   41110.350000  37508.750000  33835.250000  30370.250000
max    110760.900000  107986.900000  99945.200000  91648.700000  82163.200000
In the descriptive statistics table above, we can see that in 2020, there are a total of 31 regions with an average GDP of 3.26 trillion yuan, the highest 11.07 trillion yuan, and the lowest 0.19 trillion yuan.

As you can see, there are also parameters that can be customised as follows:

Percentiles can be customized to specify quantiles

In [4]: df.describe(percentiles=[2..4..6..8.])
Out[4] :2020years2019years2018years2017years2016In the count31.000000      31.000000     31.000000     31.000000     31.000000
mean    32658.551613   31687.758065  29487.661290  26841.819355  24224.148387
std     26661.811640   25848.652250  24136.181387  22161.575235  20008.278500
min      1902.700000    1697.800000   1548.400000   1349.000000   1173.000000
20%     13698.500000   13544.400000  12809.400000  11159.900000  10427.000000
40%     22156.700000   21237.100000  19627.800000  17790.700000  16116.600000
50%     25115.000000   24667.300000  22716.500000  20210.800000  18388.600000
60%     36102.600000   34978.600000  32494.600000  29676.200000  26307.700000
80%     43903.900000   45429.000000  42022.000000  37235.000000  33138.500000
max    110760.900000  107986.900000  99945.200000  91648.700000  82163.200000
Include and exclude specify and exclude data types, respectively, for example

df.describe(include=[np.number]) # specify a field of numeric type
df.describe(exclude=[np.float]) # Exclude floating-point fields
As you can see, by default describe specifies the type of numbers, not the part of the locale field. If you want to participate, you can specify it by include=’all’.

In [5]: df.describe(include='all') Out[5]: Area 2020 2019 2018 2017 2016 Count 32 31.00 31.00 31.00 31.00 31.00 31.00 unique 32 NaN NaN NaN Top NaN NaN NaN freq 1 NaN NaN NaN NaN NaN ... . . . . . . 25% NaN 13940.65 13826.30 13104.70 12381.80 11634.80 50% NaN 25115.00 24667.30 22716.50 20210.80 18388.60 42612.50 41110.35 37508.75 33835.25 30370.25 Max NaN 110760.90 107986.90 99945.20 91648.70 82163.20 [11 rows x 6 columns]Copy the code

In the case data, the data under the region field is of object type and not numeric correlation. We can see that in the description statistics result, it adds three new indicators unique, top and FREP, whereas these three indicators are not available for purely numeric columns. These three indicators correspond to the number of non-duplicates, maximum, and frequency (if there are duplicates), as in the following individual case:

In [6]: s = pd.Series(['red'.'blue'.'black'.'grey'.'red'.'grey'])

In [7]: s.describe()
count       6
unique      4
top       red
freq        2
dtype: object
In descripe, there is another parameter, datetime_is_numeric, which requires the value True for statistical descriptions of time types.

In [8]: s = pd.Series([np.datetime64("2000-01-01"),
    ...:                np.datetime64("2010-01-01"),
    ...:                np.datetime64("2010-01-01")
    ...:                ])

In [9]: s.describe()
FutureWarning: Treating datetime data as categorical rather than numeric in `.describe` is deprecated and will be removed in a future version of pandas. Specify `datetime_is_numeric=True` to silence this warning and adopt the future behavior now.
count                       3
unique                      2
top       2010- 01-0100:00:00
freq                        2
first     2000- 01-0100:00:00
last      2010- 01-0100:00:00
dtype: object

In [10]: s.describe(datetime_is_numeric=True)
count                      3
mean     2006- 09-01 08:00:00
min      2000- 01-0100:00:00
25%      2004-12-31 12:00:00
50%      2010- 01-0100:00:00
75%      2010- 01-0100:00:00
max      2010- 01-0100:00:00
dtype: object
In our daily data processing, in addition to describing these statistical dimensions in statistics, we will also use some other statistical calculations, such as variance, mode and so on.

3. Statistical calculation

Here we demonstrate the common method of calculating statistical functions. By default, we count by column. We can also specify by row, as shown below

A maximum #
In [11]: df.max(numeric_only=True)
Out[11] :2020years110760.9
dtype: float64

# the minimum
In [12]: df.min(numeric_only=True)
Out[12] :2020years1902.7
dtype: float64

# Average value (for calculation of statistics, it is recommended to specify data type as only number, can specify column via Axis, default is column)
In [13]: df.mean(axis=1, numeric_only=True)
Out[13] :0     32315.58
1     13085.98
2     32559.00
3     15400.46.28     2683.66
29     3432.18
30    12198.96
31         NaN
Length: 32, dtype: float64
The following sections are not specific demonstrations, only to introduce the functions of the function, all these should be used to pay attention to the original data type, non-numeric types may cause errors

df.sum(a)# sum
df.corr() # Correlation coefficient
df.cov() # covariance
df.count() # non-null count
df.abs(a)# the absolute value
df.median() # the median
df.mode() # the number
df.std() # standard deviation
df.var() # unbiased variance
df.sem() # Standard error of the mean
df.mad() # Average absolute difference # LianCheng
df.cumprod() # multiplicative
df.cumsum() # accumulation
df.nunique() # non-repeat count
df.idxmax() Index name (argmax)
df.idxmin() The minimum index name
df.sample(5) # 5 data were randomly sampled
df.skew() Sample skewness (third order)
df.kurt() Sample skewness (4th order)
df.quantile() # Sample quantile
df.rank() Rank #
df.pct_change() # change rate
df.value_counts() # do not duplicate values and quantities
s.argmax() # Maximum index (automatic index), dataframe does not
s.argmin() # Minimum index (automatic index), dataframe does not
In fact, there are other parameters in every function that can make the function more powerful, so you can try them out on your own, and we’ll give you a few examples.

>>> s = pd.Series([90.91.85])
>>> s
0    90
1    91
2    85
dtype: int64

>>> s.pct_change()
0         NaN
1    0.011111
2   -0.065934
dtype: float64

>>> s.pct_change(periods=2) Change rate every 2 rows (default: 1 row)
0         NaN
1         NaN
2   -0.055556
dtype: float64
In addition to these functions, the following functions are also commonly used

The largest first 5 rows of a column
In [14]: df.nlargest(5,columns='2020')
Out[14] : the district2020years2019years2018years2017years2016years18Guangdong province,110760.9  107986.9  99945.2  91648.7  82163.2
9Jiangsu province102719.0   98656.8  93207.6  85869.8  77350.9
14In shandong province73129.0   70540.5  66648.9  63012.1  58762.5
10Zhejiang province64613.3   62462.0  58002.8  52403.1  47254.0
15Henan province54997.1   53717.8  49935.9  44824.9  40249.3

The smallest first five rows of a column
In [15]: df.nsmallest(5,columns='2020')
Out[15] : the district2020years2019years2018years2017years2016years25Tibet Autonomous Region1902.7  1697.8  1548.4  1349.0  1173.0
28Qinghai province3005.9  2941.1  2748.0  2465.1  2258.2
29Ningxia Hui Autonomous Region3920.5  3748.5  3510.2  3200.3  2781.4
20Hainan province,5532.4  5330.8  4910.7  4497.5  4090.2
27Gansu province9016.7  8718.3  8104.1  7336.7  6907.9
Addition, subtraction, multiplication and division four operations

You can use operational notation, you can use functional methods; You can also pass a value, a DataFrame or a Serice.

''' Among flexible wrappers (`add`, `sub`, `mul`, `div`, `mod`, `pow`) to arithmetic operators: '+', '-', '*', '/', '%', '**'. '
df + 1
Copy the code

The above is all the content of this time, interested partners can run the code to try, or add the author’s wechat to communicate with oh!