This article is participating in Python Theme Month. See the link for details

Introduction to the

There are many statistical methods used in data analysis. This article describes the statistical methods used in Pandas.

Variation of 100%

Both Series and DF have a pct_change() method to calculate the percentage change in the data. This method is especially useful when populating NaN values.

ser = pd.Series(np.random.randn(8)) ser.pct_change() Out[45]: 0 NaN 1-1.264716 2 4.125006 3-1.159092 4-0.091292 5 4.837752 6-1.182146 7-8.721482 dType: Float64 ser Out[46]: 0-0.950515 1 0.251617 2 1.289537 3 -0.205155 4 -0.186426 5 -1.088310 6 0.198231 7 -1.530635 DTYPE: Float64Copy the code

Pct_change also has a value for periods, which specifies the interval between the two elements:

In [3]: df = pd.DataFrame(np.random.randn(10, 4)) In [4]: df.pct_change(periods=3) Out[4]: 01 2 3 0 NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 0.218320-1.054001 1.987147-0.510183 4 -0.439121 -1.816454 0.649715-4.822809 5-0.127833-3.042065-5.866604-1.776977 6-2.596833-1.959538-2.111697-3.798900 7 0.117826-2.169058 0.036094-0.067696 8 2.492606-1.357320-1.205802-1.558697 9-1.012977 2.324558-1.003744-0.371806Copy the code

Covariance Covariance

Series.cov() is used to calculate the covariance of two Series, ignoring NaN’s data.

In [5]: s1 = pd.Series(np.random.randn(1000))

In [6]: s2 = pd.Series(np.random.randn(1000))

In [7]: s1.cov(s2)
Out[7]: 0.0006801088174310875
Copy the code

Similarly, datafame.cov () computes the covariance of the corresponding Series and also ignores NaN data.

In [8]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"]) In [9]: frame.cov() Out[9]: A a b c d e a 1.000882-0.003177-0.002698-0.006889 0.031912 b-0.003177 1.024721 0.000191 0.009212 0.001857 c-0.002698 0.00181857-0.005087 D -0.006889 0.009212-0.031743 1.002983-0.047952 E 0.031912 0.001857-0.005087 0.047952 1.042487Copy the code

The datafame. Cov takes a min_periods parameter, which specifies the minimum number of elements to calculate the covariance to ensure that extreme values do not occur.

In [10]: frame = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"]) In [11]: frame.loc[frame.index[:5], "a"] = np.nan In [12]: frame.loc[frame.index[5:10], "b"] = np.nan In [13]: frame.cov() Out[13]: A b c a 1.123670-0.412851 0.018165b-0.412851 1.154141 0.305260 c 0.018169 0.305260 1.305149 In [14]: B b c a 1.123670 NaN 0.018169 b NaN 1.154141 0.305260 c 0.018169 0.305260 1.301149Copy the code

Correlation coefficient

The corr() method can be used to calculate the correlation coefficient. There are three ways to calculate the correlation coefficient:

The method name describe
pearson (default) Standard correlation coefficient
kendall Kendall Tau correlation coefficient
spearman Spearman rank correlation coefficient
n [15]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"]) In [16]: frame.iloc[::2] = np.nan # Series with Series In [17]: frame["a"].corr(frame["b"]) Out[17]: In [18]: frame[" A "]. Corr (frame["b"], method="spearman") Out[18]: -0.007289885159540637 # correlation of DataFrame columns In [19]: frame. Corr () Out[19]: A b c d e a 1.000000 0.013479-0.049269-0.042239-0.028525b 0.013479 1.000000-0.020433-0.011139 0.005654 c-0.049269 -0.020433 1.000000 0.018587-0.054269 d -0.042239-0.011139 0.018587 1.000000-0.028525 0.005654-0.054269 0.017060 1.000000Copy the code

Corr also supports min_periods:

In [20]: frame = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"]) In [21]: frame.loc[frame.index[:5], "a"] = np.nan In [22]: frame.loc[frame.index[5:10], "b"] = np.nan In [23]: frame.corr() Out[23]: A b c a 1.000000-0.121111 0.069544 b-0.121111 1.000000 0.051742 c 0.069544 0.051742 1.000000 In [24]: Corr (min_periods=12) Out[24]: a b c a 1.000000 NaN 0.069544 b NaN 1.000000 0.051742 c 0.069544 0.051742 1.000000Copy the code

Corrwith can calculate the correlation coefficient between different DFS.

In [27]: index = ["a", "b", "c", "d", "e"] In [28]: columns = ["one", "two", "three", "four"] In [29]: df1 = pd.DataFrame(np.random.randn(5, 4), index=index, columns=columns) In [30]: df2 = pd.DataFrame(np.random.randn(4, 4), index=index[:4], columns=columns) In [31]: df1.corrwith(df2) Out[31]: One-0.125501 two-0.493244 three 0.344056 four 0.004183 DType: float64 In [32]: df2.corrwith(dF1, Axis =1) Out[32]: A-0.675817b 0.458296 C 0.190809 d -0.186275 e NaN DType: Float64Copy the code

Ranking.

The rank method ranks the data in a Series. What is a hierarchy? Let’s take an example:

s = pd.Series(np.random.randn(5), index=list("abcde")) s Out[51]: A 0.3365b 1.073166c 0.6266d 0.6266e: float64s ["d"] = s["b"] # so there's a tie s Out[53]: A 0.3365b 1.07316b 1.07316d 1.07316d e-0.422478 dType: float64 s.rank() Out[54]: A 3.0b 4.5C 2.0D 4.5E 1.0DTYPE: float64Copy the code

Above we create a Series where the data is sorted from smallest to largest:

-0.422478 < -0.402291 <  0.336259 <  1.073116 < 1.073116
Copy the code

So the corresponding rank is one, two, three, four, five.

Since we have two values that are the same, the default is to take the average of the two, which is 4.5.

In addition to default_rank, you can also specify max_rank so that each value is a maximum of 5.

You can also specify NA_bottom, which means that the data for NaN will also be used to calculate the rank and will be placed at the bottom, the maximum value.

You can also specify pcT_rank, which is a percentage value.

df = pd.DataFrame(data={'Animal': ['cat', 'penguin', 'dog', ... 'spider', 'snake'], ... 'Number_legs': [4, 2, 4, 8, np.nan]}) >>> df Animal Number_legs 0 cat 4.0 1 Penguin 4.0 2 spider 4.0 4 Snake nanCopy the code
df['default_rank'] = df['Number_legs'].rank() >>> df['max_rank'] = df['Number_legs'].rank(method='max') >>> df['NA_bottom'] = df['Number_legs'].rank(na_option='bottom') >>> df['pct_rank'] = df['Number_legs'].rank(pct=True) >>> Df Animal Number_legs default_rank max_rank NA_bottom pct_rank 0 cat 4.0 2.5 3.0 2.5 0.625 1 Penguin 2.0 1.0 1.0 1.0 1.0 0.250 2 Dog 4.0 2.5 3.0 0.625 3 Spider 8.0 4.0 4.0 4.0 1.000 4 Snake NaN NaN 5.0Copy the code

Rank can also specify that it is evaluated by row (Axis =0) or column (axis=1).

In [36]: df = pd.DataFrame(np.random.randn(10, 6)) In [37]: df[4] = df[2][:5] # some ties In [38]: df Out[38]: 01 2 3 45 0-0.904948-1.163537-1.457187 0.135463-1.457187 0.294650 1-0.976288-0.244652-0.748406-0.999601 -0.748406-0.800809 2 0.401965 1.460840 1.256057 1.308127 1.256057 0.876004 3 0.205954 0.369552-0.669304 0.038378 -0.669304 1.140296 4-0.477586-0.730705-1.129149-0.601463-1.129149-0.211196 5-1.092970-0.689246 0.908114 0.204848 NaN 0.463347 6 0.376892 0.959292 0.095572-0.593740 NaN -0.069180 7-1.002601 1.957794-0.120708 0.094214 NaN -1.467422 8-0.547231 0.664402-0.519424-0.073254 NaN -1.263544 9-0.250277-0.237428-1.056443 0.419477 NaN -1.375064 In [39]: df.rank(1) Out[39]: 0 1 2 3 4 5 0 4.0 3.0 1.5 5.0 1.5 6.0 1 2.0 6.0 4.5 1.0 4.5 3.0 2 1.0 6.0 3.5 5.0 3.5 2.0 3 4.0 5.0 1.5 3.0 1.5 6.0 4 5.0 5.0 5.0 6.0 5 1.0 2.0 5.0 3.0 NaN 4.0 6 4.0 5.0 3.0 1.0 NaN 2.0 7 2.0 5.0 3.0 4.0 NaN 1.0 8 2.0 5.0 3.0 4.0 NaN 1.0 9 2.0 3.0 1.0 4.0 NaN 5.0Copy the code

This article is available at www.flydean.com/10-python-p…

The most popular interpretation, the most profound dry goods, the most concise tutorial, many tips you didn’t know waiting for you to discover!