Introduction to the

In this article, we will introduce the statistical methods used in Pandas, which are commonly used in data analysis.

100% variation

Both Series and DF have a pct_change() method that calculates the percentage of data change. This method is particularly useful when populating NaN values.

ser = pd.Series(np.random.randn(8)) ser.pct_change() Out[45]: 0 NaN 1-1.264716 2 4.125006 3-1.159092 4-0.091292 5 4.837752 6-1.182146 7-8.721482 dtype: float64 ser Out[46]: 0-0.950515 1 0.251617 2 1.289537 3-0.205155 4-0.186426 5-1.088310 6 0.198231 7-1.530635 Dtype: float64

Pct_change also has a period parameter that specifies the period for which the percentage is to be calculated, or how many elements are to be added:

In [3]: df = pd.DataFrame(np.random.randn(10, 4)) In [4]: df.pct_change(periods=3) Out[4]: 01 2 3 0 Nan Nan Nan 1 Nan Nan Nan 2 Nan Nan Nan Nan 3-0.218320-1.054001 1.987147-0.510183 4-0.439121 -1.816454 0.649715-4.822809 5-0.127833-3.042065-5.866604-1.776977 6-2.596833-1.959538-2.111697-3.798900 7 -0.117826-2.169058 0.036094-0.067696 8 2.492606-1.357320-1.205802-1.558697 9-1.012977 2.324558-1.003744-0.371806

Covariance Covariance

Series.cov() is used to calculate the covariance of two Series. NaN data is ignored.

In [5]: s1 = pd.Series(np.random.randn(1000))

In [6]: s2 = pd.Series(np.random.randn(1000))

In [7]: s1.cov(s2)
Out[7]: 0.0006801088174310875

Similarly, dataframe.cov () calculates the covariance of the corresponding Series and ignores NaN data.

In [8]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"]) In [9]: frame.cov() Out[9]: A B C D E A 1.000882-0.003177-0.002698-0.006889 0.03192B-0.003177 1.024721 0.000191 0.009212 0.000857 C-0.002698 0.000191 0.950735-0.031743-0.005087 D-0.006889 0.009212-0.031743 1.002983-0.047952 E 0.031912 0.000857-0.005087 0.047952 1.042487

The dataframe.cov takes a min_period parameter, which specifies the minimum number of elements to calculate the covariance, to ensure that there are no extreme data cases.

In [10]: frame = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"]) In [11]: frame.loc[frame.index[:5], "a"] = np.nan In [12]: frame.loc[frame.index[5:10], "b"] = np.nan In [13]: frame.cov() Out[13]: A B C A 1.123670-0.412851 0.018169 B-0.412851 1.154141 0.305260 C 0.018169 0.305260 1.301149 In [14]: Frame.cov (min_period =12) Out[14]: A B C A 1.123670 NaN 0.018169 B NaN 1.154141 0.305260 C 0.018169 0.305260 1.301149

Correlation coefficient

The corr() method can be used to calculate the correlation coefficient. There are three methods to calculate the correlation coefficient:

The method name describe
pearson (default) Standard correlation coefficient
kendall Kendall Tau correlation coefficient
spearman Spearman rank correlation coefficient
n [15]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"]) In [16]: frame.iloc[::2] = np.nan # Series with Series In [17]: frame["a"].corr(frame["b"]) Out[17]: In [18]: frame["a"]. Corr (frame["b"], method="spearman") Out[18]: Corr () Out[19]: correlation of DataFrame columns A B C D E A 1.000000 0.013479-0.04239-0.028525 B 0.013479 1.000000-0.020433-0.011139 0.005654 C-0.049269 B 1.000000-0.020433-0.011139 0.005654 C 1.000000-0.049269 -0.020433 1.000000 0.018587-0.054269 D-0.042239-0.011139 0.018587 1.000000 -0.017060 E-0.028525 0.005654 -0.054269 0.017060 1.000000

Corr also supports min_periods:

In [20]: frame = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"]) In [21]: frame.loc[frame.index[:5], "a"] = np.nan In [22]: frame.loc[frame.index[5:10], "b"] = np.nan In [23]: frame.corr() Out[23]: A B C A 1.000000-0.121111 0.069544 B-0.121111 1.000000 0.051742 C 0.069544 0.051742 1.000000 IN [24]: Frame.corr (min_period =12) Out[24]: A B C A 1.000000 Na0.069544 B NaN 1.000000 0.051742 C 0.069544 0.051742 1.000000

Corrwith can calculate the correlation coefficient between different DF.

In [27]: index = ["a", "b", "c", "d", "e"] In [28]: columns = ["one", "two", "three", "four"] In [29]: df1 = pd.DataFrame(np.random.randn(5, 4), index=index, columns=columns) In [30]: df2 = pd.DataFrame(np.random.randn(4, 4), index=index[:4], columns=columns) In [31]: df1.corrwith(df2) Out[31]: ONE-0.125501 TWO-0.493244 THREE 0.344056 FOUR 0.004183 DTYPE: FLOAT6IN [32]: DF2. CORRWITH (DF1, AXIS =1) OUT [32]: A-0.675817 B 0.458296 C 0.190809 D 0.186275 E NaN Dtype: Float64

Ranking.

The rank method is used to rank data in a Series. What is grade? Let’s take an example:

s = pd.Series(np.random.randn(5), index=list("abcde")) s Out[51]: A 0.336259 b 1.073116 c-0.402291 d 0.624186 e-0.422478 dtype: float64s ["d"] = s["b"] # so there's a tie s Out[53]: A 0.336259 B 1.073116 C-0.402291 D 1.073116 E-0.422478 DTYPE: FLOAT64 S.RANK () OUT [54]: A 3.0 B 4.5 C 2.0 D 4.5 E 1.0 DTYPE: FLOAT64

We created a Series where the data is sorted from smallest to largest:

-0.422478 < -0.402291 <  0.336259 <  1.073116 < 1.073116

So the corresponding rank is 1, 2, 3, 4, 5.

Since we have two values that are the same, the default is to take the average of the two, which is 4.5.

In addition to DEFAULT_RANK, you can also specify MAX_RANK so that each value is the maximum of 5.

You can also specify NA_BOTTOM, which means that the data for NaN is also used to calculate the rank, and will be placed at the bottom, which is the maximum.

You can also specify pct_rank, where the rank value is a percentage value.

df = pd.DataFrame(data={'Animal': ['cat', 'penguin', 'dog', ... 'spider', 'snake'], ... 'Number_legs': [4, 2, 4, 8, np.nan]}) >>> df Animal Number_legs 0 cat 4.0 1 penguin 2.0 2 dog 4.0 3 spider 8.0 4 snake nan
df['default_rank'] = df['Number_legs'].rank() >>> df['max_rank'] = df['Number_legs'].rank(method='max') >>> df['NA_bottom'] = df['Number_legs'].rank(na_option='bottom') >>> df['pct_rank'] = df['Number_legs'].rank(pct=True) >>> Df Animal Number_legs default_rank max_rank NA_bottom pct_rank 0 cat 4.0 2.5 3.0 2.5 0.625 1 penguin 2.0 1.0 1.0 1.0 0.250 2 Dog 4.0 2.5 3.0 2.5 0.625 3 Spider 8.0 4.0 4.0 4.0 1.000 4 Snake Nan Nan 5.0 Nan

Rank can also be specified by row (axis=0) or by column (axis=1).

In [36]: df = pd.DataFrame(np.random.randn(10, 6)) In [37]: df[4] = df[2][:5] # some ties In [38]: df Out[38]: 01 2 3 45 0-0.904948-1.163537-1.457187 0.135463 -1.457187 0.294650 1-0.976288-0.244652-0.748406-0.999601 -0.748406-0.800809 2 0.401965 1.460840 1.256057 1.308127 1.256057 0.876004 3 0.205954 0.369552-0.669304 0.038378 -0.669304 1.140296 4-0.477586-0.730705-1.129149-0.601463-1.129149-0.211196 5-1.092970-0.689246 0.908114 0.204848 NaN 0.463347 6 0.376892 0.959292 0.095572-0.593740 NaN -0.069180 7-1.002601 1.957794-0.120708 0.094214 NaN -1.467422 8-0.547231 0.664402-0.519424-0.073254 NaO-1.263544 9-0.250277-0.237428-1.056443 0.419477 NaO-1.375064 In [39]: df.rank(1) Out[39]: 0 1 2 3 4 5 0 4.0 3.0 1.5 6.0 1 2.0 6.0 4.5 3.0 2 1.0 6.0 3.5 3.5 3.5 3 4.0 5.0 3.0 1.5 6.0 4 5.0 3.0 1.5 6.0 5 1.0 2.0 3.0 NaN 4.0 6 4.0 5.0 3.0 NaN 2.0 7 2.0 5.0 3.0 4.0 NaN 1.0 8 2.0 5.0 3.0 4.0 NaN 1.0 9 2.0 3.0 1.0 4.0 NaN 5.0

This article has been included in http://www.flydean.com/10-python-pandas-statistical/

The most popular interpretation, the most profound dry goods, the most concise tutorial, many you do not know the tips to wait for you to discover!