Pandas Advanced: Simple operations, statistics, and sorting for DataFrame

In the previous section, we discussed the calculation of Series and the automatic alignment of Pandas. Not only Series, DataFrame also supports computation and is one of the most commonly used features.

Because DataFrame data structures contain multiple rows and columns, DataFrame calculations and statistics can be performed using either row data or column data. For our convenience, Pandas provides common computational and statistical methods:

operation	methods	operation	methods
sum	sum	The maximum	max
Calculating mean	mean	The minimum value	min
Strives for the variance	var	The standard deviation	std
The median	median	The number of	mode
quantile	quantile

Operation.

We have N students’ scores in math, Chinese and English. Now, we want to calculate the total score of each student. We can use the following method:

"' the sum demonstrates two methods: method 1: the column data to sum first delete selected (remove the name column), then use the sum function sum method 2: Select the columns to be summated one by one and use the operator to sum the two methods. The result is the sum of each row in the DataFrame.
df['sum'] = df[['chinese'.'math'.'english']].sum(1)	Method # 1

df['sum'] = df['chinese'] + df['math'] + df['english']	Method # 2

Output：
        name  chinese  english  math  sum
0   XiaoMing       99      100    80  279
1      LiHua      102       79    92  273
2  HanMeiNei      111      130   104  345
Copy the code

In the sum method we pass in the argument 1, which represents the row we are using. To calculate the sum for each column, we simply pass in 0:

df[['chinese'.'math'.'english']].sum(0) the Output: Chinese312
math       276
english    309
dtype: int64
Copy the code

Now that we have the total score, the math teacher or the Chinese teacher will care about the average score of the students in the class, and again, we can calculate it very quickly:

df['math'].mean()		# Method 1: Use the mean method provided by Pandas

df['math'].sum() / df.shape[0]	# method 2: Use summation method to calculate the sum and divide by the total number of rowsThe Output:92.0
Copy the code

The DataFrame shape method is used in 🌰. This method displays the number of rows and columns of the DataFrame. The number of rows is 0 and the number of columns is 1. Note that the output column values do not include indexed columns.

The above 🌰 only calculated the average score of math, interested partners can base on their own English and Chinese average score oh ~

2. Statistical

I’ll look at the number of math scores for the students in their class. Pandas have the lowest and highest scores in math.

df['math'].min(a)The minimum value of the math columnThe Output:80

df['math'].max(a)The maximum value of the math columnThe Output:104

df['math'].quantile([0.3.0.4.0.5])  The 30%, 40%, 50% quantile of the math columnThe Output:0.3    87.2
0.4    89.6
0.5    92.0
Name: math, dtype: float64

df['math'].std() The standard deviation of the math columnThe Output:12

df['math'].var() The variance of the math columnThe Output:144

df['math'].mean() The average of the math columnThe Output:92

df['math'].median() The median of the math columnThe Output:92

df['math'].mode() The mode of the # math column, which returns a Series object (parallelism is possible, in this case the mode is 1, so return both)
Output:
0     80
1     92
2    104
dtype: int64
Copy the code

We can also use the Describe method of the DataFrame to view basic statistics on the DataFrame:

df.describe()

Outprint:
          chinese     english   math         sum
count    3.000000    3.000000    3.0    3.000000
mean   104.000000  103.000000   92.0  299.000000
std      6.244998   25.632011   12.0   39.949969
min     99.000000   79.000000   80.0  273.000000
25%    100.500000   89.500000   86.0  276.000000
50%    102.000000  100.000000   92.0  279.000000
75%    106.500000  115.000000   98.0  312.000000
max    111.000000  130.000000  104.0  345.000000
Copy the code

3. Sorting

Generally speaking, our score table is in order of total points from highest to lowest:

df = df.sort_values(by='sum', ascending=False)


Output:
        name  chinese  english  math  sum
2  HanMeiNei      111      130   104  345
0   XiaoMing       99      100    80  279
1      LiHua      102       79    92  273
Copy the code

You can see that we use the sort_values method to sort the DataFrame, and the by argument ‘sum’ is passed to sort according to the ‘sum’ field. Ascending sets whether the order is descending (False) or ascending (True, the default). Sort_values returns a new DataFrame by default, which does not affect the original DataFrame. In this example, we assign the sorted DataFrame to the original DataFrame. If we do not want to create a new DataFrame, we can do so. Just pass inplace=True (to modify the original DataFrame) :

df.sort_values(by='sum', ascending=False, inplace=True)
print(df)

Output:
        name  chinese  english  math  sum
2  HanMeiNei      111      130   104  345
0   XiaoMing       99      100    80  279
1      LiHua      102       79    92  273
Copy the code

Careful friend may find when we sort, if you have any adjustment DataFrame of row data, the adults of the index value is not changed, the above example because we use the default index of ascending series, so after ordering does not look very friendly, but don’t worry, we can reset the index values of:

df = df.sort_values(by='sum', ascending=False).reset_index()

Output:
   index       name  chinese  english  math  sum
0      2  HanMeiNei      111      130   104  345
1      0   XiaoMing       99      100    80  279
2      1      LiHua      102       79    92  273
Copy the code

After resetting the index with reset_index, the index column of our DataFrame object is indeed reset to an increasing sequence, as well as an additional column named index. Of course we can pass drop=True to insert the original index column into the new DataFrame:

df = df.sort_values(by='sum', ascending=False).reset_index(drop=True)

        name  chinese  english  math  sum
0  HanMeiNei      111      130   104  345
1   XiaoMing       99      100    80  279
2      LiHua      102       79    92  273
Copy the code

To visualize the ranking, we can index +1 to show the ranking of students:

df.index += 1

        name  chinese  english  math  sum
1  HanMeiNei      111      130   104  345
2   XiaoMing       99      100    80  279
3      LiHua      102       79    92  273
Copy the code

Pandas Advanced: Simple operations, statistics, and sorting for DataFrame

Operation.

2. Statistical

3. Sorting

Related Posts

Soaring rents in Beijing? Six dimensions, tens of thousands of pieces of data to debunk

Compile and install GreatSQL in Ubuntu

COP of Zhiyuan Interconnected Cooperative Operation Platform: a collaborative ecological platform on the middle Platform