Public account: Youerhuts author: Peter editor: Peter

Hello, I’m Peter

In our life, we often encounter various ranking problems: student performance ranking, sales performance ranking, various competition ranking and so on. In a previous ARTICLE on SQL, “Interview Essentials: SQL Rankings and Window Functions,” I mentioned how to use SQL to implement three main ranking methods: sequential ranking, jump ranking, and dense ranking.

Pandas is a powerful library that allows you to quickly implement a variety of rankings, primarily through the Rank function, which is shown in several examples.

Rank parameters

Here are the main arguments to the rank function:

DataFrame.rank(axis=0, 
               method='average', 
               numeric_only=None, 
               na_option='keep', 
               ascending=True, 
               pct=False)
Copy the code

The specific explanation of parameters is as follows:

  • Axis: indicates the axis on which the ranking is based. Axis =0 indicates the horizontal axis and axis=1 indicates the vertical axis
  • Method: The value can be ‘average’, ‘first’, ‘min’, ‘Max’, ‘dense’; The default value is Average
  • Numeric_only: specifies whether to calculate only numeric columns
  • Na_optiaon: Whether the NaN value participates in the ranking and how to rank it. The value can be keep, top, or bottom
  • Ascending: ascending or descending; The default is ascending
  • PCT: Whether to display the ranking as a percentage of the ranking; Percentage of all rankings and maximum rankings

This article will show you how to use the rank function for both Series and DataFrame data types.

The Series number

import pandas as pd
import numpy as np
Copy the code

First let’s simulate a simple piece of data:

Parameter method

Method =”average”;

2, the method = “first”

The values are ranked according to the order in which they appear in the original data, and the same values are ranked by one:

Explain the above two results:

  • First: Directly rank the values in order of size
  • Average: Indicates that if two values are the same, the ranking is the average of them

We see the use of first as the natural order in which the values appear; In the case of the use of Average explained as follows:

-5 is ranked 1.0, 0 is ranked 2.0, 3 is ranked 3.0, 5(index 3 position) is ranked 4.0, 5(index 6 position) is ranked 5.0, 8(index 0 position) is ranked 6.0, and 8(index 2 position) is ranked 7.0

Through the use of average, the rank average of the same value will be taken out. The rank of 5 is unified into 4.5, and the rank of 8 is unified into 6.5

3. Use of Max and min

For example: method= “Max” : If the values are the same, the highest value is used. For example, the largest ranking of 5 is 5, so the ranking of both 5’s in the original data is 5; Both 8’s are ranked 7 (the two 8’s are ranked 6 and 7, so take the highest value 7)

4, method = “dense”

The same number ranks the same, and the next number does not jump

This time when the ranking is not a jump

Parameters of the ascending

The default is ascending, but you can use descending order: The larger the value, the higher the rank:

If method= “first”, the ranking is 1 and 2. If average is used, the ranking will become 1.5. Other values rank similarly. Take a look at Max:

Parameters of PCT

Whether to display the ranking as a percentage of the ranking; Percentage of all rankings and maximum rankings

How was the above ranking calculated? Our biggest ranking is 7:

For example, the PCT parameter in dense situation is similar:

Parameter na_option

This parameter indicates whether null values participate in the ranking. The value can be keep, top, or bottom. Let’s simulate another data with null values:

Look at three different scenarios:

DataFrame ranking

Simulated data

Let’s start with a simulation:

df0 = pd.DataFrame({"Subject": ["Chinese"."Chinese"."Chinese"."Chinese"."Chinese"."Mathematics"."Mathematics"."Mathematics"."Mathematics"."Mathematics"]."Name": ["Xiao Ming"."Little su"."Chou"."Note"."Wang"."Xiao Ming"."Little su"."Chou"."Note"."Wang"]."Score": [137.125.125.115.115.80.111.130.130.140]})

df = df0.copy()   # Generate a copy of df
df
Copy the code

Individual Subject Ranking

For example, we want to see the ranking of Chinese and take out students’ Chinese scores:

Use sequential ranking, jump ranking, and dense ranking to show the ranking:

# Default ranking

df1["Mean ranking _ Default"] = df1["Score"].rank(ascending=False)
df1["Jump rank _min"] = df1["Score"].rank(method="min",ascending=False)
df1["Jumping _max"] = df1["Score"].rank(method="max",ascending=False)
df1["Dense"] = df1["Score"].rank(method="dense",ascending=False)

df1
Copy the code

Total ranking of students

First, generate the total score of each student through the transform:

df["Total"] = df.groupby("Name") ["Score"].transform("sum")
df
Copy the code

We ranked the total scores using an intensive ranking method:

Group retrieves the specified ranking

We now see the second place students in each subject, if they have the same score and the same ranking (no jumps), we use dense ranking:

# Define a second function

def rank_second(x) :
    return x[x["Score"].rank(method="dense",ascending=False) = =2]
Copy the code

Let’s look at the real data for the second place students in each subject:

The second function above is a two-step custom;

1. Implement dense ranking first

2. Specify a rank equal to 2

When we use this custom function, we need to first group according to the subject, and then use this custom function in each group separately, to get the second place under each subject.

conclusion

After explaining the use of rank function, can be compared to SQL window functions:

  • Row_number: order ranking, method=first in the rank function
  • Rank: jump rank (method=min
  • Dense_rank: dense rank, method=dense

Rank function (rank function) :

Pandas.pydata.org/docs/refere…