Author | B. Chen, compile | source of vitamin k | forward Data Science

Pandas DataFrame has a built-in method sort_values() that sorts values based on a given variable. The method itself is fairly simple to use, but it doesn’t work for custom sorts, for example,

  • T-shirt sizes: XS, S, M, L and XL

  • Months: January, February, March, April, etc

  • Days of the week: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday and Sunday.

In this article, you’ll learn how to do a custom ordering for Pandas DataFrame.

Check out my Github repo for the source code :github.com/BindiChen/m…

The problem

Suppose we have a data set about clothing stores:

df = pd.DataFrame({
    'cloth_id': [1001.1002.1003.1004.1005.1006].'size': ['S'.'XL'.'M'.'XS'.'L'.'S'],})Copy the code

We can see that each piece of cloth has a size value and the data should be sorted in the following order:

  • XS stands for extra large

  • S is for trumpet

  • M stands for medium

  • L is for large

  • XL stands for extra large

However, when sort_values(‘size’) is called, you get the following output.

The output is not what we want, but it is technically correct. In fact, sort_values() sorts numeric data in numeric order and object data in alphabetical order.

Here are two common solutions:

  1. Create a new column for a custom sort

  2. Use CategoricalDtype to cast data to an ordered category type

Create a new column for a custom sort

In this solution, we need a map data frame to represent a custom sort, then create a new column based on the map, and finally we can sort the data by the new column. Let’s go through an example to see how this works.

First, let’s create a map data frame to represent a custom sort.

df_mapping = pd.DataFrame({
    'size': ['XS'.'S'.'M'.'L'.'XL'],
})

sort_mapping = df_mapping.reset_index().set_index('size')
Copy the code

After that, a new column, size_num, is created using the mapping values in sort_mapping.

df['size_num'] = df['size'].map(sort_mapping['index'])
Copy the code

Finally, sort the values by the new column size.

df.sort_values('size_num')
Copy the code

That’s certainly our job. But it creates an alternate column, which can be less efficient when working with large data sets.

We can solve this problem more effectively by using a CategoricalDtype.

Use CategoricalDtype to cast data to an ordered category type

CategoricalDtype is a type of categorizing data with categories and order [1]. It is very useful for creating custom sorts [2]. Let’s go through an example to see how this works.

First, let’s import a CategoricalDtype.

from pandas.api.types import CategoricalDtype
Copy the code

Then, create a custom category type cat_size_ORDER

  • The first parameter is set to [‘XS’, ‘S’, ‘M’, ‘L’, ‘XL’] as a unique value for the size.

  • The second argument, ordered=True, treats this variable as ordered.

cat_size_order = CategoricalDtype(
    ['XS'.'S'.'M'.'L'.'XL'], 
    ordered=True
)
Copy the code

Astype (cat_size_ORDER) is then called to cast the size data to the custom category type. By running df[‘size’], we can see that the size column has been converted to a category type in the order [XS<S<M<L<XL].

>>> df['size'] = df['size'].astype(cat_size_order)
>>> df['size']

0     S
1    XL
2     M
3    XS
4     L
5     S
Name: size, dtype: category
Categories (5.object): [XS < S < M < L < XL]
Copy the code

Finally, we can call the same method to sort the values.

df.sort_values('size')
Copy the code

It works better. Let’s see how it works.

Access using cat’s Codes attribute

Now that the size column has been converted to category type, we can use the.cat accessor to see the category properties. Behind the scenes, it uses the Codes attribute to indicate the size of an ordered variable.

Let’s create a new column code so that we can compare sizes and code values side by side.

df['codes'] = df['size'].cat.codes
df
Copy the code

We can see that the codes for XS, S, M, L, and XL are 0, 1, 2, 3, 4, and 5, respectively. Codes are the actual values of categories. By running df.info(), we can see that it is actually INT8.

>>> df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
 # Column Non-Null Count Dtype
---  ------    --------------  -----   
 0   cloth_id  6 non-null      int64   
 1   size      6 non-null      category
 2   codes     6 non-null      int8    
dtypes: category(1), int64(1), int8(1)
memory usage: 388.0 bytes
Copy the code

Sort by multiple variables

Now, let’s make things a little more complicated. Here, we will sort the data frames by multiple variables.

df = pd.DataFrame({
    'order_id': [1001.1002.1003.1004.1005.1006.1007].'customer_id': [10.12.12.12.10.10.10].'month': ['Feb'.'Jan'.'Jan'.'Feb'.'Feb'.'Jan'.'Feb'].'day_of_week': ['Mon'.'Wed'.'Sun'.'Tue'.'Sat'.'Mon'.'Thu'],})Copy the code

Similarly, let’s create two custom category types, cat_DAY_of_week and cat_month, and pass them to astype().

cat_day_of_week = CategoricalDtype(
    ['Mon'.'Tue'.'Wed'.'Thu'.'Fri'.'Sat'.'Sun'], 
    ordered=True
)

cat_month = CategoricalDtype(
    ['Jan'.'Feb'.'Mar'.'Apr'], 
    ordered=True,
)

df['day_of_week'] = df['day_of_week'].astype(cat_day_of_week)
df['month'] = df['month'].astype(cat_month)
Copy the code

To sort by multiple variables, we simply pass a list instead of sort_values(). For example, sort by month and day_of_week.

df.sort_values(['month'.'day_of_week'])
Copy the code

Sort by ustomer_id, month, and day_of_week.

df.sort_values(['customer_id'.'month'.'day_of_week'])
Copy the code

That’s it. Thank you for reading.

Please export the notebook to my Github for the source code: github.com/BindiChen/m…

Refer to the reference

  • [1] Pandas. CategoricalDtype API (pandas.pydata.org/pandas-docs…).
  • [2] In the tutorial (pandas.pydata.org/pandas-docs…)

The original link: towardsdatascience.com/how-to-do-a…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/