Introduction to the

Pandas has a special type of data called a category. It represents a category, usually used in statistical classifications, such as sex, blood type, classification, rank, etc. It’s kind of like the Enum in Java.

Today I’m going to explain the use of category in detail.

Create a category

Create using Series

To create a category, add dtype=”category” while creating a Series. A category is divided into two parts, one for order and one for literals:

In [1]: s = pd.Series(["a", "b", "c", "a"], dtype="category")

In [2]: s
Out[2]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
Copy the code

You can convert a Series in DF to a category:

In [3]: df = pd.DataFrame({"A": ["a", "b", "c", "a"]})

In [4]: df["B"] = df["A"].astype("category")

In [5]: df["B"]
Out[32]: 
0    a
1    b
2    c
3    a
Name: B, dtype: category
Categories (3, object): [a, b, c]
Copy the code

We can create a “pandas.Categorical” and pass it as a parameter to the Series:

In [10]: raw_cat = pd.Categorical(
   ....:     ["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=False
   ....: )
   ....: 

In [11]: s = pd.Series(raw_cat)

In [12]: s
Out[12]: 
0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): ['b', 'c', 'd']
Copy the code

Create using DF

When creating a DataFrame, you can also pass dtype=”category” :

In [17]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")}, dtype="category")

In [18]: df.dtypes
Out[18]: 
A    category
B    category
dtype: object
Copy the code

A and B in DF are both A category:

In [19]: df["A"]
Out[19]: 
0    a
1    b
2    c
3    a
Name: A, dtype: category
Categories (3, object): ['a', 'b', 'c']

In [20]: df["B"]
Out[20]: 
0    b
1    c
2    c
3    d
Name: B, dtype: category
Categories (3, object): ['b', 'c', 'd']
Copy the code

Or use df.astype(“category”) to convert all Series in DF to category:

In [21]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})

In [22]: df_cat = df.astype("category")

In [23]: df_cat.dtypes
Out[23]: 
A    category
B    category
dtype: object
Copy the code

Create control

By default, passing dtype=’category’ creates a category that uses the default value:

  1. Categories are extrapolated from the data.
  2. Categories are in no order of size.

You can modify the above two defaults by creating a CategoricalDtype:

In [26]: from pandas.api.types import CategoricalDtype

In [27]: s = pd.Series(["a", "b", "c", "a"])

In [28]: cat_type = CategoricalDtype(categories=["b", "c", "d"], ordered=True)

In [29]: s_cat = s.astype(cat_type)

In [30]: s_cat
Out[30]: 
0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): ['b' < 'c' < 'd']
Copy the code

The same CategoricalDtype can also be used with DF:

In [31]: from pandas.api.types import CategoricalDtype

In [32]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})

In [33]: cat_type = CategoricalDtype(categories=list("abcd"), ordered=True)

In [34]: df_cat = df.astype(cat_type)

In [35]: df_cat["A"]
Out[35]: 
0    a
1    b
2    c
3    a
Name: A, dtype: category
Categories (4, object): ['a' < 'b' < 'c' < 'd']

In [36]: df_cat["B"]
Out[36]: 
0    b
1    c
2    c
3    d
Name: B, dtype: category
Categories (4, object): ['a' < 'b' < 'c' < 'd']
Copy the code

Convert to the original type

To convert a Category to a primitive type, use series.astype (original_dtype) or Np.asarray (categorical) :

In [39]: s = pd.Series(["a", "b", "c", "a"])

In [40]: s
Out[40]: 
0    a
1    b
2    c
3    a
dtype: object

In [41]: s2 = s.astype("category")

In [42]: s2
Out[42]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']

In [43]: s2.astype(str)
Out[43]: 
0    a
1    b
2    c
3    a
dtype: object

In [44]: np.asarray(s2)
Out[44]: array(['a', 'b', 'c', 'a'], dtype=object)
Copy the code

The operation of the categories

Get the properties of the category

Categorical data has categories and ordered attributes. Available from s.cat.categories and s.cat.ordered:

In [57]: s = pd.Series(["a", "b", "c", "a"], dtype="category")

In [58]: s.cat.categories
Out[58]: Index(['a', 'b', 'c'], dtype='object')

In [59]: s.cat.ordered
Out[59]: False
Copy the code

Rearrange the order of the categories:

In [60]: s = pd.Series(pd.Categorical(["a", "b", "c", "a"], categories=["c", "b", "a"]))

In [61]: s.cat.categories
Out[61]: Index(['c', 'b', 'a'], dtype='object')

In [62]: s.cat.ordered
Out[62]: False
Copy the code

Rename categories

We rename categories by assigning a value to S.cat.categories:

In [67]: s = pd.Series(["a", "b", "c", "a"], dtype="category")

In [68]: s
Out[68]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']

In [69]: s.cat.categories = ["Group %s" % g for g in s.cat.categories]

In [70]: s
Out[70]: 
0    Group a
1    Group b
2    Group c
3    Group a
dtype: category
Categories (3, object): ['Group a', 'Group b', 'Group c']
Copy the code

Use rename_categories to achieve the same effect:

In [71]: s = s.cat.rename_categories([1, 2, 3])

In [72]: s
Out[72]: 
0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [1, 2, 3]
Copy the code

Or use a dictionary object:

# You can also pass a dict-like object to map the renaming
In [73]: s = s.cat.rename_categories({1: "x", 2: "y", 3: "z"})

In [74]: s
Out[74]: 
0    x
1    y
2    z
3    x
dtype: category
Categories (3, object): ['x', 'y', 'z']
Copy the code

useadd_categoriesAdd category

You can add categories using add_categories:

In [77]: s = s.cat.add_categories([4])

In [78]: s.cat.categories
Out[78]: Index(['x', 'y', 'z', 4], dtype='object')

In [79]: s
Out[79]: 
0    x
1    y
2    z
3    x
dtype: category
Categories (4, object): ['x', 'y', 'z', 4]
Copy the code

Use remove_categories to delete the category

In [80]: s = s.cat.remove_categories([4])

In [81]: s
Out[81]: 
0    x
1    y
2    z
3    x
dtype: category
Categories (3, object): ['x', 'y', 'z']
Copy the code

Delete the unused cagtegory

In [82]: s = pd.Series(pd.Categorical(["a", "b", "a"], categories=["a", "b", "c", "d"]))

In [83]: s
Out[83]: 
0    a
1    b
2    a
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [84]: s.cat.remove_unused_categories()
Out[84]: 
0    a
1    b
2    a
dtype: category
Categories (2, object): ['a', 'b']
Copy the code

Reset cagtegory

Use set_categories() to add and remove categories at the same time:

In [85]: s = pd.Series(["one", "two", "four", "-"], dtype="category")

In [86]: s
Out[86]: 
0     one
1     two
2    four
3       -
dtype: category
Categories (4, object): ['-', 'four', 'one', 'two']

In [87]: s = s.cat.set_categories(["one", "two", "three", "four"])

In [88]: s
Out[88]: 
0     one
1     two
2    four
3     NaN
dtype: category
Categories (4, object): ['one', 'two', 'three', 'four']
Copy the code

The category sorting

If a category was created with ordered=True, then it can be ordered:

In [91]: s = pd.Series(["a", "b", "c", "a"]).astype(CategoricalDtype(ordered=True))

In [92]: s.sort_values(inplace=True)

In [93]: s
Out[93]: 
0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): ['a' < 'b' < 'c']

In [94]: s.min(), s.max()
Out[94]: ('a', 'c')
Copy the code

You can use as_ordered() or as_unordered() to force a sort or not:

In [95]: s.cat.as_ordered()
Out[95]: 
0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): ['a' < 'b' < 'c']

In [96]: s.cat.as_unordered()
Out[96]: 
0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']
Copy the code

reorder

Categorical.reorder_categories() can be used to reorder existing categories:

In [103]: s = pd.Series([1, 2, 3, 1], dtype="category")

In [104]: s = s.cat.reorder_categories([2, 3, 1], ordered=True)

In [105]: s
Out[105]: 
0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]
Copy the code

Multi-column sorting

Sort_values supports sorting of multiple columns:

In [109]: dfs = pd.DataFrame( ..... : {... : "A": pd.Categorical( ..... : list("bbeebbaa"), ..... : categories=["e", "a", "b"], ..... : ordered=True, ..... :),... : "B": [1, 2, 1, 2, 2, 1, 2, 1], ..... :}... :)... : In [110]: dfs.sort_values(by=["A", "B"]) Out[110]: A B 2 e 1 3 e 2 7 a 1 6 a 2 0 b 1 5 b 1 1 b 2 4 b 2Copy the code

Comparison operation

If ordered==True is set, then categories can be compared. Support = =,! =, >, >=, <, and <=.

In [113]: cat = pd.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))

In [114]: cat_base = pd.Series([2, 2, 2]).astype(CategoricalDtype([3, 2, 1], ordered=True))

In [115]: cat_base2 = pd.Series([2, 2, 2]).astype(CategoricalDtype(ordered=True))
Copy the code
In [119]: cat > cat_base
Out[119]: 
0     True
1    False
2    False
dtype: bool

In [120]: cat > 2
Out[120]: 
0     True
1    False
2    False
dtype: bool
Copy the code

Other operating

Cagetory is essentially a Series, so you can use a Series operation category like series.min (), series.max (), and series.mode ().

Value_counts:

In [131]: s = pd.Series(pd.Categorical(["a", "b", "c", "c"], categories=["c", "a", "b", "d"]))

In [132]: s.value_counts()
Out[132]: 
c    2
a    1
b    1
d    0
dtype: int64
Copy the code

DataFrame. Sum () :

In [133]: columns = pd.Categorical( ..... : ["One", "One", "Two"], categories=["One", "Two", "Three"], ordered=True ..... :)... : In [134]: df = pd.DataFrame( ..... : data=[[1, 2, 3], [4, 5, 6]], ..... : columns=pd.MultiIndex.from_arrays([["A", "B", "B"], columns]), ..... :)... : In [135]: df.sum(axis=1, level=1) Out[135]: One Two Three 0 3 3 0 1 9 6 0Copy the code

Groupby:

In [136]: cats = pd.Categorical( ..... : ["a", "b", "b", "b", "c", "c", "c"], categories=["a", "b", "c", "d"] ..... :)... : In [137]: df = pd.DataFrame({"cats": cats, "values": [1, 2, 2, 2, 3, 4, 5]}) In [138]: Df.groupby ("cats").mean() Out[138]: values cats a 1.0 b 2.0 C 4.0 d NaN In [139]: cats2 = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"]) In [140]: df2 = pd.DataFrame( ..... : {... : "cats": cats2, ..... : "B": ["c", "d", "c", "d"], ..... : "values": [1, 2, 3, 4], ..... :}... :)... : In [141]: df2.groupby(["cats", "B"]).mean() Out[141]: values cats B a C 1.0 D 2.0 B C 3.0 D 4.0 C C NaN d NaNCopy the code

The Pivot tables:

In [142]: raw_cat = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])

In [143]: df = pd.DataFrame({"A": raw_cat, "B": ["c", "d", "c", "d"], "values": [1, 2, 3, 4]})

In [144]: pd.pivot_table(df, values="values", index=["A", "B"])
Out[144]: 
     values
A B        
a c       1
  d       2
b c       3
  d       4
Copy the code

This article is available at www.flydean.com/08-python-p…

The most popular interpretation, the most profound dry goods, the most concise tutorial, many you do not know the small skills waiting for you to find!

Welcome to follow my public number: “procedures those things”, understand technology, more understand you!