This article is participating in Python Theme Month. See the link for details

Introduction to the

The DF data types in PANDAS can perform groupby operations like database tables. Generally speaking, groupby operations can be divided into three parts: split data, apply transformations, and merge data.

This article explains the groupby operation in Pandas in detail.

Segmentation data

The purpose of data segmentation is to divide DF into groups. To perform the groupby operation, you need to specify the corresponding label when creating DF:

df = pd.DataFrame( ... : {... : "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"], ... : "B": ["one", "one", "two", "three", "two", "two", "one", "three"], ... : "C": np.random.randn(8), ... : "D": np.random.randn(8), ... :}... :)... : df Out[61]: A B C D 0 foo one-0.490565-0.233106 1 bar one 0.430089 1.040789 2 foo two 0.653449-1.155530 3 bar three-0.610380 0.447735 4 Foo two -0.934961 0.256358 5 bar two - 0.256263-0.661954 6 foo one - 1.132186-0.304330 7 foo three 2.129757 0.445744Copy the code

By default, the axis of groupby is the X-axis. A group can be a single column or a multi-column group:

In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])
Copy the code

Multiple index

In version 0.24, if we had multiple indexes, we could select a specific index from them to group:

In [10]: df2 = df.set_index(["A", "B"])

In [11]: grouped = df2.groupby(level=df2.index.names.difference(["B"]))

In [12]: grouped.sum()
Out[12]: 
            C         D
A                      
bar -1.591710 -1.739537
foo -0.752861 -1.402938
Copy the code

get_group

Get_group retrieves grouped data:

In [24]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]})

In [25]: df3.groupby(["X"]).get_group("A")
Out[25]: 
   X  Y
0  A  1
2  A  3

In [26]: df3.groupby(["X"]).get_group("B")
Out[26]: 
   X  Y
1  B  4
3  B  2
Copy the code

dropna

NaN data is excluded from groupby by default. By setting Dropna =False NaN data is allowed:

In [27]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]] In [28]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"]) In [29]: df_dropna Out[29]: A b c 0 1 2 2 3 3 1 2 2Copy the code
# Default ``dropna`` is set to True, which will exclude NaNs in keys In [30]: df_dropna.groupby(by=["b"], dropna=True).sum() Out[30]: A c b 1.0 2 3 2.0 2 5 # In order to allow NaN In keys, set ' 'dropna' 'to False In [31]: Df_dropna. Groupby (by=["b"], dropna=False). Sum () Out[31]: a c b 1.0 2 3 2.0 2 5 NaN 1 4Copy the code

Groups attribute

The groupby object has a Groups attribute, which is a key-value dictionary. The key is the data to classify and the value is the corresponding value of the classification.

In [34]: grouped = df.groupby(["A", "B"])

In [35]: grouped.groups
Out[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}

In [36]: len(grouped)
Out[36]: 6
Copy the code

The index of the hierarchy

For multilevel index objects, groupby can specify the index level of the group:

In [40]: arrays = [ ....: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"], ....: ["one", "two", "one", "two", "one", "two", "one", "two"], ....: ] .... : In [41]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"]) In [42]: s = pd.Series(np.random.randn(8), index=index) In [43]: s Out[43]: First second bar one -0.919854 two-0.042379 Baz one 1.247642 two-0.009920 Foo One 0.290213 two 0.495767 qux One 0.362949 two 1.548106 DTYPE: float64Copy the code

Group Level 1:

In [44]: grouped = s.groupby(level=0)

In [45]: grouped.sum()
Out[45]: 
first
bar   -0.962232
baz    1.237723
foo    0.785980
qux    1.911055
dtype: float64
Copy the code

Group level 2:

In [46]: s.groupby(level="second").Sum () Out[46]: second one 0.980950 two 1.991575 dType: Float64Copy the code

Group of traversal

Once we have the group object, we can iterate through the group with the for statement:

In [62]: grouped = df.groupby('A') In [63]: for name, group in grouped: .... : print(name) .... : print(group) .... : Bar A B C D 1 bar one 0.254161 1.511763 3 bar three 0.215897-0.990582 5 bar two-0.077118 1.211526 foo A B C D 0 One-0.575247 1.346061 2 Foo two -1.143704 1.627081 4 Foo two 1.193555 -0.441652 6 foo one -0.408530 0.268520 7 foo Three - 0.862495-0.024580Copy the code

If it is a multi-field group, the group name is a tuple:

In [64]: for name, group in df.groupby(['A', 'B']): .... : print(name) .... : print(group) .... : ('bar', 'one') A B C D 3 bar one 0.254161 1.511763 ('bar', 'three') A B C D 3 bar one 0.254161 1.511763 ('bar', 'three') 'two') A B C D 5 bar two -0.077118 1.211526 ('foo', 'one') A B C D 0 foo one -0.575247 1.346061 6 foo one -0.408530 0.268520 ('foo', 'three') A B C D 7 foo three -0.862495 0.02458 ('foo', 'two') A B C D 2 Foo two -1.143704 1.627081 4 Foo two 1.193555 -0.441652Copy the code

Aggregation operations

After grouping, you can aggregate:

In [67]: grouped = df.groupby("A") In [68]: grouped.aggregate(np.sum) Out[68]: Grouped In [69]: grouped = df. Groupby (["A", "B"]) In [70]: Grouped In [70] grouped.aggregate(np.sum) Out[70]: C D A B bar one 0.254161 1.511763 three 0.215897-0.990582 two-0.077118 1.211526 foo one -0.983776 1.614581 three -0.862495 0.024580 two 0.049851 1.185429Copy the code

For multi-index data, the default return value is also multi-index. If you want to use a new index, add as_index = False:

In [71]: grouped = df.groupby(["A", "B"], as_index=False) In [72]: grouped.aggregate(np.sum) Out[72]: A B C D 0 bar one 0.254161 1.511763 1 bar three 0.215897-0.990582 2 bar two-0.077118 1.211526 3 foo one -0.983776 1.614581 4 foo three -0.862495 0.024580 5 foo two 0.049851 1.185429 In [73]: Df.groupby ("A", as_index=False).sum() Out[73]: A C D 0 bar 0.392940 1.732707 1 foo -1.796421 2.824590Copy the code

The above effect is equivalent to reset_index

In [74]: df.groupby(["A", "B"]).sum().reset_index()
Copy the code

Grouped. Size () Counted the size of the group:

In [75]: grouped.size()
Out[75]: 
     A      B  size
0  bar    one     1
1  bar  three     1
2  bar    two     1
3  foo    one     2
4  foo  three     1
5  foo    two     2
Copy the code

From the head office. ()

In [76]: grouped.describe() Out[76]: C ... D count mean std min 25% 50% ... STD min 25% 50% 75% Max 0 1.0 0.254161 NaN 0.254161 0.254161... NaN 1.511763 1.511763 1.511763 1.511763 1.511763 1.511763 1 1.0 0.215897 NaN 0.215897 0.215897 0.215897... NaN - 0.990582-0.990582-0.990582-0.990582-0.990582-0.990582 2 1.0-0.077118 NaN - 0.077118-0.077118-0.077118... NaN 1.211526 1.211526 1.211526 1.211526 1.211526 1.211526 1.211526 3 2.0-0.491888 0.117887-0.575247-0.533567-0.491888... 0.761937 0.268520 0.537905 0.807291 1.076676 1.346061 4 1.0-0.862495 NaN 0.862495-0.862495-0.862495... NaN 0.024580 0.024580 0.024580 0.024580 0.024580 0.024580 5 2.0 0.024925 1.652692-1.143704-0.559389 0.024925... [6 rows x 16 columns]Copy the code

General polymerization method

The following is a common aggregation method:

function describe
mean() The average
sum() sum
size() Calculate the size
count() Group of statistical
std() The standard deviation
var() The variance
sem() Standard error of the mean
describe() Description of Statistics
first() The first group value
last() The last group value
nth() The NTH group value
min() The minimum value
max() The maximum

Use multiple aggregation methods simultaneously

Multiple aggregation methods can be specified simultaneously:

In [81]: grouped = df.groupby("A") In [82]: grouped["C"].agg([np.sum, np.mean, np.std]) Out[82]: Sum mean STD A bar 0.392940 0.130980 0.181231 foo-1.796421-0.359284 0.912265Copy the code

Can be renamed:

In [84]: (
   ....:     grouped["C"]
   ....:     .agg([np.sum, np.mean, np.std])
   ....:     .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})
   ....: )
   ....: 
Out[84]: 
          foo       bar       baz
A                                
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265
Copy the code

NamedAgg

NamedAgg provides a more precise definition of aggregation, with two custom columns, Column and AGgFunc.

In [88]: animals = pd.DataFrame( .... : {... : "kind": ["cat", "dog", "cat", "dog"], .... : "height": [9.1, 6.0, 9.5, 34.0],.... : "weight": [7.9, 7.5, 9.9, 198.0],.... :}... :)... : In [89]: animals Out[89]: kind height weight 0 CAT 9.1 7.9 1 dog 6.0 7.5 2 CAT 9.5 9.9 3 dog 34.0 198.0 In [90]: animals.groupby("kind").agg( .... : min_height=pd.NamedAgg(column="height", aggfunc="min"), .... : max_height=pd.NamedAgg(column="height", aggfunc="max"), .... : average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean), .... :)... : Out[90]: min_height max_height average_weight kind cat 9.1 9.5 8.90 dog 6.0 34.0 102.75Copy the code

Or just use a tuple:

In [91]: animals.groupby("kind").agg( .... : min_height=("height", "min"), .... : max_height=("height", "max"), .... : average_weight=("weight", np.mean), .... :)... : Out[91]: min_height max_height average_weight kind cat 9.1 9.5 8.90 dog 6.0 34.0 102.75Copy the code

Different columns specify different aggregation methods

By passing a dictionary to the AGG method, you can specify different columns to use different aggregations:

In [95]: Grouped. Agg ({"C": "sum", "D": "STD "}) Out[95]: C D A bar 0.392940 1.366330 foo-1.796421 0.884785Copy the code

Conversion operations

A conversion is the operation of converting an object to an object of the same size. In the process of data analysis, data conversion is often needed.

Lambda operations can be followed:

In [112]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())
Copy the code

Filling NA value:

In [121]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))
Copy the code

Filtering operation

The filter method can use lambda expressions to filter unwanted data:

In [136]: sf = pd.Series([1, 1, 2, 3, 3, 3])

In [137]: sf.groupby(sf).filter(lambda x: x.sum() > 2)
Out[137]: 
3    3
4    3
5    3
dtype: int64
Copy the code

The Apply operation

Some data may not be suitable for aggregation or transformation. Pandas provides a apply method for flexible transformation.

In [156]: df Out[156]: A B C D 0 foo one -0.575247 1.346061 1 bar one 0.254161 1.511763 2 foo two -1.143704 1.627081 3 bar three 0.215897 -0.990582 4 Foo two 1.193555 -0.441652 5 bar two -0.077118 1.211526 6 Foo one -0.408530 0.268520 7 Foo three -0.862495 From the [157]: grouped = df. Groupby ("A") # could also just call. Describe () In [158]: grouped["C"]. X.describe () Out[158]: A bar count 3.000000 mean 0.130980 STD 0.181231 min-0.077118 25% 0.069390... Foo min-1.143704 25%-0.862495 50%-0.575247 75%-0.408530 Max 1.193555 Name: C, Length: 16, DType: float64Copy the code

External functions can be added:

In [159]: grouped = df.groupby('A')['C'] In [160]: def f(group): ..... : return pd.DataFrame({'original': group, ..... : 'demeaned': group - group.mean()}) ..... : In [161]: grouped.apply(f) Out[161]: Original demeaned 0-0.575247-0.215962 1 0.254161 0.123181 2-1.143704-0.784420 3 0.215897 0.084917 4 1.193555 1.552839 5-0.077118-0.208098 6-0.408530-0.049245 7-0.862495-0.503211Copy the code

This article is available at www.flydean.com/11-python-p…

The most popular interpretation, the most profound dry goods, the most concise tutorial, many tips you didn’t know waiting for you to discover!