This article introduces a very useful function from the LIBRARY Pandas: assign

Assign is very convenient when we need to evaluate a column to create a new column for later use. This is equivalent to creating a new column based on a known column. The following uses an example to illustrate the use of the function.

Pandas articles

This is the 21st article in the Pandas series, which is divided into three categories:

The basic operations in Pandas (1 to 16 chapters) are introduced to the basic and common operations in Pandas, such as creating data, retrieving and querying data, ranking and sorting, and missing/duplicate value handling

Chapter 17 begins with the advanced operations used in Pandas

Compare THE OPERATIONS of SQL and Pandas to learn Pandas

parameter

Assign takes only one parameter: datafame. Assign (**kwargs).

**kwargs: dict of {str: callable or Series}
Copy the code

A few notes on the parameters:

  • Column names are keyword keywords
  • If column names are callable, they are evaluated on the DataFrame and assigned to the new column
  • If the column name is not callable (for example, Series, scalar Scalar, or array Array), it will be allocated directly

Finally, the return value of this function is a new DataFrame data box containing all existing and newly generated columns

Import libraries

import pandas as pd
import numpy as np
Copy the code
# Simulation data

df = pd.DataFrame({
  "col1": [12.16.18]."col2": ["xiaoming"."peter"."mike"]})

df
Copy the code
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

col1 col2
0 12 xiaoming
1 16 peter
2 18 mike

The instance

When the value is callable, we calculate it directly on the data box:

Method 1: Directly invoke the data enclosure

# Method 1: call on data box DF
# Generate col3 using the col1 property of data box DF

df.assign(col3=lambda x: x.col1 / 2 + 20)  
Copy the code
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

col1 col2 col3
0 12 xiaoming 26.0
1 16 peter 28.0
2 18 mike 29.0

We can look at the original DF and see that it’s constant

df  The original data box is unchanged
Copy the code
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

col1 col2
0 12 xiaoming
1 16 peter
2 18 mike

Manipulating string data:

df.assign(col3=df["col2"].str.upper())
Copy the code

Approach 2: Call Series data

The same behavior can be achieved by referring directly to an existing Series or sequence:

# Method 2: Call the existing Series to calculate

df.assign(col4=df["col1"] * 3 / 4 + 25)
Copy the code

df  The original data remains unchanged
Copy the code
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

col1 col2
0 12 xiaoming
1 16 peter
2 18 mike

In Python3.6+, we can create multiple columns in the same assignment, and one of the columns can depend on another column defined in the same assignment. The new column generated in the middle can be used directly:

df.assign(
    col5=lambda x: x["col1"] / 2 + 10,         
    col6=lambda x: x["col5"] * 5.# Use COL5 directly in COL6 calculations
    col7=lambda x: x.col2.str.upper(),         
    col8=lambda x: x.col7.str.title()  Col7 is used in # col8
)
Copy the code

df   The original data remains unchanged
Copy the code
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: left; }

col1 col2
0 12 xiaoming
1 16 peter
2 18 mike

If we reassign an existing column, the value of the existing column will be overwritten:

df.assign(col1=df["col1"] / 2)  # col1 is directly overwritten
Copy the code
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: left; }

col1 col2
0 6.0 xiaoming
1 8.0 peter
2 9.0 mike

Compare the Apply function

We can also use the apply function in pandas

df  # the original data
Copy the code
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: left; }

col1 col2
0 12 xiaoming
1 16 peter
2 18 mike

To generate a copy, we operate directly on the copy:

df1 = df.copy()  Create a copy and operate directly on the copy
df2 = df.copy()

df1
Copy the code
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: left; }

col1 col2
0 12 xiaoming
1 16 peter
2 18 mike
df1.assign(col3=lambda x: x.col1 / 2 + 20)  
Copy the code
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: left; }

col1 col2 col3
0 12 xiaoming 26.0
1 16 peter 28.0
2 18 mike 29.0
df1  # df1 remains the same
Copy the code
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: left; }

col1 col2
0 12 xiaoming
1 16 peter
2 18 mike
df1["col3"] = df1["col1"].apply(lambda x:x / 2 + 20)

df1  # dF1 has changed
Copy the code
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: left; }

col1 col2 col3
0 12 xiaoming 26.0
1 16 peter 28.0
2 18 mike 29.0

We find that with assign, the original data remains the same, but with apply, the data has changed

BMI

Finally, a data simulation was performed to calculate each person’s BMI.

Body mass index, or BMI, is an internationally used measure of fat, thinness and health.


B M I = weight The body high 2 {BMI} = \frac {weight}{height ^2}

The weight unit is kg and the height unit is M

df2 = pd.DataFrame({
    "name": ["xiaoming"."xiaohong"."xiaosu"]."weight": [78.65.87]."height": [1.82.1.75.1.89]
})

df2
Copy the code
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: left; }

name weight height
0 xiaoming 78 1.82
1 xiaohong 65 1.75
2 xiaosu 87 1.89
Use assign

df2.assign(BMI=df2["weight"] / (df2["height"] * *2))
Copy the code

df2 # stays the same
Copy the code
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: left; }

name weight height
0 xiaoming 78 1.82
1 xiaohong 65 1.75
2 xiaosu 87 1.89
df2["BMI"] = df2["weight"] / (df2["height"] * *2)

df2  # df2 generates a new column: BMI
Copy the code

conclusion

Through the above example, we find that:

  1. The assigned DataFrame does not change the original DataFrame. This DataFrame is new
  2. Assign can operate on multiple column names at the same time, and the intermediate column names can be used directly
  3. The main difference between Assign and apply is that the former does not change the original data. The Apply function adds new columns to the original data