In 16 Pandas functions, you can increase your "data cleaning" ability by 100 times.

In this paper,

Have you ever had a feeling why the data in your hands are always messy?

As a data analyst, data cleaning is an essential step. Sometimes the data is so cluttered that it takes a lot of time to process it. So knowing more about how to clean your data increases your power by a factor of 100.

The strvectorized string function is used in Pandas. It is very useful to clean data in Pandas.

1 data set, 16 Pandas

The data set is carefully made by Huang for everyone, just to help everyone learn the knowledge. The data set is as follows:

import pandas as pd

df ={'name': ['Classmate Huang'.Huang Zhi Zun.'Huang Lao Xie'.'Chen Da-mei'.'Sun Shangxiang'].'English name': ['Huang tong_xue'.'huang zhi_zun'.'Huang Lao_xie'.'Chen Da_mei'.'sun shang_xiang'].'gender': ['male'.'women'.'men'.'woman'.'male'].Identity card: ['463895200003128433'.'429475199912122345'.'420934199110102311'.'431085200005230122'.'420953199509082345'].'height': ['mid:175_good'.'low:165_bad'.'low:159_bad'.'high:180_verygood'.'low:172_bad'].'Home Address': ['Guangshui, Hubei'.'Xinyang, Henan'.Guilin, Guangxi.'Xiaogan hubei'.'Guangzhou, Guangdong'].'Phone number': ['13434813546'.'19748672895'.'16728613064'.'14561586431'.'19384683910'].'income': ['11000'.'8.5千'.'09000'.'6.5千'.'20000']}
df = pd.DataFrame(df)
df
Copy the code

The results are as follows:

Looking at the data above, the data set is messy. We will use 16 Pandas to clean the data.

Cat function: used for string concatenation

df["Name"].str.cat(df["Home address"],sep=The '-'*3)
Copy the code

The results are as follows:

② Contains: Determines whether a string contains a given character

df["Home address"].str.contains("Wide")
Copy the code

The results are as follows:

③ Startswith /endswith: check whether a string with… Beginning/End

# The first line "Huang Wei" begins with a space
df["Name"].str.startswith("Yellow") 
df["English name"].str.endswith("e")
Copy the code

The results are as follows:

④ count: Counts the number of occurrences of a given character in a string

df["Phone number"].str.count("3")
Copy the code

The results are as follows:

⑤ get: Gets the string at the specified position

df["Name"].str.get(-1)
df["Height"].str.split(":")
df["Height"].str.split(":").str.get(0)
Copy the code

The results are as follows:

⑥ len: Calculates the length of the string

df["Gender"].str.len()
Copy the code

The results are as follows:

⑦ Upper /lower: English case conversion

df["English name"].str.upper()
df["English name"].str.lower()
Copy the code

The results are as follows:

⑧ pad+side parameter /center: Add the specified characters to the left, right, or left sides of the string

df["Home address"].str.pad(10,fillchar="*")      # equivalent to ljust()
df["Home address"].str.pad(10,side="right",fillchar="*")    # equivalent to rjust()
df["Home address"].str.center(10,fillchar="*")
Copy the code

The results are as follows:

⑨ repeat: Repeat the string several times

df["Gender"].str.repeat(3)
Copy the code

The results are as follows:

⑩ slice_replace: Uses the given string to replace the characters at the specified position

df["Phone number"].str.slice_replace(4.8."*"*4)
Copy the code

The results are as follows:

⑪ replace: Replaces the character in the specified position with the given string

df["Height"].str.replace(":"."-")
Copy the code

The results are as follows:

12 replace: Replace a character in a specified position with a given string (accepting regular expressions)

Replace in the regular expression, is good;
Regardless of whether the following example is useful or not, you just need to know how useful it is to use re to do data cleaning;

df["Income"].str.replace("\d+\.\d+"."Regular")
Copy the code

The results are as follows:

The password-split method +expand parameter is very powerful with the join method

# Common usage
df["Height"].str.split(":")
# split method with expand parameter
df[["Description of height"."Final height"]] = df["Height"].str.split(":",expand=True)
df
The # split method is followed by the join method
df["Height"].str.split(":").str.join("?"*5)
Copy the code

The results are as follows:

14 strip/rstrip/lstrip: Remove blank characters and newlines

df["Name"].str.len()
df["Name"] = df["Name"].str.strip()
df["Name"].str.len()
Copy the code

The results are as follows:

15) Findall: returns a list of search results by eliminating matches in a string using a regular expression

Findall uses regular expressions to clean data.

df["Height"]
df["Height"].str.findall("[a-zA-Z]+")
Copy the code

The results are as follows:

16) ⑯ persons: Accept regular expression, extract matching string (must be parentheses)

df["Height"].str.extract("([a-zA-Z]+)")
Composite index is extracted by extractall
df["Height"].str.extractall("([a-zA-Z]+)")
Extract expand parameter
df["Height"].str.extract("([a-zA-Z]+).*? ([a-zA-Z]+)",expand=True)
Copy the code

The results are as follows:

Top 10 Best Popular Python Libraries of 2020 \

2020 Python Chinese Community Top 10 Articles \

5 minutes to quickly master the Python timed task framework \

Special recommendation \

Click below to read the article and join the community

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

In 16 Pandas functions, you can increase your “data cleaning” ability by 100 times.

In this paper,

1 data set, 16 Pandas

Cat function: used for string concatenation

② Contains: Determines whether a string contains a given character

③ Startswith /endswith: check whether a string with… Beginning/End

④ count: Counts the number of occurrences of a given character in a string

⑤ get: Gets the string at the specified position

⑥ len: Calculates the length of the string

⑦ Upper /lower: English case conversion

⑧ pad+side parameter /center: Add the specified characters to the left, right, or left sides of the string

⑨ repeat: Repeat the string several times

⑩ slice_replace: Uses the given string to replace the characters at the specified position

⑪ replace: Replaces the character in the specified position with the given string

12 replace: Replace a character in a specified position with a given string (accepting regular expressions)

The password-split method +expand parameter is very powerful with the join method

14 strip/rstrip/lstrip: Remove blank characters and newlines

15) Findall: returns a list of search results by eliminating matches in a string using a regular expression

16) ⑯ persons: Accept regular expression, extract matching string (must be parentheses)

In 16 Pandas functions, you can increase your “data cleaning” ability by 100 times.

In this paper,

1 data set, 16 Pandas

Cat function: used for string concatenation

② Contains: Determines whether a string contains a given character

③ Startswith /endswith: check whether a string with… Beginning/End

④ count: Counts the number of occurrences of a given character in a string

⑤ get: Gets the string at the specified position

⑥ len: Calculates the length of the string

⑦ Upper /lower: English case conversion

⑧ pad+side parameter /center: Add the specified characters to the left, right, or left sides of the string

⑨ repeat: Repeat the string several times

⑩ slice_replace: Uses the given string to replace the characters at the specified position

⑪ replace: Replaces the character in the specified position with the given string

12 replace: Replace a character in a specified position with a given string (accepting regular expressions)

The password-split method +expand parameter is very powerful with the join method

14 strip/rstrip/lstrip: Remove blank characters and newlines

15) Findall: returns a list of search results by eliminating matches in a string using a regular expression

16) ⑯ persons: Accept regular expression, extract matching string (must be parentheses)

Related Posts

37 Mobile game based on Flink CDC + Hudi lake warehouse integrated practice

Climb the stairs

The thread pool – the demo