In this paper,

Have you ever had a feeling why the data in your hands are always messy?

As a data analyst, data cleaning is an essential step. Sometimes the data is so cluttered that it takes a lot of time to process it. So knowing more about how to clean your data increases your power by a factor of 100.

The strvectorized string function is used in Pandas. It is very useful to clean data in Pandas.

1 data set, 16 Pandas

The data set is carefully made by Huang for everyone, just to help everyone learn the knowledge. The data set is as follows:

import pandas as pd

df ={'name': ['Classmate Huang'.Huang Zhi Zun.'Huang Lao Xie'.'Chen Da-mei'.'Sun Shangxiang'].'English name': ['Huang tong_xue'.'huang zhi_zun'.'Huang Lao_xie'.'Chen Da_mei'.'sun shang_xiang'].'gender': ['male'.'women'.'men'.'woman'.'male'].Identity card: ['463895200003128433'.'429475199912122345'.'420934199110102311'.'431085200005230122'.'420953199509082345'].'height': ['mid:175_good'.'low:165_bad'.'low:159_bad'.'high:180_verygood'.'low:172_bad'].'Home Address': ['Guangshui, Hubei'.'Xinyang, Henan'.Guilin, Guangxi.'Xiaogan hubei'.'Guangzhou, Guangdong'].'Phone number': ['13434813546'.'19748672895'.'16728613064'.'14561586431'.'19384683910'].'income': ['11000'.'8.5千'.'09000'.'6.5千'.'20000']}
df = pd.DataFrame(df)
df
Copy the code

The results are as follows:

Looking at the data above, the data set is messy. We will use 16 Pandas to clean the data.

Cat function: used for string concatenation
df["Name"].str.cat(df["Home address"],sep=The '-'*3)
Copy the code

The results are as follows:

② Contains: Determines whether a string contains a given character
df["Home address"].str.contains("Wide")
Copy the code

The results are as follows:

③ Startswith /endswith: check whether a string with… Beginning/End
# The first line "Huang Wei" begins with a space
df["Name"].str.startswith("Yellow") 
df["English name"].str.endswith("e")
Copy the code

The results are as follows:

④ count: Counts the number of occurrences of a given character in a string
df["Phone number"].str.count("3")
Copy the code

The results are as follows:

⑤ get: Gets the string at the specified position
df["Name"].str.get(-1)
df["Height"].str.split(":")
df["Height"].str.split(":").str.get(0)
Copy the code

The results are as follows:

⑥ len: Calculates the length of the string
df["Gender"].str.len()
Copy the code

The results are as follows:

⑦ Upper /lower: English case conversion
df["English name"].str.upper()
df["English name"].str.lower()
Copy the code

The results are as follows:

⑧ pad+side parameter /center: Add the specified characters to the left, right, or left sides of the string
df["Home address"].str.pad(10,fillchar="*")      # equivalent to ljust()
df["Home address"].str.pad(10,side="right",fillchar="*")    # equivalent to rjust()
df["Home address"].str.center(10,fillchar="*")
Copy the code

The results are as follows:

⑨ repeat: Repeat the string several times
df["Gender"].str.repeat(3)
Copy the code

The results are as follows:

⑩ slice_replace: Uses the given string to replace the characters at the specified position
df["Phone number"].str.slice_replace(4.8."*"*4)
Copy the code

The results are as follows:

⑪ replace: Replaces the character in the specified position with the given string
df["Height"].str.replace(":"."-")
Copy the code

The results are as follows:

12 replace: Replace a character in a specified position with a given string (accepting regular expressions)
  • Replace in the regular expression, is good;
  • Regardless of whether the following example is useful or not, you just need to know how useful it is to use re to do data cleaning;
df["Income"].str.replace("\d+\.\d+"."Regular")
Copy the code

The results are as follows:

The password-split method +expand parameter is very powerful with the join method
# Common usage
df["Height"].str.split(":")
# split method with expand parameter
df[["Description of height"."Final height"]] = df["Height"].str.split(":",expand=True)
df
The # split method is followed by the join method
df["Height"].str.split(":").str.join("?"*5)
Copy the code

The results are as follows:

14 strip/rstrip/lstrip: Remove blank characters and newlines
df["Name"].str.len()
df["Name"] = df["Name"].str.strip()
df["Name"].str.len()
Copy the code

The results are as follows:

15) Findall: returns a list of search results by eliminating matches in a string using a regular expression
  • Findall uses regular expressions to clean data.
df["Height"]
df["Height"].str.findall("[a-zA-Z]+")
Copy the code

The results are as follows:

16) ⑯ persons: Accept regular expression, extract matching string (must be parentheses)
df["Height"].str.extract("([a-zA-Z]+)")
Composite index is extracted by extractall
df["Height"].str.extractall("([a-zA-Z]+)")
Extract expand parameter
df["Height"].str.extract("([a-zA-Z]+).*? ([a-zA-Z]+)",expand=True)
Copy the code

The results are as follows:

Read more

Top 10 Best Popular Python Libraries of 2020 \

2020 Python Chinese Community Top 10 Articles \

5 minutes to quickly master the Python timed task framework \

Special recommendation \

\

Click below to read the article and join the community