Public account: You and the cabin by: Peter Editor: Peter

Hello, I’m Peter

Two functions are used to extract information from text in Pandas: Extract + Extractall

The extract function

Grammar specification

The extract function is used as follows, with only three arguments:

Series.str.extract(pat, flags=0, expand=None)
Copy the code

The specific interpretation of parameters is as follows:

  • Pat: string or regular expression
  • Flags: integer
  • Expand: Boolean value, whether to return DataFrame. T- yes, F- no

Simulated data

Let’s take a look at a simple case provided on the official website. The following is a simulated data Series:

Match 1

In the following example, two sets of schema data are matched; A pair of () matches a group:

  • [ab] : matches any letter of ab
  • \ D: Matches a number

Through the results, we can find two points:

  1. When matching multiple sets of rules, NaN is used instead if there is no match
  2. When the first set of patterns does not match, the second set of matches is invalid

In c3, although \d matches the number, [ab] does not match. C does not satisfy either of ab, so NaN is still the whole thing

Match the two

The difference between the following match and the above match is that there are multiple question marks? The result was different

When doing a regular match, the question mark? Represents the match of 1 or 0 of the preceding elements. So in C3, [ab] can be said to match zero, and NaN is used instead, which is also a match

Match 3

Specify the column name when matching to generate the final DataFrame:

The specified use of column names? P

Parameter expand use

About the use of parameter expand:

  • Expand = True: DataFrame is returned
  • Expand = False: returns a Series or Index

By comparing the following two examples, we can see expand in action:

Extractall function

Grammar specification

Extract returns only the first matched character; The Extractall will match all returned characters

Series.str.extractall(pat, flags=0)
Copy the code

The specific interpretation of parameters is as follows:

  • Pat: string or regular expression
  • Flags: integer

The return value must be a DataFrame data box

Simulated data

Here is a simulation of a new data:

Here are three examples to compare the two functions:

Compared to 1

Matching in single group mode

Compare the two

Matching in multi-group mode:

Compare the three

Matches in multi-group mode, plus column names:

Practical cases

Here’s an example of how to use the extract function:

Simulated data

The name field contains both name and gender, and the address field contains both province and city:

df = pd.DataFrame({
    "name": ["Tom-male"."Peter male"."Jimmy-female"."Mike male"."John-female"]."address": [Shenzhen city, Guangdong Province."Guangzhou, Guangdong Province"."Hangzhou, Zhejiang Province".Nanjing, Jiangsu Province."Changsha, Hunan Province"]}
    )
df
Copy the code

Extract the provinces

Quickly extract province information from address, where.*? Matches anything

Extract province + city

At the same time extract province + city, can also specify the column name information:

Extract the name + gender

Extract both the name and gender from the field name, \w for matching one letter and + for matching multiple characters

Regular matching knowledge

Here’s a quick primer on regular matching, courtesy of Google Analytics:

The wildcard

. Matches any single character (letter, number, or symbol) 1. Can match 10, 1A 1.1 can match 111, 1A1
? Matches the preceding character 0 or 1 times 10? Can match 1, 10
+ Matches the preceding character 1 or more times 10+ matches 10 and 100
* Matches the preceding character 0 or more times 1* matches 1 and 10
| Create OR (OR) matches Do not use at the end of expressions 1 | 10 can match 1, 10

locator

^ Matches adjacent characters at the beginning of the string ^10 matches10,100,10X; Failed to match 110, 1,10x
$ Matches adjacent characters at the end of the string Ten dollars will match one10, 1010; Can’t match100,10x

Question mark (?)

Question mark (?) Matches the preceding character 0 or 1 times. For example, 10? Can match:

  • 1: The 0 before the question mark matches 0 times
  • 10: The 0 before the question mark matches once

A plus sign (+)

The plus sign (+) matches the preceding character 1 or more times. For example, 10+ can match:

  • 10:0 matches once
  • 100:0 matches twice
  • 1000:0 matches three times

An asterisk (*)

The asterisk (*) matches the preceding character 0 or more times. For example, 10* can match:

  • 1: matches 0 times
  • 10: Matches once
  • 100
  • 1000

I will write a detailed article on regular matching based on Python’s RE module