Python regular expression module (RE) details

This article is authorized by the wechat official account: If you have any suggestions, please leave a message on the official account.

We looked briefly at the basic syntax of regular expressions in Python’s basic Syntax for regular expressions. In this article, we will look at how Python uses regular expressions and what methods it has.

Regular expression module (RE) content

The re module is a Python library with many ways to manipulate regular expressions.

Re.search (pattern, string, flags = 0)

parameter	describe
pattern	Regular expression
string	String to match
flags	Flag value used to change the behavior of a regular expression

What are the values for flags:

mark	meaning
re.S(DOTALL)	make`.`Matches all characters, including newlines
re.I(IGNORECASE)	Make the match case insensitive
re.L(LOCALE)	Location-aware matching, French, etc
re.M(MULTILINE)	Multi-line matching, influence`^`and`$`
re.X(VERBOSE)	This flag makes regular expressions easier to understand by giving them a more flexible format
re.U(UNICODE)	Resolves characters according to the Unicode character set. in`python3`Is redundant,`python3`Use the default`Unicode`string
re.A(ASCII)	make`\w`.`\W`.`\b`.`\B`.`\d`.`\D`.`\s`and`\S`Only perform`ASCII`A match, not quite`Unicode`Matching.

Note: the parameters of the same name in the following methods have the same meaning as those described above, and will not be explained too much.

Returns the first object that matches the regular expression by scanning the string from left to right, or None if no match is found

For example, find the first occurrence of CodeId in the string

1import re2text = 'Welcome to codeid. CodeId'3result = re.search(r'CodeId',text)4print(result.start())5#Copy the code

Result.start () returns the index of the position where the match started. From this we can guess that result.end() should return the index of the position at the end of the match.

Re.match (pattern, string, flags = 0)

Function: Finds whether zero or more characters at the beginning of a string match the regular expression, returns the corresponding match object if successful, and None otherwise

For example, determine whether a Python variable starts with a number.

1result = re.match(r'\d+','123CodeId')2if result :3 print("python variable contains a number at the beginning"Copy the code

Note: Re.match () only matches the beginning of the string, not the beginning of every line, even in MULTILINE mode.

1result = re.match(r'a','b\na', re.multiline)2print(result)3Copy the code

Re.fullmatch (Pattern, string, flags = 0)

Action: Returns the matching object if the entire string matches the regular expression, otherwise Returns None

For example, check whether the entered email address is valid

2email = '[email protected]'3result = re.fullmatch(r'^\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*$',email)4if result :5 Print (" your email address :",result.group()," valid"Copy the code

Result.group () is a method that returns all matched substrings. More on that later.

Re.split (pattern, string, maxsplit = 0, flags = 0)

Function: Separates a string using substrings that match the regular expression as delimiters. Returns the split list. For example, divide English sentences into lists of words

1text = 'Welcome to CodeId'2result = re.split(r'\W+',text)3print(result)4#Copy the code

The maxsplit parameter is the maximum number of splits, and the default is 0 for unlimited splits.

1text = 'Welcome to CodeId'2result = re.split(r'\W+',text,maxsplit=1)3print(result)4# ['Welcome', 'to CodeId']5# Max split the remaining part once as the last element of the listCopy the code

Note: If you use group parentheses in a regular expression, the text for all groups in the regular expression is also returned as part of the result list.

1text = 'Welcome to CodeId'2result = re.split(r'(\W+)',text)3print(result)4#Copy the code

Re.findall (Pattern, String, flags = 0)

Function: Scans a string from left to right, returning all substrings matching the regular expression as a list. For example, match all numbers with a decimal point in a sentence

2result = re.findall(r'-?) 2result = re.findall(r'-? \ d + \ \ d + ', text) 3 print (result) 4 # results for [' 3.5 ', '1.5']Copy the code

Note: If there is a group in the regular expression, the list of groups is returned; If there are multiple groups in the regular expression, this will be a list of tuples.

1text = 'I have $3.50, here's $1.5, I have $2 left '2result = re.findall(r'(-? \ d + \ \ d +) ', text) # a group 3 print (result) 4 # results for [' 3.5 ', '1.5'] 5 result = re. The.findall (r '(-? . \ d) + (\ \ d +) ', text) # 6 print multiple groups (result) 7 # results for [(' 3 ', '5'), (' 1 ', '5')]Copy the code

Re.finditer (Pattern, String, flags = 0)

Function: Scans a string from left to right, returning all substrings that match the regular expression as iterators. For example, the above example is returned as an iterator

2text = 'I have $3.50, give you $1.50, I have $2 left' 3result = re.finditer(r'-? \d+\.\d+',text) # return an iterator 4print(result) 5# Return a <callable_iterator object at 0x0000025D2B81B320> 6for s in result: Print (s.group()) print(s.group())Copy the code

Re.sub (Pattern, repl, string, count = 0, flags = 0)

Repl replaces a substring in a string that matches a regular expression with a repL. A repL can be either a string or a function.

When repl is a string, it can handle any escape character such as:

Working with ordinary strings

1# replace key information with ****, such as mobile phone number 2text = 'Please call: 15589878888'3result = Re. Sub (r '(13 [0-9] 14 [5] | 7 | | 15 [0 | 1 | 2 | 3 | | 5 6 7 | | | 8, 9] | 18 [0 | 1 | 2 | 3 | | 5 6 7 | | | 8, 9]) \ d {8}', '* * * *, text) 4 print (result) # 5 results as follows: have something please call: * * * *Copy the code

backreferences

1# hide the middle four digits of the cell phone number 2text = ' 15589878888'3# string r'\1****\3' \1 and \3 are the contents of group 1 and group 3 re.sub(r'(13[0-9]|14[5|7]|15[0|1|2|3|5|6|7|8|9]|18[0|1|2|3|5|6|7|8|9])(\d{4})(\d{4})',r'\1****\3',text)5print(result)6# Call 155****8888Copy the code

Note: In addition to the above character escapes and backreferences, the use of \g

can also refer to those that use (? P

…) Syntactically defined groups. Meanwhile,\g

is equivalent to \number. \g

is expressed more clearly in the alternative method. Examples will follow. The backreference \g<0> replaces the entire substring matched in the regular expression.

Repl is a function that takes a single match object and returns a replacement string. Such as:

1def dashrepl(matchobj): 2 if matchobj.group(0) == '-': 3 return ' ' 4 else: Return '*' 6result = re.sub(r'-{1,3}', dashrepl, 'pro--a--gram-files') 7print(result) pro*a*gram files 910text = "JGood is a handsome boy, he is cool, clever, and so on..." 11 print (re. Sub (r '\ S +, lambda m:' [' + m.g roup (0) + '] ', text, 0)) 12 # results as follows: [JGood] [is] [a] [handsome] [boy,] [he] [is] [cool,] [clever,] [and] [so] [on...]Copy the code

The count argument controls the maximum number of substitutions. Count must be a non-negative integer. If omitted or zero, all matches are replaced. Such as:

1 result = re. Sub (r '-' and r '*', '-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --', count = 4) 2 print (result) 3 # results as follows: * * * * -- -- -- -- -- -- -- -- -- -- -- -- -- -- 4 result = re. Sub (r '-' r '*', '-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --', count = 8) 5 print (result) 6 # results for: * * * * * * * * -- -- -- -- -- -- -- -- -- --Copy the code

Re.subn (pattern, repl, string, count = 0, flags = 0)

Function: Same as re.sub(), except re.subn() returns a tuple. The contents of the tuple include the string after the substitution and the number of substitutions. For example:

1# Replace some private information with ****, such as mobile phone number 2text = 'Please call: 15589878888, previous mobile phone number: 15888886666 '3result = 'not used Re. Subn (r '(13 [0-9] 14 [5] | 7 | | 15 [0 | 1 | 2 | 3 | | 5 6 7 | | | 8, 9] | 18 [0 | 1 | 2 | 3 | 5 6 7 8 | | | | 9]) \ d {8}', '* * * *, text) 4 print (result) # 5 results as follows: (' Please call ****, previous mobile number: **** not in use ', 2)Copy the code

re.escape(pattern)

Function: To escape special characters in a string, that is, to precede each special character with \

Such as:

1 result = re. Escape (' CodeId * & @ _. \.) 2 print (result) 3 # results for CodeId, *, &, @ _ \. \ \ \.Copy the code

Re. Purge ()

Action: Clears the regular expression cache.

Re.compile (pattern, flags = 0)

What it does: Compiles a regular expression into a regular expression object that can be used to match match(), search(), and other methods. Such as:

Search ('Welcome to codeid.codeid ')3print(result.start())4#Copy the code

Re.compile () compiles regular expressions once and then saves them for later reuse. Re.compile () is suitable for regular expressions that are used multiple times in a single program. The method of the regular expression object differs a little from the above method in terms of parameters. For example, pattern.match (string [, pos [, endpos]]).

The pos argument is used to set where the string starts matching.
The endpos argument is used to set the longest matching distance of a string. All the other method is similar to the above Is not fine, can see https://docs.python.org/3/library/re.html for more details.

Match object

Boolean values for matching objects always return True. Since match() and search() return None if there is no match, we can test for a match with an if statement. Such as:

1result = re.search(r"CodeId", "Welcome to CodeId")2if result :3 print('Yes'Copy the code

Match.expand(template)

Function: Replaces the specified position in the template string with the contents of the specified group. In template, you can use \id, \g< ID >, or \g

to reference groups, but you cannot use the number 0. \id is equivalent to \g

; If you want to express that \1 is followed by the character ‘0’, you can only use \g<1>0, because \10 will be considered the 10th grouping.

For example, match dates

1 # match date 2 data = '2018-8-9'. 3 result = re fullmatch (r '^ (\ d {4}) - (\ d {1, 2}) - (? P<day>\d{1,2})$',data) 4 5expand = result. Expand (r' today is \1 year \g<2> month \g<day> 表 ' 9print(result.group())10#Copy the code

Match. Group ([group1,… )

Function: Returns one or more matched subgroups. If there is a single argument, the result is a single string; If there are multiple arguments, the result is to return a tuple of the group contents for each argument. With no arguments, group1 defaults to 0(the entire match is returned). If the groupN argument is zero, the corresponding return value is the entire matching string. For example:

1 data = '2018-8-9' 2 result = re. Fullmatch (r '^ (\ d {4}) - (\ d {1, 2}) - (? P<day>\d{1,2})$',data) 3print(result.group()) # print(result.group(1)) # print(result.group(1)) # print(result.group(1)) # print(result.group(1)) # Print (result.group(1,2,3,0)) '9', '2018-8-9')Copy the code

If the group number is negative or greater than the number defined in the regular expression, an IndexError exception is raised.

Print (result.group(4)) # print(result.group(4)) # print(result.group(4)) # print(result.group(4)) #Copy the code

If the regular expression uses (? P

…) Syntax, the groupN parameter can also access the group content by its group name.

Print (result.group('day')) # print(result.group('day')Copy the code

If a group matches multiple times, only the last match can be accessed. Note: Return results can also be accessed via arrays, thanks to the mate.__getitem__ (g) method, which was added in version 3.6.

1 data = '2018-8-9' 2 result = re. Fullmatch (r '^ (\ d {4}) - (\ d {1, 2}) - (? P<day>\d{1,2})$',data)3print(result[0])4# 2018-8-95print(result[1])6# 20187print(result['day'])8#Copy the code

Match.groups(default=None)

Effect: Starting at 1, returns a tuple that matches all subgroups. The default parameter sets the default values for those subgroups that do not match successfully. Default is None if there are no arguments. For example, match floating-point numbers

1result = re.match(r"(\d+)\.? (\d+)?" , "24")2print(result.groups())3# print(result.groups()) 4 # print(result.groups()) 4 # print(result.groups('0'))Copy the code

Match.groupdict(default=None)

Function: For using (? P

…) All subgroups of the syntax are returned as dictionaries. The default parameter default is used for subgroups that fail to match; It defaults to None

1result = re.match(r"(? P<first_name>\w+) (? P<last_name>\w+)", "Malcolm Reynolds")2print(result.groupdict())3# 'Reynolds'}Copy the code

Match. The start ([group]) and Match. The end ([group])

Function: Returns the start and end indexes of a matched substring by group; The group defaults to zero (meaning the entire matching substring). If the group is not matched, -1 is returned. Such as:

1result = re.match(r"(\d+)\.? (\d+)?" . Print (result.start()) # print(result.start()) # print(result.start()) # print(result.start()) # print(result.start()) # print(result.start()) # print(result.start()) # print(result.start()) # print(result.start()) # print(result.start()) # print(result.start()) # print(result.start()) # Print (result. End (2)) # print(result. End (2)) # print(resultCopy the code

Match.span([group])

Function: Returns a tuple consisting of the start and end of a matching subgroup. If the group is not successfully matched, return (-1,-1). Such as:

1result = re.match(r"(\d+)\.? (\d+)?" , "24.59")2print(result.span())3# print(result.span(2)) 4 # print(result.span(2))5# print(result.span(2)) 4 # print(result.span(2))5# print(result.span(2))Copy the code

Welcome to CodeId