This article explores regular expressions and the RE module in Python in detail

  • What is a regular expression
  • Regular expression function
  • Metacharacters and their meanings
  • Re module details
  • Regular expression modifier
  • Regular expression instances

This article directory

What is a regular expression

A regular expression describes a pattern of string matching. It can be used to check whether a string contains a substring, replace a matching substring, or take a substring that matches a certain condition from a string.

Regular expression function

By using a regular expression, you can:

  1. For example, you can test an input string to see if a phone number pattern or a credit card number pattern appears within the string. This is called data validation.

  2. Replace text You can use regular expressions to identify specific text in the document, remove the text completely, or replace it with other text.

  3. Extracting substrings from a string based on pattern matching can find specific text within a document or input field, such as what is needed directly from web page content via a crawler

Metacharacters and their meanings

Common metacharacter

symbol meaning
Points. Matches any character except the newline character
An asterisk * Matches zero or more arbitrary characters
The question mark? Matches 0 or 1 of any characters (non-greedy mode)
^ The starting position
$ End position
\s Match any whitespace
\S Matches any non-whitespace
\d Match a number
\D Matches a non-number
\w Matches a word character that contains numbers and letters
\W Matches a non-word character that contains numbers and letters
[abcd] Matches an arbitrary character in abcd
[^abcd] Matches any character that does not contain the package abcd
+ Matches the previous content one or more times
{n} Match n words (fixed)
{n,} Matches at least n times
{n,m} It matches n to m times
x|y Match either x or y
(a) Match what’s in parentheses

metacharacters

Here is a more complete metacharacter matching table

metacharacters describe
\ Takes the next character marker, or a backreference, or an octal escape character. For example, “\n” matches \n. “\n” matches a newline. The sequence ‘\’ matches’ \ ‘and’ (‘ matches’ (‘. This is equivalent to the concept of “escape characters” found in many programming languages.
^ Matches the beginning of the input line. If the Multiline property of the RegExp object is set, ^ also matches the position after “\n” or “\r”.
$ Matches the end of the input line. If the Multiline property of the RegExp object is set, $also matches the position before “\n” or “\r”.
* Matches the previous subexpression any times. For example, zo* matches “z”, as well as “zo” and “zoo”. * is equivalent to {0,}.
+ Matches the previous subexpression one or more times (greater than or equal to one times). For example, “zo+” matches “zo” and “zoo”, but not “z”. + is equivalent to {1,}.
? Matches the previous subexpression zero or one times. For example, “do (es)? Can match “do” or “does”. ? This is equivalent to {0,1}.
{n} nIs a non-negative integer. Matched determinednTimes. For example, “o{2}” cannot match the “o” in “Bob”, but it can match the two o’s in “food”.
{n,} nIs a non-negative integer. At least matchnTimes. For example, “o{2,}” does not match the “o” in “Bob”, but matches all the “O” in “foooood”. “O {1,}” is equivalent to “o+”. “O {0,}” is equivalent to “o*”.
{n.m} mandnAre non-negative integers, wheren< =m. At least matchnTimes and at most matchesmTimes. For example, “o{1,3}” will match the first three OS in “fooooood” as a group and the last three as a group. “O {0,1}” is equivalent to “o? . Note that there can be no space between a comma and two numbers.
? When the character is followed by any other qualifier (*,+,? , {n}, {n}, {n.m}), the matching pattern is non-greedy. The non-greedy mode matches as little of the searched string as possible, while the default greedy mode matches as much of the searched string as possible. , for example, the string “oooo”, “o +” will match “o”, as much as possible to get the results [] “oooo”, and “o +?” Will match “o” as little as possible to get the result [‘o’, ‘o’, ‘o’, ‘o’]
Point. Matches any single character other than “\ n “and “\r”. To match any character including “\ n “and “\r”, use a pattern like” [\s\ s] “. (The newline character does not match)
(pattern) Matches the pattern and gets the match. The Matches obtained can be obtained from the resulting Matches set, using the SubMatches set in VBScript and the SubMatches set in JScriptNine attributes. To match parenthesis characters, use ‘(‘ or’) ‘.
(? :pattern) Non-fetch match: matches the pattern but does not fetch the result, and is not stored for later use. It is in use or characters “” (|) to combine various parts of a model is very useful. For example, “industr (? : y | ies) “is a” industry | industries “more brief expression.
(? =pattern) Non-fetched match, forward affirmative presearch, matches the search string at the beginning of any string that matches pattern, the match does not need to be fetched for later use. For example, the “Windows (? = 95 NT | | 98 | 2000) “can match the” Windows “in the” Windows “, but it can’t match “Windows3.1” in the “Windows”. Presearch does not consume characters, that is, after a match has occurred, the search for the next match begins immediately after the last match, rather than starting after the character containing the presearch.
(? ! pattern) Non-retrieved match, positive negative presearch, matches the search string at the beginning of any string that does not match the pattern, and the match does not need to be retrieved for later use. Such as “Windows (? ! 95 NT | | 98 | 2000) “can match” Windows3.1 “in the” Windows “, but it can’t match “Windows” in the “Windows”.
(? <=pattern) Non-fetch matching, reverse affirmative precheck, similar to forward affirmative precheck, but in the opposite direction. For example, “(? < = 95 NT | | 98 | 2000) Windows “can match the” Windows “in the” 2000 Windows “, but it can’t match the “Windows” in “3.1 Windows”. * Python’s regular expressions are not fully implemented according to the regular expression specification, so some advanced features suggest using other languages such as Java, Scala, etc
(? <! patte_n) Non-fetch matching, reverse negation prefetch, similar to positive negation prefetch, but in the opposite direction. “(for example? <! 95 NT | | 98 | 2000) Windows can match the “3.1” Windows “in the” Windows “, but it can’t match “Windows 2000” in the “Windows”. * Python’s regular expressions are not fully implemented according to the regular expression specification, so some advanced features suggest using other languages such as Java, Scala, etc
x|y Match x or y. For example, “z | food” can match “z” or “food” (please careful here). “[z | f] ood” matching “zood” or “food”.
[xyz] Set of characters. Matches any of the contained characters. For example, “[ABC]” can match the “a” in “plain”.
[^xyz] Negative character set. Matches any character that is not included. For example, “[^ ABC]” can match any character of “plin” in “plain”.
[a-z] Character range. Matches any character in the specified range. For example, “[a-z]” can match any lowercase character ranging from “a” to “Z”. Note: A hyphen can represent a character range only if it is inside a character group and occurs between two characters. If it is the beginning of a character group, it can only represent the hyphen itself.
[^a-z] Negative character range. Matches any character that is not in the specified range. For example, “[^a-z]” can match any character that is not in the range “A” to “z”.
\b To match a word’s boundary, that is, the position between a word and a space. For example, “er\b” can match the “er” in “never” but not the “er” in “verb”; “\b1_” can match “1_” in “1_23”, but cannot match “1_” in “21_3”.
\B Matches non-word boundaries. “Er \B” matches the “er” in verb, but not the “er” in never.
\cx Matches the control character specified by x. For example, \cM matches a control-m or carriage return character. The value of x must be a-Z or one of a-z. Otherwise, c is treated as a literal “C” character.
\d Matches a numeric character. This is equivalent to [0-9]. Grep to add -p, Perl regular support
\D Matches a non-numeric character. This is equivalent to [^0-9]. Grep to add -p, Perl regular support
\f Matches a page break. This is equivalent to \x0c and \cL.
\n Matches a newline character. This is equivalent to \x0a and \cJ.
\r Matches a carriage return character. This is equivalent to \x0d and \cM.
\s Matches any invisible character, including Spaces, tabs, page feeds, and so on. Is equivalent to [\f\n\r\t\v].
\S Matches any visible character. This is equivalent to [^ \f\n\r\t\v].
\t Matches a TAB character. This is equivalent to \x09 and \cI.
\v Matches a vertical TAB character. This is equivalent to \x0b and \cK.
\w Matches any word character that includes an underscore. Similar but not equivalent to “[a-za-z0-9_]”, where the “word” character uses the Unicode character set.
\W Matches any non-word character. This is equivalent to [^ a-za-z0-9_].
\xn matchingn, includingnIs a hexadecimal escape value. The hexadecimal escape value must be two digits long. For example, “\x41” matches “A”. “\x041” is equivalent to “\x04&1”. ASCII encoding can be used in regular expressions.
*num* matchingnum, includingnumIt’s a positive integer. A reference to the match obtained. For example, “(.). \1 “matches two consecutive characters of the same character.
*n* Identifies an octal escape value or a backreference. If * nAt least beforenOf the subexpression, thennIs a backreference. Otherwise, ifnIs an octal number (0-7), thenN * is an octal escape value.
*nm* Identifies an octal escape value or a backreference. If * nmAt least there wasnmObtain a subexpression, thennmIs a backreference. If * nmAt least there wasn, thennFor a heel textmBackreference to. If none of the above conditions are met, ifnandmAre octal digits (0-7), then *nmThe octal escape value is matchedNm *.
*nml* ifnIs an octal number (0-7), andmandlAre octal digits (0-7), then octal escape values are matchednml.
\un matchingn, includingnIs a Unicode character represented by four hexadecimal numbers. For example, \u00A9 matches the copyright symbol (©).
\p{P} The lowercase P stands for property, which represents a Unicode property and is used as a prefix for Unicode positive expressions. The “P” in brackets represents one of the seven character properties of the Unicode character set: the punctuation character. The other six attributes: L: letters; M: token (usually does not appear alone); Z: delimiters (such as Spaces, newlines, etc.); S: symbols (such as mathematical symbols, currency symbols, etc.); N: numbers (such as Arabic numerals, Roman numerals, etc.); C: Other characters. *Note: This syntax is not supported by some languages, such as javascript.
<> Matches the beginning (<) and end (>) of a word (word). For example, the regular expression <the> matches the” the” in the string “for the wise”, but not the” the” in the string “otherwise”. Note: This metacharacter is not supported by all software.
( ) An expression between (and) is defined as a “group,” and the characters that match this expression are saved in a temporary area (up to nine in a regular expression) that can be referred to using the \1 to \9 symbols.
| The logical or operation is performed on the two matching conditions. Such as regular expressions (question | its ehrs) matching “it belongs to question” and “it belongs to its ehrs”, but can’t match “. It belongs to them.” Note: This metacharacter is not supported by all software.

Re module details

Python provides the RE module to deal with regular expressions. Here are some common methods

re.match

Re.match tries to match a pattern from the starting position of the string, and match() returns None if the match is successful outside the starting position.

This method returns a regular match object

grammar
import re
re.match(pattern, string, flags=0)
Copy the code
Parameters that
parameter describe
pattern The matched regular expression
string The string to match.
flags Flag bit that controls the matching mode of the regular expression, such as case-sensitive, multi-line matching, and so on.
demo
  • throughgroup()To get the content
  • throughspan()To get the scope
# Most common match
content = "Hello 1234567 World_This is a Regex Demo"
print(len(content))
result = re.match("^Hello\s\d+\s\w{10}.*? Demo$", content)   # must match from the starting position
# result = re.match("^Hello\s\d{7}\s\w{10}.*? Demo$", content)
print(result) print(result.group()) print(result.span()) Copy the code

If a newline is present, use the flag bit re.s

If there is a newline, use the flag bit symbol

content = """Hello 1234567 World_This is a Regex Demo.
My name is Peter
I am from shenzhen "" " print(len(content)) result = re.match("^Hello\s\d+\s.*? shenzhen$", content, re.S) # result = re.match("^Hello\s\d{7}\s\w{10}.*? Peter$", content) print(result) print(result.group()) print(result.span()) Copy the code
line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) . *? ', line, re.M|re.I)

if matchObj:
 print ("matchObj.group() : ", matchObj.group()) # returns the entire content  print ("matchObj.group(1) : ", matchObj.group(1)) # return the contents of the first ()  print ("matchObj.group(2) : ", matchObj.group(2)) # 2 else:  print ("No match!!") Copy the code

Use re.match as little as possible

Use re.match as little as possible

Use re.match as little as possible


re.search

Re.search scans the entire string and returns the first successful match, otherwise None. This method does not require you to start at the starting position. Once the first fit is found, the search stops

You can use the group(num) or groups() matching object function to get the result of a matching expression.

The function of grammar
re.search(pattern, string, flags=0)
Copy the code
Parameters that
parameter describe
pattern The matched regular expression
string The string to match.
flags Flag bit, used to control the matching mode of the regular expression, such as case-sensitive or multi-line matching
demo
  1. Returns the first successfully matched element
  2. The number of arguments in the group() method cannot exceed the number of parentheses

re.findall

Re.findall scans the entire string and returns all the elements that match the criteria as a list

grammar
findall(pattern, string, flags=0)
Copy the code
Parameters that
parameter describe
pattern The matched regular expression
string The string to match.
flags Flag bit, used to control the matching mode of the regular expression, such as case-sensitive or multi-line matching
demo

The result is in tabular form

If the extracted content contains multiple.*? , the return is still a list, but the elements are tuples

re.sub

Replace something in a string with a regular expression

grammar
re.sub(pattern, repl, string, count)  
Copy the code
Parameters that

The meanings of the parameters are as follows:

  • Regular expressions
  • Replacement content
  • Raw string
  • The number of substitutions. the default is 0, all of them
demo

Sub special treatment

Re.sub allows special handling of matches using functions

Two modes

The two modes are: the greedy mode and the non-greedy mode

Three symbols

We often use three symbols in regular expressions:

  • Dot. : Matches any character except the newline character
  • The question mark? : indicates that 0 or 1 packets are matched
  • Asterisk * : indicates that 0 or any characters are matched

demo

explain

  1. In the non-greed model example above, the question mark? , which indicates the non-greed modeaaaacbThe requirements have been met, the first one found; And then we start matching again, and there we goab; It matches again.adceb
  2. In the greedy mode example, the program finds the longest string that fits the requirement
  3. In the last example, use.?, means that there can only be 0 or 1 elements between ab and AB, so there are only two cases in the result

Regular expression modifier – Optional flag

Regular expressions can contain optional flag modifiers to control the matching pattern. The modifier is specified as an optional flag. Multiple sign can be through the bitwise OR (|) to specify them. Such as re. | I re. M is set to the I and M logo:

The modifier describe
re.I Make the match case-insensitive
re.L Do locale-aware matching
re.M Multiple rows match, affecting ^ and $
re.S Make. Match all characters including newlines
re.U Parse characters according to the Unicode character set. This symbol affects \w, \w, \b, \b.
re.X This flag allows you to write regular expressions that are easier to understand by giving you more flexibility in the format.

Regular expression instances

Character match

The instance describe
python Matching “python”.

Character classes

The instance describe
[Pp]ython Match “Python” or “Python” [Pp] select a letter match
rub[ye] Match “Ruby” or “rube” [ye] select a match
[aeiou] Match Any letter in brackets [Aeiou] matches a letter
[0-9] Match any number. Something like [0123456789] matches any number of digits
[a-z] Matches any lowercase letter
[A-Z] Matches any uppercase letter
[a-zA-Z0-9] Matches any letter and number
[^aeiou] All characters except the aeiou letterThe ^ is the inverse operation
[^ 0-9] Matches characters other than numbers

Special character class

The instance describe
. Matches any single character other than “\n”. To match any character including ‘\n’, use a pattern like ‘[.\n]’.
\d Matches a numeric character. This is equivalent to [0-9].
\D Match aNon-numeric character. This is equivalent to [^0-9].
\s Matches any whitespace character, including Spaces, tabs, page breaks, and so on. Is equivalent to [\f\n\r\t\v].
\S Matches any non-whitespace character. This is equivalent to [^ \f\n\r\t\v].
\w Matches any word character that includes an underscore. This is equivalent to ‘[a-za-z0-9_]’.
\W Matches any non-word character. This is equivalent to ‘[^ a-za-z0-9_]’.

conclusion

The resources

👉Beginner’s course – Regular expressions

👉Python-regular expressions

👉Regular expression online testing

👉Python3- Regular expressions

👉Regular Expression Encyclopedia

👉Re module