Regular expression

Regular expression describes a pattern of string matching. It can be used to check the use of a string containing a certain substring, replace the matched substring, or extract a certain condition substring from a string, etc. The components of a regular expression can be a single character, a set of characters, a range of characters, a selection between characters, or any combination of all of these components.

Regular expression pattern

  1. Numbers and letters represent themselves.
  2. Most letters and numbers have different meanings when preceded by a backslash.
  3. Punctuation marks match themselves only if they are escaped, otherwise they have a special meaning.
  4. The backslash itself requires backslash escape.
Pattern (RE for expression) describe
^ Matches the beginning of the string
$ Matches the end of the string
. Matches any character except newline. When the re.DOTALL tag is specified, any character including newline can be matched.
[…]. Used to represent a set of characters, listed separately: [amk] matches ‘a’,’m’ or ‘k’
[^…]. Characters not in [] : [^ ABC] matches characters other than a,b, and c
re* Matches zero or more expressions
re+ Matches one or more expressions
re? Matches zero or one fragment defined by the previous regular expression, in a non-greedy manner
re{n} Matches n preceding expressions. For example, “o{2}” does not match the “O” in “Bob”, but does match the two o’s in “food”
re{n,} Matches exactly n preceding expressions. For example, “o{2,}” does not match the “O” in “Bob”, but matches all the o’s in “foooood”. O {1,}” is equivalent to “o+”. O {0,}” is equivalent to “o*”.
re{n,m} Matches fragments defined by the previous regular expression n to m times, greedy way
a b
(re) Matches the expression in parentheses, also representing a group

Special character class

The instance describe
. Matches any single character except “\n”. To match any character including ‘\n’, use a pattern like ‘[.\n]’.
\d Matches a numeric character. Equivalent to [0-9]
\D Matches a non-numeric character. Equivalent to [^ 0-9]
\s Matches any whitespace character, including whitespace, tabs, page feeds, and so on. Equivalent to [\f\n\r\t\v].
\S Matches any non-whitespace character. Equivalent to [^\f\n\r\t\v].
\w Matches any word character including underscores. Equivalent to ‘[a-za-z0-9_]’.
\W Matches any non-word character. Equivalent to ‘[^ a-za-z0-9_]’.

Regular common function

  • Re.match (pattern, string, flags=0) : Attempts to match a pattern from the start of the string. We can use the group(num) or groups() match object functions to get a match expression.
print(re.match("www", "www.runoob.com")) # <_sre.SRE_Match object; span=(0, 3), match='www'> print(re.match("www", "www.runoob.com").group()) # www print(re.match("www", "www.runoob.com").span()) # (0, 3) print(re.match("com", "www.runoob.com")) # None line = "Cats are smarter than dogs" #.* Only save the first matching to the substring matchObj = re. The match (r '(. *) are (. *?). *', line, re. M | re I) if matchObj: print (" matchObj. Group () : ", matchObj.group()) print("matchObj.group(1): ", matchObj.group(1)) print("matchObj.group(2): ", matchObj.group(2)) else: print("No match!" ) # matchObj.group(): Cats are smarter than dogs # matchObj.group(1): Cats # matchObj.group(2): smarterCopy the code
  • Re.search (pattern, string, flags=0) : scans the entire string and returns the first successful match, otherwise returns None.
print(re.search('www', 'www.runoob.com').span())
# (0, 3)
print(re.search('com', 'www.runoob.com').span())
# (11, 14)
Copy the code

Re.match matches only the beginning of the string. If the beginning of the string does not match the regular expression, the match fails. The function returns None, while Re.search matches the entire string until a match is found.

  • Re.sub (pattern, repl, string, count=0, flags=0),3 parameters are mandatory:
    • Pattern (Mandatory) : pattern string in the re.
    • Repl (Mandatory) : Replacement string, also available as a function.
    • String (Mandatory) : The original string to be searched and replaced.
    • Count (Optional) : indicates the maximum number of times that a pattern is replaced. By default, 0 indicates that all matches are replaced.
    • Flags (Optional) : Matching mode used at compile time, in numeric form.
phone = "2004-959-559 # this is a phone number"
num = re.sub(r'#.*$', "", phone)
print("phone number : ", num)
# phone number :  2004-959-559 
num = re.sub(r'\D', "", phone)
print("phone number : ", num)
# print("phone number : ", num)
Copy the code
  • Re.compile (pattern[, flags]) : Re.compile (pattern[, flags]) : Re.compile (pattern[, flags]) : Re.compile (pattern[, flags]) : Re.compile (pattern[, flags]) : Re.compile (pattern[, flags]) : Re.compile (pattern[, flags]).
pattern = re.compile(r'([a-z]+) ([a-z]+)', M = pattern. Match ('Hello World Wide Web') print(m) # < _sre.sre_match object; span=(0, 11), Print (m.group(1)) # Hello: print(m.group(1)) # Hello: print(m.group(1)) # Hello: print(m.group(1) Print (m.group(2)) # (1, 2) print(m.group(2)) # (2, 3) print(m.group(2)) # (3, 3) print(m.group(2)) # (4, 4) Print (m.roups ()) # ('Hello', 'World') equivalent to (m.roup (1), m.roup (2)...)Copy the code
  • Re.findall (pattern, string, flags=0) or pattern.findall(string[, pos[, endpos]]) : Finds all substrings matched by the regular expression in the string and returns a list, or an empty list if no matches are found. Note: Match and search match once. Findall matches all.
results1 = re.findall(r'\d+', 'runoob 123 google 456') pattern = re.compile(r'\d+') results2 = pattern.findall('runoob 123 google 456') # Pattern. Findall (string[, pos[, endpos]]) # pos Specifies the start position of the string. The default value is 0. # endpos Specifies the end of the string. The default is the length of the string. results3 = pattern.findall('run88oob123google456', 0, 10) # ['88', '12']Copy the code
  • Re.finditer (pattern, String, flags=0) : Similar to findAll, finds all substrings in a string that the regular expression matches and returns them as an iterator.
it = re.finditer(r"\d+", "12a32bc43jf3")
for match in it:
	print(match.group())
# 12
# 32
# 43
# 3
Copy the code
  • Re.split (pattern, string[, maxsplit=0, flags=0]) : The split method returns a list of matched substrings after splitting the string. Maxsplit indicates the number of split times. Maxsplit =1 split once. The default value is 0 and the number of split times is not limited.
Print (re.split('\W+', 'runoob, runoob, runoob.')) # \W+ : print('\W+', 'runoob, runoob, runoob. [' runoob ', 'runoob', 'runoob', '] print (re) split (' (\ W +) ', 'runoob, runoob runoob.')) # (\ W +) match one or more letters for cutting, All non-letter matches are cached. ['', ' ', 'runoob', ', ', 'runoob', ', ', 'runoob', '.', ''] print(re.split('\W+', ' runoob, runoob, runoob.', 1)) ['', 'runoob, runoob, runoob.']Copy the code