Regular expressions (RE)

  • It’s a computer science concept
  • Used to match a string that matches a rule using a single string
  • Text that is often used to retrieve and replace certain patterns

Regular notation

  • .(dot): represents any character except \n, such as finding all one character.

  • []: Matches any characters listed in parentheses, such as [L,Y,0], LLY, Y0, LIU

  • \ D: Any number

  • \D: Anything but numbers

  • \ S: indicates space, TAB key

  • \S: Except for blank space

  • \ W: Word characters, namely A-z, A-z, 0-9, _

  • \W: Anything except “\W”

  • : indicates that the preceding content is repeated zero or more times, \w

  • +: indicates that the previous content appears at least once

  • ? : Zero or once of the previous content

  • {m,n}: allow the previous content to appear at least m times, at most N times

  • ^: Matches the beginning of the string

  • $: matches the end of the string

  • \ B: Match word boundaries

  • (): Groups the contents of the regular expression, starting with the first parentheses and increasing in number

    To verify a number: ^\d$must have a number, at least one digit: ^\d+$Can only appear numbers, and the number of digits is 5-10: ^\d{5,10}$Register age, 16 years old or older,99 years old or younger: ^[16,99]$Only English characters and numbers can be entered: ^[a-za-z0-9]$verify qq number: [0-9]{5,12}Copy the code
  • \A: matches only the beginning of the string, \Aabcd, then abcd

  • \Z: Matches only the end of the string, abcd\Z, abcd

  • | : about any one

  • (? P…) : group, make an alias in addition to the original number, (? P12345){2}, 1234512345

  • (? P=name): reference group

RE uses rough steps

  1. Use compile to compile the string representing the re into a pattern object
  2. The pattern object provides a series of method degree text to find the Match and obtain the Match result, a Match object
  3. Finally, use the properties and methods provided by the Match object to get the information and operate as needed

RE common functions

  • Group (): To get one or more matching strings, use group or group(0) to get the whole matching string.
  • Start: Gets the starting position of the substring matched by the grouping in the entire string. The default argument is 0
  • End: Gets the end position of the grouping matched substring in the entire string. Default is 0
  • Span: returned structural techniques (start(group), end(group))
Import related packages
import re

# find a number
# r indicates that the string is not escaped
p = re.compile(r'\d+')
# look in the string "one12twothree33456Four78", according to the re set by rule P
If None is returned, the match object is returned
m = p.match("one12twothree33456four78")

print(m)
Copy the code
None
Copy the code
Import related packages
import re

# find a number
# r indicates that the string is not escaped
p = re.compile(r'\d+')
# look in the string "one12twothree33456Four78", according to the re set by rule P
If None is returned, the match object is returned
Parameter 3,6 indicates the range to look for in the string
m = p.match("one12twothree33456four78".3.26)

print(m)

# The problem with the above code
# 1. Match can input arguments to indicate the starting position
# 2. Only one result is found, indicating that the first match was successful
Copy the code
<_sre.SRE_Match object; span=(3, 5), match='12'>
Copy the code
print(m[0])
print(m.start(0))
print(m.end(0))
Copy the code
12 March 5Copy the code
import re
# I means case is ignored
p = re.compile(r'([a-z]+) ([a-z]+)', re.I)

m = p.match("I am really love you")
print(m)
Copy the code
<_sre.SRE_Match object; span=(0, 4), match='I am'>
Copy the code
print(m.group(0))
print(m.start(0))
print(m.end(0))
Copy the code
I am
0
4
Copy the code
print(m.group(1))
print(m.start(1))
print(m.end(1))
Copy the code
I
0
1
Copy the code
print(m.group(2))
print(m.start(2))
print(m.end(2))
Copy the code
am
2
4
Copy the code
print(m.groups())
Copy the code
('I', 'am')
Copy the code

To find the

  • Search (STR, [, pos[, endpos]]): Looks for a match in the string, with pos and endpos representing the starting position
  • Findall: Finds all
  • Finditer: To find an iter result
import re

p = re.compile(r'\d+')

m = p.search("one12two34three567four")

print(m.group())
Copy the code
12
Copy the code
rst = p.findall("one12two34three567four")
print(type(rst))

print(rst)
Copy the code
<class 'list'>
['12', '34', '567']
Copy the code

Sub replaced

  • sub(rep1, str[, count])
# sub replacement case
import re

# \w contains numbers and letters
p = re.compile(r'(\w+) (\w+)')

s = "hello 123 wang 456, i love you"

rst = p.sub(r'Hello world', s)
print(rst)
Copy the code
Hello world Hello world, Hello world you
Copy the code

Matching Chinese

  • Most Chinese representations range is [u4e00-U9FA5] and do not include full-angle punctuation
import re

title = 'Hello world, Hello Moto'

p = re.compile(r'[\u4e00-\u9fa5]+')
rst = p.findall(title)

print(rst)
Copy the code
[' World ', 'Hello ']Copy the code

Greed and non-greed

  • Greedy: As many matches as possible, (*) indicates greedy matches
  • Not greedy: find the smallest content that fits the criteria, (?) Not greedy
  • The re uses greedy matching by default
import re

title = u'<div>name</div><div>age</div>'

p1 = re.compile(r'<div>.*</div>')
p2 = re.compile(r'
      
.*?
'
) m1 = p1.search(title) print(m1.group()) m2 = p2.search(title) print(m2.group()) Copy the code
<div>name</div><div>age</div>
<div>name</div>
Copy the code