This is the sixth day of my participation in the August Text Challenge.More challenges in August

In solitude, you can gain everything except character. — Stendhal, The Red and the Black

An overview of the

In the previous article, Python simply used the RE module to extract the value of a particular field from a JSON string using regular expressions, but was not familiar with other methods. In order to get a more comprehensive understanding and use of RE in Python, I documented my learning process here.

When crawlers are used to retrieve data from web pages, various tools such as Etree, BeautifulSoup, and scrapy are used to parse data in web pages. However, the most powerful tools are regular expressions. The following will summarize python’s RE module methods.

Python provides support for regular expressions through the RE module. The general steps to use RE are:

  1. useRe.compile (regular expression)Compiles the string form of the regular expression toPatternThe instance
  2. usePatternThe method provided by the instance processes the text and gets the matching result (oneMatchInstance)
  3. useMatchThe instance gets the information and does something else

A simple example:

# -*- coding: utf-8 -*-
import re

if __name__ == '__main__':
    Compile regular expressions into Pattern objects
    pattern = re.compile(r'hello')

    # use Pattern to match the text, get the result of a match, or Return None if there is no match
    match = pattern.match('hello world! ')

    if match:
        Use Match to get grouping information
        print(match.group()) # Output: hello
        
Copy the code

Defining regular expressions using native strings is a convenient way to solve the problem of escaping characters

Native strings are defined as: r “”

With native strings, you don’t need to manually add an escape symbol, it escapes automatically, and the written expression is more intuitive.

1. The use of re

re.compile(strPattern[, flag]):

This method is a factory method of the Pattern class that compiles regular expressions in the form of strings into Pattern objects.

The first argument: the regular expression string

The second parameter (optional) : is matching mode, values can use the bitwise or operator ‘|’ said to take effect at the same time, such as re. | I re. M.

Possible values are as follows:

  • Re.i (re.ignorecase): IGNORECASE

  • M(MULTILINE): Multi-line mode that changes the behavior of ‘^’ and ‘$’

  • S(DOTALL): Click on any matching pattern to change the behavior of ‘.’

  • L(LOCALE): makes the predefined character class \w \w \b \b \s \s depend on the current LOCALE

  • U(UNICODE): make the predefined character class \w \w \b \b \s \s \d \d depend on character attributes defined by UNICODE

  • X(VERBOSE): indicates the VERBOSE mode. In this mode, regular expressions can be multi-line, whitespace ignored, and comments can be added. The following two regular expressions are equivalent:

    a = re.compile(r"""\d + # the integral part \. # the decimal point \d * # some fractional digits""", re.X)
    b = re.compile(r"\d+\.\d*")
    Copy the code

Re provides a number of modular methods for accomplishing regular expression functionality. These methods can be replaced with the corresponding methods of Pattern instances, with the sole benefit of saving a line of re.pile () code, but also not being able to reuse the compiled Pattern object. These methods are described together in the instance methods section of the Pattern class. This example can be abbreviated as:

m = re.match(r'hello'.'hello world! ')
print m.group()
Copy the code

2. Using the Pattern

The Pattern object is a compiled regular expression that provides a series of methods for matching text lookup.

The Pattern object cannot be instantiated directly and must be obtained using re.compile().

2.1 Attributes of Pattern objects

Pattern provides several readable attributes to get information about an expression:

  1. Pattern: Expression string used at compile time.

  2. Flags: matching mode used at compile time, in numeric form.

  3. Groups: Number of groups in an expression.

  4. Groupindex: dictionary whose key is the alias of the aliased groupin the expression and whose value is the corresponding number of the group. Groups without aliases are excluded.

# -*- coding: utf-8 -*-
import re

if __name__ == '__main__':
    text = 'hello world'
    p = re.compile(r'(\w+) (\w+)(? P
      
       .*)'
      , re.DOTALL)

    print("p.pattern:", p.pattern)
    print("p.flags:", p.flags)
    print("p.groups:", p.groups)
    print("p.groupindex:", p.groupindex)
Copy the code

The following output is displayed:

p.pattern: (\w+) (\w+)(? P<sign>.*) p.flags:48
p.groups: 3
p.groupindex: {'sign': 3}
Copy the code

2.2 Method of Pattern object

1. match(string[, pos[, endpos]]) | re.match(pattern, string[, flags]):

Return a Match object if any matches of the regex pattern can be found at the beginning of the string.

If pattern does not match during the match, or if the match reached Endpos before it was completed, None is returned.

The default values for pos and endpos are 0 and len(string), respectively;

Re.match () cannot specify these two parameters. The flags argument is used to specify the matching pattern when compiling pattern.

Note: This method is not an exact match. If the string has any remaining characters when pattern ends, it is still considered successful. For a perfect match, add the boundary card ‘$’ to the end of the expression.

2. search(string[, pos[, endpos]]) | re.search(pattern, string[, flags]):

This method is used to find successful substrings in a string.

Attempts to Match pattern from string’s POS subscript, and returns a Match object if pattern is still matched at the end;

If no match can be made, add pos by 1 and try again. None is returned if there is no match until pos=endpos.

The default values for pos and endpos are 0 and len(string), respectively;

Re.search () cannot specify these two parameters. The flags argument is used to specify the matching pattern when compiling pattern.

A simple example:

# -*- coding: utf-8 -*-
import re

if __name__ == '__main__':
    Compile regular expressions into Pattern objects
    pattern = re.compile(r'world')

    # Use search() to find matching substrings, return None if no matching substrings exist
    # This example does not match successfully with match()
    match = pattern.search('hello world! ')

    if match:
        Use Match to get grouping information
        print(match.group()) # output result: world
Copy the code

Notice the difference between the match method and the search method

3. split(string[, maxsplit]) | re.split(pattern, string[, maxsplit]):

Returns the list after splitting the string by matching substrings.

Maxsplit specifies the maximum number of splits, not all splits.

# -*- coding: utf-8 -*-
import re

if __name__ == '__main__':
    p = re.compile(r'\d+')
    # Separate strings by numbers
    print(p.split('one1two2three3four4')) ['one', 'two', 'three', 'four', '"]

Copy the code

4. findall(string[, pos[, endpos]]) | re.findall(pattern, string[, flags]):

Searches for strings, returning all matching substrings as a list.

#! /usr/bin/env python
# -*- coding:utf-8 -*-
import re

if __name__ == '__main__':
    p = re.compile(r'\d+')
    # Find all the numbers and return them as a list
    print(p.findall('one1two2three3four4')) ['1', '2', '3', '4']
Copy the code

5. finditer(string[, pos[, endpos]]) | re.finditer(pattern, string[, flags]):

Search for string and return an iterator that accesses each Match result (the Match object) sequentially.

#! /usr/bin/env python
# -*- coding:utf-8 -*-
import re

if __name__ == '__main__':
    p = re.compile(r'\d+')
    # return an iterator that accesses each Match result (the 'Match' object) sequentially
    for m in p.finditer('one1two2three3four4') :print(m.group())  # output result: 1, 2, 3, 4
Copy the code

6. sub(repl, string[, count]) | re.sub(pattern, repl, string[, count]):

Replace each matched substring in the string with repl and return the replaced string. When repl is a string, you can use \ ID or \g< ID >, \g

reference groups, but not the number 0. When repl is a method, the method should take only one argument (the Match object) and return a string for substitution (the returned string can no longer reference the grouping). Count specifies the maximum number of times to be replaced. If this parameter is not specified, all times are replaced.

#! /usr/bin/env python
# -*- coding:utf-8 -*-
import re

if __name__ == '__main__':
    p = re.compile(r'(\w+) (\w+)')
    s = 'i say, hello world! '

    print(p.sub(r'\1 \2 hi', s))  I say hi, hello world hi!

    def func(m) :
        return m.group(1).title() + ' ' + m.group(2).title()

    print(p.sub(func, s))  # I Say, Hello World!
Copy the code

7. subn(repl, string[, count]) |re.sub(pattern, repl, string[, count]):

The subn() method differs from the sub() method in that it returns different results:

The subn() method returns a tuple :(string after substitution, number of substitutions)

The sub() method returns a string: the replaced string

#! /usr/bin/env python
# -*- coding:utf-8 -*-
import re

if __name__ == '__main__':
    p = re.compile(r'(\w+) (\w+)')
    s = 'i say, hello world! '

    print(p.subn(r'\1 \2 hi', s))  'I say hi, hello world hi! ', 2)

    def func(m) :
        return m.group(1).title() + ' ' + m.group(2).title()

    print(p.subn(func, s))  # output result :('I Say, Hello World! ', 2)
Copy the code

3. Use the Match

A Match object is the result of a Match and contains a lot of information about that Match, which can be retrieved using readable properties or methods provided by Match.

3.1 Attributes of the Match Object

  1. String: text used for matching.
  2. Re: Pattern object used for matching.
  3. pos: Index in the text where the regular expression starts to search. Values andPattern.match()andPattern.seach()The method has the same name as the argument.
  4. endpos: Index in the text where the regular expression ends the search. Values andPattern.match()andPattern.seach()The method has the same name as the argument.
  5. Lastindex: Index of the last group to be captured. None if there are no captured groups.
  6. Lastgroup: alias of the lastgroup to be captured. None if the group has no aliases or no captured groups.
# -*- coding: utf-8 -*-
import re

if __name__ == '__main__':
    text = 'hello world'
    p = re.compile(r'(\w+) (\w+)(? P
      
       .*)'
      , re.DOTALL)
    match = p.match(text)
    if match:
        print("match.re:", match.re)
        print("match.string:", match.string)
        print("match.endpos:", match.endpos)
        print("match.pos:", match.pos)
        print("match.lastgroup:", match.lastgroup)
        print("match.lastindex:", match.lastindex)
        
        
The output is as follows:
# match.re: re.compile('(\\w+) (\\w+)(?P<sign>.*)', re.DOTALL)
# match.string: hello world
# match.endpos: 11
# match.pos: 0
# match.lastgroup: sign
# match.lastindex: 3
Copy the code

3.2 Methods of the Match object

1. Group (group1,… ) :

Gets one or more packet-intercepted strings that are returned as tuples when multiple arguments are specified.

Group () can be numbered or aliased;

The number 0 represents the entire matched substring;

Return group(0) if no parameter is specified;

Groups that do not intercept strings return None;

2. groups([default]):

Returns the entire packet intercepted string as a tuple, equivalent to calling group(1,2... The last);Copy the code

Default indicates that groups that do not intercept strings are replaced with this value, which defaults to None.

3. groupdict([default]):

Returns a dictionary of aliases as keys and values of intercepted substrings of groups that have been aliased, excluding groups without aliases. Default Meaning the same as the preceding.Copy the code

4. start([group]):

Returns the starting index (the index of the first character of the string) of the substring intercepted by the specified group. Group The default value is 0.Copy the code

5. end([group]):

Returns the end index of the string intercepted by the specified group (index +1 of the last character of the string). Group The default value is 0.Copy the code

6. span([group]):

Return (start(group), end(group)).Copy the code

7. expand(template):

Substitute the matched group into template and return. In template, you can use \id or \g

, \g

reference groups, but you cannot use the number 0. \id is equivalent to \g

; But \10 will be considered the 10th grouping, and if you want to express \1 followed by the character ‘0’, you can only use \g<1>0.


# -*- coding: utf-8 -*-
import re

if __name__ == '__main__':
    import re
    m = re.match(r'(\w+) (\w+)(? P
      
       .*)'
      .'hello world! ')
    print("M.g roup (1, 2) :", m.group(0.1.2.3))
    print("m.groups():", m.groups())
    print("m.groupdict():", m.groupdict())
    print("m.start(2):", m.start(2))
    print("m.end(2):", m.end(2))
    print("m.span(2):", m.span(2))
    print(r"m.expand(r'\2 \1\3'):", m.expand(r'\2 \1\3'))


# output result:
# m.g roup (1, 2) : (' hello world! ', 'hello', 'world', '! ')
# m.groups(): ('hello', 'world', '! ')
# m.groupdict(): {'sign': '! '}
# m.start(2): 6
# m.end(2): 11
# m.span(2): (6, 11)
# m.expand(r'\2 \1\3'): world hello!
Copy the code

Refer to the article

Python Official Documentation

www.cnblogs.com/huxi/archiv…