Using regular expressions

Regular expression knowledge

When writing programs or web pages that process strings, there is often a need to find strings that conform to some complex rules. Regular expressions are tools that describe these rules. In other words, regular expressions are tools. It defines string matching patterns (how to check if a string has parts that match a pattern or extract or replace parts that match a pattern from a string). If you have used file lookups on Windows and have used wildcards (* and?) when specifying file names , then the regular expression is also similar to text matching tools, rather than just a wildcard regular expressions more powerful, and it can more accurately describe what you need (of course comes at a cost to you is to write a regular expression is much more complex than playing a wildcard, want to know anything bring you benefits comes at a price, It’s like learning a programming language), for example, you can write a regular expression to find all strings that start with 0, followed by 2-3 digits, followed by a hyphen “-“, and finally 7 or 8 digits (like 028-12345678 or 0813-7654321), which is the national landline number. At first, computers were born to do mathematical operations, and the information they deal with is basically numerical values, while today, the information we deal with in daily work is basically text data. We hope that computers can recognize and process text that conforms to certain patterns, and regular expressions become very important. Almost all programming languages today provide support for regular expression operations, and Python supports regular expression operations through the RE module in the standard library.

Consider the following problem: We get a string from somewhere (maybe a text file, maybe a piece of news on the Internet), and we want to find the phone and landline numbers in the string. Of course we could set the phone number to an 11-digit number (note that it is not a random 11-digit number, since you have never seen “25012345678”) and the landline number to the same pattern as described in the previous paragraph, but this would be very difficult to do without using a regular expression.

To learn more about regular expressions, you can read a popular blog called the 30-Minute Tutorial on Regular Expressions. After reading this article, you should be able to see a table that summarizes some of the basic symbols in regular expressions.

symbol explain The sample instructions
. Match any character b.t Can match bat/but/b#t/b1t, etc
\w Matches letters, digits, and underscores b\wt It can match bat, b1T, b_T, etc

But it doesn’t match b#t
\s Match whitespace characters (including \r, \n, \t, etc.) love\syou Can match love you
\d Match the Numbers \d\d It can match 01/23/99, etc
\b Matches word boundaries \bThe\b
^ Matches the beginning of the string ^The Matches The string beginning with The
$ Matches the end of the string .exe$ Can match the string at the end of.exe
\W Matches non-letter/digit/underscore b\Wt Can match b#t/b@t etc

But cannot match but/b1t/b_t, etc
\S Matches non-whitespace characters love\Syou Can match love#you etc

But can’t match love you
\D Match non-numbers \d\D Can match 9A / 3# / 0F etc
\B Matches non-word boundaries \Bio\B
[] Matches any single character from the character set [aeiou] Can match any vowel character
(^) Matches any single character that is not in the character set [^aeiou] Can match any non – vowel character
* Matches 0 or more times \w*
+ Match 1 or more times \w+
? Matches 0 or 1 times \w?
{N} Match the N \w{3}
{M,} Match at least M times \w{3,}
{M,N} Match at least M times and at most N times \ w {3, 6}
| branch foo|bar It can match foo or bar
(? #) annotation
(exp) Exp is matched and captured in an automatically named group
(? <name>exp) Exp is matched and captured in a group named name
(? :exp) Matches exp but does not capture the matching text
(? =exp) Matches the position in front of exp \b\w+(? =ing) Matches dancin I’m Dancing
(? <=exp) Matches the position after exp (? <=\bdanc)\w+\b Matches the first “ING” in “I Love Dancing and Reading”
(? ! exp) Matches positions that are not followed by exp
(? <! exp) Matches positions that are not preceded by exp
*? Repeat as many times as you want, but as few times as possible a.*b

a.*? b
When you apply a regular expression to aabab, the former matches the entire string aabab and the latter matches both strings
+? Repeat 1 or more times, but as little as possible
?? Repeat 0 or 1 times, but as little as possible
{M,N}? Repeat M to N times, but as few as possible
{M,}? Repeat more than M times, but as little as possible

Note: If the character to be matched is a special character in the regular expression, you can use \ to escape. For example, to match the decimal point, you can write \. I’ll do it, because I’ll just write it. Matches any character; Similarly, to match parentheses, you must write \(and); otherwise parentheses are treated as groups in regular expressions.

Python’s support for regular expressions

Python provides the re module to support regular expression related operations. The following are the core functions in the RE module.

function instructions
compile(pattern, flags=0) Compiling a regular expression returns a regular expression object
match(pattern, string, flags=0) Returns the match object on success with the regular expression matching string otherwise Returns None
search(pattern, string, flags=0) Returns the match object on success for the first occurrence of the regular expression pattern in the search string otherwise Returns None
split(pattern, string, maxsplit=0, flags=0) Returns a list of broken strings with the pattern delimiter specified by the regular expression
sub(pattern, repl, string, count=0, flags=0) To replace the pattern matching the regular expression in the original string with the specified string, count can be used to specify the number of substitutions
fullmatch(pattern, string, flags=0) The full match (from the beginning of the string to the end) version of the match function
findall(pattern, string, flags=0) Find string A list of all the strings returned by patterns that match the regular expression
finditer(pattern, string, flags=0) Finding all patterns in the string that match the regular expression returns an iterator
purge() Clears the cache of implicitly compiled regular expressions
re.I / re.IGNORECASE Case matching flags are ignored
re.M / re.MULTILINE Multi-line matching flag

Note: the above mentioned re these functions in the module, the actual development of object can also use regular expressions to replace the use of these functions, if the use of a regular expression need to repeat, then pass the compile function compiled regular expression and create a regular expression objects is more sensible choice.

Here’s a series of examples to show you how to use regular expressions in Python.

Example 1: Verify that the input user name and QQ number are valid and provide the corresponding prompt message.

Verify that the input user name and QQ number are valid and provide the corresponding prompt information requirements: The user name must contain 6 to 20 letters, digits or underscores, and the QQ number must contain 5 to 12 digits and cannot start with 0.
import re


def main(a):
    username = input('Please enter user name:')
    qq = input('Please enter QQ id:')
    The first argument to the # match function is a regular expression string or regular expression object
    The second argument is the string object to match with the regular expression
    m1 = re.match(R '^ [0-9 a zA - Z_] 6, 20} {$', username)
    if not m1:
        print('Please enter a valid username.')
    m2 = re.match(R '^ 1-9] [\ d {4} 11 $', qq)
    if not m2:
        print('Please enter a valid QQ number.')
    if m1 and m2:
        print('Your input is valid! ')


if __name__ == '__main__':
    main()
Copy the code

Note: The regular expression is written “raw string” (preceded by r), which means that every character in the string has its original meaning. More directly, there are no escape characters in the string. Because there are many metacharacters and escapes in regular expressions, if you do not use the original string, you need to write the backslash \\. For example, the number \d should be written as \\d, which is not only inconvenient to write, but also difficult to read.

Example 2: Extract a domestic mobile phone number from a paragraph of text.

The chart below shows the mobile phone number segments launched by three Chinese carriers by the end of 2017.

import re


def main(a):
    # Create regular expression objects using lookahead and retrospectives to ensure that the phone number should not appear before or after the number
    pattern = re.compile(r'(? <=\D)1[34578]\d{9}(? =\D)')
    sentence = "Important thing to say 8130123456789 times, my mobile phone number is 13512346789 this beautiful number, not 15600998765, also 110 or 119, Wang Dahammer's mobile phone number is 15600998765. ' ' '
    Find all matches and save them to a list
    mylist = re.findall(pattern, sentence)
    print(mylist)
    print('-------- gorgeous divider --------')
    Fetch the matched object through iterator and get the matched content
    for temp in pattern.finditer(sentence):
        print(temp.group())
    print('-------- gorgeous divider --------')
    Find all matches by specifying the search location
    m = pattern.search(sentence)
    while m:
        print(m.group())
        m = pattern.search(sentence, m.end())


if __name__ == '__main__':
    main()
Copy the code

Note: The regular expression matching the national phone number is not good enough, because numbers starting with 14 are only 145 or 147. The regular expression does not take this into account. <=\D)(1[38]\d{9}|14[57]\d{8}|15[0-35-9]\d{8}|17[678]\d{8})(? =\D), it seems that there are mobile phone numbers starting with 19 and 16 in China recently, but this is not in our consideration for the moment.

Example 3: Replace bad content in a string

import re


def main(a):
    sentence = 'Are you stupid? Fuck you."
    purified = re.sub('[fuck Cao acutely] | fuck | shit | silly force [than cunt fork lifting prick] | evil spirit'.The '*', sentence, flags=re.IGNORECASE)
    print(purified)  Are you a *? I * * * you.


if __name__ == '__main__':
    main()
Copy the code

Re module regular expression related functions have a flag parameter, which represents the regular expression matching flag, you can use this flag to specify whether to ignore case during matching, whether to perform multi-line matching, whether to display debugging information, and so on. If you need to specify multiple values for the flags parameter, you can use the bitwise or operator to overlay, such as flags = re. | I re. M.

Example 4: Splitting long strings

import re


def main(a):
    poem = 'The moon was shining before the window, and I thought it was frost on the ground. Looking up the bright moon, lower the head to think of home. '
    sentence_list = re.split(R '[,.,.]', poem)
    while ' ' in sentence_list:
        sentence_list.remove(' ')
    print(sentence_list)  # [' The moon is shining before my window ', 'I think it is frost on the ground ',' LOOK up at the bright moon ', 'lower my head and think of home ']


if __name__ == '__main__':
    main()
Copy the code

The latter

If you want to engage in the development of crawler applications, then regular expression must be a very good assistant, because it can help us quickly find a certain pattern we specify from the web code and extract the information we need, of course, for beginners to receive, To write a correct proper regular expressions may not be an easy thing (of course some of the commonly used regular expressions can be directly to find on the Internet), so the actual development of the crawler application, there are a lot of people will choose Beautiful Soup or Lxml to matching and information extraction, the former is simple convenient but performance is poorer, The latter is easy to use and performs well, but a bit cumbersome to install, which we will cover in a later crawler.