Today, because I saw an example of a crawler, I saw that when the data was captured, others used regular expressions to match the desired data. Immediately interested in this expression, carefully read the related documentation, have a general understanding of it, and write an article to introduce the use of regular expressions in Python, so that I can refer to it later!

Related introduction

Regular expressions are a highly precise language that allows us to filter, replace, and find the data we need using the re generated by a particular string. Regular expressions are compiled into a series of codes and executed by a C-written matching engine, which is generally more efficient than ordinary algorithms, but less easy to understand. You can also refer to the official documentation.

Related to the library

Python’s RE library makes the most use of the RE module. Since it is a built-in module in Python, we simply import the RE module directly to use its functionality.

Re module common function usage

The related functions described below are so important that we need to explain them in detail!

1. com running function
	re.compile(pattern, flags=0)
Copy the code

The compile function is like a regular factory. It returns a pattern object of regular expressions that we can use to process any string we need to process. The first argument is passed in the regular expression string, and the second argument is passed in the matching pattern.

2. Match function
Pattern=re.compile (' Regular string ') pattern.match(' String to match ')Copy the code

The match function is a matching function. We use the factory object generated by the compile function to match the string to be processed. If there is a response string, it returns None. Note that the match function matches the string from the beginning of the string. If the beginning is not found, the search will not continue! And it will only return the first matched string, that is, if there are two possible matches in a string, it will only return the first matched part! Here’s an example:

PSTR =pattm.match('aabcad') # PSTR = match('aabcad') =None: print(pstr.group()) else: print(' No match! ') >>> select * from 'a';Copy the code

Since the string ‘adcd’ has the regular character A in the first place, we have successfully matched a!

PSTR =pattm.match('abcd') # match if PSTR! None: print(pstr.group()) else: print(' no match! ') >>> No match! Run no matchCopy the code

The first character of the matching string must be b. Otherwise, the match is unsuccessful.

3. The search function
Pattern=re.compile (' Regular string ') pattern.search(' String to match ')Copy the code

The search function has one thing in common with the match function: it only returns the first part of the string to be matched! But unlike match, it doesn’t just look at the beginning of the string. It looks throughout the string until it finds a match!

pattm=re.compile('a') pstr=pattm.search('baabcad') if pstr! =None: print(pstr.group()) else: print(' No match! ') >>>> a # The result is a, where a is the first a in the stringCopy the code

If there are no matching characters in the string, return None

pattm=re.compile('f') pstr=pattm.search('baabcad') if pstr! =None: print(pstr.group()) else: print(' No match! ') >>> No match! # f character not matched in string!Copy the code
3. The.findall function

The findAll function not only finds the entire string, but also returns any matching characters! Not limited to returning the first match! It loads all the returned characters into a list object and returns them.

pattm=re.compile('a') pstr=pattm.findall('baabcad') if pstr! =None: print(PSTR) else: print(' No match! ') >>>> ['a', 'a', 'a'] # returns all matched asCopy the code

Also, if no match is found, an empty list object is returned

4. The split function

Split function, similar to string split. The string we need will be split based on the regular character.

pattm=re.compile(':') pstr=pattm.split('baa:bcad') if pstr! =None: print(PSTR) else: print(' No match! ') >>> ['baa', 'bcad'] # Split into two strings based onCopy the code

Also, if no match is found, an empty list object is returned

A special sequence commonly used in a re

symbol describe
\d Matches any decimal number
\D Matches any non-numeric character
\s Matches any whitespace character (that is, a space)
\S Matches any non-space character
\w Matches any alphanumeric character
\W Matches any non-alphanumeric character

The above are 6 commonly used special sequences, basically covering all our string filtering requirements, sequences can be mixed. Here’s a simple example:

PSTR =pattm.findall('abcd1234') if PSTR! =None: print(PSTR) else: print(' No match! ') >>>> ['1', '2', '3', '4'] # return an arrayCopy the code

Mixed examples:

PSTR =pattm.findall('abcd1234') if PSTR! =None: print(PSTR) else: print(' No match! ') >>> ['d1'] # match d1 successfullyCopy the code

A metacharacter commonly used in a re

Metacharacters are the most common and least understood characters in regular expressions. Here we introduce some of the most common metacharacters.

The following metacharacters are commonly used:

. Metacharacters:
. ^ $* +? {} [] \ | ()Copy the code

Let’s take a look at some examples:

PSTR =pattm.findall('abcd1234\n') if PSTR! PSTR =pattm.findall('abcd1234\n') if PSTR! =None: print(PSTR) else: print(' No match! ') >>> ['a', 'b', 'c', 'd', '1', '2', '3', '4', '\n']Copy the code

. Represents an arbitrary character, which by default contains all characters except newline characters, but by changing the matching mode, we can also match newline characters, all results are returned in all matches!

^ Metacharacter:
PSTR =pattm.findall('abcd1234\n') if PSTR! =None: print(PSTR) else: print(' No match! ') >>> ['abc']Copy the code

The ^ sign indicates the beginning of a match. In this example, we are matching the string ABC, but we are not matching the entire string that starts with ABC.

$metacharacter
pattm=re.compile('abc$',re.S) pstr=pattm.findall('abcd1234abc') if pstr! =None: print(PSTR) else: print(' No match! ') >>> ['abc']Copy the code

$= ABC; $= ABC;

* metacharacters
PSTR =pattm. Findall ('bcaacaaab') if PSTR! =None: print(PSTR) else: print(' No match! ') >>> ['', '', 'aa', '', 'aaa', '', '']Copy the code

* represents a multiple, and is valid only for the character before the symbol, can be 0 times or any multiple, so the above result returns a null character, because 0 times is a null character!!

+ metacharacters
pattm=re.compile('a+',re.S) pstr=pattm.findall('bcaacaaab') if pstr! =None: print(PSTR) else: print(' No match! ') >>> ['aa', 'aaa']Copy the code

The metacharacter + is similar to the metacharacter *. It only applies to the character before the metacharacter, but the + does not match 0 times.

? metacharacters
pattm=re.compile('ca?t',re.S) pstr=pattm.findall('catdddct') if pstr! =None: print(PSTR) else: print(' No match! ') >>> ['cat', 'ct']Copy the code

Metacharacters? The symbol is also a duplicate character, but it represents an optional character. In the example above, the a character represents an optional character. It can match or not match, so it returns two matching results!

{} metacharacter

The metacharacter {} is also a repeating character, and only applies to the character before the symbol. It is more flexible than + and *

PSTR =pattm.findall('fcabcdaaaef^') if PSTR! =None: print(PSTR) else: print(' No match! ') >>> ['a', 'aa', 'a']Copy the code

The {} character can have two variables {m,n}, where m represents the least matching multiple and n represents the largest matching multiple. You can also write only one variable {n}, indicating that the maximum match n characters!

[] yuan characters
pattm=re.compile('[abc]',re.S) pstr=pattm.findall('abcdef') if pstr! =None: print(PSTR) else: print(' No match! ') >>> ['a', 'b', 'c']Copy the code

The metacharacter [] represents a range, which is equivalent to specifying the character matching a range class. In the figure above, the metacharacter [] can also be written as [A-B], which has the same effect. Many mobile phone numbers use the metacharacter, for example, [0-9] takes a number in the range of 0 to 9!

It is also important to note that other metacharacters in the [] class will no longer have the same functionality!

Here’s an example:

PSTR =pattm.findall('fcabcdef^') if PSTR! =None: print(PSTR) else: print(' No match! ') > > > [' c ', 'a', 'b', 'c', '^'] # but it can be seen that the results because of the reason of [] class, ^ the role of metacharacters disappeared, regarded as the common one character at a time, return all the matching range of characters in []Copy the code
\ metacharacters

The \ character is an interesting character, which has two main functions

One is escape:

PSTR =pattm.findall('fc{aa{ef[') if PSTR!=None: print(PSTR) else: Print (' no match!') >>> ['{', '{']Copy the code

We can match by \ treating other metacharacters as normal characters!

The second is combinatorial sequences, which combine into some sequences of specific functions through some specific combinations, such as the special sequences \s,\w mentioned above

| metacharacters
pattm=re.compile('a|b',re.S) pstr=pattm.findall('abcdbcda') if pstr! =None: print(PSTR) else: print(' No match! ') >>> ['a', 'b', 'b', 'a']Copy the code

| characters and in Java or similar to, means to match the front or the back part, it is important to note the front and the back part! If the above example is ABC | a expression, means to match the ABC or b, rather than the first match ab, then choose a in c | a, this understanding is wrong!

() metacharacter

The () character represents a group, a whole

PSTR =pattm.findall(' abccCababab ') if PSTR! =None: print(PSTR) else: print(' No match! ') >>> ['abc']Copy the code

Of course, the usage of () is far more than the above simple, we can join any matching rules in () to form a group, you can achieve countless functions

Such as:

pattm=re.compile('(^abc.+)',re.S) pstr=pattm.findall('abcccababab') if pstr! =None: print(PSTR) else: print(' No match! ') >>> ['abcccababab']Copy the code

The ‘(^ ABC.+)’ re, which should be easy to understand if you’ve looked at all the metacharacters above, matches any string beginning with ABC, so it returns the entire string.

One more thing to note is that when we use match and search at the top, we usually print with the group function.

pattm=re.compile('(^abc.+)',re.S) pstr=pattm.search('abcccababab') if pstr! None: print(pstr.group()) else: print(' no match! ')Copy the code

Groupe (1,3) represents the match between the first () group and the third () group.

Q&A

Above we have introduced, in fact, regular is a very large knowledge, far more than the kinds we introduced in the article, but we must know the basis, so that in the future encounter a complex regular expression, at least to understand the general appearance. Instead of knowing nothing!