preface

If there’s one thing I use the most to learn programming, it’s regular expressions. Strictly speaking, regular expressions are not a programming language, nor are they knowledge in the service of a programming language. But it’s useful enough, and it’s useful enough.

For example, phone numbers and email addresses of rules can be extracted from text. Wildcards in Office are also regular expressions, so in Office to do regular search and replace, but also can greatly improve work efficiency.

Regular expressions are also often used in crawlers, such as the ability to retrieve everything under an H1 tag with a few simple lines of code.

import re html = ''' <h1> test1 </h1> <h1> test2 </h1> <h1> test3 </h1> ''' content = re.findall('<h1>(.*?) </h1>', html) print(content) #result [' test1 ', ' test2 ', ' test3 ']Copy the code

What exactly is a regular expression, how to use it, and why do we always use (.*?) in crawlers? This article will tell you in detail what role it plays.

What is a regular expression

The regular expression describes a pattern of string matches, which doesn’t sound very intuitive.

Let’s draw three key words from this definition:

  • String: This defines the object to use, which is text.
  • Match: Defines the purpose for finding and locating.
  • Patterns: Patterns are rules. This is the heart of regular expressions. Rules here are artificially defined characters, numbers, and letters.

So in plain English, a regular expression is a set of artificially defined rules that are combined to match strings quickly.

metacharacters

A regular expression is a set of defined rules, with metacharacters behind the rules.

There are many metacharacters, and we have divided them into five categories according to their purpose for easy understanding and use.

  • Set: []
  • Times: Indicates times: * +? {}
  • Juxtaposition: |
  • Extract: ()
  • Special meaning symbol:. ^ $\b\ b

Examples of this article are verified online on this site: the regex101.com/ (1) collection ([]) [] indicates that any character contained in the match, such as [Pp]ython, matches Python and Python.

Use – in a collection to match a range of characters. For example, [a-z] can match any character from A to Z.

Use ^ to match complements, such as [^p]ython, which matches characters other than p.

(2) The regular expression above the number character can only match one character, then you need the number character.

  • * indicates that zero or more characters can be followed
  • + indicates that it can be followed by one or more characters
  • ? It can be followed by 0 or 1 character
  • {n,m} indicates that n to m characters can be followed

For example, match an 11-character phone number.

This use method is very simple, we can practice using. But there’s a very important point that I want to talk about. That’s greedy mode and non-greedy mode.

In the case of *, it can match zero or more characters. How many characters does it match? The greedy mode is to make as many matches as possible as long as the match is successful, and the non-greedy mode is the reverse. Greedy mode is used by default. If you want to switch to non-greedy mode, you need to add? Number.

In order to

test

test

If <.*? >, will match

and

(3) coordination (|) coordination character is easy to understand, when you need to match one of the two characters, use |. A | B, matching to A, will not find B.

In this case, the match is c or Python.

If you want to extract the matching string, you need to use parentheses. This is mainly used in programming, the extraction of data. As in the previous crawler code, with parentheses, you can extract the contents of the H1 tag.

import re html = ''' <h1> test1 </h1> <h1> test2 </h1> <h1> test3 </h1> ''' content = re.findall('<h1>(.*?) </h1>', html) print(content) #result [' test1 ', ' test2 ', ' test3 ']Copy the code

Add first in ()? :, represents only matching without non-capturing.

import re html = ''' <h1> test1 </h1> <h1> test2 </h1> <h1> test3 </h1> ''' content = re.findall('<h1>(? :. *?) </h1>', html) print(content) #result ['<h1> test1 </h1>', '<h1> test2 </h1>', '<h1> test3 </h1>']Copy the code

Actually, right here? : is one of the non-capturing elements. Are the other two non-capturing elements? = and? ! , the former is the positive pre-check, the latter is the negative pre-check. And these two spawn, right? < = and? <! . Let’s take a look at how they use it.

A(? =B), match A that meets the condition of B; (? <=B)A, matches A that matches B. The former matches the one before the parentheses, the latter matches the one after.

windows(? 10) = 7 | 2000 | | xp, can match the Windows 7, Windows xp, Windows, Windows 10 in front of the Windows.

A(? ! B), match A that does not meet the conditions of B; (? <! B)A, matches A that does not meet B’s criteria. The former matches the one before the parentheses, the latter matches the one after.

For example, \d stands for matching a numeric character, which is equivalent to [0-9].

The following are commonly used symbols for specific meanings:

string meaning
^ Matches the start of the input string.
$ Matches the end of the input string.
. Matches any single character except newline characters (\n, \r).
\b Match a word boundary, which is the position between words and Spaces. For example, ‘er\b’ can match ‘er’ in “never”, but not ‘er’ in “verb”.
\B Matches non-word boundaries. ‘er\B’ matches the ‘er’ in “verb” but not the ‘er’ in “never”.
\d Matches a numeric character. Equivalent to [0-9].
\D Matches a non-numeric character. That’s the same thing as ^0 minus 9.
\f Matches a feed character.
\n Matches a newline character
\r Matches a carriage return.
\t Matches a TAB character.
\v Matches a vertical TAB character.
\s Matches any whitespace character, including Spaces, tabs, page feeds, and so on. Equivalent to [\f\n\r\t\v].
\S Matches any non-whitespace character. Equivalent to [^ \f\n\r\t\v].
\w Matches letters, digits, and underscores. Equivalent to ‘[a-za-z0-9_]’.
\W The value cannot contain letters, digits, or underscores. Equivalent to ‘[^ a-za-z0-9_]’.

\ is an escape character, such as \*, which matches the * itself.

Modifiers (optional tags)

Once you’ve learned most of the metacharacters, you can use regular expressions to complete your daily tasks. In the previous screenshot, you can see that gm is actually modifiers.

Modifiers are not written inside regular expressions, and tags are outside expressions. Let’s look at what they stand for.

The modifier meaning The specific explanation
i ignore It doesn’t discriminate between small and uppercase
g global Global match, find all matches.
m multi line Multi-line matching so that the boundary characters ^ and $match the beginning and end of each line.
s The special character dot. Contains the newline character \n By default, the dot. Matches any character except newline \n, and after the s modifier,. Contains the newline character \n.

That’s it. Next time, we’ll look at regular expressions in everyday work.