0 x0, introduction

Regular expressions – no developer will be unfamiliar with this term, will they? If you don’t remember, think about how you judge whether your ID card or mobile phone number is legitimate.

Soul Searching: Have you ever written your own regular expressions?

Possible answers:

No, the syntax looks very difficult, the string like alien text, can not understand, the regular expression used by the regular business, there is no need to write.

Have written good direct Copy, no problem, after all, the author is also so, but the same is change, the problem comes, if:

When the matching result of the regular expression on the Internet is not satisfactory, and there is a special string matching requirement in the project

There’s no more to copy. What do we do? You have to write by yourself. At the thought of the seemingly boring and difficult grammar, you:

Don’t panic, regular expressions are really easy, you can always trust Jack, let this short and concise article to help you learn regular expressions, and then kick the block.

This section of code is based on Python’s re library. Although most of the re libraries are Perl learners, the syntax is basically the same, but there may be some differences

0x1. Introduction and Hundred million learning experiences

Regular Expression → a search and replace tool for string pattern matching.

To put it simply:

  • Appeal: find or replace a substring in a string that meets a condition;
  • Regular expressions: a tool used to describe this condition;

Talk is cheap, the regular expression, string matching magic device, exactly how god, write a simple example to experience ~

Given a string like this, what would you do if you were asked to find all the numbers in it?

SDFK seems to understand 123 but does not understand 35 frame 89 path test 669 consider the path

Without the re, a common simple idea is to traverse each character, perform a judgment, number concatenation, non-number skip, the following code example:

sentence = "SDFK seems to understand 123 or 35 frame 89 Path test 669 consider the path"
number_dict = {}  The # key is a numeric string and the value is a subscript
temp_str = ""
for index, character in enumerate(sentence):
    # check whether it is a number
    if character.isnumeric():
        temp_str += character
        else:
            if len(temp_str) > 0:
                number_dict[temp_str] = index
                temp_str = ""
                continue
print(number_dict)
# run the following output:
{'123': 9.'35': 14.'89': 19.'669': 25}
Copy the code

With re, you only need to write a matching expression \d+ to achieve the same matching result, the code example is as follows:

import re
sentence = "SDFK seems to understand 123 or 35 frame 89 Path test 669 consider the path"
results = re.finditer(r'\d+', sentence)
if results is not None:
    for result in results:
        print(result)
        
# run the following output:
<re.Match object; span=(6.9), match='123'>
<re.Match object; span=(12.14), match='35'>
<re.Match object; span=(17.19), match='89'>
<re.Match object; span=(22.25), match='669'>
Copy the code

Don’t be too simple, at this time to demand that the letter should also match, traversal method to change the loop logic, and the regular expression can be directly changed ~

It’s nice to write, but it probably doesn’t perform as well as traversal

Before I start to explain the syntax of regular expressions, I will share some of my own regular expressions. After all, I am afraid of regular expressions

  • (1) Don’t be afraid of difficult emotions, the more afraid of the more you can’t learn, are the combination of dead knowledge, far less difficult than the algorithm!
  • (2) Lower your expectations. Don’t try to write regular expressions that are awesome. Write regular expressions that work well first, and then slowly optimize them.
  • ③ Practice repeatedly to understand grammar, who can not practice, to deepen the impression, if there is no chance to practice, you can find two ways to practice, Amway:

Direction one: search for web page nodes or text that meet the conditions

Open developer tools → Switch to the Source TAB → Ctrl + Shift + F and enter the regular expression as follows:

Direction 2: Review regular expression templates commonly used on the Internet

Analyze why others write the way they do, get some test samples, and try to write it yourself. Write it several times.

Don’t miss every opportunity to practice writing regular expressions. After a few days, your regular posture will be returned to Jack. In addition, Amway has a great tool for visualizing regular expressions: Regexper

In addition, temporary re verification tool, direct search engine search key online regular expressions, a bunch of online.

All right, start to explain the regular grammar posture, go!!

0x2 basic Syntax

A complete regular expression consists of two types of characters:

  • Ordinary characters;
  • Special characters (metacharacters) → represent special semantic characters, the smallest unit of regular expression functions, such as *** ^ $\d**, etc.;

① The simplest match

For example, if you want to find “tho” in a “Hello Python” string, use the regular “tho”.

② How to match special characters?

The dot. In the re is a special character used to match any character (except \n). If you want to treat it like a normal string, that is, to match a dot, you need to use the escape character → backslash. Put it before a special character, so that it loses its original special semantics, such as: \. Instead, it simply matches the dot (.). A;

List a wave of special character metacharacters:

  • .→ Match any character (except newline \n)
  • \d → matching digits, 0-9;
  • \D → match non-digit;
  • \s → Match whitespace, i.e. whitespace and TAB indentation;
  • \S → Match non-blank;
  • \w → Match letters, digits or underscores: A-z, A-z, 0-9, _;
  • \W → Matches non-letters, numbers, or underscores;
  • \x → Matches hexadecimal numbers;
  • \O → Matching octal digits;
  • \n → Match newline character;
  • \r → Match carriage return;
  • \v → Match vertical tabs;
  • \t → Match tabs;
  • \f → Match page feed;
  • [\b] → match backspace character, add [] to distinguish from \b;

The number 3.

Only the above metacharacters can be used to match a single character. To match multiple characters, you must match the following metacharacters:

  • | – logic or the operator, such as “ab” and “ac” all want to match to, can use the regular: “ab | c”
  • * → The previous character (subexpression) appears 0 or infinite times, that is, optional;
  • + → The previous character (subexpression) occurs 1 or infinitely, i.e., at least once;
  • ? → The previous character (subexpression) appears 0 or 1 times, that is, either does not appear, or only appears once;
  • {m} → The previous character (subexpression) occurs m times;
  • {m,} → The previous character (subexpression) occurs at least m times;
  • {m,n} → The previous character (subexpression) occurs m to n times;

Tips: Use *+ as much as possible? Because of the optimization, it is faster than {m}{m,}{m,n}.

④ Set and interval

What do we say when we want to match one of more than one character, say 26 lowercase letters? Is it like this down here?

a|b|c|d|e|f|g|h|... Too long to omitCopy the code

What if I had to match 26 capital letters? Concatenating a bunch of or operators is tedious and can be simplified using sets and interval metacharacters, such as the following:

  • [] → Match the characters listed in the character set [], such as [ABC], which matches a character in ABC.
  • [^] → Matches characters that are not listed in the character set [], such as [^ ABC].
  • – → Define an interval, or match a range, e.g. [a-z] matches one of the 26 lower-case letters;

⑤ Position boundary

Sometimes we have to restrict the position of the query, for example, we only want to look for characters at the beginning or end of a word. We can use these metacharacters:

  • ^ → Match the beginning of a line, multi-line pattern can be recognized \n;
  • $→ Matches the end of a line, multi-line pattern can recognize \n;
  • \A → Match the beginning of the string, single-line mode effect is the same as ^, multi-line mode is not recognized \n;
  • \Z → Matching the end of the string, single line mode effect is the same as $, multi-line mode can not recognize \n;
  • \b → Matches the word boundary, that is, the position between the word and the space. For example, er\b can match er in never, but not er in verb;
  • \B → Match non-word boundaries;
  • < → Match the beginning of the word;
  • > → Match word end;

0x3, simple practice → Hand tear to verify the regular format of the mobile phone number

Light say don’t practice false handle type, light practice don’t say silly handle type, practice and say true handle type, haste makes waste, after learning the basic grammar, first write an example to practice hand, find the feeling ~

Such as the title, tear to verify the mobile phone number format of the re, how to play?

A normal mobile phone number can be broken into three parts to write the re, and finally put together

① Number Prefix

Either one or the other if you have one, use? , and then the value can be one of the following three:

  • 0 (Long distance)
  • 86 (China International Code)
  • 17951 (International Call)

It’s not hard to write such a regular: (0 | 86 | 17951)? The parentheses represent subexpressions, which you can think of as groups, but more on that later

(2) 1 xx

Then to 1xx, there are several kinds: 13x, 14x, 15x, 17x, 18x, and then the corresponding value range of the following x is different:

  • 13 x: 0123456789
  • 14 x: 579
  • 15 x: 012356789
  • 17 x: 01678
  • 18 x: 0123456789

Looks complicated, right? Actually otherwise, the use of good | and collection and interval metacharacters, according to regular rules is not hard to write:

(13 14 [579] [0-9] | | 15 [9] 0 to 35-17 [01678] | | 18 [0-9])

③ The remaining eight numbers

Write (\d{8}) without even thinking about it, and then concatenate the next three re’s:

(0 | 86 | 17951)? (13[0-9]|14[579]|15[0-35-9]|17[01678]|18[0-9])(\d{8})Copy the code

Then search engine search under mobile phone number daquan, readily open a site, F12 → Source input the above regular check to see:

Yes, there is a match, but it also seems to match some strange things, such as the above 921-xxx. HTML, simple, add $to match the end of the line

Example brush up, simple bar! Move on to the high-level syntax of re ~

0x4 advanced Syntax

It doesn’t mean you have to think about it a little bit

1) group

A subexpression wrapped around a pair of parentheses () is the basis of an advanced regular expression. If you don’t use the following syntax, it doesn’t matter whether you group or not

② Back reference

Also known as backtracking reference, refers to the following content of the pattern, referring to the previously matched substring. Mentally? To help you understand this, we have a string like this:

pineapple peach plum watermelon durian durian grape mango strawberry strawberry chestnut
Copy the code

To match two consecutive words that are the same, write re like this:

So the syntax for backreferencing is: \ number of substrings. Backreferencing is often used in substitution string scenarios.

Substitutions are syntactically slightly different, with \g< number of strings > being used to refer to the string to be replaced. Take the substitution of string parts in the previous article as an example ~

The Markdown image URL is split into two lines by a newline character \n, and cannot be replaced globally because it will affect the formatting of other text. Only the abnormal image URL can be replaced by a backreference.

error_pic_pattern_n = re.compile(r'(http.*?) (\n)(.*? \.\w+\))', re.M)

# divide into three groups (subexpressions), and concatenate the results of 1 and 3 groups as replacement results
new_content = error_pic_pattern_n.sub(r"\g<1>\g<3>", new_content)
Copy the code

Backreference, that’s it? Just messing around with the match results of the subexpression, right? Yes, it’s that simple

Alternatively, if you don’t want the subexpression to be referenced, you can use the uncaptured re → (? : subexpression), this can also avoid wasting memory ~

③ Look forward/backward

Also known as the forward/backward paragraph, sequence/reverse look around, look forward and backward, is some of the professional terms?

Write a simple example to help you understand how it works. Go to the Apple website and identify your iPhone model. Now there is a need like this:

To see which iPhone 11, 12 and 13 models are available, write a regular screen

Let’s write a re that can extract all models:

<h2>iPhone .*? </h2>Copy the code

You can get the model, but the original demand only wants to screen 11, 12, 13 models, what XS, XR is not interested in, this time you can forward affirmation (? =). Regex after modification:

<h2>iPhone (? 13 12 | | = 11). *? </h2>Copy the code

Yes, except 11, 12, 13, the other models of cat and dog are not matched, it is not difficult to see the forward affirmative assertion (? =re) is used here:

Find meet the first assertion Expression – | | 11 12 13 position, and then in the position to find matching forward < h2 > iPhone string;

This order is very important!! Find the subexpression in the assertion first, then find the preceding expression, this is not clear in many tutorials!!

Now that we know how to play forward affirmative assertions, the other three assertions make a little bit of sense:

  • (? ! Re) → forward negation assertion, first match does not meet the position of RE, and then forward match;
  • (? <=re) → backward affirmative assertion, first match to meet the position of re, and then backward match;
  • (?

Readers can try the following examples to understand:

# forward negative assertion, look at models 11, 12, 13<h2>iPhone (? !11|12|13). *? </h2>Let's see what models are available in the Plus series<h2>iPhone .*? (? <=Plus)</h2># Backward negative assertion, see what models do not have the Plus series<h2>iPhone .*? (? <! Plus)</h2>Copy the code

Due to space limitation, I will not post the result graph, and there are two points to pay attention to:

  • ① The subexpression in the assertion does not consume characters! Does not affect your external regular expression matching!!
  • In some languages or environments may not be supported, use caution to verify support ha!!

(4) Greed and non-greed

Regular matches default to greedy matches, that is, matches as many characters ~ as possible

It’s very simple, write an example you will understand:

re.match(r'^(\d+)(0*)$'.'12345000').groups()

'12345000', '12345000', '12345000'

= 0* = 0* = 0* = 0*

If you want as few matches as possible, you can add a question mark after \d+. , adopt non-greedy matching, and change to medium sample:

re.match(r'^(\d+?) (0 *) $'.'12345000').groups()

# output result :('12345', '000')
Copy the code

*0x5

I want to understand the principle behind regular expressions:

There are a lot of articles on the web about regular expressions having a terrible performance problem, but few people say exactly what the problem is. How did it happen? How to avoid optimization ~

The following images and content are mostly from: How the Regular Expression Engine works — Never been Clearer!

① Regular expression workflow

Precompilation refers to pre-initialization of regular objects, such as the Python re library.

  • A call to re.compile(patter) returns a Pattern object, which is referenced directly wherever the regex is used.

Precompiled, matching scenes with the same re in a loop works wonders (avoiding duplicate creation and compilation).

(2) engine

The program syntactic analyzes the regular expression, establishes the syntax analysis tree, and then generates the execution program (state machine, also known as state automata) based on the analysis tree and the regular expression engine for character matching.

The engine here is a set of core algorithms for building state machines, which fall into the following two categories:

  • DFA (Deterministic Finite automaton) → Deterministic finite automaton;
  • NFA (Non-Deterministic Finite Automaton) → Non-deterministic finite automaton;

Down the words:

  • Deterministic and nondeterministic → String Without writing regular expression, can directly determine the character matching order? Yes is deterministic, and no is indeterminate;
  • Finite → results can be obtained in a finite number of times;
  • Automata → after setting up the matching rules, it is completed automatically by the engine without human intervention;

Contrast:

  • The cost of constructing DFA (more memory and slower compilation speed) is much higher than that of NFA, but the execution efficiency of DFA is higher than that of NFA.
  • Assume that the string length is N, and the matching time complexity of DFA is O(n), while the matching time complexity of NFA is O(ns) due to the large number of branches and backtracking in the matching process. Assume that the number of states is S, and the matching time complexity of NFA is O(ns).
  • The advantage of NFA is that it supports more functions, such as capturing Group, loop, quantifier and other advanced functions. These functions are matched independently based on subexpression. Therefore, in programming languages, libraries using regular expressions are realized based on NFA.

How is the DFA automaton matched?

Key points:

  • Text-led → execute in text order, stable;
  • Records of all possible – such as executing the currently valid (d | b), at the same time comparative expressions of a | b, so need more memory;
  • Each character is checked only once → the execution efficiency is improved, and the speed is independent of regular expression;
  • Can not use backreference and other functions → each character only check once, the position only record the current value, so can not use backreference, look around and other functions;

How is the NFA automaton matched?

Key points:

  • Expression dominant → according to the expression of part of the execution, do not match for other parts to continue to match, until the expression matching completed;
  • Will record a location – such as executing the | b (d), record the position of the character, and then select one of the first match;
  • Each character can check many times to perform to | b (d), found that do not match, after compare d with the expression of another branch b, text position back at the same time, to match the characters of the b. This is also the reason why NFA may not be as uncertain and efficient as DFA.
  • Can use backreference and other functions → because there is a backreference function, so it is easy to achieve backreference, look around and other functions;

③ Backtracking of NFA automata

The technical term for this kind of backtracking is backtracking, and it’s similar to setting a marker on the path you took in a maze and changing it to a different path instead of going back.

The backtracking mechanism not only recalculates the position of the regular expression and text, but also maintains the text state matched by the parenthesis subexpression in a numbered group in memory, also known as a capture group.

The capture group holds the matching results in parentheses, which can be used later in regular expressions, such as the backreference mentioned above.

Don’t quite get it? Draw a simple diagram of the backtracking process

content_str = "HELLO"
content_regex = "HEL+O"
Copy the code

It is not hard to see that the backtracking problem is triggered by greedy matching, eating more, vomiting more, if the length of the matched text is a few W, the engine may need to backtrack several W times.

If you encounter a complex regular expression with multiple parts to trace, the number of tracebacks is exponential. For example, if the text is 500, and the expression has two parts to go back, maybe 500^2= 25W times. It will…

(4) optimization

Direction of optimization: reduce engine backtracking + faster and more direct matching to results

1) Use less greed mode

Can use non-greedy mode (plus? , will first select the minimum matching range) and exclusive mode (plus +, no match ends match, no trace) to avoid backtracking.

2) Reduce branch selection

Use less, and if you must use it, you can optimize it in the following ways:

  • Order of selection – more common options are placed first so that they can be matched more quickly;
  • Extract the common patterns, such as: the (abcd | abef) substitute – > ab | ef (CD);
  • Simple branch selection → do not use the regular, direct string search function (such as index()) to find, the efficiency is higher;

3) Use non-capture brackets

In general, a () is a capture group. If you do not need to reference the text in parentheses, you can use a non-capture group (? :er), which can reduce both the capture time and the number of states used backtracking.

4) Some fragmentary optimization points

  • Length judgment optimization → Matching string length is not the length of the regular length, there is no need to match;
  • Pre-check must characters → pre-scan must characters, find can not find, there is no need to match;
  • Start and end with a good line and string– the use of good^$\A\ZMatches the beginning, end, beginning, and end of strings more accurately;
  • Don’t abuse parentheses → use parentheses only when needed; using them at other times will prevent some optimizations;
  • Don’t abuse the * → dot notation to match any string, but greedy matching leads to a lot of backtracking;
  • Equivalent quantifier substitution → A {3} is faster than AAA;
  • Split expressions → Sometimes, multiple small regular expressions are much faster than one large regular expression;

0 x6, summary

This section, the system has a wave of regular expressions, from know what is, to the grammar, and then to the principle, the content is little, the five organs are complete, I hope to help friends who want to learn regular expressions, do not next time certain, this time certain, learn to waste regular expressions ~

There are a lot of regular expression templates on the web, but there are also quite a lot of rookie tools

Any-rule is also a plugin for amway

Last but not least, this is a quick reference to the Python re module. If you have any questions, please feel free to comment

This is a common function of the re module in Python

import re

Compile regular expressions into Pattern objects
test_pattern = re.compile(Regular expression string, flag bit modifier)# marks a modifier (flags) is used to control the matching mode, support select multiple (|) at the same time, there are the following:
# 
# re.i IGNORECASE → IGNORECASE
# re.m MULTILINE → MULTILINE matching affects ^ and $
# re.s DOTALL → make Matches all characters, including newlines
# re.x VERBOSE → Ignores whitespace and comments, and allows the use of '#' to boot a comment
# re.u UNICODE → Parse characters according to the UNICODE character set, affecting \w, \w, \b and \b
Locale-aware = locale-aware = locale-aware = locale-aware

# Match: Attempts to match from the beginning of the string, returns the match object on success, or None otherwise
re.match(pattern, string, flags=0)

# match: scans the entire string, returning the first match, otherwise returning None;
re.search(pattern, string, flags=0)

# retrieve: scan the entire string, match all matched objects, return as a list;
re.findall(pattern, string, flags=0)

As with FindAll, match all matched objects, but return them as iterators;
re.finditer(pattern, string, flags=0) parameter# replace: Replace the matched string with another string, count is the maximum number of substitutions, default is 0, replace all.
re.sub(pattern, repl, string, count=0, flags=0)

# split: Split the matched string, return the list, maxsplit is the maximum number of splits, default is 0, split all.
re.split(pattern, string, maxsplit=0, flags=0)

# Group: In the process of obtaining matching results, the matching content of each group can be passed in the group number. Instead of transmitting the whole matching result, the corresponding group content can be transmitted
pattern_result.group()

# group: all values from group(1) onwards, return a tuple
pattern_result.groups()

# match the start and end positions
pattern_result.start() # return the starting position of the match
pattern_result.end()  # return the end of the match
pattern_result.span() # return a tuple representing the matching position (start, end)

Re.compile (r' XXX ');
# tell the compiler that STR is a raw string and do not escape backslashes, such as r'\n' is two characters
# backslash + n instead of newline!
Copy the code

References:

  • Python crawler | 0 x9 – string parsing artifact: regular expressions

  • A brief introduction to the principle of regular expression

  • How the regular expression engine works has never been clearer!

  • Why is it slow to use regular expressions?

  • Exploration of regular expression performance optimization