The secret of regular expressions (5) Recursive re

This is the third day of my participation in the More text Challenge. For details, see more text Challenge

After a few days of preparation, today is a small climax, recursive re is probably the most difficult piece of re, everyone is the same, don’t be afraid, follow me.

Recursively regular (? R)

First of all, why do we have recursively regular methods?

Two purposes: one, because we often encounter countless nested structures when crawling data from a website, if the depth of nesting is not fixed, then besides writing a regular for each format, is there a good way to automatically adapt the format? That’s where recursion comes in. Use circular references to match nested structures of different depths and styles. 2. You can greatly improve the conciseness of the re by recursively referring to the previous expression in the re instead of copying it.

Recursion is a reference to itself, or in re, to the entire expression (? R) or specify subgroups (? n)

Start with the simplest recursive subgroup and pass the method (? N),n is an integer. This is exactly the same as the backreference in the previous section, with one key difference:

G {n} refers to the captured value of the subgroup, whereas the recursive reference (? N), which refers to an expression of a subgroup!

Look at the difference, after the reference (sens | respons) e and \ {n} g ibility match “sense and responsibility” and “the response and responsibility”, But it doesn’t match sense and responsibility.

If using a recursive reference (sens | respons) and e (? 1) This will match sense and responsibility.

Okay, that’s all warmed up. Let’s get down to business.

There are several basic ways to use recursive regularization:

We have to understand this first to help us understand recursion:

(? R) or (? N) References are expressions, not captured values. This is different from backward references.
(? R) in translation, can be understood as a “placeholder”. Its position determines where the captured value is inserted in the next recursion.
(? R) recurses the entire expression, no matter where it is. If there is an anchor in the expression (such as ^$), the anchor will also participate in the recursion, which will cause an error; The solution is to take (? R) (instead? N), the overall reference is changed to specify a recursive subgroup reference.
(? R) When used with quantifiers, improper use of quantifiers can lead to an infinite loop. For example, (? R)+ does not mean recursion at least once, it means (? R)+ will always remain in the final round result because (? R) always refers to the entire expression, thus leading to infinite recursion. In the same way (? R){3} this way of expressing strict recursion three times is also incorrect, because after the third time of recursion, (? R) placeholder and its quantifier {3}, which will also recurse indefinitely.
Which brings me to my fourth point, recursion has to have an Ending. So, only the quantifier * or? And {0} are the correct recursive regular expression methods that represent zero recursion in the number of repetitions. Because no matter how many times recursion, the last placeholder quantifier can be 0 times, so as to reach the end of recursion, that is, stop recursion.
The recursion matches from the outside to the inside, the 0th recursive call is the outermost, the first recursive call goes in one level, and so on;

Ok, let me give you a couple of intuitive examples to give you an intuition:

Expression ` (ABC) (123 | (def)? R)? What string can (xyz) 'match? Look we apart: first round of matching, (ABC) 123 | (def) can match the corresponding string abc123 or abcdef, (? R)? If you take 0 times, skip to the end xyz, then abc123xyz or abcdefxyz must match and exit successfully; If (? R)? Take 1, recursion, for the first time at this moment expressions can be translated into: (ABC) 123 | (def) (ABC) (123 | (def)? R)? (xyz)(xyz), then (? R)? If the value can be 0 again, the expression can match abc123abcdefxyzxyz, abc123abc123xyzxyz, abcdefabcdefxyzxyz, abcdefabcdefxyzxyz, abcdefabc123xyzxyz, and successfully exit. In the same way, we can also take (? R)? Take 1 again, so, in fact, this expression can match (ABC (123 | def) n + (xyz) \ \ * * n this string;Copy the code

To sum up, first (? R), which determines the insertion position of the reference expression during recursion; Second (? R) Refers to an expression, not the result of the previous round of matches; (again? R)? It must be 0 to exit, otherwise it will go on forever and the match will fail; Finally, when the feature matches the string, it always matches from outside to inside.

Now that we haven’t given up yet, we’ve mastered 60% of recursion regularization, so let’s look at some more advanced uses.

Let’s look at another recurrence with anchor points, the expression ^$((? R)? * \ | [^ () “+) $, can match (z) (y) (x) y z? Let’s just assume that (? R)? At 1, the expression becomes ^\(^\((? R)? \ | [^ () “+ *) $[^ ()] +) * $ $, 2 pair has occurred in the expression of the beginning and the end of the anchor, impossible to match any string that matches a failure.

So how do you change it? You can change the expression to ^(\((? 1)? * \ | [^ ()] +))) $, 1, (\ (((reference? 1)? * \ | [^ ()] +))) the subgroups, does not include the anchor points at both ends.

Take a look at the core recursive subgroup inside ((? 1)? | [^ ()] +) *, it shows that either 0 or 1 times, recursive, either match any characters in brackets. This branch can be repeated many times at the same time ((? 1)? | [^ () “+ ((? 1)? | [^ () “+ ((? 1)? | [^ () “+ ((? 1)? | [^ ()] +)…

Now translate how this expression matches:

Focus on step 6, which also bothered me for a long time. We first recall (a | b) * the meaning of this expression, the expression is equivalent to (a | b) (a | b) (a | b) (a | b)…

So ((? 1)? | [^ ()]) * means, at the same time recursion, we can go to repeat the subgroups, so we in the first and second recursion, it repeated 3 times, and in different subgroups (? 1)? Given a different value. When? When =0, only characters are matched. When? We recurse once when theta is equal to 1.

So the 3rd order recursive expansion of the regular expression in the above example can be viewed as the structure (AB(AB(ABC)BC)BC), where B refers to ((? 1)? [^()]) A refers to (, C refers to), so it perfectly matches (z(y(x)y)z). Of course, if you generize it, this expression can actually match any nested combination of ABC structures, such as (AB(AB(ABC)BC(ABC))… You can try it out.

In addition, do not ignore () the middle branch symbol |, greater flexibility, it gave expression (? 1)? | [^ ()] + can be viewed as a | b mode; Without it, our feature becomes ab | b, this is not the same matching characteristics, matching flexibility also reduced a lot.

What I want to show from the above example is that the combination of recursive subgroups and branching structures + quantifiers can achieve very flexible matching.

In conclusion, this time, we introduce recursive subgroups, the difference between backward references, and the basic use of recursive re, and through a case, we explain recursion and branching structure, the combination of quantifiers, and finally, we add a point, recursive reference (? 0) the meaning of the expression and (? R) is the same.

Thank you for reading, if there are inaccuracies and errors, please leave a message to correct, I will correct in time, thanks!

Welcome those who like to study technology to communicate with me, learn from each other and grow together 🙂

Summary is not easy, please do not reprint privately, otherwise don’t blame the old man you are welcome

References:

www.php.net Official Documents

Blog.csdn.net/technofiend… Understand the (? R) recursion

www.cnblogs.com/f-ck-need-u… Master recursive regular expressions step by step

The secret of regular expressions (5) Recursive re

Recursively regular (? R)

There are several basic ways to use recursive regularization:

Related Posts

Directory traversal of Web security vulnerabilities

Spring+ Mybatis

14. What are type assertions in Go?