Zhiyuan Wang, Wedoctor Front-end Technology Department

preface

Re, the familiar stranger, is seen in form validation and in the html-to-AST tree Parser principle of the framework source code; Often see, need baidu a search, can really use, but once met on the panic, the reason is very simple, this Martian text, who can understand it!

The goal of this article, take you into the regular world, as a serious and responsible popular science, must make you learn to understand but not, so repeatedly to check.

Regular history

Regex are rules that are used to verify or obtain information.

It really starts with neural networks

In the 1940s, two neurophysiologists developed a method to describe neural networks in a mathematical way. Based on this, they published a paper “Regular expression Search Algorithm” in 1956, which mainly described a kind of coincidence called ** Regular Sets.

Shine in the computer world

Ten years later, in 1968, the father of UNIX published his article “Regular expression Search Algorithm” and ported the re to the well-known text search tool grep.

Why the re exists

Don’t start by memorizing application layer apis. Understand that the human brain is designed to understand things that are easy to remember. Don’t waste your talent.

As we said, regex are rules that are used to verify or obtain information. The most brutal matching action is undoubtedly one-to-one correspondence, a to A, B to B, this rule is exactly the same to match, but this is undoubtedly too inefficient; It is easy for a thing to have commonalities, which constitute a set, such as a 13-digit mobile phone number, an email address with @, etc. Regular is to enable us to match or obtain the existence of these specific rule sets more efficiently.

How does the re do that

The way to deal with regex is to set up subrules, so that some symbols don’t represent themselves, but some subrules. It’s like building a building. If we want to find the rules we want, we have to use the appropriate minimum bricks, and then set the amount of bricks, and the building is complete.

This subrule, is metacharacter, such as our common \d, represents a single 0-1 number, \s is the newline index whitespace character; It’s important to note that we’ve been talking about singletons, which makes sense, because for text, the smallest unit is a single character, and with the smallest unit, plus the repetition scale, we can build our own building.

At this point, our basic understanding of the regular world is in place, and we can start to share it more down to earth.

Brick: metacharacter

There are four dimensions, namely [character group], [reverse character group], [common character group], [blank character group], which can be remembered as [3 + 1]

Character groups

Based on using

For single-character selection, the term in regex is called a character set, and we’re going to use that term, but don’t be fooled, it doesn’t match a set of data, it matches just one character in a set of data, which is critical.

Grammar: [XXX]

Matching rule: The target text must contain any element enclosed in parentheses.

** Get information rule: ** will get the first [any element enclosed in parentheses]

For example: /[ABC]/ This re will match any and only one of a, B, and C. By default, it matches first. Using the test platform, you will get the following result

Take the anticharacter group

Take on logic

Normal use is range selection, but can also appear [in addition to these] range selection, then need to use the inverse

Grammar: [^ XXX]

Matching rule: The target text must contain any element not included in parentheses.

** Get information rule: ** will get the first [any element not enclosed in parentheses]

For example: /[^ ABC]/ This regex will match any and only one of a, B, or C. By default, it matches the first digit. Using the test platform, you will get the following result

Common character group

The rule of ’82’ also applies to character groups. There are many common matching rules that do not need to be repeated, so the re gives each of these special character groups individual names

It is mainly divided into the following [3+1] concept, three pairs + one

Use symbols The rules Corresponding character group
\d A single number [0-9] (Digit)
\D A single arbitrary non-number [^ 0-9]
\w A single digit, letter and underscore [0-9A-ZA-Z_] Indicates the numeric character underscore
\W A single non-numeric alphanumeric underscore [^ 0-9a-zA-z_] non-word characters
\s A single arbitrary whitespace character [\t\v\n\r\f] represents whitespace (space, horizontal TAB, vertical TAB, line feed, carriage return, page feed)
\S A single arbitrary non-whitespace character [^ \t\v\n\r\f]
. A single character of any character except newline [^\n\r\u2020\u2029] Wildcard Any character except newline, carriage return, line separator, and segment separator

Other common character groups

Corresponding character group The rules
[\d\D] | [\s\S] | [\w\W] | [^] Any character

White space characters

These characters are special and can be carried out separately for direct use

At this point, we also know how to match individual characters, which is our “minimum size”, but matching one by one will not work, and the re will be unbearably long, so we need to repeat, which is what quantifiers are for.

How many bricks: quantifiers

Once we understand character groups, we need to understand scale, which is a term in re – quantifier, or 3+1

This is not very complicated, the only thing to note is that the re is matched only once by default.

Look at the case, we still use [ABC] as the smallest unit of character group to demonstrate.

* : 0-N times

Regular: / [ABC] * /

Matching rule: The target text does not need to contain any element enclosed in parentheses

** Get information rule: ** will get the first paragraph [contiguous group of elements enclosed in parentheses]

+ sign: 1+ N times

Regular: [ABC] + /

** Get information rule: ** will get the first paragraph [contiguous group of elements enclosed in parentheses]

Matching rule: The target text must contain at least one element enclosed in parentheses.

? No. : 0 times or 1 time

Regular: / [ABC]? /

** Matching rule: ** This matching rule must be true

** Get information rule: ** will get the first [any element enclosed in parentheses]

{} symbol: precise control of times

All of the above are special cases. We can use {} to accurately control the number of matches. There are three main uses

  • {m} : must occur m times
  • {m, n} : can occur m-N times
  • {m,} : indicates at least m occurrences

We demonstrate all three cases, with the following giFs

Quantifier mode

Quantifiers are also concerned with patterns, because quantifiers have scope, which means they are more or less desirable, but computers don’t allow ambiguity, so there are three modes of quantifiers

  • Greedy mode: By default, matches as much content as possible
  • Lazy mode: add one after the quantifier?, matches as little content as possible
  • Exclusive mode: a quantifier followed by a+Does not trigger backtracking

See pattern differences for examples

Test case: AAABB

Test re:

  • Greed mode:/a*/
  • Lazy mode:/a*? /
Greed mode:/a*/

Matching process:

Matching results:

Output: [‘aaa’,”,”,’]

Lazy mode:/a*? /

Matching process:

Matching results:

Output: [‘,’a’,’ A ‘,’ A ‘,’a’,’ a’,”,”,”]

Supplementary case

At this point, we have completed the study of quantifier rules

The regular pattern

Since quantifiers have patterns, the re itself naturally has patterns. There are [3+1] patterns for [case, multi-line, dot matching, remarks]

  • Case insensitive mode
  • Point wildcard mode
  • Multi-line matching pattern
  • Annotation model

Let’s take a look at each one

Insensitive mode (case-insensitive)

Grammar: /? I)reg/ corresponds to js /reg/ I

Note:

  1. How to specify a case-insensitive schema using the schema modifier (? I);
  2. If the modifier is inside a parenthesis, it applies to the regular within the parenthesis, not the entire regular.

Effect: Ignore case to match

Regular: /? I)(cat) \1/ corresponding js is /(cat) \1/ I

If we want to repeat words with exactly the same case, we can use the following re

Regular: / ((? I)cat) \1/ the corresponding js is not yet

Point wildcard mode (Single Line matching mode)

Grammar: /? S) Reg/corresponding JS is currently unavailable

Note:

What it does: Enables. Metacharacters to match all characters, including newlines

Multi-line matching pattern

Grammar: /? M)reg/ corresponds to js /reg/m

Note:

Effect: Causes ^ and $to match the beginning or end of each line

Before using the regular: / ^ the | $/ cat

After using the re: /(? M) ^ the | cat $/ corresponding js / ^ the | cat $/ m

Annotation model

Grammar: /? #)reg/ js is not available

Note:

Effect: Enables the re to support adding remarks

Use the regular example: /(\w+)(? #word) \1(? #word repeat again)/

Regular position information

For the match, as we see a person is looking for himself, not only to the portrait, photos have been watching this one method, can also describe TA before anything beside, TA is what, what is behind and so on, the location information in regular demand too, and there is a special term – assertions.

Predicate, that is, determine the position of the matching text; There are three categories: word boundaries, line start/end, and look around.

Start/end of line

We’ve done this before, more or less, using ^ and $if we want a match to appear at the beginning or end of a line of text.

Combined with the concept of multi-line mode mentioned earlier, the default processing text is treated as a line by the re, regardless of whether it is a newline, which is the beginning and end of the text is the same as the beginning and end of the text; If you want to deal with multiple lines, you just need to change the mode to match multiple lines, js syntax is /reg/m.

Word Boundary

The multi-line mode +^$handles boundaries at the line dimensions, but not if they are words, as we would like to replace the name Tom with Jerry in the following text

tom asked me if I would go fishing with him tomorrow.
Copy the code

If the re is/Tom /, this false substitution will occur

Obviously, what we want is Tom, not just the part that contains Tom, so we can use the concept of word boundaries to set the start and end to avoid matching ambiguity.

Basic concept

Grammar: \ b

【[a-za-z0-9_]】 = = = = = = = = = = = = = = = =

The instance

Look around the

We just talked about boundaries, including word and line boundaries, but boundaries are basically saying that the matching text must have a certain content before and after it, except that this particular content is ^$for lines and $for word boundaries;

So let’s make this specific range more flexible. For a paragraph of content, there are two directions and two cases of satisfy or dissatisfy, which means there are four cases, as shown in the following table.

In fact, it is summed up: Angle brackets are left, equal sign must exclamation no

Regular logic information

We have already learned the concepts of character groups and quantifiers. Just like a programming language, it is not enough to have components, but also logical decisions. Logical metacharacters also exist in re.

Logical metacharacter

| number: or logic

A resource may start with http://, https://, or ftp://, so the protocol part of the resource can be used (HTTPS? | FTP) : / /.

Grouping of regular priority promotion

There exists the concept of grouping in re, which has two main functions: global and reuse.

As a whole

/\d{15}\d{3}? /; At the back of the \ d {3}? Will represent a lazy pattern match, which will only match 18 digits instead of 15

Test re: /\d{15}\d{3}? /

There is a need to make sure that \d{3} is a whole, which can be implemented using grouping

Test re: /\d{15}(\d{3})? /

reuse

Sometimes, we also need to use the results of the previous match, such as [check the text for repeated words], the solution becomes

  1. Write a re that matches a single word

  2. Match again using the previous results

    The second is by grouping; Let’s start with the basics

Basic concept

Syntax: use () for definitions, \ number for re access, and $number for method access

For grouping, parenthesized portions are saved as subgroups by default, which can be accessed by subgroup numbers, which are incremented from one, or by syntax (? 🙂 so as not to save the subgroup, avoid occupying the number.

Group reference syntax explained in detail

Grouping reference

Given that the group number is number, you can reference it using \number

Multiple numbering case

The opening bracket is the number, that’s the number of groups

Do not save subgroups

This subgroup is not assigned a number with this syntax

Replace feature

After grouping

V8 has now fully implemented the proposal tc39.github. IO /proposal-re…

Basic concept

Syntax: define using (?

), use \k

in the re and $

in the method.


What it does: It is used to name groups, no longer using numbered access but directly through the group variable name, which is more accurate

API in conjunction with deconstruction assignment

In the JS method about re, if there is a named group, there will be the groups attribute, which stores the name of each named group and the value they match. Combined with deconstruction assignment, it has a very magical effect;

Use in exec() and match() :

The exec() and match() methods return an array of matching results with a Groups attribute that holds the name of each named group and the value they match

const {day, month, year} = "04-25-2017".match(/(? <month>\d{2})-(? <day>\d{2})-(? <year>\d{4})/).groupsCopy the code

In the replace (/… /, replacement) :

When replacement is a function, one more Groups object is passed at the end of the argument list

"04-25-2017".replace(/(? <month>\d{2})-(? <day>\d{2})-(? <year>\d{4})/, (... args) => { const groups = args.slice(-1)[0] const {day, month, year} = groups return `${day}-${month}-${year}` })Copy the code

Regular programming

This is the most critical part, learn to use ah, we will share in the re in the front-end programming application.

Regex will eventually be implemented into a programming language, so let’s take a look at regex programming.

Regular processing can be divided into the following four categories:

  • Verify text content
  • Extract text content
  • Replace text content
  • Cut text content

Let’s take them one by one

Verify text content

** Note: ** is about lastIndex, which is where the re will start the next match; Four methods of a string, each match starts at zero, that is, lastIndex remains the same; The exec and test methods of the re will change the value of lastIndex at the end of each match.

var regex = new RegExp(/^\d{4}-\d{2}-\d{2}/.'g')
regex.test('2021-12-21') // true
console.log(regex.lastIndex ) / / 10
regex.test('2021-12-21') // false
console.log(regex.lastIndex ) / / 0
Copy the code

Since we’re doing text validation here, we don’t have to find all of them. Therefore, it is recommended not to set g mode when using RegExp.

String method: search

Search converts the string to a regular

Regex method: test

Extract text content

String method: match

Match converts a string to a regular

** Note: the return value of the **match method is related to the modifier G (null if there is no match).

  • No g: returns the standard matching format, that is, the first element of the array is the contents of the whole match, followed by the contents of the group capture, followed by the first subscript of the whole match, and finally the target string

  • G: returns an array containing all matches

The regular method: exec

Exec is more powerful than match and can solve the problem that match has no index information with the modifier G. With exec, the re stores the position where the next match starts on the re’s attribute lastIndex

Replace text content

String method: replace

Cut text content

String method: split
  • You can have a second argument that represents the maximum length of the result array
  • If the re uses groups, the resulting array contains delimiters

Summary of front-end related apis

  • string
    • match
    • split
    • search
    • replace
  • RegExp
    • test
    • exec

conclusion

Make a summary, draw knowledge graph, convenient for their own memory, but also convenient to share with others

3 + 1 metacharacter; 3 + 1 common metacharacter; 3 + 1 regular quantifier; 3. Quantifier matching mode; 3 + 1 regular matching pattern; 3 + 1 regular logic

Firstly, the material metacharacter has four dimensions, namely [character group], [reverse character group], [common character group] and [blank character], which can be remembered as [3 + 1].

Once we understand character groups, we need to understand scale, which is a term in re – quantifier 3+1.

Quantifiers are also concerned with the problem of patterns, because quantifiers have a range, which means they are more or less desirable, but computers are not allowed to have ambiguity, so there are three modes of quantifiers;

Since quantifiers have patterns, the re itself naturally has patterns. There are [3+1] patterns for [case, multi-line, point matching, remarks].

Positioning, for example, requires not only the absolute information of itself, but also the relative location information, which is called assertion in the re. There are three cases, [beginning and end of line, word boundary and circumferential], among which there are four cases of circumferential

As programming language, we have the fragmentary material is not enough, also need logic, exist in the regular branch statement | and priority groups, group has three categories, the default group, named non capturing group and group

At this point, we have summarized the overall context of the re in a very concise summary statement.

The end of the

I hope I can make you understand and remember it once. I want to echo from the beginning to the end and try to realize it by myself. Some needs will be found that if we use the regular Angle, there will be a lot of magical ways to achieve it.