Writing in the front

A brief history and genre

In 1956, a mathematician published a paper called “Neural Network Event Representation and Finite Automata,” describing a notation called “regular sets.”

Then, Ken Thompson, the famous father of Unix, published the article “Regular expression Search Algorithm” in 1968, and introduced the regular into his editor QED, and the editor Ed after that, and later transplanted into the famous text search tool grep. Since then, regular expressions have been widely used in various tools of Unix or UniX-like systems (such as maxOS and Linux).

Later, due to the power of regular expressions, more and more platform tools are introduced. At the time, however, there were no clear standards, and the details of how regular expressions were used varied from platform to platform. There is an urgent need for a unified standard to regulate. Thus, genre emerged.

In 1986, POSIX was born and attempts to standardize began. Most tools on Unix systems or UniX-like systems, such as grep, sed, AWk, etc., follow this standard.

In December 1987, Larry Wall released the first version of Perl, which became popular for its powerful, regular-expression capabilities. Since then, pCL-Perl-compatible regular expressions have been improved and introduced in 1997. At this point, POSIX and PCRE two schools basically formed. As for the two schools under the corresponding regular expression syntax characteristics such as specific do not say, we have to understand is that the development of PCRE and rise closer to us, so we now used by most programming languages in the regular expression follows the PCRE genre, the POSIX grammar more primitive, concision, basic is still used in early in the system.

Basic usage

An expression describing the composition of text content

  • Verify text content
    • test
    • search
  • Extract text information
    • match
  • Replace text content
    • replace
  • Cut text content
    • split

metacharacters

  • Special single character
    • . Any character, excluding newline
    • \d any digit, \d any non-digit
    • \w any alphanumeric underscore, \D any non-alphanumeric underscore
    • \s any whitespace character, \s any non-whitespace character
  • Whitespace characters
    • \ r a carriage return
    • \ n a newline character
    • \ f form-feed character
    • \t TAB, \ V vertical TAB
  • The scope of
    • Ab | BC ab or BC
    • [1, 2, 3… Take any
    • [1-9A-z] take any of the range
    • [^…]. The not
  • quantifiers
    • (*) 0 to multiple times
    • (+) 1 to multiple times
    • (?) 0 or 1
    • {m} m times
    • {m,} at least m times
    • {m, n} m to n times

Quantifiers and greed

Re matching defaults to greedy mode

/a*? /Copy the code

In both greedy mode and non-greedy mode, one-off matching fails. In this case, it is necessary to backtrack, adjust the number of repeated matches and continue matching until the complete matching succeeds or fails.

There is another common mode, exclusive mode, that ECMAScript does not support. Its rule is to add a plus sign + after the quantifier, as follows:

/a*+/
Copy the code

Its matching follows the greedy pattern and does not backtrack.

Grouping and referencing

/ (? :123) /Copy the code

In ECMAScript, references are given by \ numbers, such as \1

Example: I hava a cat cat. ‘I hava a cat cat.’.replace(/(cat)(\s\1)/g, ‘$1’)

Do not save subgroups Groups in parentheses are saved by default. If you only need to use it for matching and do not reference it, then you do not need to save subgroups. : Can be, such as /(? :abc)/

Parenthesis nesting numbering problem

After grouping

Match the pattern

The global model

This is the most common matching pattern. Look for all matches, not stop at the first match.

Case insensitive mode

/[a-zA-Z]/ 
// Can be written as/ [a-z] / I or/a-z/ICopy the code

Multiline mode

The result is that ^ and $match the beginning and end of each line, i.e. each line is a matching unit.

Unicode mode

There are many Chinese names: wanguo code, International code, unicode code, etc. It is an industry standard in the field of computer science. It arranges and encodes most of the world’s text. Unicode is still evolving, with the latest version, 13.0.0, released on March 10, 2020, containing more than 140,000 characters. Unicode characters are now arranged into 17 groups, each consisting of a Plane with 65536 (2 ^ 16) code values per Plane. At present, most of the planes used belong to the no. 0 plane, that is, the BMP plane, in addition to the BMP plane, the other planes are called supplementary planes.

So much for the basics of Unicode, what is the Use of the Unicode schema? It is useful for matching correctly when we use characters outside the BMP plane, but a character may be split if Unicode is still not in use. For example, the music character ∮, standard matching will be interpreted as two characters 0xD834 and 0xDD1E, and /u mode is interpreted as a single character.

Dot pattern

Here’s an example:

var reg = /abc/
var str = "abcdefg"
reg.test(str) // true
reg.lastIndex / / 0
reg.test(str) // true
Copy the code

So this is in nonfixed-point mode, lastIndex is zero; For fixed-point mode, the following is used:

var reg = /abc/
var str = "abcdefg"
reg.test(str) // true
reg.lastIndex // 3
reg.test(str) // false
Copy the code

In fixed-point mode, lastIndex will move the index based on the result of the last match, and the next match will start from lastIndex. If there is a match, lastIndex will continue to move to the right. If the match fails, false will be returned, and lastIndex will be set to 0

assertions

There are three common assertions:

  • Word boundaries
  • Start or end of a line
  • Circumferential (zero-width assertion)

Word boundaries


If the re/Tom/match produces two results, both Tom and tomorrow’s Tom will be matched, and if Tom needs to be matched exactly, word boundaries can be handled efficiently: /\btom\b/ matches only Tom. Note that:

Word boundaries appear in grouping parentheses; \1 is not referenced.

Start or end of a line

Circumferential (zero-width assertion)

Of course, just looking at this chart is confusing. So let me explain, a zero-width assertion is really just a restriction on the left and right of the items that you want to match, and that corresponds to these four cases. Just remember: the left Angle bracket means look left, if you don’t, look right, and the exclamation mark means not.

escape

\ Transposed character

Principle of regular matching

Finite state automata

Regular engine
DFA
NFA
Traditional NFA
POSIX NFA

const reg = /a(? :bb)+c/Copy the code

The compilation of this regular expression is essentially the automaton generation process, and the regular engine matches the string based on the generated automaton. A generated automaton might look like this:

When matching to S3 state, it does not matter whether bb is in S3 state or S1 state before further matching, because after bb matches 1 group, the number of groups in the following state is the same for both states. This state machine is called a nondeterministic finite state automata (NFA). And NFA and DFA can be transformed. Let’s look at deterministic finite state automata (DFA) below:

To optimize the

01 | good compile regular ahead of time

02 | accurate representation matching range as far as possible

03 | extract the public section

const reg = /abcd|abxy/ // We need to extract the common part ab
Copy the code

04 | match likely part of the left

05 | careful subgroups

06 | avoid nested repeated subgroups

const reg = * / / (. *) // There are serious nested duplicates
Copy the code

07 | to avoid different branches to repeat

const reg = /[1-9]+|\w+/ // \w includes the preceding number
Copy the code

tool

Regular test