Regular expression

Regular expressions are patterns used to match character combinations in strings.

In JavaScript, regular expressions are also objects. These patterns are used with the exec() and test() methods of RegExp, and with the match(), matchAll(), replace(), replaceAll(), search(), and split() methods of String

Regular expressions – JavaScript | MDN (mozilla.org)

Two forms of creation

Basic form (literal)

The regular expression content is wrapped with a pair of slashes

/abc/ // Matches the ABC substring
Copy the code

Use constructors to create regular expressions

let re = new RegExp('abc')
Copy the code

Regular expression literals are compiled after the script is loaded. When the regular expression does not change, use literals for better performance. The regular expressions created with constructors are compiled while the script is running. If the regular expressions change dynamically, constructors are needed to create the regular expressions

Arguments in a regular expression

Substitute variables in using the variable concatenation of the template string

let a = 'abc'
` /${a}/ ` // In this case, ABC is matched
Copy the code

Pattern matching of regular expressions

Simple patterns (literals)

Consists of basic characters that directly match character content (same characters and order)

/abc/ // Match the ABC substring without any difference
Copy the code

Special characters

Special characters can be used to construct special pattern matches rather than just fixed basic character content, such as matching an indefinite number of b (/b*/), or a string beginning with a/ ^a/

The backslash\

A backslash before a non-special character indicates that the next character is a special character. In this case, the non-special character does not match its literal counterpart, for example, /\d/, and matches 0 to 9

Note that \ is an escape character in the string literal, so in order to add a backslash to the pattern string, you need to escape the backslash in the string, namely new RegExp(“\\b”)

There is also a special case where backslashes need to be escaped in both string literals and regular expression literals when matching, such as /\b\\/, “\b\\\\”

Character classes

Use character classes to specify character types (such as numbers, letters, control characters)

The decimal point.

  • Defaults to match the division terminator (\n.\r.\u2028 or \u2029Any single character other than)
  • In a character class, the dot loses its special meaningAnd only matches literal.
/.n/ 
// Will match 'an' and 'on' in "nay, an apple is on the tree", but will not match 'nay'
Copy the code

In ES2018 if the S “dotAll” flag bit is set to true, it will also match newlines (ES9’s regular extension)

\d

Matches any single character that is an Arabic digit

\D

Matches any single character that is not an Arabic digit

\w

Matches any single character of uppercase letters, digits, and underscores

\W

Matches any single character that is non-uppercase, non-numeric, and non-underscore

\s

Matches any single character that is a whitespace character, including Spaces, tabs, page feeds, line feeds, and other Unicode Spaces

\S

Matches any single character that is not a whitespace character

Matches escape characters

\t

Matches a horizontal TAB character

\v

Matches a vertical TAB character

\r

Match a carriage Return

\n

Matching a line feed

\f

Match a form-feed

[\b]

Matches a backspace character (note the difference with \b)

\ 0

Matches a null character

\cX

Matches A control notation (using caret notation, such as ^A~^Z). X represents A – Z

Matching a specific value

\xhh

Match two sixteen mechanism numbers (hh), note that the x here does not refer to the matching item, but to the actual X

\uhhhh

Matching a UTF-16 character unit (HHHH)

\u{hhhh} or \u{hhhhh}

(when the U flag bit is set) matches Unicode character values (hex numbers)

\p{UnicodeProperty}, \P{UnicodeProperty}

Matching a character (such as an emoji character, A Katakana character, or a Chinese kanji character) through the character’s Unicode character Properties

Quantifiers

Used to specify the number of characters or expressions that should be matched

Note: The X below represents the regular expression entry, which represents not only individual characters but also character classes, Unicode Property Escapes, groups and ranges

X*

Match X0 or more times

X+

Match X1 times or more

X?

  • A match is indicated when it is not followed by any number of modifiersXZero or one
  • When immediately after the quantity modifier (* +? {}), which makes the quantity modifier non-greedy, matching as few characters as possible. By default, the quantity modifier matches as many characters as possible, in greedy mode (greedy)

X{n}

N is a positive integer that matches X exactly n times

X{n,}

N is a positive integer that matches X at least n times

X{n,m}

N is 0 or a positive integer, m is a positive integer, and m > n matches X at least n times and at most m times

Greedy versus non-greedy

By default all quantity modifiers are in greedy mode, which means they will try to match as many characters as possible. Add the question mark character after the quantity modifier? Causes the quantity modifier to go into non-greedy mode, which means it stops matching once a match is found (matches as few characters as possible)

"some <foo> <bar> new </bar> </foo> thing";
> / <. * /; // will match "<foo> <bar> new </bar> </foo>"
/ <. *? >/; // will match "<foo>"
// Examples from MDN documentation
Copy the code

Groups and ranges

x|y

Matches x or y

[xyz], [a-c]

Here we define a character class that will match a character that contains characters. You can use dashes – to briefly specify character ranges such as [abcd] => [a-d]. If the dashes are the first or last character, the dashes are literal only

[^xyz], [^a-c]

Defines a negation character class (think of as a complement of [xyz], [a-c]) that will match any character that is not included in it

(x)

Capturing groups, which match X and record what matches. There can be multiple capture groups in an expression. These groups are stored in an array with elements in the same order as the opening parentheses. By subscript ([1],… , [n]), or through the property of the regular expression object ($1… , $n.)

The capture group is always a way to record the substring captured so that it can be called again. Use non-capture groups if you do not want to record

(? :x)

Non-capturing groups, will match X but will not record content, so matched content cannot be called again

\num

Num is a positive integer, and \num is a backreference to the last substring matching the NTH opening bracket (corresponding to the order of the opening bracket). For example, \(foo) (bar) \1 \2\ corresponds to foo bar foo bar

(? <name>x)

Named capture groups, where name is the corresponding name (Angle brackets <> cannot be omitted) and x is what to match. Matches are stored in a groups property, and the matches substring is retrieved using matches. Groups. Name.

\k<name>

Name is the name of the corresponding named capture group (Angle brackets <> cannot be omitted). K is the literal k, which is used to identify the reference and cannot be omitted

Assertions

Assertions include assertions about boundaries and character sequences.

caret^

Matches the start (leading character) of the input. If the multi-line flag is set to true, the position immediately following the newline character is also matched

/^A/
// Does not match A in an A, matches E in an E
Copy the code

Dollar sign$

Matches the end (trailing character) of the input. If the multi-line flag is set to true, it also matches the position before the newline

/t$/ 
// Does not match the t in Eater, but matches the t in eat
Copy the code

\b

Matches the boundaries of a word. This boundary position refers to the position where the character of one word is not followed by, or is not immediately followed by, another word character

"moon";
/\bm/; Matching / / m
/oon\b/; / / match oon
/oo\b/; // Does not match anything
/\w\b\w/; // It is impossible to match anything because the characters of one word cannot follow a word boundary followed by another word's characters
Copy the code

\B

Matches non-word boundaries. This position is when the character preceding and following the position are of the same type: either they are both word characters or they are both non-word characters, such as between letters or Spaces

The start and end positions of a string are considered non-word positions

"noon";
/\Bon/; // on
/no\B/; // no
Copy the code

x(? =y)

Lookahead assertion, match ‘x’ only when ‘x’ is followed by ‘y’

/Jack(? =Sprat)/// will match 'Jack' only if it is followed by 'Sprat'
// But 'Sprat' is not part of the matching result
Copy the code

x(? ! y)

Negative lookahead assertion, match ‘x’ only when ‘x’ is not followed by ‘y’

(? <=y)x

Lookbehind Assertion, match ‘x’ only if ‘x’ is preceded by ‘y’

(? <! y)x

Assertion match ‘x’ only when ‘x’ is not preceded by ‘y’

Unicode Property Escapes

Unicode attribute escape allows matching of content (emojis, letters, etc.) with Unicode properties

For Unicode attribute escapes to work, the U flag bit should be set to indicate that a string must be considered a sequence of Unicode code points.

grammar

// Non-binary values
\p{UnicodePropertyValue}
\p{UnicodePropertyName=UnicodePropertyValue}

// Binary and non-binary values
\p{UnicodeBinaryPropertyName}

// Negation: \P is negated \p
\P{UnicodePropertyValue}
\P{UnicodeBinaryPropertyName}
Copy the code

Unicode property escapes – JavaScript | MDN (mozilla.org)

Advanced search with flags

Regular expressions have seven optional flag bits, which can be used individually or in combination

Note that flags are part of a regular expression and cannot be added or removed by subsequent operations once the regular expression is identified

Adds flag bits when creating regular expression literals or objects

/pattern/flag;

const re = new RegExp("pattern"."flag");
Copy the code

The following bits are optional flag bits and meanings

Flag Desc Corresponding prop
d Generate the starting and ending indexes for matching substrings, obtained from the Indices array of RegExpArray RegExp.prototype.hasIndices
g Global search. For string-dependent methods it means searching through the entire String, and for regEXp-dependent methods it means getting the iteration process, which means calling the method multiple times iteratively looking for matching substrings in the String RegExp.prototype.global
i Search regardless of case RegExp.prototype.ignoreCase
m Multi-line search, which means using assertion notation in strings containing more than one line^and$Matches the beginning and end of each line instead of the default of just the beginning and end of the entire string RegExp.prototype.multiline
s allow.Matches a newline character RegExp.prototype.dotAll
u “Unicode” means to treat a schema as a series of Unicode code points RegExp.prototype.unicode
y For sticky search, the starting index of the search can be customized by using the lastIndex property of the regular expression object, if lastIndex identifies the starting positionNot the starting position of a full regular expression match, then nothing will match, andLastindex return 0; ifIs the starting position of a full regular expression match, then the matching item will be obtained andUpdate lastIndex to add 1 to the end of the match and try to continue the match. The stickiness here means that the search will continue from the current subscript position only if the matches are consecutive and “sticky”, otherwise the lastindex will return to 0 and start from scratch.The G flag is omitted when y and G flags are used together RegExp.prototype.sticky

All of the above flag bits correspond to properties of the corresponding RegExp that are read-only (Boolean)

Note that the start position of each match search is determined by the lastIndex property of the regular expression object. This value is allowed to be set by the user. The user can customize the start position of the search by setting the lastIndex value

Use regular expressions in JavaScript

This is mainly used in the methods of RegExp and several methods of String that allow regular expression arguments

RegExp related methods

RegExp.prototype.exec(str)

The entire method performs a match search in the specified string, returning either a result array (RegExpArray, which is a pseudo-array containing multiple attributes) or NULL

The string that you want to search for matches is passed in as a method parameter

When the flag bit of a regular expression contains g or Y, a regular expression object is stateful. This state is recorded by the regular expression object’s property lastIndex, which represents the index at the start of each match search, or the subscript position currently pointed to since the last search. For regular expressions with the G flag set, each subsequent exec() will change lastIndex once, so that each search will start at the new location until the string search ends (returning null), and a call to exec() will start at the beginning (lastIndex returns to 0); For a regular expression with the y flag set, according to the sticky search method, the search will continue from the current position only if the match is continuous, otherwise start from scratch (lastIndex returns to 0).

Note that when exec() is executed multiple times, lastIndex will not return zero (unless null is returned) if the target string changes midway through

RegExpArray

When an exec() match succeeds, RegExpArray is returned, which is an array of additional attributes. Each successful match updates the lastIndex property of the RegExp object. If the match fails, null is returned and lastIndex is set to 0

The RegExpArray returned takes the match of the full regular expression as the first element [0] of the array, followed by the match of each capture group (in the order in which the open bracket of the capture group appears). The array contains additional attributes:

  • Input: Indicates the input character string
  • Index: Indicates the start index of a full regular expression match in a string
  • Groups: Records the object of the named capture group, the key is the capture group name, the value is the group match or undefined (if there is no match), rea.groups.name
  • Indices [][] : When the flag bit is setdAfter, the start and end indexes of the match are recorded, which is a two-dimensional array in which each one-dimensional array stores a pair of start and end indexes. One of them, toogroupsProperty that records the start and end indexes of matches for named capture groups

RegExp.prototype.test(str)

This method is used to determine whether a full regular expression match exists in the string and returns a Boolean value

Similar to exec(), when the regular expression sets the flag bit G or y, the RegExp object’s property lastIndex is updated once the method is executed, and if the target string is changed midway, lastIndex does not return to zero unless false is returned

String correlation method

String.prototype.match(regexp)

parameter

Retrieves a match in a string based on the passed regular expression argument, returning either an array of matches or null; If no arguments are passed, an array containing an empty string is obtained

When a parameter is passed in that is not a RegExp object type, it is implicitly converted to a RegExp object using new RegExp(), and note that if a signed positive number is passed in, the + is omitted, such as +10086 resulting in /10086/

The return value

Returns null if there is no match, and what is returned if there is a match depending on whether flag bit G is set

If the flag bit G is set, all matches to the full regular expression are returned, excluding those for the capture group

If the flag bit G is not set, the result is the same as if the regexp.prototype. exec(STR) were not set, which returns an array of matches with additional attributes, the first full regular expression, and the capture of grouped matches

String.prototype.matchAll(regexp)

Returns an iterator that contains all matches, an iterator that cannot restart the iteration process. Each iterator returns a RegExpArray (of the same type as RegExp. Prototype.exec ()).

The characteristics of the method

  • The parameter is oneMust containgSign aThe regular expression object (if containinggFlag bit of the regular expression string, will be automatically converted to a RegExp object), if not containedgAn error will be thrownTypeError
  • In the absence ofmatchAll()Method before setting upgThe regular expression is requiredRegExp.prototype.exec()And loop statement to get all matches and their specific information (including capture group matches, start index information, etc.), nowusingmatchAll()The iterator returned by the method can be usedfor... of...The structure,Array.from()Or extend the operator.To get everything in a nutshell
  • matchAll()Methods are internalized at execution timeCreates a clone of the incoming regular expression object, which makes the original regular expression objectlastIndexParameters will not be changed(this andexec()Instead, the method is updated once executedlastIndex)
  • matchAll()andmatch()One advantage of comparison is that it canBetter capture grouping information. As mentioned abovematch()Regular expression Settings ingWhen the flag bit is used, only the first matched content can be obtained. Then, only the information related to the capture group of the first complete regular expression match can be obtainedmatchAll()The captured grouping information of all matches can be obtained through the returned iterator

String.prototype.search(regexp)

Performs a match search in the string

Returns the start index of the matching item if the full regular expression is found, or -1 otherwise

String.prototype.replace(regexp, newSubstr|replacerFunction)

Returns a new string with the match replaced, unchanged

parameter

The parameters of this method can also be replace (substr, newSubstr | replacerFunction), as the first parameter is a string, the first and only target string to match the substr substring will be replaced

In this case, the first parameter must be a regular expression object or literal (for example, regexp (/ ABC /) or/ABC /), but cannot be a string regular expression. Cannot be “/ ABC /”). When the flag bit G is set to the regular expression, all full regular expression matches are replaced

The second argument can be a string to replace a full regular expression match or a function that generates a string to replace a full regular expression match.

  • When a string is used for substitution, it can be a simple string, or it can be some specific variable

    Note that the following variables are enclosed in quotation marks

    Pattern Inserts
    $$ Insert a$
    $& Inserts the substring matched by the full regular expression
    $` Insert the original content of the string before the substring matched by the full regular expression
    $’ Insert the original content of the string after the substring matched by the full regular expression
    $n nIs a positive integer less than 100, used to insert the NTH capture group match if the corresponding content does not exist or the first argument is notRegExpObject is parsed as a literal$n
    $<name> nameRepresents the name of the named capture group,$<name>Matches for the corresponding capture group. If the corresponding name has no corresponding content or the first parameter is notRegExp, or if the name does not exist, it will be resolved to a literal$<name>
  • When is a function, the function is executed after the match operation has been performed and returns the string used for the replacement. If the first argument is a RegExp object with flag bit G set, the function is executed multiple times to replace multiple full regular expression matches

    The function takes the following arguments:

    Possible name Supplied value
    match Substring matched by the complete regular expression$&
    p1, p2, … The NTH capture group match corresponds to$n
    offset The starting index of the current match, or the displacement of the starting position throughout the string
    string The string being searched for a match, which is the string calling the method
    groups An object whose key is the name of the named capture group and whose value is the corresponding match for the group

When the second argument is set to an empty string, the match is removed

String.prototype.replaceAll()

The difference between replaceAll() and replace() is that the former replaces all matches and requires that if the first argument is RegExp, the flag bit G must be set or TypeError will be raised

String.prototype.split(separator? , limit?)

Used to split a string according to a particular symbol, returns an array containing the split substring

parameter

  • Separator: Can be a simple string or a regexp. If the regEXP contains a capture group, the matches of the capture group are recorded in the returned array (immediately after the corresponding full regular expression match). If the argument is an array, it is cast to a string
  • Limit: A non-negative integer that limits the number of neutrons in the array returned by the method

The return value

If the string is an empty string but the delimiter is not an empty string, an array containing an empty string is returned; If both the string and delimiter are empty strings, return an empty array; Returns an array of the entire string if no delimiter is matched or no delimiter is passed in; Returns an array containing matching substrings if a match is normal

On the usesplit("")The problem of parsing strings into character arrays

In this case, character units of UTF-16 are used to split characters, which destroys surrogate pairs. Characters in the string may be gargled if there are special characters. The damaged proxy pair means that the character encoding of some characters does not correspond to only one character unit, but two characters (in a pair, one leading and one suffix). In this way, the two character units of a character will be divided into two independent characters and put into the array, resulting in garbled characters in the returned result

So if you want to turn a string into an array of characters, you can use the extension operator… STR, Array methods Array. The from (STR), or use the STR. The split (/ (? = [\ s \ s])/u) or STR. The split (/ (? =.) /us) (actually uses the prior assertion x(? =y), where x is the empty string “”), or loop characters one by one into a new array

Common properties in RexExp and RexExpArray

Changes the properties of the RexExp object instance when using the RexExp correlation method

RexExp

  • Source: corresponding regular expression
  • LastIndex: the subscript where the next match starts (the last match ends with a subscript plus 1)
  • hasIndices
  • global
  • dotAll
  • sticky
  • unicode
  • multiline
  • ignoreCase

RexExpArray

  • Input: Indicates the input character string
  • Index: Indicates the start index of a full regular expression match in a string
  • Groups: Records the object of the named capture group, the key is the capture group name, the value is the group match or undefined (if there is no match), rea.groups.name
  • Indices [][] : When the flag bit is setdAfter, the start and end indexes of the match are recorded, which is a two-dimensional array in which each one-dimensional array stores a pair of start and end indexes. One of them, toogroupsProperty that records the start and end indexes of matches for named capture groups

reference

Regular expressions – JavaScript | MDN (mozilla.org)