takeaway

Have you ever been searching for text, trying expression after expression, and it still doesn’t work?

Have you ever done a form validation, just go through the motions (as long as it’s not empty), and then burn incense and pray, don’t go wrong.

Have you ever used sed or grep commands and wondered if metacaracters that should be supported did not match?

Even if you don’t, you’re just calling replace over and over again, and you’re just saying, “Replace is great” in the face of other people’s succinct and efficient sentences.

Why you should learn regular expressions. One netizen said: In the legend of the river’s lake, the regular expression of a programmer, the prescription of a doctor, and the ghost symbol of a Taoist priest are equally famous, saying: the common people can not understand the three magical objects. This legend reveals at least two information to us: first, regular expression is very good, and the doctor’s prescription, Taoist ghost fu is the same name, and is mentioned by everyone, visible its status in the lake. The second is that regular expressions are hard, which means that if you can master them and use them, you’ll be on your way to success (don’t ask me who that is…). !

Obviously, I don’t need to say much about regular expressions. This is where Jeffrey Friedl’s introduction to Mastering Regular Expressions comes in.

“If there were a list of great inventions in computer software, I believe there would be no more than twenty, and in that list, of course, there would be packet-based networks, the Web, Lisp, hashing, UNIX, compiler technology, relational models, object-oriented, XML. Regular expressions should definitely not be left out.

Regular expressions are a panacea for many practical tasks, improving development efficiency and program quality a hundredfold, and the key role they play in bioinformatics and human genome mapping is well known. Mr. Jiang Tao, the founder of CSDN, experienced the power of this tool in his early years when he was developing professional software products and was always impressed.”

So there’s no reason not to understand regular expressions, or even master them and use them.

This article starts with regular grammar, and then explains the principle of regular expression matching step by step with concrete examples. The code instances are in JS, PHP, Python, and Java (because there are some matching patterns that are not supported in JS, you need to use other languages). The content includes initial and advanced skills, suitable for beginners to learn and advance. This article tries to be simple and easy to understand. At the same time, in order to be comprehensive, it involves a lot of knowledge, a total of 12K words, a long space, please read patiently, if you have reading difficulties, please contact me in time.

Review of history

The origins of regular expressions can be traced back to early studies of how the human nervous system works. Warren McCulloch and Walter Pitts, two neurophysiologists, developed a mathematical way to describe these neural networks.

In 1956, a mathematician named Stephen Kleene, building on McCulloch and Pitts’ earlier work, published a paper entitled “Representation of Neural Network Events” that introduced the concept of regular expressions.

Subsequently, it was found that this work could be applied to some earlier research using Ken Thompson’s computational search algorithm. And Ken Thompson was the main inventor of Unix. So the QED editor in Unix half a century ago (the QED editor came out in 1966) was the first application to use regular expressions.

Since then, regular expressions have become a household name for text processing tools, and almost every programming language, including JavaScript, has marketed support for regular expressions.

Definition of a regular expression

A regular expression is a literal template made up of ordinary and special characters, also called metacharacters or qualifiers. Here’s a simple regular expression to match consecutive numbers:

/[0-9]+//\d+/

“\d” is the metacharacter, and “+” is the qualifier.

metacharacters

metacharacters describe
. Matches any character except the newline character
\d Matches numbers, equivalent to character groups [0-9]
\w Matches letters, numbers, underscores, or Chinese characters
\s Matches any whitespace (including tabs, Spaces, newlines, etc.)
\b Matches the beginning or end of a word
^ Match the beginning of a line
$ Match the end of each line

Antisense metacharacters

metacharacters describe
\D Matches any character that is not a number, equivalent to [^0-9]
\W Matches any character other than letters, numbers, underscores, or Chinese characters
\S Matches any character that is not whitespace
\B Matches the start or end of a non-word
[^x] Matches any character except x

You can see that regular expressions are strictly case-sensitive.

Repetition qualifier

There are six qualifiers, and assuming the number of repeats is x, the following rules will apply:

qualifiers describe
* x>=0
+ x>=1
? x=0 or x=1
{n} x=n
{n,} x>=n
{n,m} n<=x<=m

Character groups

[…]. Matches one of the characters in brackets. For example, [xyz] matches the characters x, y, or z. If the brackets contain metacharacters, the metacharacters are degraded to ordinary characters and no longer function as metacharacters. For example, [+.?] Matches plus, dot, or question mark.

Exclusionist character groups

[^…]. Matches any character that is not listed. For example, [^x] matches any character except x.

Multiple structures

| means or, says one of the both. Such as: a | b match a or b characters.

parentheses

Parentheses are often used to range repeated qualifiers and to group characters. For example: (ab)+ can match abab.. Etc., where AB is a group.

Escape character

\ is an escape character, usually \ * +? | {[()]} ^ $. # and blank These characters are need to escape.

The precedence of the operator

  1. \ escape character
  2. (), (? :), (? =), [] parentheses or square brackets
  3. *, +,? , {n}, {n,}, {n,m} qualifiers
  4. ^ and $position
  5. | “or” operations

test

To test this, write a regular expression that matches a mobile phone number like this:

(\ + 86)? 1\d{10}

(1) “\ + 86” matching “+ 86” text, followed by metacharacters question mark, said can be 1 or 0 times matching, combined said “(\ + 86)?” Matches +86 or “”.

② The common character “1” matches the text “1”.

③ The metacaracter “\d” matches the digits 0 to 9, and the interval quantifier “{10}” means 10 matches, which together means that “\d{10}” matches 10 consecutive digits.

Above, the matching results are as follows:

The modifier

By default, regular expressions in javaScript have the following five modifiers:

  • G (full-text search), as shown in the screenshot above, actually turns on full-text search mode.
  • I (ignore case lookup)
  • M (multi-line lookup)
  • Y (adhesion modifier added in ES6)
  • U (added in ES6)

Common regular expressions

  1. Chinese characters: ^ [\ u4e00 – \ u9fa5] {0} $
  2. Email: ^\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*$
  3. URL: ^https? ://([\w-]+.) +[\w-]+(/[\w-./?%&=]*)? $
  4. Mobile phone number: ^1\ D {10}$
  5. Id number: ^ (\ d {15} | \ d {and} \ | d (X)) $
  6. [1-9]\d{5}(? ! \d) (zip code is 6 digits)

Password authentication

Password authentication is a common requirement. In general, a password must meet the following rules: A password must contain at least two types of characters, including 6-16 characters, digits, letters, and Chinese characters and cannot contain Spaces. Here is a regular description of regular password authentication:

var reg = /(? ! ^ ([0-9] + $)? ! ^[A-z]+$)(? ! [^ ^ A - z0-9] + $) ^ \ u4e00 - \ [^ \ s u9fa5] 6 16th} {$/;

Regular several big families

Regular Expression classification

On Linux and OSX, there are at least three common regular expressions:

  • Basic Regular Expressions (Also called Basic RegEx for short)
  • Extended Regular Expressions (Extended RegEx)
  • Perl Regular Expressions (also called Perl RegEx or PREs)

Regular expression comparison

character instructions Basic RegEx Extended RegEx python RegEx Perl regEx
escape
^ Matches the beginning of a line. For example, ‘^dog’ matches a line that begins with the string dog. (Note: in the awk directive, ‘^’ matches the beginning of the string.) ^ ^ ^ ^
$ Match the end of the line, for example: ‘^, dog\$’ matches a line ending in the string dog (note: in the awk directive,’ $’ matches the end of the string) $ $ $ $
^ $ Match a blank line ^ $ ^ $ ^ $ ^ $
^string$ Match line, for example: ‘^dog$’ matches lines that contain only one string, dog ^string$ ^string$ ^string$ ^string$
\ < Match words, such as: ‘\<frog’ (equivalent to ‘\bfrog’), match words that start with frog \ < \ < Does not support Does not support(But you can use \b to match words, e.g. ‘\bfrog’)
> Match words, such as: ‘frog>’ (equivalent to ‘frog\b’), to match words ending in frog > > Does not support Does not support(But you can use \b to match words, e.g. ‘frog\b’)
\ Matches a word or a specific character, such as: ‘\’ (equivalent to ‘\bfrog\b’), ‘\’ \ \ Does not support Does not support(But you can use \b to match words, e.g. ‘\bfrog\b’
(a) Match expressions such as’ (frog) ‘not supported Does not support(But you can use, for example, dog (a) (a) (a)
Match expressions such as’ (frog) ‘not supported Does not supportWith () () Does not supportWith () () Does not supportWith () ()
? Matches the previous subexpression 0 or 1 times (equivalent to {0,1}), for example: Where (is)? Matches where and whereis Does not support(with \? ? ? ?
\? Matches the previous subexpression 0 or 1 times (equivalent to ‘{0,1}’), for example: ‘Whereis \? Matches where and whereis \? Does not support(the same? Does not support(the same? Does not support(the same?
? When the character is followed by any other qualifier (*, +,? ,{n},{n,}, {n,m}), the matching pattern is non-greedy. The non-greedy mode matches as little of the searched string as possible, while the default greedy mode matches as much of the searched string as possible. For example, for the string “oooo”, ‘o+? ‘will match a single’ o ‘and’ o+ ‘will match all’ o ‘ Does not support Does not support Does not support Does not support
. Matches any single character other than a newline (‘ \n ‘) (note: periods in the AWk directive can match newlines) . To match any character including ‘\n’, use: [\s\ s] . To match any character including ‘\n’, use: ‘[.\n]’
* Matches the previous subexpression 0 or more times (equivalent to {0,}). For example, zo* matches “z” and “zoo”. * * * *
+ Matches the previous subexpression one or more times (equivalent to ‘{1,}’), for example: whereis+ matches whereis as well as whereISIS + Does not supportWith (+) Does not supportWith (+) Does not supportWith (+)
+ Matches the previous subexpression one or more times (equivalent to {1,}). For example, zo+ matches “zo” and “zoo”, but not “z”. Does not support(with \ +) + + +
{n} N must be a 0 or positive integer and match the subexexpressions n times. For example, zo{2} can match Does not support(with \ n \} {) {n} {n} {n}
{n,} N must be a 0 or positive integer and match the subexexpressions at least n times. For example: go{2,} Does not support(with \ {n, \}) {n,} {n,} {n,}
{n,m} Godm and n are non-negative integers where n <= m matches at least n and at most m matches. For example, o{1,3} matches the first three o’s in “fooooood” (please note that there is no space between comma and two numbers). Does not support(with \ {n, m \}) {n,m} {n,m} {n,m}
x l y Match x or y Does not support(Same as x \l \ y x l y x l y x l y
[0-9] Matches any numeric character from 0 to 9 (note: write incrementing) [0-9] [0-9] [0-9] [0-9]
[xyz] For example: ‘[ABC]’ matches the ‘a’ in ‘lay’ (note: if metacaracters, such as:. *, are placed in [], they become a common character) [xyz] [xyz] [xyz] [xyz]
[^xyz] Set of negative characters that match any character not included (note: no newlines). For example, ‘[^ ABC]’ matches the ‘L’ in ‘Lay’ (note: [^xyz] matches any character not included + newlines in awk). [^xyz] [^xyz] [^xyz] [^xyz]
[A-Za-z] Matches any character in either uppercase or lowercase (note: write incrementing) [A-Za-z] [A-Za-z] [A-Za-z] [A-Za-z]
[^A-Za-z] Matches any character except uppercase and lowercase letters (note: write incrementing) [^A-Za-z] [^A-Za-z] [^A-Za-z] [^A-Za-z]
\d Matches any numeric character from 0 to 9 (equivalent to [0-9]) Does not support Does not support \d \d
\D Matches non-numeric characters (equivalent to [^0-9]) Does not support Does not support \D \D
\S Matches any non-whitespace character (equivalent to [^\f\n\r\t\v]) Does not support Does not support \S \S
\s Matches any whitespace character, including Spaces, tabs, feed characters, and so on (equivalent to [\f\n\r\t\v]) Does not support Does not support \s \s
\W Matches any non-word character (equivalent to [^ a-za-z0-9_]) \W \W \W \W
\w Matches any word character that includes an underscore (equivalent to [a-za-Z0-9_]) \w \w \w \w
\B Matches non-word boundaries, for example: ‘er\B’ matches the ‘er’ in verb, but not the ‘er’ in never. \B \B \B \B
\b Matches a word boundary, that is, the position between a word and a space. For example, ‘er\b’ matches the ‘er’ in ‘never’, but not the ‘er’ in ‘verb’. \b \b \b \b
\t Matches a horizontal TAB (equivalent to \x09 and \cI) Does not support Does not support \t \t
\v Matches a vertical TAB (equivalent to \x0b and \cK) Does not support Does not support \v \v
\n Matches a newline character (equivalent to \x0a and \cJ) Does not support Does not support \n \n
\f Matches a feed character (equivalent to \x0c and \cL) Does not support Does not support \f \f
\r Matches a carriage return character (equivalent to \x0d and \cM) Does not support Does not support \r \r
\ Match the escape character itself “\” \ \ \ \
\cx Matches A Control character specified by x. For example: \cM matches A control-m or carriage return character. The value of x must be either A-z or a-Z, otherwise, c is treated as A literal ‘c’ character Does not support Does not support \cx
\xn Matches n, where n is a hexadecimal escaped value. The hexadecimal escape value must be A definite two digits long, for example: ‘\x41′ matches’ A ‘. ‘\x041’ is equivalent to ‘\x04’ & ‘1’. ASCII encoding can be used in regular expressions Does not support Does not support \xn
\num Matches num, where num is a positive integer. Represents a reference to the match obtained Does not support \num \num
[:alnum:] Match any letter or number ([a-za-z0-9]), for example: ‘[[:alnum:]]’ [:alnum:] [:alnum:] [:alnum:] [:alnum:]
[:alpha:] Match any letter ([a-za-z]), for example: ‘[[:alpha:]]’ [:alpha:] [:alpha:] [:alpha:] [:alpha:]
[:digit:] Match any digit ([0-9]), for example: ‘[[:digit:]]’ [:digit:] [:digit:] [:digit:] [:digit:]
[:lower:] Match any lowercase letter ([a-z]), for example: ‘[[:lower:]]’ [:lower:] [:lower:] [:lower:] [:lower:]
[:upper:] Match any uppercase letter ([a-z]) [:upper:] [:upper:] [:upper:] [:upper:]
[:space:] Any whitespace character: Supports tabs, Spaces, for example: ‘[[:space:]]’ [:space:] [:space:] [:space:] [:space:]
[:blank:] Spaces and tabs (horizontal and vertical), e.g. : ‘[[:blank:]]’ o ‘[\s\t\v]’ [:blank:] [:blank:] [:blank:] [:blank:]
[:graph:] Any visible character that can be printed (note: Spaces and newlines are not included), such as: ‘[[:graph:]]’ [:graph:] [:graph:] [:graph:] [:graph:]
[:print:] Any character that can be printed (note: does not include: [: CNTRL :], end of string ‘\0’, end of EOF (-1), but includes Spaces), for example: ‘[[:print:]]’ [:print:] [:print:] [:print:] [:print:]
[:cntrl:] Any of the control characters (the first 32 characters in the ASCII character set, i.e., 0 through 31 in decimal notation, e.g. newlines, tabs, etc.), such as: ‘[[: CNTRL :]]’ [:cntrl:] [:cntrl:] [:cntrl:] [:cntrl:]
[:punct:] Any punctuation mark (excluding the [:alnum:], [: CNTRL :], [:space:] character sets) [:punct:] [:punct:] [:punct:] [:punct:]
[:xdigit:] Any hexadecimal number (i.e. 0-9, a-f, a-f) [:xdigit:] [:xdigit:] [:xdigit:] [:xdigit:]

Pay attention to

  • EREs is supported in js.
  • When using BREs (basic regular expressions), you must use the following symbols (? , +, |, {and}, (,)) with escape characters \.
  • Regular expressions of the form [[: XXXX :]] are built into PHP and are not supported in JS.

This section describes the relationship between regular expressions and common commands in Linux or OSX

I’ve tried to write regular expressions in grep and sed commands, and I’ve often found that metacaracters aren’t used, that sometimes they need to be escaped, and sometimes they don’t, and I’ve never been able to figure it out. If this happens to be a problem for you, please read on and you should be able to learn something.

Grep, egrep, sed, awK regular expression features

  1. Grep supports regular expressions such as BREs, EREs, and PREs

    Grep without any arguments means “BREs” is used

    Grep followed by “-e” indicates that “EREs” is used.

    Grep followed by a “-p” parameter indicates that “PREs” is to be used

  2. Egrep supports EREs and PREs regular expressions

    If egrep is not followed by any arguments, then “EREs” is used.

    The egrep command followed by the “-p” parameter means to use “PREs”

  3. Sed supports BREs and EREs

    The sed directive defaults to “BREs”

    The sed command followed by the “-r” parameter indicates that “EREs” is to be used

  4. Awk supports EREs and uses “EREs” by default.

Beginner regular expression skills

Greedy and non-greedy modes

By default, all qualifiers are in greedy mode, meaning to capture as many characters as possible; And after the qualifier you add? , is a non-greedy mode, which means to capture as few characters as possible. As follows:

Var STR = "aaab"; var STR = "aaab"; var STR = "aaab"; var STR = "aaab"; var STR = "aaab"; var STR = "aaab"; /; // Non-greedy mode console.log(str.match(reg1)); //["aaa"] capture all aconsole.log(str.match(regs)); //["a"], only the first a is captured because of non-greed mode

In fact, the non-greed pattern works very well, especially when matching HTML tags. For example, to match a paired DIV, scheme 1 May match many div tag pairs, while scheme 2 will match only one div tag pair.

var str = "<div class='v1'><div class='v2'>test</div><input type='text'/></div>"; var reg1 = /<div.*<\/div>/; Var reg2 = /<div.*? <\/div>/; // Match console.log(str.match(reg1)); //"<div class='v1'><div class='v2'>test</div><input type='text'/></div>"console.log(str.match(reg2)); //"<div class='v1'><div class='v2'>test</div>"
The non-greedy pattern of interval quantifiers

In general, non-greed mode, we use “*?” , or “+?” Another form of this is “{n,m}? .

The interval quantifier “{n,m}” is also matched first. Although there is an upper limit on the number of matches, it still matches as many as possible until the upper limit is reached, while “{n,m}?” Represents as few matches as possible within the range.

Note that:

  • Greedy mode and non-greedy mode can achieve the same matching result, and the matching efficiency of greedy mode is usually higher.
  • All non-greedy modes can be converted to greedy modes by modifying the quantifier modified subexpressors.
  • Greed mode can be associated withCuring group(more on this later) combination, improving the efficiency of matching, and non-greed mode does not.

grouping

The grouping of regex is mainly achieved by parentheses, which enclose subexpressions as a grouping, and which can be followed by a qualifier indicating the number of repetitions. As follows, the ABC wrapped in parentheses is a grouping:

/(abc)+/.test("abc123") == true

So what’s the use of groups? In general, the grouping is a convenient representation of the number of repetitions, but it is also used for capture, as you can see below.

Capturing grouping

A capturing group, usually consisting of a pair of parentheses followed by a subexpresse. The capture group creates backreferences, each identified by a number or name, in js primarily using $+ numbering or \+ numbering notation. The following is an example of captured grouping.

var color = "#808080"; var output = color.replace(/#(\d+)/,"$1"+"~~"); Console. log(RegExp.$1); //808080console.log(output); / / 808080 ~ ~

Above, (\d+) represents a captured group, and RegExp.$1 refers to what the group captures. The $+ number reference is usually used outside of regular expressions. The \+ number reference can be used inside regular expressions, however, to match substrings of the same part in different places.

var url = "www.google.google.com"; var re = /([a-z]+)\.\1/; console.log(url.replace(re,"$1")); //"www.google.com"

Above, the same part of the “Google” string is replaced only once.

Non-trapping grouping

Non-trapping grouping, usually with parentheses and the words “? Composed of subexpressions, non-capturing groups do not create backreferences, as if there were no parentheses. As follows:

var color = "#808080"; var output = color.replace(/#(? :\d+)/,"$1"+"~~"); console.log(RegExp.$1); //""console.log(output); / / $1 ~ ~

Above, (? :\d+) represents a non-capturing group, and since the group does not capture anything, RegExp.$1 refers to the empty string.

Also, since the $1 backreference doesn’t exist, it ends up being replaced as a normal string.

In fact, there is no difference in search efficiency between capture-free and capture-free groups; neither is faster than the other.

After grouping

Grammar: (? …).

Named group is also a capturing group, which captures the matched string into a group name or number name. After the matching result is obtained, the group name can be used to obtain the matching result. Here is an example of a named group in Python.

import redata = "#808080"regExp = r"#(? P<one>\d+)"replaceString = "\g<one>" + "~~"print re.sub(regExp,replaceString,data) # 808080~~

Python’s named group expressions compare to the standard format in? There is an additional uppercase P character, and Python refers to it in the “\g< named >” notation (in the case of captive grouping, Python refers to it in the “\g< numbered >” notation).

Unlike Python, named groups are not supported in javaScript.

Curing group

Curing group, also called atomic group.

Grammar: (? >…).

As mentioned above, when we use the non-greedy pattern, we may do a lot of backtracking during the matching process, and the more backtracking, the less efficient the regular expression will run. The solidified grouping is designed to reduce the number of backtracking.

In fact, curing grouping (? >…). A match of is the same as a normal match and does not change the result. The only difference is that by the end of a solidified group match, the text it matches has solidified into a single unit and can only be retained or abandoned as a whole. Untried alternate states in parentheses subexpresses are abandoned, so a backtrace can never select any of them (and therefore cannot take part in a backtrace). Let’s look at an example to better understand the solidification grouping.

If you want to deal with a batch of data, the original format is 123.456, because the floating point number display problem, part of the data format will become 123.456000000789 this, now requires only to retain the decimal point after 2~3, but the last one can not be 0, then how to write this regular?

Var STR = "123.456000000789"; str = str.replace(/(\.\d\d[1-9]?) \d*/,"$1"); / / 123.456

The above re, for the “123.456” format of data, will be processed for nothing. To improve efficiency, we change the last “*” to a “+”. As follows:

Var STR = "123.456"; str = str.replace(/(\.\d\d[1-9]?) \d+/,"$1"); / / 123.45

At this point, “\ d \ d [1-9]?” Subexpression, matches “45” instead of “456”, because the end of the regular expression uses “+”, which means at least one digit must be matched at the end, so the end subexpression “\d+” matches “6”. Obviously “123.45” is not the match we expect, so what should we do? Could you let “[1-9]?” Once a match is made, no backtracking is done, and we use the fixed grouping we mentioned above.

“(\ \ d \ d (? > [1-9])? )\d+ “is a solidified grouping of the above regularization. Since the string “123.456” does not satisfy the re of the solidified group, the match will fail as expected.

Let’s analyze the regularization of the solidified group (\.\d\d(? > [1-9])? )\d+ why does not match the string “123.456”?

It is obvious that there are only two matching results for the above curing grouping.

Case ①: If [1-9] fails to match, is the re returned? The state of reserve left behind. Then match out of the cured group and proceed to [\d+]. When control leaves the cured group, there is no standby state to give up (since no standby state was created in the cured group at all).

Case 2: If [1-9] matching is successful, after the matching is separated from the curing group,? The saved standby state still exists, but is discarded because it belongs to a cured group that has ended.

For the string “123.456”, [1-9] can match successfully, so it matches case ②. Let’s revert to the execution scene of case 2.

  1. The state of the match: the match has reached the “6” position, the match will continue to move forward; = = >
  2. The subexpression \d+ finds no match, and the regex engine attempts a traceback; = = >
  3. Is there a standby state for backtracking? = = >
  4. “?” The saved standby state belongs to the cured group that has ended, so it will be abandoned; = = >
  5. In this case, the “6” matched by the solidified group cannot be used for backtracking of the regular engine. = = >
  6. Attempts to backtrack failed; = = >
  7. The regular match fails.==>
  8. The text “123.456” is not matched by the regular expression, which is expected.

The corresponding flow chart is as follows:

Unfortunately, the solidified grouping syntax is not supported in javaScript, Java, and Python, but it is available in PHP and. NET. A PHP version of the solidified grouping form of regular expressions is provided below to try out.

$STR = "123.456"; echo preg_replace("/(\.\d\d(? > [1-9])? )\d+/","\\1",$str); // Cure grouping

Not only that, PHP provides syntax that takes precedence over quantifiers. As follows:

$STR = "123.456"; echo preg_replace("/(\.\d\d[1-9]? +)\d+/","\\1",$str); // Take precedence over quantifiers

While Java does not support a solidified grouping syntax, Java does provide a quantifier first syntax that also avoids regular backtracking. As follows:

String STR = "123.456"; System.out.println(str.replaceAll("(\\.\\d\\d[1-9]? +)\\d+", "$1")); / / 123.456

It’s worth noting that the replaceAll method in Java needs to escape the backslash.

Regular expression advanced skill – zero width assertion

If the regular grouping is the eye, then the zero-width assertion is the ultimate mystery of kaleidoscope eye. The proper use of zero-width assertions, which can be grouped in groups that can’t be grouped, greatly enhances regular matching, and it can even help you quickly locate text in situations where the matching conditions are very vague.

Zero width assertion, also called look around. Glancing only matches subexpression, and the matched content is not saved in the final match result. Since the match is zero-width, only one position is finally matched.

According to the direction of the look, there are two kinds of look, order and reverse order (also called forward and backward), there are two kinds of positive and negative according to whether they match, and when combined, there are four kinds of look. The four kinds of look are not complicated, and the following are their descriptions.

character describe The sample
(? :pattern) Non-capturing grouping, which matches the position of the pattern but does not capture the matching result. That is, no backreferences are created, just as if there were no parentheses. ‘abcd (? E) matching “abcde
(? =pattern) The order must look around, followed by the position of the pattern, and the matching result is not captured. ‘Windows (? =2000) ‘match “Windows” in “Windows2000”; Does not match “Windows” in “Windows3.1”
(? !pattern) Look around in order negation, matches positions that are not followed by pattern, and the matching result is not captured. ‘Windows (? ! 2000) ‘matches’ Windows’ in’ Windows3.1 ‘; Does not match Windows in Windows2000
(? < =pattern) The reverse must look around, matches the position before the pattern, and does not capture the matching result. ‘(? <=Office)2000 ‘matches “2000” in “Office2000”; Does not match “2000” in “Windows2000”
(? <!pattern) Reverse order negation look around, matches positions that are not previously in the pattern, and the matching result is not captured. ‘(? <! Office)2000 ‘matches “2000” in “Windows2000”; Does not match “2000” in Office2000

Non-capturing groups are listed in the table for comparison because their structure is similar to that of glancing. Of the above four lookaround, only the first two are currently supported in javaScript, that is, only sequential positive lookaround and sequential negative lookaround are supported. Let’s use examples to help understand below:

var str = "123abc789",s; S = str.replace(/ ABC /,456); console.log(s); //123456789// use the order of sure to look around and capture the position before a, so ABC is not replaced, just replace 3 with 3456s = STR. Replace (/3(? =abc)/,3456); console.log(s); //123456abc789// s = str.replace(/3(? ! abc)/,3456); console.log(s); //123abc789

The following uses python to demonstrate the use of reverse positive and reverse negative look.

Import redata = "123abc789" import redata = "123abc789" <=123)[a-z]+"replaceString = "456"print re.sub(regExp,replaceString,data) # 123456789# The subexpression [a-z]+ captures BC, and BC is replaced with 456regExp = r"(? <! 123)[a-z]+"replaceString = "456"print re.sub(regExp,replaceString,data) # 123a456789

Note that in Python and Perl, subexpressions that look in reverse order can only use fixed-length text. For example, the above “(? <=123) “subexpression is written as” (? <=[0-9]+) “, the Python interpreter will say: “Error: look-behind requires fixed-width pattern.”

Scenario review

Getting an HTML fragment

Now let’s say that js uses Ajax to get a piece of HTML code like this:

var responseText = "<div data='dev.xxx.txt'></div><img src='dev.xxx.png' />";

Now we need to replace the “dev” string in the SRC attribute of the img tag with the “test” string.

If the responseText string contains at least two substrings “dev”, it is not possible to replace the string “dev” with “test”.

“SRC = ‘” and then replace” dev “with” SRC =’ “.

③ We notice that the SRC attribute of the img tag ends in “.png “. Based on this, we can use order affirmation to look around. As follows:

var reg = /dev(? =[^']*png)/; Var STR = responsetext.replace (reg,"test"); console.log(str); //<div data='dev.xxx'></div><img src='test.xxx.png' />

Of course, not only does the above order certainly look at a solution, but capturing grouping can also be done. So where’s the look high? The point of looking at advanced is that it can locate a location with a single capture, which is often useful for complex text-replacement scenarios, whereas grouping requires more manipulation. Please read on.

Thousands separator

The thousands separator is, as the name suggests, the comma in a number. Refer to the western custom of adding a symbol to the number to avoid the fact that the number is too long to see its value intuitively. Therefore, add a comma every three digits to the number, which is the thousands separator.

So how do you convert a string of numbers into thousands?

var str = "1234567890"; (+str).toLocaleString(); / / "1234567890"

As above, toLocaleString() returns a “localized” string form of the current object.

  • If the object is of type Number, the value is returned as a symbol-split string.
  • If the object is of type Array, each item in the Array is converted to a string, and then the strings are concatenated with the specified delimiter and returned.

The toLocaleString method is special and localized. For Chinese, the default delimiter is the English comma. So you can use it to just convert a number to a string in the form of a thousand separator. If internationalization is considered, the above approach may fail.

Let’s try to deal with that by looking around.

function thousand(str){ return str.replace(/(? ! (^)? =([0-9]{3})+$)/g,','); }console.log(thousand(str)); / / "1234567890" the console. The log (thousand (" 123456 ")); / / "123456" the console. The log (thousand (" 1234567879876543210 ")); / / "1234567879876543210"

The re used above is divided into two pieces. ! ^) and (? =([0-9]{3})+$).

  1. [0-9]{3} indicates three consecutive digits.
  2. “([0-9]{3})+” indicates at least one or more consecutive occurrences of three digits.
  3. “([0-9]{3})+$” represents consecutive positive integer multiples of three until the end of the string.
  4. then(? = ([0-9] {3}) + $)Matches a zero-width position, and from that position to the end of the string, has a positive integer multiple of 3 in the middle.
  5. The regular expression uses the global match g, which means that after a match is reached, it continues to match until no match is found.
  6. Replacing this with a comma is essentially adding a comma every three digits.
  7. Of course, the string “123456”, which happens to have a positive multiple of 3, cannot be preceded by a comma. Then using(? ! ^)Specifies that the replacement position cannot be the starting position.

The thousand separator example shows the power of looking around in one step.

The application of regular expressions in JS

ES6 extension to regular

ES6 extends re with two more modifiers (which other languages may not support):

  • The y (sticky) modifier, similar to g, is also a global match, and the next match starts at the next position where the last match was successful. The difference is that the G modifier only needs to have a match at the remaining position, while the Y modifier ensures that the match must start at the first remaining position.
var s = "abc_ab_a"; var r1 = /[a-z]+/g; var r2 = /[a-z]+/y; console.log(r1.exec(s),r1.lastIndex); // ["abc", index: 0, input: "abc_ab_a"] 3console.log(r2.exec(s),r2.lastIndex); // ["abc", index: 0, input: "abc_ab_a"] 3console.log(r1.exec(s),r1.lastIndex); // ["ab", index: 4, input: "abc_ab_a"] 6console.log(r2.exec(s),r2.lastIndex); // null 0

As shown above, since the second match starts at the subscript 3, the corresponding string is “_”, and the regular object R2, which uses the y modifier, needs to start at the first remaining position, the match fails and returns null.

The sticky property of the regular object, indicating whether the y modifier is set. We’ll talk about that later.

  • The u modifier provides support for adding 4 bytecode points to regular expressions. For example, the “𝌆” character is a 4-byte character. Using the regular match directly would have failed, while using the u modifier would have waited for the correct result.
Var s = "𝌆"; console.log(/^.$/.test(s)); //falseconsole.log(/^.$/u.test(s)); //true
UCS – 2 byte code

A word about bytecode points: javaScript can only handle UCS-2 encodings (js was designed in 10 days by Brendan Eich in May 1995, more than a year before the encoding specification UTF-16 was released in July 1996, when only UCS-2 was available). Due to the congenital deficiency of UCS-2, all characters in JS are 2 bytes. If it is a four-byte character, it will be treated as two double-byte characters by default. As a result, any character handling function in JS is restricted from returning the correct result. As follows:

Var s = "𝌆"; console.log(s == "\uD834\uDF06"); //true 𝌆 is equivalent to 0xd834df06console. log(s.long) in UTF-16; //2 The length is 2, indicating that this is a 4-byte character

Fortunately, ES6 automatically recognizes 4-byte characters. So you can use the for of loop directly to traverse strings. At the same time, if javascript uses code points directly to represent Unicode characters, for 4-byte characters, ES5 will not be able to recognize. ES6 fixes this by putting the code point in curly braces.

console.log(s === "\u1D306"); 𝌆console.log(s === "\u{1D306}"); //false ES5 cannot recognize 𝌆console.log(s === "\u{1D306}"); //true ES6 can identify 𝌆 with braces
Add: ES6 added to handle 4 bytecode function
  • String.fromcodepoint () : Returns the corresponding character from a Unicode code point
  • From character String. The prototype. CodePointAt () : returns the corresponding code
  • String.prototype.at() : Returns the character at the given position in the String

For the unicode character set in JS, please refer to Ruan Yifeng’s detailed explanation of Unicode and JavaScript.

The above is an ES6 extension to regular. On the other hand, in terms of methods, the methods associated with regular expressions in javaScript are:

The method name compile test exec match search replace split
Subordinate to the object RegExp RegExp RegExp String String String String

There are seven jS-related methods from RegExp and String objects. First, let’s look at the js regular class RegExp.

RegExp

The RegExp object represents a regular expression that is used primarily to perform pattern matching on strings.

Syntax: new RegExp(pattern[, flags])

The argument pattern is a string that specifies a regular expression string or other regular expression object.

The argument flags is an optional string containing the attributes “g”, “I”, and “m” to specify global, case-sensitive, and multi-line matches, respectively. If pattern is a regular expression and not a string, this parameter must be omitted.

var pattern = "[0-9]"; var reg = new RegExp(pattern,"g"); Var reg = /[0-9]/g; var reg = /[0-9]/g;

Above, creating regular expressions with object literals and constructors is an interlude.

“For direct regular expressions, ECMAscript 3 states that the same RegExp object is returned each time it is created, so regular expressions created with direct quantities share an instance. It wasn’t until ECMAScript 5 that it was required to return a different instance each time.”

So, for now, we don’t really have to worry about this, except to use constructors to create regex as much as possible in older non-Ie browsers (IE always follows ES5 rules on this, and other browsers follow ES3 rules on lower-level versions).

The RegExp instance object contains the following properties:

Instance attributes describe
global Whether to include the global flag (true/false)
ignoreCase Whether to include case-sensitive flags (true/false)
multiline Whether to include multiple line flags (true/false)
source Returns the expression text string form specified when an instance of a RegExp object is created
lastIndex Represents the position after the end of the matched string in the original string. Default is 0
flags(ES6) Returns the modifier of a regular expression
sticky(ES6) Whether the y(adhesion) modifier is set (true/false)
compile

The compile method is used to change and recompile regular expressions during execution.

Compile (pattern[, flags])

Refer to the RegExp constructor above for parameter description. The usage is as follows:

var reg = new RegExp("abc", "gi"); var reg2 = reg.compile("new abc", "g"); console.log(reg); // /new abc/gconsole.log(reg2); // undefined

As you can see, the compile method changes the original regular expression object and recompiles it, and its return value is null.

test

The test method checks whether a string matches a regular rule, returning true if the string contains text that matches the regular rule, and false otherwise.

Syntax: test(string)

console.log(/[0-9]+/.test("abc123")); //trueconsole.log(/[0-9]+/.test("abc")); //false

Above, the string “abc123” contains numbers, so the test method returns true; The string “ABC” does not contain a number, so false is returned.

If you need to use the test method to test whether the string completed matches a regular rule, you can add the start (^) and end ($) metacaracters to the regular expression. As follows:

console.log(/^[0-9]+$/.test("abc123")); //false

Above, since the string “abc123” does not start or end with a number, the test method returns false.

In fact, if the regular expression has a global flag (with an argument g), the test method is also affected by the lastIndex property of the regular object, as follows:

var reg = /[a-z]+/; Console. log(reg.test(" ABC ")); //trueconsole.log(reg.test("de")); //truevar reg = /[a-z]+/g; Gconsole. log(reg.test(" ABC ")); //trueconsole.log(reg.lastIndex); //3, next time you run test, look for console.log(reg.test("de")) starting at index 3; //false

This effect will be analyzed in the explanation of exec method.

exec

The exec method is used to detect a string match to the regular expression, returning an array of results if a matching text is found, or null otherwise.

Grammar: exec (string)

The array returned by the exec method contains two additional properties, index and input. And the array has the following characteristics:

  • The 0th item represents the text captured by the regular expression
  • Items 1 to N indicate the first to n backreferences, pointing to the text captured by the first to n groups in sequence. You can use RegExp.$+ “Number 1 to N” to obtain the text in the group in sequence
  • Index indicates the initial position of the matched string
  • Input represents the string being retrieved

Exec behaves the same with or without the global “g” in the regular expression. But regular expression objects behave a little differently. Let’s elaborate on how regular expression objects behave differently.

Assume that the regular expression object is reg, the detected character is string, and the returned value of reg.exec(string) is array.

If reg contains the global identifier “g”, then the reg.lastIndex property represents the position after the end of the matched string in the original string, where the next match begins. LastIndex == array. Index (where the match started) + array[0].length(the length of the match string). As follows:

var reg = /([a-z]+)/gi, string = "World Internet Conference"; var array = reg.exec(string); console.log(array); //["World", "World", index: 0, input: "World Internet Conference"]console.log(RegExp.$1); //Worldconsole.log(reg.lastIndex); Array. Index + array[0].length

As the search continues, the value of array.index increments, that is, the value of reg.lastIndex increments synchronously. Therefore, we can also iterate through all the matching text in the string by calling the exec method repeatedly. Until the exec method matches no more text, it returns null and resets the reg.lastIndex property to 0.

Following the example above, let’s go ahead and execute the code to see if the above is correct, as follows:

array = reg.exec(string); console.log(array); //["Internet", "Internet", index: 6, input: "World Internet Conference"]console.log(reg.lastIndex); //14array = reg.exec(string); console.log(array); //["Conference", "Conference", index: 15, input: "World Internet Conference"]console.log(reg.lastIndex); //25array = reg.exec(string); console.log(array); //nullconsole.log(reg.lastIndex); / / 0

In the above code, with repeated calls to the exec method, the reg.lastIndex property is eventually reset to 0.

Issue review

In explaining the test method, we left a question. If the regular expression has the global flag g, the result of the test method above will be affected by reg.lastIndex, as well as the exec method. Since reg.lastIndex is not always zero and determines where the next match begins, you must manually reset the lastIndex property to 0 if you want to start retrieving new strings after a match has been made in a string. Avoid errors such as:

var reg = /[0-9]+/g, str1 = "123abc", str2 = "123456"; reg.exec(str1); console.log(reg.lastIndex); //3var array = reg.exec(str2); console.log(array); //["456", index: 3, input: "123456"]

The correct result should be “123456”, so it is recommended to add “reg.lastIndex = 0;” before executing the exec method the second time. .

If reg does not contain the global identifier “g”, then the result of the exec method (array) will be exactly the same as the result of the string.match(reg) method.

String

The match, search, replace, and split methods are described in the string common methods.

The following shows how to process a text template using captive grouping to generate a complete string:

var tmp = "An ${a} a ${b} keeps the ${c} away"; var obj = { a:"apple", b:"day", c:"doctor"}; function tmpl(t,o){ return t.replace(/\${(.) }/g,function(m,p){ console.log('m:'+m+' p:'+p); return o[p]; }); }tmpl(tmp,obj);

The above functions can be achieved with ES6 as follows:

var obj = { a:"apple", b:"day", c:"doctor"}; with(obj){ console.log(`An ${a} a ${b} keeps the ${c} away`); }

Application of regular expressions in H5

The new PATTERN attribute in H5 specifies the pattern used to validate input fields, and the pattern matching of pattern supports the way regular expressions are written. The default Pattern attribute is all-match, that is, it matches all text regardless of whether the regular expression contains the “^”, “$” metacaracters.

Note: Pattern applies to the following input types: text, search, URL, telephone, email, and password. To cancel form validation, add the novalidate attribute to the form tag.

Regular engine

There are two types of regular engine, DFA and NFA. NFA can be divided into traditional NFA and POSIX NFA.

  • DFA Deterministic Finite Automaton
  • NFA Non-deterministic finite automaton
  • Traditional NFA
  • POSIX NFA

The DFA engine does not support backtracking, matches quickly, and does not support capture groups, and therefore backreferences. The preceding awk and egrep commands support the DFA engine.

POSIX NFA refers to the posix-compliant NFA engine. Languages such as javaScript, Java, PHP, python, and c# implement the NFA engine.

For detailed matching principles of regular expressions, there is no suitable article on the Internet. I suggest you read Jeffrey Friedl’s chapter 4 – Matching Principles of expressions (P143-P183) in Mastering Regular Expressions [third edition]. Jeffrey Friedl has a deep understanding of regular expressions, and I believe he can help you learn regular expressions better.

About the NFA engine simple implementation, can refer to the article based on ε-NFA regular expression engine – Twoon.

conclusion

In the initial stage of learning regularity, focus on understanding ① greedy and non-greedy patterns, ② grouping, ③ capture and non-capture grouping, ④ named grouping, ⑤ fixed grouping, and realize the subtlety of design. And the advanced stage, mainly lies in the skilled use of ⑥ zero-width assertion (or look around) to solve problems, and familiar with the principle of regular matching.

In fact, regularity is not very powerful in javaScript. JavaScript only supports (1) greedy and non-greedy modes, (2) grouping, (3) trapping and non-trapping grouping, and (6) sequential looking around in zero-width assertions. If you are a little more familiar with the seven regulars-related methods in JS (compile, test, exec, match, search, replace, split), you will be able to handle text or strings with ease.

Regular expressions, which are gifted at text processing, are so powerful that in many cases they are the only solution. Re is not limited to JS, it is supported by popular editors (such as Sublime, Atom) and ides (such as WebStorm, IntelliJ IDEA). You can even, at any time in any language, try to use re to solve problems that might not have been solved before, but can now be solved easily.

Regular data in other languages:

  • Python Regular expression Guide
  • Java regular expressions

That’s all for this article. If you have any questions or good ideas, please feel free to leave a comment below.

Louis

Introduction to this article: this article lasts for two months, a total of 12K words, in order to restore the use of regular front scene in a concise and comprehensive, collected a large number of regular information, and eliminate many redundant words, code words are not easy, like please click 👍 or collection, I will continue to keep updated.

This paper links: louiszhai. Making. IO / 2016/06/13 /…

Refer to the article

  • “Mastering Regular Expressions” by Jeffrey Friedl [third edition]
  • Comparison of Linux shell regular expressions (BREs,EREs,PREs)
  • Regular Expressions capture groups/Non-capture Groups Introduction _ Regular expressions _ Home of scripting
  • Regular expression (a) – metacharacter – inverse – bloggers park
  • Regular expression detail – Guisu, program life. He who does not advance loses ground. – Blog channel – CSDN.NET
  • Regular expression – fixed grouping – Taek – blog garden
  • Regular expressions: Greedy and non-greedy patterns in detail (overview) _ Regular expressions _ Home of scripts
  • JAVASCRIPT regular expression learning — > Basics and zero-width assertions — Feather in the Wind — blog channel — CSDN.NET
  • Unicode and JavaScript in Detail – ruan Yifeng’s web log