Overview of new developments in regular expressions in ES6-ES2019

In the previous article “JS Regular Expressions — From Getting Started to Getting Refined”, WE covered the use of regular expressions in Javascript in a more complete way. Since ES6(ES2015), with the help of iconic tools such as Babel, JS development seems not to repeat the inaction of Flash era, embarked on the fast track of a small version every year; In this process, regular expressions have evolved some new features.

This article tries to take a quick look at these features and see if they can simplify our development.

ECMAScript and TC39

Although they may be common knowledge, the repetition of these terms may need a little explanation first.

ECMA stands for The European Computer Manufacturers Association; Now also known as ECMA International, ECMA is positioned as an information and telecommunications standards organization with an International membership system.

In November 1996, Netscape, the creator of JavaScript, decided to submit JavaScript to ECMA in the hope that the language would become an international standard. The following year, ECMA released the first version of the standard document 262 (ECMA-262), which defined the browser scripting language as ECMAScript (ES).

Since then, the standard has been widely used, JavaScript, JScript, ActionScript and so on are implementations and extensions of the ECMA-262 standard.

From 1997 to 2009, ES1 to ES5 were released successively; ES4 was actually designed with a bias towards Adobe’s ActionScript style practice, but it was eventually abandoned as Flash was banned by other vendors in the market as too much change, but some of these features were inherited by the later ES6.

In 2015, arguably the most important release to date, ES6, or ES2015 (the first version of ES6), was released.

ECMA Technical Committee 39 (TC39), composed of representatives from major browser vendors, is responsible for developing the new ECMAScript standard.

New grammars go through five stages from proposal to formal standard. Changes at each stage need to be approved by TC39 Committee:

Stage 0 – Strawman
Stage 1-Proposal
Stage 2 – Draft
Stage 3
Stage 4-Finished

These are the things that we introduced when we used the Babel translation toolbabel-preset-stage-0With the release of Babel 7, of course, these solutions are unified@babel/preset-envIn the.

Regular expression features in ES6

The following features are first available in ES6:

“Sticky” modifier/y
Unicode modifier/u
A new property on a regular expression objectflags
Use constructorsRegExp()Copying regular expressions

“Sticky” modifier`/y`

The /y modifier only anchors each match of the regular expression to the end of the previous match

In a nutshell, this has to do with the lastIndex property on regular expression objects — its pairing with /g or /y can have a different effect.

When no modifier is set, or only the /g modifier is set, as long as there is a match in the target string (or the remainder of the last match).

The /y modifier tells the regular expression to match only from the lastIndex of the string, which is what “sticky” means.

Take the use of exec() as an example:

// Do not set modifiers
const re1 = /a/;
re1.lastIndex = 7; // The setting is very clear and does not help
const match = re1.exec('haha');
console.log(match.index); / / 1
console.log(re1.lastIndex); // 7 (no change)

// Set the '/g' modifier
const re2 = /a/g;
re2.lastIndex = 2;
const match = re2.exec('haha');
console.log(match.index); // 3 (this time affected by lastIndex)
console.log(re2.lastIndex); // 4 (update to next bit after match)
console.log(re2.exec('xaxa')); // null (no match after 4)

// Set the '/y' modifier
const re3 = /a/y;
re3.lastIndex = 2;
console.log(re3.exec('haha')); // null (does not match at position 2)
re3.lastIndex = 3;
const match = re3.exec('haha');
console.log(match.index); // 3 (position 3 matches exactly)
console.log(re3.lastIndex); // 4 (also updated)
Copy the code

Of course, in normal cases — such as when the match is run for the first time, or when lastIndex is not specifically set — /y is almost as effective as ^ at the start, since lastIndex is 0:

const re1 = /^a/g;
const re2 = /a/y;
console.log(re1.test('haha')); // false
console.log(re2.test('haha')); // false
Copy the code

Note that if both /g and /y are set, only /y will take effect.

`sticky`attribute

To match the /y modifier, the ES6 regular expression object has the sticky attribute, indicating whether the /y modifier is set:

var r = /hello\d/y;
r.sticky // true
Copy the code

Unicode modifier`/u`

To briefly explain Unicode, the goal is to provide a unique identifier for every character in the world, which can be called a code point or character encode.

Before ES6, JS strings were based on 16-bit character encoding (UTF-16). Each 16-bit sequence (equivalent to two bytes) is a code unit (or code element for short) used to represent a character. All attributes and methods of a string, such as the length attribute and the charAt() method, are based on such 16-bit sequences.

Originally, JS allows a common Unicode character to be represented in \uxxxx form, where the four hexadecimal digits represent the character’s Unicode code point:

console.log("\u0061"); // "a"
Copy the code

At the same time, this representation can only represent characters with code points limited to 0x0000~0xFFFF. Characters beyond this range must represent a code point in the form of a pair of surrogates (called surrogate pairs) :

console.log("\uD842\uDFB7"); / / "𠮷"
Copy the code

This leads to a problem, for some Unicode characters beyond 16 bits of 0xFFFF, the traditional approach is wrong; For example, in \u20BB7, JS would read \u20BB + 7; So it shows up as a special character followed by a 7.

ES6 improves on this by putting unicode encodings in braces (this syntax is called Unicode codepoint escapes) to read characters correctly:

console.log("\u{20BB7}"); / / "𠮷"
Copy the code

The same example:

'\u{1F680}'= = ='\uD83D\uDE80' //true

console.log('\u{1F680}') / / 🚀
console.log('\uD83D\uDE80') / / 🚀
Copy the code

The transformation correspondence in this article (Blog.csdn.net/hherima/art…) made a clear exploration. If you are interested, you can conduct research based on the data at the end of the article. This article does not expand to break out, can experience the example.

Back to the book, in ES6’s re:

The /u modifier switches the regular expression to a special Unicode mode

In Unicode mode, you can either use the new braced Unicode encoding point escape to represent a wider range of characters or continue to use UTF-16 codes. This mode has the following characteristics:

The “lone surrogates” feature:

// Traditional non-Unicode mode
/\uD83D/.test('\uD83D\uDC2A') //true, identified by 16 bits

/ / Unicode mode
/\uD83D/u.test('\uD83D\uDC2A') //false. Atomic parts of code points are identified in this mode
/\uD83D/u.test('\uD83D \uD83D\uDC2A') //true
/\uD83D/u.test('\uD83D\uDC2A \uD83D') //true
Copy the code

Code points can be placed in a regular character class:

/^[\uD83D\uDC2A]$/.test('\uD83D\uDC2A') //false
/^[\uD83D\uDC2A]$/u.test('\uD83D\uDC2A') //true

/^[\uD83D\uDC2A]$/.test('\uD83D') //true
/^[\uD83D\uDC2A]$/u.test('\uD83D') //false
Copy the code

Dot operators match code points, not code elements

'\uD83D\uDE80'.match(/./gu).length / / 1
'\uD83D\uDE80'.match(/./g).length / / 2
Copy the code

Quantity descriptors also match code points

/\uD83D\uDE80{2}/u.test('\uD83D\uDE80\uD83D\uDE80') //true
/\uD83D\uDE80{2}/.test('\uD83D\uDE80\uD83D\uDE80') //false
/\uD83D\uDE80{2}/.test('\uD83D\uDE80\uDE80') //true
Copy the code

A new property on a regular expression object`flags`

Flags property, which returns regular expression modifiers

const re = /abc/ig;
console.log( re.source ); //'abc'
console.log( re.flags ); //'gi'
Copy the code

with`RegExp()`Copying regular expressions

The traditional signature of a regular expression constructor is new RegExp(PATTERN: string, flags = “), for example:

const re1 = new RegExp("^a\d{3}".'gi')
// is equivalent to: /^ AD {3}/gi
Copy the code

In ES6, the new usage is new RegExp(regex: RegExp, flags = regex.flags) :

var re2 = new RegExp(re1, "yi")
/^ AD {3}/iy
Copy the code

This provides a way to copy an existing regular expression or change its modifiers.

New features in ES2018/ES2019

In ES2018-ES2019, some additional features have been added:

Named capture group
backreferences
Reverse assertion
Unicode attribute escape
DotAll modifier/s

Named capture group

Previous regular-expression operations used Numbered capture groups to match and group strings, such as:

const RE_DATE = / [0-9] {4}) - ([0-9] {2}) - ([0-9] {2}) /; // The order of the parentheses determines their numbering
const matchObj = RE_DATE.exec('1999-12-31');

const year = matchObj[1]; / / 1999
const month = matchObj[2]; / / 12
const day = matchObj[3]; / / 31
Copy the code

This approach is not ideal in terms of ease of use or complexity, especially when there are too many fields, nesting, etc. If you change the regular expression, it’s easy to forget to synchronize the scattered numbers.

With ES6 Named Capture Groups, you can identify the captured groups by name

The format such as(? <year>[0-9]{4})
By capturing the results ingroups.yearAttribute to take out the
Undefined is returned for any named group that fails to match

const RE_DATE = / (? 
      
       [0-9]{4})-(? 
       
        [0-9]{2})-(? 
        
         [0-9]{2})/
        
       
      ;

const matchObj = RE_DATE.exec('1999-12-31');
const year = matchObj.groups.year; / / 1999
const month = matchObj.groups.month; / / 12
const day = matchObj.groups.day; / / 31

// The numbering method is also retained
const year2 = matchObj[1]; / / 1999
const month2 = matchObj[2]; / / 12
const day2 = matchObj[3]; / / 31
Copy the code

This also provides an additional convenience for the replace() method. Note the syntax:

const RE_DATE = / (? 
      
       [0-9]{4})-(? 
       
        [0-9]{2})-(? 
        
         [0-9]{2})/
        
       
      ;
console.log( '2018-04-30'.replace(RE_DATE, '$<month>-$<day>-$<year>'));/ / 04-30-2018

// The number of the capture group:
console.log( '2018-04-30'.replace(RE_DATE, "$2 - $3 - $1"));/ / 04-30-2018
Copy the code

Backreferences

The \k

in the regular expression means that a string is matched based on the name in the named capture group that was previously matched, for example:

const RE_TWICE = / ^ (? 
      
       [a-z]+)! \k
       
        $/
       
      ;
RE_TWICE.test('abc! abc'); // true
RE_TWICE.test('abc! ab'); // false
Copy the code

This syntax, called backreferencing, also applies to numbered capture groups:

const RE_TWICE = / ^ (? 
      
       [a-z]+)! \ 1 $/
      ;
RE_TWICE.test('abc! abc'); // true
RE_TWICE.test('abc! ab'); // false
Copy the code

Two mixed, no problem:

const RE_TWICE = / ^ (? 
      
       [a-z]+)! \k
       
        ! \ 1 $/
       
      ;
RE_TWICE.test('abc! abc! abc'); // true
RE_TWICE.test('abc! abc! ab'); // false
Copy the code

Lookbehind assertions

As described in previous articles, “Lookahead assertions,” or forward lootings, are supported in JS.

x(? =y)Matches ‘x’ only if ‘x’ is followed by ‘y’. It’s called positive lookup
x(? ! y)Matches ‘x’ only if ‘x’ is not followed by ‘y’. This is called positive negative lookup

ES2018 introduces lookbehind assertions, which work in the same way as forward assertions, only in the opposite direction

There are also two subtypes:

y(? <=x)Matches ‘x’ only if ‘x’ is preceded by ‘y’. Referred to asReverse positive search

// The traditional way:
const RE_DOLLAR_PREFIX = /(\$)foo/g;
'$foo %foo foo'.replace(RE_DOLLAR_PREFIX, '$1bar'); // '$bar %foo foo'
    
// Use the reverse affirmative lookup method:
const RE_DOLLAR_PREFIX = / (? <=\$)foo/g;
'$foo %foo foo'.replace(RE_DOLLAR_PREFIX, 'bar'); // '$bar %foo foo'
Copy the code

y(? <! x)Matches ‘x’ only if ‘x’ is not preceded by ‘y’. Referred to asReverse negative search

// The traditional way:
const RE_NO_DOLLAR_PREFIX = /([^\$])foo/g
'$foo %foo *foo'.replace(RE_NO_DOLLAR_PREFIX, '$1bar');  //"$foo %bar *bar"

// Use the reverse affirmative lookup method:
const RE_NO_DOLLAR_PREFIX = / (? 
      ;
'$foo %foo foo'.replace(RE_NO_DOLLAR_PREFIX, 'bar'); // '$foo %bar bar'
Copy the code

Unicode attribute escape

ES2018 added unicode property escapes to the/U modifier of ES6 — of the form \p{… } and \ {P… }, indicating “included” and “not included”, respectively

In purpose and form this is very similar to using \s to match whitespace such as whitespace. The sections in \p{} and \p{} braces are called “Unicode character properties”, making regular expressions more readable.

/^\p{Script=Greek}+$/u.test('mu epsilon tau ά') //true, matching Greek letters, prop=value

/^\p{White_Space}+$/u.test('\t \n\r') //true, matches all Spaces in the form of bin_prop
Copy the code

Unicode character attributes refer to the unicode standard, in which each character has metadata that describes its properties, such as:

Name: A unique name consisting of uppercase letters, digits, hyphens, and Spaces, such as:
- A: Name = LATIN CAPITAL LETTER A
- 😀: Name = GRINNING FACE
General_Category: Categorizing characters, such as:
- x: General_Category = Lowercase_Letter
- $: General_Category = Currency_Symbol
White_Space: Used to mark invisible Spaces, tabs, newlines, etc., such as:
- \t: White_Space = True
- PI.: White_Space = False
Age: Version number of the Unicode standard, for example:
- euro: Age = 2.1
Block: A contiguous range of code points that are not repeated and whose names are unique, as in:
- S: Block = Basic_Latin (range U+0000.. U+007F)
- 😀: Block = Emoticons (range U+1F600.. U+1F64F)
Script: A set of characters used in one or more writing systems
- Some scripts support multiple writing systems, such as Latin Script, which supports English, French, German, Latin, etc
- Some languages can be written with multiple alternative writing systems supported by multiple scripts. For example, Turkish used Arabic Script before it was converted to Latin Script in the early 20th century.
- For example:
  - Alpha.: Script = Greek
  - Д: Script = Cyrillic

A few more examples:

"AaBbCcDD".split("").filter(letter=>{ return /\p{Lower}/u.test(letter); }).join("") //"abc" const regex = /^\p{Number}+$/u; The (regex 'squared after delighted many customers and 𝟏 𝟞 𝟩 ㉝ Ⅰ Ⅱ ⅻ ⅼ ⅽ ⅾ ⅿ'); / / true / \ p {Currency_Symbol} + / u.t est (" $$euro "); //trueCopy the code

DotAll modifier`/s`

There is a [\s\ s] matching trick that we often see in regular expressions. This seemingly superfluous notation is intended to compensate. Flags a defect that cannot achieve a correct match in multiple lines.

The /s modifier solves this problem, so it is also called the dotAll modifier.

console.log(/hello.world/.test('hello\nworld'));  // false 
console.log(/hello[\s\S]world/.test('hello\nworld'));  // true
console.log(/hello.world/s.test('hello\nworld')); // true
Copy the code

As for the. Mark, by the way:

/^.$/.test('😀') //false, does not recognize emoji as a character

/^.$/u.test('😀') //true, corrected by u
Copy the code

References:

JS regular expressions – from entry to refinement mp.weixin.qq.com/s?__biz=MzI…
Exploringjs.com/es6/ch_rege…
Exploringjs.com/es2018-es20…
www.cnblogs.com/dandelion-d…
www.appui.org/2496.html
Stackoverflow.com/questions/4…
www.cnblogs.com/detanx/p/es…
www.cnblogs.com/xiaohuochai…
www.princeton.edu/~mlovett/re…
Unicodebook. Readthedocs. IO/unicode_enc…
Docs.microsoft.com/zh-cn/previ…
Blog.csdn.net/hherima/art…
Arui. Tech/es2018 – new -…
Github.com/tc39/propos…
caibaojian.com/es6/
zhuanlan.zhihu.com/p/27762556
Babeljs. IO/docs/en/bab…

–End–

Search fewelife concern public number reprint please indicate the source

Overview of new developments in regular expressions in ES6-ES2019

ECMAScript and TC39

Regular expression features in ES6

“Sticky” modifier/y

stickyattribute

Unicode modifier/u

A new property on a regular expression objectflags

withRegExp()Copying regular expressions