Regular matching in the post-ES6 era

Translator: smartsrh

The original link

In this article, we’ll take a look at ES6 and the future of regular expressions. There are some new regular expression flags introduced in ES6: paste the match flag /y and the Unicode flag /u. We then discuss five proposals for TC39’s ECMAScript specification development process.

Paste matching mark`/y`

The sticky matching Y flag introduced in ES6 is similar to the global flag G. Like global regular expressions, stickiness is typically used to match multiple times, up to the end of the input string. Sticky regular expressions move lastIndex to the position after the last match, just like global regular expressions. The only difference is that a sticky regular expression must start matching at the end of the previous match, unlike a global regular expression that moves to the rest of the input string if it doesn’t match at any given position.

The following example illustrates the difference between the two. Given an input string such as ‘haha haha’ and the regular expression /ha/, the global flag matches each ‘ha’, while the sticky flag matches only the first two, because the third occurrence is not at the starting index 4, but at index 5.

function matcher (regex, input) {
    return () = > { 
        const match = regex.exec(input) 
        const lastIndex = regex.lastIndex 
        return { lastIndex, match } 
    }
}
const input = 'haha haha haha'
const nextGlobal = matcher(/ha/g, input) 
console.log(nextGlobal()) // <- { lastIndex: 2, match: ['ha'] }
console.log(nextGlobal()) // <- { lastIndex: 4, match: ['ha'] } 
console.log(nextGlobal()) // <- { lastIndex: 7, match: ['ha'] } 
const nextSticky = matcher(/ha/y, input) 
console.log(nextSticky()) // <- { lastIndex: 2, match: ['ha'] } 
console.log(nextSticky()) // <- { lastIndex: 4, match: ['ha'] } 
console.log(nextSticky()) // <- { lastIndex: 0, match: null }
Copy the code

If we forcibly move lastIndex with the next code, we can verify that the sticky matcher works.

const rsticky = /ha/y 
const nextSticky = matcher(rsticky, input) 
console.log(nextSticky()) // <- { lastIndex: 2, match: ['ha'] } 
console.log(nextSticky()) // <- { lastIndex: 4, match: ['ha'] } 
rsticky.lastIndex = 5 console .log(nextSticky()) // <- { lastIndex: 7, match: ['ha'] } 
Copy the code

Sticky matching was added to JavaScript to improve performance in the compiler because lexers rely heavily on regular expressions.

Unicode symbol`/u`

The ES6 also introduced a U flag. U stands for Unicode, but this flag can also be thought of as a stricter regular expression.

Without the U flag, the following code snippet is a regular expression containing an unnecessarily escaped ‘A’ character literal.

/\a/.test('ab') // <- true
Copy the code

Using a non-reserved character like the escape form of a in a regular expression with the U flag causes an error, as shown in the code bits below.

/\a/u.test('ab') // <- SyntaxError: Invalid regular expression: /\a/: Invalid escape` 
Copy the code

Strings like ‘\u{1f40e}’ have been added to ES6. The following example attempts to embed a horse emoji into a regular expression by using \u{1f40e}, but the regular expression does not match the horse emoji. No u sign, \u{… The} pattern is interpreted as having the unneeded escape of the U character and the rest thereafter.

/\u{1f40e}/.test('? ') // <- false 
/\u{1f40e}/.test('u{1f40e}') // <- true` 
Copy the code

The u flag supports Unicode code escapes in regular expressions, such as the \u{1f40e} horse emoji.

/\u{1f40e}/u.test('? ') // <- true
Copy the code

Without the U flag,. Matches any BMP symbol except line terminators and astral characters. Is the following example a treble clef in music? This is an astral symbol that is not matched by. In regular regular expressions.

const rdot = $/ / ^
rdot.test('a') // <- true 
rdot.test('\n') // <- false 
rdot.test('\u{1d11e}') // <- false
Copy the code

Unicode symbols that do not belong to BMP are also matched when the U flag is used. What does the next clip show? How symbols are matched after setting flags.

const rdot = /^.$/u 
rdot.test('a') // <- true 
rdot.test('\n') // <- false 
rdot.test('\u{1d11e}') // <- true
Copy the code

When the U flag is set, Unicode characters can be found in the numeric and character classes, both of which treat each Unicode code as a single symbol, rather than just matching on the first character unit. The I flag match is case-insensitive and matches Unicode letters when the U flag is set, which is used to normalize input strings and code in regular expressions.

For more details on the U flag in regular expressions, see Mathias Bynens’ article.

Named capture group

So far, JavaScript regular expressions can group matches in numbered and non-captured groups. In the next section, we use grouping to extract keys and values from an input string containing key-value pairs separated by ‘=’.

function parseKeyValuePair(input) { 
    const rattribute = /([a-z]+)=([a-z]+)/ 
    const [, key, value] = rattribute.exec(input) 
    return { key, value } 
} 
parseKeyValuePair( 'strong=true' ) 
// <- { key: 'strong', value: 'true' }
Copy the code

There are also discarded non-capture groups that do not exist in the final result but are still available for matching. The following example supports matching of key-value pairs separated by ‘is’ and ‘=’.

function parseKeyValuePair(input) {
    const rattribute = /([a-z]+)(? :=|\sis\s)([a-z]+)/ 
    const [, key, value] = rattribute.exec(input) 
    return { key, value } 
} 
parseKeyValuePair( 'strong is true' ) // <- { key: 'strong', value: 'true' } 
parseKeyValuePair( 'flexible=too' ) // <- { key: 'flexible', value: 'too' }
Copy the code

Although array deconstruction in the previous example hides our code’s reliance on the magic of array indexing, the fact remains that matches are placed in ordered arrays. The named capture Group proposal (in phase 3 at the time of this writing) adds a similar (?

), where we can name the capture group, which is then returned to the groups property of the returned matching object. The groups attribute can be parsed from the resulting object when RegExp#exec or String#match is called.

function parseKeyValuePair (input) { 
    const rattribute = / (? 
      
       [a-z]+)(? :=|\sis\s)(? 
       
        [a-z]+)/u
       
       
    const { groups } = rattribute.exec(input) 
    return groups 
} 
parseKeyValuePair( 'strong=true' ) // <- { key: 'strong', value: 'true' } 
parseKeyValuePair( 'flexible=too' ) // <- { key: 'flexible', value: 'too' }
Copy the code

JavaScript regular expressions support backreferencing, where captured groups can be reused to find duplicates. The following code snippet uses a backreference to the first capture group to identify the case where the username is the same as the password in the ‘user:password’ input.

function hasSameUserAndPassword(input) { 
    const rduplicate = / [^ :] +) : \ 1 / 
    returnrduplicate.exec(input) ! = =null
} 
hasSameUserAndPassword('root:root') // <- true
hasSameUserAndPassword('root:pF6GGlyPhoy1! 9i') // <- false
Copy the code

The named capture group proposal adds support for backreference naming.

function hasSameUserAndPassword(input) { 
    const rduplicate = / (? 
      
       [^:]+):\k
       
        /u
       
       
    returnrduplicate.exec(input) ! = =null 
} 
hasSameUserAndPassword('root:root') // <- true 
hasSameUserAndPassword('root:pF6GGlyPhoy1! 9i') // <- false
Copy the code

\k

references can be used with numbered references, and the latter should be avoided when named references are already used.

Finally, you can reference the named group in a replacement passed to String#replace. In the next code snippet, we use String#replace and named groups to change the us date string to Hungarian format.

function americanDateToHungarianFormat(input) { 
    const ramerican = / (? 
      
       \d{2})\/(? 
       
        \d{2})\/(? 
        
         \d{4})/u
        
       
       
    const hungarian = input.replace(ramerican, '$<year>-$<month>-$<day>') 
    return hungarian 
} 
americanDateToHungarianFormat('06/09/1988') / / < - '1988-09-06'
Copy the code

If the second argument to String#replace is a function, the named group can be accessed through the groups at the end of the argument list. The required parameters for this feature are now (match,… Captures, groups). In the following example, notice how we use template literals similar to those used in the previous example to replace strings. In fact, the replacement string follows the $

syntax instead of the ${groupName} syntax, which means that if we use template literals, we can name groups in the replacement string without using escape code.

function americanDateToHungarianFormat(input) { 
    const ramerican = / (? 
      
       \d{2})\/(? 
       
        \d{2})\/(? 
        
         \d{4})/u
        
       
       
    const hungarian = input.replace(ramerican, (match, capture1, capture2, capture3, groups) = > { 
        const { month, day, year } = groups 
        return `${ year }-${ month }-${ day }` 
    }) 
    return hungarian 
} 
americanDateToHungarianFormat( '06/09/1988' ) / / < - '1988-09-06'
Copy the code

Unicode attribute escape

Unicode attribute escape proposal _ (currently in phase 3) _ is a new escape sequence that can be used in regular expressions with the U flag. The proposal in the form of \ p {LoneUnicodePropertyNameOrValue} for binary Unicode attributes and \ p {UnicodePropertyName = UnicodePropertyValue} is a binary Unicode Property to add an escape. In addition, \P is a negated version of the escape sequence \P.

The Unicode standard defines attributes for each symbol. With these attributes, you can perform advanced queries on Unicode characters. For example, symbols in the Greek alphabet have the Script attribute set to Greek. We can use the new escape to match any Greek Unicode symbol.

function isGreekSymbol(input) { 
    const rgreek = /^\p{Script=Greek}$/u 
    return rgreek.test(input) 
} 
isGreekSymbol('PI.) // <- true
Copy the code

Alternatively, using \P, we can match non-Greek Unicode symbols.

function isNonGreekSymbol(input) {
    const rgreek = /^\P{Script=Greek}$/u 
    return rgreek.test(input) 
} 
isNonGreekSymbol('PI.) // <- false
Copy the code

When we need to match every Unicode decimal digit symbol, not just [0-9] like \d, we can use \p{Decimal_Number} as shown below.

function isDecimalNumber(input) { 
    const rdigits = /^\p{Decimal_Number}+$/u 
    return rdigits.test(input) 
} 
isDecimalNumber( '???????????????? ' ) // <- true
Copy the code

The link below is a complete list of supported Unicode attributes and values.

Lookbehind assertion

JavaScript has long had positive lookahead assertions. This feature allows us to match an expression followed by another expression. These assertions are expressed as (? =)… . Regardless of whether the preceding assertion matches, the result of that match is discarded and no characters of the input string are entered.

The following example uses a positive antecedent assertion to test whether the input string ends in.js, in which case it will return the filename without the.js part.

function getJavaScriptFilename(input) { 
    const rfile = / ^ (? 
      
       [a-z]+)(? =\.js)\.[a-z]+$/u
       
    const match = rfile.exec(input) 
    if (match === null ) { 
        return null 
    } 
    return match.groups.filename 
} 
getJavaScriptFilename( 'index.js' ) // <- 'index' 
getJavaScriptFilename( 'index.php' ) // <- null
Copy the code

There is also negative antecedent assertion, which is expressed as (? ! …). Rather than masculine preassertion (? =)… . In this case, the assertion succeeds only if the preceding assertion does not match. The following code uses negative prior assertions, and we can see how the result is different: any expression other than ‘.js’ now results in a successful assertion.

function getNonJavaScriptFilename(input) { 
    const rfile = / ^ (? 
      
       [az]+)(? ! \.js)\.[az]+$/u
       
    const match = rfile.exec(input) 
    if (match === null ) { 
        return null 
    } 
    return match.groups.filename 
} 
getNonJavaScriptFilename('index.js') // <- null 
getNonJavaScriptFilename('index.php') // <- 'index'
Copy the code

The subsequent assertion proposal (Phase 3) introduces positive and negative subsequent assertions using (? < =…). And (?

function getDollarAmount(input) { 
    const rdollars = / ^ (? (< = \ $)? 
      
       \d+(? :\.\d+)?) $/u
       
    const match = rdollars.exec(input) 
    if (match === null ) { 
        return null 
    } 
    return match.groups.amount 
} 
getDollarAmount('$12.34') / / < - '12.34'
getDollarAmount('euro 12.34') // <- null 
Copy the code

On the other hand, negative trailing lines can be used to match numbers that are not dollar signs.

function getNonDollarAmount (input) { 
    const rnumbers = / ^ (? 
      \d+(? :\.\d+)?) $/u 
    const match = rnumbers.exec(input) 
    if (match === null ) { 
        return null 
    } 
    return match.groups.amount 
} 
getNonDollarAmount('$12.34') // <- null 
getNonDollarAmount('euro 12.34') / / < - '12.34'
Copy the code

A new`/s`_ (`dotAll`Sign) _

With. We usually expect to match every character. However, in JavaScript, a. Expression does not match the astral symbol _ (which can be fixed by adding the U flag) _, nor does it match the line terminator.

const rcharacter = $/ / ^ 
rcharacter.test('a') // <- true 
rcharacter.test('\t') // <- true 
rcharacter.test('\n') // <- false
Copy the code

This sometimes forces developers to write other types of expressions to synthesize a regular expression that matches any character. The expression in the next generation code matches any character that is a space character or a non-space character, thus providing the expected. Match the behavior

const rcharacter = /^[\s\S]$/ 
rcharacter.test('a') // <- true 
rcharacter.test('\t') // <- true 
rcharacter.test('\n') // <- true
Copy the code

DotAll proposal _ (Phase 3) _ adds an S flag that changes the behavior of. Matches any single character in a JavaScript regular expression.

const rcharacter = $/ / ^s 
rcharacter.test('a') // <- true 
rcharacter.test('\t') // <- true 
rcharacter.test('\n') // <- true 
Copy the code

`String#matchAll`

In general, when we have a regular expression with a global or sticky flag, we want to iteratively capture every match in the set of groups. Currently, generating a list of matches can be a bit cumbersome: we need to use String#match or RegExp#exec to collect the captured groups in the loop until the regular expression doesn’t match the input starting with the last lastIndex. In the code snippet below, the parseAttributes generator function works only for the given regular expression.

function *parseAttributes(input) { 
    const rattributes = /(\w+)=""([^""]+)""\s/ig 
    while (true) { 
        const match = rattributes.exec(input) 
        if (match === null ) { 
            break 
        } 
        const [ , key, value] = match 
        yield [key, value] 
    } 
}
const html = '<input type=""email"" placeholder=""[email protected]"" />' 
console.log(... parseAttributes(html))// <- ['type', 'email'] ['placeholder', '[email protected]']
Copy the code

One problem with this approach is that it is tailored to our regular expressions and their capture group. We can solve this problem by creating a matchAll generator that cares only about circular matching and collecting a collection of capture groups, as shown in the code snippet below.

function *matchAll(regex, input) {
    while (true) { 
        const match = regex.exec(input) 
        if (match === null ) { 
            break 
        } 
        const [ , ...captures] = match 
        yield captures 
    } 
} 
function *parseAttributes(input) { 
    const rattributes = /(\w+)=""([^""]+)""\s/ig 
    yield *matchAll(rattributes, input) 
} 
const html = '<input type=""email"" placeholder=""[email protected]"" />' 
console.log(... parseAttributes(html))// <- ['type', 'email'] ['placeholder', '[email protected]'] 
Copy the code

A further confusion is that the RAttributes change their lastIndex property every time RegExp#exec is called, which makes it possible to record the location of the last match. When there is no match, lastIndex is reset to 0. There is a problem when we don’t iterate over all possible matches of one input at once — which resets lastIndex to 0 — and then we use the regular expression on the second input and get unexpected results.

While it doesn’t look like our matchAll implementation will fall victim to this problem, the fact that we can manually loop through all matches with the generator means that if we reuse the same regular expression, we’ll have trouble, as shown in the code below. Notice how the second matcher should match [‘type’, ‘text’], but instead matches from indexes further than 0 and even mismatches the ‘placeholder’ key to ‘laceholder’.

const rattributes = /(\w+)=""([^""]+)""\s/ig 
const email = '<input type=""email"" placeholder=""[email protected]"" />' 
const emailMatcher = matchAll(rattributes, email) 
const address = '<input type=""text"" placeholder=""Enter your business address"" />' 
const addressMatcher = matchAll(rattributes, address) 
console.log(emailMatcher.next().value) // <- ['type', 'email'] 
console.log(addressMatcher.next().value) // <- ['laceholder', 'Enter your business address']
Copy the code

One solution is to change matchAll so that when we yield to *parseAttributes, lastIndex is always 0, while internally tracking lastIndex so that we can start executing from the last paused step in the sequence.

The following code shows that, indeed, this solves the problem we mentioned. For this reason, reusable global regular expressions are often avoided: then we don’t have to worry about reproducing lastIndex after each use.

function *matchAll(regex, input) { 
    let lastIndex = 0 
    while (true) { 
        regex.lastIndex = lastIndex 
        const match = regex.exec(input) 
        if (match === null) { 
            break 
        } 
        lastIndex = regex.lastIndex 
        regex.lastIndex = 0 
        const [ , ...captures] = match 
        yield captures 
    } 
} 
const rattributes = /(\w+)=""([^""]+)""\s/ig 
const email = '<input type=""email"" placeholder=""[email protected]"" />' 
const emailMatcher = matchAll(rattributes, email) 
const address = '<input type=""text"" placeholder=""Enter your business address"" />' 
const addressMatcher = matchAll(rattributes, address) 
console.log(emailMatcher.next().value) // <- ['type', 'email'] 
console.log(addressMatcher.next().value) // <- ['type', 'text'] 
console.log(emailMatcher.next().value) // <- ['placeholder', '[email protected]'] 
console.log(addressMatcher.next().value) // <- ['placeholder', 'Enter your business address']
Copy the code

The String#matchAll proposal _ (in phase 1 at the time of writing this article) _ defines a new method on a string prototype that will operate in a similar way to our matchAll implementation, except that the iterable returned is a sequence of match objects, Not just captures in the example above. Note that the String#matchAll sequence contains the entire match object, not just the capture of the number. This means that we can access named capture for each match in the sequence through mate.Groups.

const rattributes = / (? 
      
       \w+)=""(? 
       
        [^""]+)""\s/igu
       
       
const email = '<input type=""email"" placeholder=""[email protected]"" />' 
for ( const { groups: { key, value } } of email.matchAll(rattributes)) { 
    console .log(`${ key }: ${ value }`)}// <- type: email 
// <- placeholder: [email protected]`
Copy the code

Regular matching in the post-ES6 era

Paste matching mark/y

Unicode symbol/u

Named capture group

Unicode attribute escape

Lookbehind assertion

A new/s_ (dotAllSign) _

String#matchAll

Related Posts

SVG+JS to achieve the new version of mobile QQ gesture elimination red dot message animation

Getting Started with WebPack (2) – Installation, configuration, environment setup

Front-end Engineering (7) : All you need to know about the latest Babel compatibility implementation

Paste matching mark`/y`

Unicode symbol`/u`

A new`/s`_ (`dotAll`Sign) _

`String#matchAll`