Regular expression optimization - capture and non-capture groups

Regular expression series summary:

Regular expressions – Tools section
Regular expression optimization – capture and non-capture groups
Regular expression optimization – Avoid catastrophic backtracking

preface

I recently saw a project in which some regular expressions included some useless capture groups, so I wanted to investigate the difference in program execution when using capture groups versus non-capture groups, other things being equal.

The basic grammar

JavaScript syntax MDN document directions: Groups and Ranges

Capturing groups Capturing groups

Regular expressions can be used to group and capture text with a pair of parentheses (x),

Capture group usage scenarios:

Extract string
backreferences
To make the parts in parentheses a whole, specifying multiple choice structures or acting on quantifiers

A simple example:

var r = /a(.*)e/
var s = 'abcde fghi'

console.log(s.match(r))
Copy the code

The result is shown below:

With this pair of brackets, you can extract the characters between the letters A and E.

Named Capture Group

You can name the capture group to make it easier to read the extracted content.

x),

A simple example:

To extract the contents between div tags from the string

hello regexp

, you can use the expression /

(?

.*)<\/div>/, where message is the group name to extract this part of the content.

var s = '<div>hello regexp</div>';
var r = /
      
       (? 
       
        .*)<\/div>/
       
      

console.log(s.match(r))
Copy the code

The results are shown below:

The match() method of String returns an array of matched results:

The element with subscript 0 is the full character matched by the regular expression
If a capture group is set, the following entries are the characters extracted from the capture group starting with the subscript 1, and the order in which the left parentheses appear is the order in which the results are sorted
If the capture group name is also set, the result will be extracted by group name under the Groups field of the result array

Back reference

The content matched by the capture parentheses can be referenced in the program in the following scenarios:

Reference in a regular expression
Referenced when replacing a String, as in the second argument to String’s replace() method in javascript
Referenced by the Regexp object

Reference method: By the serial number of the capture group, or by the name of the capture group

Reference in an expression:

// Use the \n form n for the left to right capture group number
var r = /<(\w+)[^>]*>(.*? < 1 > \ / \)? /g 

// use \k
      
        by grouping name reference
      
// var r = /<(? 
      
       \w+)[^>]*>(.*? <\/\k
       
        >)? /g
       
      

var s = '
      
       div content
      
 some words '

var s1 = s.replace(r, ' ') // Can filter HTML tags
Copy the code

The above re is taken from the any-rule mentioned in the previous article and is used to loosely match tags in HTML,

This expression contains two pairs of capture parentheses:

The second pair of capture parentheses has one\ 1, is a reference to the contents of the first capture group, indicating matching pairs of tags. The first pair of parentheses says I now want to matchscriptThe label,< 1 > \The combination matches its corresponding closed part</script>
The effect of the second pair of capture parentheses is to make it inside the character into a whole, so that the following?Quantifiers work on the whole, indicating that HTML tags can be matched with or without closing tags.

References when replacing characters:

When you replace a String with the replace method of String, if the first argument to replace() is a regular expression, the second argument can be either a String or a function.

When a string is used as the second argument (representing the text to be replaced with), several special variables can be included:

$n: indicates the NTH capture parenthesis matching character in the insert expression, empty string if there is no capture group, n is 1-99, 01,02… Can also be
$: represents the character matched by the named capture group in the insert expression, an empty string if no capture group name is specified
$$: Simply replace with the $sign
$&: matches the substring of the entire expression
$’ : the substring following the matched substring
$’: the substring before the matched substring

var r = / (? 
      
       hello regexp)/
      
var s = '<div>hello regexp<\/div>'

var s1 = s.replace(r, '<em>$<text></em>')
// "<div><em>hello regexp</em></div>"
Copy the code

The above example references and replaces the capture group name

Reference by Regexp object:

The captured content can also be accessed using Regexp.$n, for example, Regexp.$1

var r = /(hello regexp)/
var s = '<div>hello regexp<\/div>'
var s1 = s.match(r)

console.log(Regexp.$1) // "hello regexp"
Copy the code

Non-capturing groups Non-capturing groups

Non-capture parentheses (? :x) cannot be used to extract text, as the Book Mastering Regular Expressions says:

Non-capturing parentheses help improve efficiency, and if the regex engine does not need to record the contents of capturing parentheses matches, it will be faster and use less memory.

var r = /<(\w+)[^>]*>(.*? < 1 > \ / \)? /g
var s = '
      
       div content
      
 some words '

var s1 = s.replace(r, ' ') // Can filter HTML tags
Copy the code

Or the previous example of matching HTML tags, where the second group does not need to be backreferenced, just to allow the part in parentheses to be used? The whole function of a quantifier, so it is entirely possible to replace it with non-capture parentheses.

var r = /<(\w+)[^>]*>(? :. *? < 1 > \ / \)? /g
Copy the code

The performance test

After summarizing the above usage, the most important thing to verify is the difference in performance between capture and non-capture groups. Here are my discard tests and summaries.

Round 1

In the first round of testing, I found a piece of text with a size of about 63KB and compared the execution of programs with three non-capture groups and three capture groups in the same regular expression respectively:

Text size	Number of capture groups	Program execution time	Program execution times per second
63kb	0	0.02 ms	About 28,000 OPS /s
63kb	3	0.03 ms	About 27,000 OPS /s

About 10 attempts, rough visual estimation, did not calculate the exact value.

The conclusion is: when the text content is not particularly large, the difference in execution efficiency between the capture group and the non-capture group is not obvious, and the difference is about 0.01ms in most cases after several times of direct execution.

Round 2

In the second round of testing, we found a section of dictionary data with a size of about 297KB and randomly extracted four fields in batches with regular expressions.

Compare the program execution with 4 non-capture groups, 2 capture groups +2 non-capture groups and 4 capture groups in the regular expression respectively:


console.time('0 capture groups')
var r3 = /"code":(? :. *?) ,"name":"(? :. *?) ","parentCode":(? :. *?) ,"shortName":(? :. *?) /g
var list3 = [...s.matchAll(r3)]
console.timeEnd('0 capture groups')

console.time('2 capture groups')
var r2 = /"code":(? .*?) ,"name":"(? 
       
        .*?) ","parentCode":(? :. *?) ,"shortName":(? :. *?) /g
       
var list2 = [...s.matchAll(r2)]
console.timeEnd('2 capture groups')

console.time('4 capture groups')
var r = /"code":(? .*?) ,"name":"(? 
       
        .*?) ","parentCode":(? 
        
         .*?) ,"shortName":(? 
         
          .*?) /g
         
        
       
var list = [...s.matchAll(r)]
console.timeEnd('4 capture groups')
Copy the code

Execute JS directly and print 5 groups of execution time statistics:

Text size	Four non-capture groups	2 Capture group + 2 Non-capture group	Four named capture groups
297kb	2.94 ms	4.76 ms	7.74 ms
297kb	0.92 ms	1.81 ms	8.48 ms
297kb	3.02 ms	3.64 ms	19.12 ms
297kb	0.57 ms	2.61 ms	12.53 ms
297kb	0.53 ms	4.73 ms	9.78 ms

Results of 10 performance tests using JSBench:

Text size	Four non-capture groups	2 capture groups +2 non-capture groups	Four named capture groups
297kb	1398.44 ops/s	Slow 954.12 ops/s (31.77%)	Slow 796.08 ops/s (43.07%)
297kb	1438.64 ops/s	Slow 946.43 ops/s (34.33%)	Slow 744.85 ops/s (48.32%)
297kb	1457.66 ops/s	Slow 887.82 ops/s (29.52%)	Slow 777.46 ops/s (38.28%)
297kb	1318.97 ops/s	Slow 930.34 ops/s (29.46%)	Slow 734.85 ops/s (44.29%)
297kb	1468.66 ops/s	Slow 950.73 ops/s (35.27%)	Slow 749.94 ops/s (48.94%)
297kb	1431.64 ops/s	Slow 959.95 ops/s (32.95%)	Slow 758.47 ops/s (45.13%)
297kb	1355.06 ops/s	Slow 946.32 ops/s (30.16%)	Slow 752.23 ops/s (44.49%)
297kb	1428.12 ops/s	Slow 925.76 ops/s (35.18%)	Slow 695.8 ops/s (51.28%)

Conclusion:

Put together two simple tests and the verdict is: the book’s right!

Although the difference in program execution speed between using capture groups and non-capture groups may seem small in the case of small text, the more capture groups are used, the slower the program is as the text gets larger and the expressions get more complex.

Therefore, during actual development, it is necessary to specify whether the grouping content in the regular expression needs to be captured. If not, use non-capturing parentheses

conclusion

One small step for optimization, one big step for civilization. While regular expression optimization focuses on reducing backtracking, I think it’s also necessary to use capture groups and non-capture groups correctly.

Take a look back at some of the regular expressions I’ve written and see if there’s some room to optimize the capture and non-capture groups

Finally, if there are any mistakes or omissions in the text, please correct them and bow.

Regular expression optimization – capture and non-capture groups

preface

The basic grammar

Capturing groups Capturing groups

Named Capture Group

Back reference

Non-capturing groups Non-capturing groups

The performance test

conclusion

Related Posts

Talk about readable and writable streams in Node

Back to the front

Create a React application from scratch