Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

Demand background

In business development, error logs need to be monitored. Currently, our error log format is error…… So, how do we match the error log?

We can do this with the ^error regular expression.

Now suppose we want to implement a feature that filters occasional timeouts that do not match the log. How do we implement this feature?

Knowledge popularization

The basic grammar

Since we’re going to filter logs using regular expressions, we should know a little bit about the syntax of regular expressions.

character describe
^ Matches the starting position of the input string, except when used in a square bracket expression, when the symbol is used in a square bracket expression, to indicate that the set of characters in the square bracket expression is not accepted.
$ Matches the end of the input string. If the Multline property of the RegExp object is set$Also match\n\r. To match$The character itself, please use\ $.
( ) Marks the start and end of a subexpression. Subexpressions can be retrieved for later use. To match these characters, use the\ [\).
* Matches the preceding subexpression zero or more times. To match*Character, please useA \ *.
+ Matches the previous subexpression one or more times. To match+Character, please use\ +.
. Matches the division newline character\nOther than any single character. To match., please use the\..
? Matches the preceding subexpression zero or once, or indicates a non-greedy qualifier. To match?Character, please use\?.
| Indicates a choice between two items. To match|, please use the\ |.

Zero width assertion

This section refers to regular expression zero-width assertion.

Applicable scenario

In introducing the concept of zero-width assertions, let’s first look at a scenario in which zero-width assertions are used.

The zero-width assertion comes into play sometimes when we need to capture something before and after the specific content, but not when the specific content is captured.

For example, there are two characters: abcdefg and bcdefg. We want to find the de string preceded by ABC. That’s where the zero-width assertion comes in.

concept

A zero-width assertion is, as its name suggests, a zero-width match. What it matches is not saved in the match result, which is just a position.

Adds a qualifier to a position that specifies that strings before or after the position must meet the qualifier for the subexpression in the re to match.

Note: The subexpression is not just an expression enclosed in parentheses, but any matching unit in a regular expression.

str := "abZW863"
pattern := "/ab(? =[A-Z])/"
regexp.MatchString(pattern, str)
Copy the code

In the above code, the semantics of the regular expression are: match any string ab followed by an uppercase letter. The result is ab, because the zero-width assertion (? =[a-z]) does not match any character, and knowledge is used to specify that the current position must be followed by A capital letter.

str := "abZW863"
pattern := "/ab(? ! [A-Z])/"
regexp.MatchString(pattern, str)
Copy the code

In the above code, the semantics of the regular expression are as follows: matches the string ab that is not followed by any uppercase letter. The regular expression failed to match any characters because ab is followed by an uppercase letter in the string.

Zero-width assertions are used to find things before or after something (but not including it), that is, they are like\b,^,$That is used to specify a position that should be used to satisfy a condition (that is, an assertion). They are therefore also known as zero-width assertions. Assertions are used to declare a fact that should be true, and the regular expression will continue to match only if the assertion is true.

(? =exp), also known as zero-width positive prediction ahead assertion, which asserts that the position after its occurrence matches the expression exp.

(? <=exp), also known as zero-width forward retrospective assertion, asserts that the position in front of its occurrence matches the expression exp.

Negative zero width assertion

What if we want to make sure a character doesn’t appear, but we don’t want to match it? That’s the question at the beginning. This is the negative zero width assertion.

Zero-width negative predictive predictive predictive predictors (? ! Exp) asserts that this position cannot be followed by the expression exp.

In the same way, if there is a later mismatch, there will be a previous mismatch, i.e. (?

conclusion

  • (? =exp) : zero-width positive prediction predicate, which asserts that the position after its occurrence matches the expression exp.

  • (? <=exp) : zero-width forward trailing assertion, which asserts that the position in front of its occurrence matches the expression exp.

  • (? ! Exp) : zero-width negative prediction precedes assertion that this position cannot be followed by the expression exp.

  • (?

“Not included” feature implementation

From the zero-width assertion above, we know that if we want to query for logs that start with error but do not contain timeout, we can use the following regular expression: “^error((? ! timeout).) * $”.

Let’s walk through the regular expression step by step:

  1. ? ! Timeout: is a zero-width negative-predictive predicate assertion that this position cannot be followed by the expression timeout.

  2. (? ! Timeout). : Looks forward to see if there is no timeout string in front of it, if not, then the. (dot) matches those other characters. This expression does not capture any characters, just judgment.

  3. ((? ! timeout).) * : expression (? ! Timeout). It will only be executed once, so we wrap the expression in parentheses as a group and then decorate it with * (asterisk) — matches 0 or more times.

Regular expression testing

Let’s now test if this regular expression works.

Error1 test normal error if the error log without timeout matches correctly. As follows, we found that it can be matched normally.

Error2 test with timeout err, error2 test with timeout err, error2 test with timeout err, error2 test with timeout err, error2 test with timeout err This is exactly what we need.

Go code implementation

So can we implement this in the GO code? So let’s try it out.


package main

import (
    "fmt"
    "regexp"
)

func main(a) {

    pattern := "^error((? ! timeout).) * $"
    error1 := "error1 test normal err"
    error2 := "error2 test with timeout err"
    match1, err := regexp.MatchString(pattern, error1)
    match2, err := regexp.MatchString(pattern, error2)
    fmt.Printf("match: %v, err: %v\n", match1, err)
    fmt.Printf("match: %v, err: %v\n", match2, err)
}
Copy the code

The running results are as follows:

match: false, err: error parsing regexp: invalid or unsupported Perl syntax: ` (? ! `
match: false, err: error parsing regexp: invalid or unsupported Perl syntax: ` (? ! `
Copy the code

Invalid or unsupported Perl syntax: (? ! . This shows that go does not support zero-width assertions.

We can also see from the documentation that zero-width assertions are not supported.

grammar introduce
(? =re) before text matching re (NOT SUPPORTED)
(? ! re) before text not matching re (NOT SUPPORTED)
(? <=re) after text matching re (NOT SUPPORTED)
(? <! re) after text not matching re (NOT SUPPORTED)

So what to do? Don’t panic, the predecessors have already built the wheel, the Regexp.

Here’s what RegEXP does:

Regexp2 is a feature-rich RegExp engine for Go. It doesn’t have constant time guarantees like the built-in regexp package, but it allows backtracking and is compatible with Perl5 and .NET. You’ll likely be better off with the RE2 engine from the regexp package and should only use this if you need to write very complex patterns or require compatibility with .NET.

Regexp2 is a feature-rich RegExp engine for Go. It doesn’t have a fixed time guarantee like the built-in Regexp package, but it does allow backtracking and is compatible with Perl5 and.NET. It is probably better to use the RE2 engine in the RegEXP package, which should only be used when you need to write very complex schemas or need.NET compatibility.

A partial comparison of regexp and regEXP2 is as follows:

Category regexp regexp2
Catastrophic backtracking possible no, constant execution time guarantees yes, if your pattern is at risk you can use the re.MatchTimeout field
Python-style capture groups (? P<name>re) yes no (yes in RE2 compat mode)
.NET-style capture groups (? <name>re) or (? 'name're) no yes
comments (? #comment) no yes
branch numbering reset `(? a b)`
possessive match (? >re) no yes
positive lookahead (? =re) no yes
negative lookahead (? ! re) no yes
positive lookbehind (? <=re) no yes
negative lookbehind (? <! re) no yes
back reference \ 1 no yes
named back reference \k'name' no yes
named ascii character class [[:foo:]] yes no (yes in RE2 compat mode)
conditionals (? (expr)yes|no) no yes

From regEXP2’s introduction and comparison with Regexp, we can see that Regexp2 supports zero-width assertions. So let’s try it out with the documentation.


package main

import (
    "fmt"
    "github.com/dlclark/regexp2"
    "regexp"
)

func main(a) {

    pattern := "^error((? ! timeout).) * $"
    error1 := "error1 test normal err"
    error2 := "error2 test with timeout err"
    match1, err := regexp.MatchString(pattern, error1)
    match2, err := regexp.MatchString(pattern, error2)
    fmt.Printf("match1: %v, err: %v\n", match1, err)
    fmt.Printf("match2: %v, err: %v\n", match2, err)
    reg, err := regexp2.Compile(pattern, 0)
    iferr ! =nil {
        fmt.Printf("reg: %v, err: %v\n", reg, err)
        return
    }

    match3, err := reg.FindStringMatch(error1)
    match4, err := reg.FindStringMatch(error2)
    fmt.Printf("match3: %v, err: %v\n", match3, err)
    fmt.Printf("match4: %v, err: %v\n", match4, err)
}
Copy the code

The running results are as follows:


match1: false, err: error parsing regexp: invalid or unsupported Perl syntax: ` (? ! `
match2: false, err: error parsing regexp: invalid or unsupported Perl syntax: ` (? ! `
match3: error1 test normal err, err: <nil>
match4: <nil>, err: <nil>
Copy the code

You can see that match3 is a normal match. The test succeeded.

conclusion

As you can see from the above, Regexp2 is quite powerful and is recommended if you need to implement complex regular expressions.

One thing to note, however, is that the time complexity of RegEXP2 is not guaranteed. As we can also see from It doesn’t have constant time guarantees like the built-in regexp package, he doesn’t guarantee definite time complexity like the official package. Therefore, be careful when using it in a production environment !!!!

Reference documentation

  • Golang regular expressions do not support complex regex and pre-lookup problem solving

  • Regular expression zero – width assertion details