The author:Chen Shuyi| this article from theTencent Cloud + community Chen ShuyiThe column of

A few days ago, an online project monitoring information was suddenly abnormal. After checking the usage of related resources on the machine, it was found that the CPU usage was nearly 100%. Using Java’s own thread Dump tool, we exported the stack information in question.

We can see that all the stacks point to a method called validateUrl, and there are more than 100 such error messages on the stack. By examining the code, we know that the main function of this method is to verify that the URL is valid.

It is strange how a regular expression can cause high CPU utilization. In order to understand the recurrence problem, we copied the key code and did a simple unit test.

public static void main(String[] args) {
    String badRegex = "^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).) +([A-Za-z0-9-~\\\\/])+$";
    String bugUrl = "http://www.fapiao.com/dddp-web/pdf/download?request=6e7JGxxxxx4ILd-kExxxxxxxqJ4-CHLmqVnenXC692m74H38sdfdsazxcUmfcOH2fAf Y1Vw__%5EDadIfJgiEf";
    if (bugUrl.matches(badRegex)) {
        System.out.println("match!!");
    } else {
        System.out.println("no match!!"); }}Copy the code

When we run the above example, we can see from the resource monitor that the CPU utilization of one process named Java skyrocketed to 91.4%.

At this point, we can almost conclude that this regular expression is responsible for the high CPU utilization!

So we focus on that regular expression:

^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).) +([A-Za-z0-9-~\\/])+$Copy the code

This regular expression looks fine and can be broken down into three parts:

The first part matches HTTP and HTTPS protocols, and the second part matches WWW. Character, the third part matches many characters. I stared at this expression for a long time, but I didn’t see any major problems.

The key reason for the high CPU usage here is that the Java regular expression engine implementation is NFA automata, which backtracks when matching characters. ** Once a backtrace occurs, it can take a long time, ranging from minutes to hours, depending on the number and complexity of the backtrace.

See here, maybe you are not very clear what is backtracking, there is still a little muddled. It doesn’t matter. Let’s start with the basics of regular expressions.

Regular expression engine

A regular expression is a convenient symbol for matching, but to implement such a complex and powerful matching syntax, you need a set of algorithms to implement it, and the thing that implements this set of algorithms is called a regular expression engine. In short, there are two ways to implement a regular expression engine: DFA (Deterministic Final Automata) and NFA (Non Deterministic Finite Automaton).

For these two kinds of automata, they have their own differences, here do not intend to go into their principle. In short, the time complexity of DFA automata is linear and more stable, but its functionality is limited. The time complexity of the NFA is quite variable, sometimes very good, sometimes not so good, depending on the regular expression you write. However, NFA is more powerful, so Java,.NET, Perl, Python, Ruby, PHP and other languages use NFA to implement their regular expressions.

How does the NFA auto-add match? Let’s illustrate the following characters and expressions.

text="Today is a nice day."
regex="day"
Copy the code

It is important to remember that the NFA matches against regular expressions. In other words, the NFA automatically reads each character of the regular expression and matches it with the target string. If the match is successful, the next character of the regular expression is replaced. Otherwise, the comparison continues with the next character of the target string. Maybe you don’t understand that, ok, let’s go through the example above step by step.

  • First, get the first match of the regular expression: d. So compare that to the character in the string, the first character in the string is T, it doesn’t match, replace. The second one is o, it doesn’t match, so let’s do the next one. The third character is D. If it matches, the second character of the regular expression, a, is read.
  • Read the second match of the regular expression: a. Then it goes on to the fourth character of the string, a, and matches again. Then read the third character of the regular expression: y.
  • Reads the third match of the regular expression, y. Then it matches the fifth character of the string, y, again. Try to read the next character in the regular expression and find that there is none, then the match ends.

The above matching process is the matching process of NFA automata, but the actual matching process is much more complicated than this, but its principle remains the same.

Backtracking of NFA automata

Now that you know how the NFA does string matching, you can move on to the main point of this article: backtracking. To better explain backtracking, let’s do the same with the following example.

text="abbc"
regex=Ab {1, 3} "c"
Copy the code

The purpose of the above example is simple, matching a string that starts with a, ends with c, and has 1-3 b characters in between. The process by which the NFA resolves it looks like this:

  • First, we read the first match a of the regular expression and compare it to the first character a of the string, and it matches. The second character of the regular expression is read.
  • B {1,3} matches b{1,3} and b{1,3}. However, because b{1,3} represents 1-3 b strings, and because of the greedy nature of NFA automata (that is, to match as many as possible), the next match of the regular expression is not read. Instead, b{1,3} is compared with the third character b of the string, and the match is found. B {1,3} is compared with c, the fourth character of the string. This is where backtracking occurs.
  • How does backtracking work when it happens? After backtracking occurs, the fourth character c of the string that we have read is spat out, and the pointer returns to the third string. After that, the program reads the next operator c of the regular expression, reads the next character C of the current pointer, compares it, and finds a match. So we read the next operator, but we’re done here.

Let’s go back to the previous regular expression that validates urls:

^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).) +([A-Za-z0-9-~\\/])+$Copy the code

The URL in question is:

http://www.fapiao.com/dzfp-web/pdf/download?request=6e7JGm38jfjghVrv4ILd-kEn64HcUX4qL4a4qJ4-CHLmqVnenXC692m74H5oxkjgdsYa zxcUmfcOH2fAfY1Vw__%5EDadIfJgiEfCopy the code

We divide this regular expression into three parts:

  • Part ONE: Verification protocol.^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://).
  • Part two: Verify the domain name.(([A-Za-z0-9-~]+).) +.
  • Part three: verification parameters.([A-Za-z0-9-~\\/])+$.

We can find that the regular expression verification protocol http://part is no problem, but when the verification www.fapiao.com, it uses XXXX. This way to check. So the matching process actually looks like this:

  • Matching to the WWW.
  • Matching to the fapiao.
  • Match to thecom/dzfp-web/pdf/download? request=6e7JGm38jf.....You’ll find that because of greedy matching, the program will keep reading the string to match until it finds no dot, so it will go back character by character.

This is the first problem with this regular expression.

Another problem is that in the third part of the regular expression, we find that the URL in question has an underscore (_) and a percent sign (%), but the regular expression corresponding to the third part does not. This would result in a long string of characters being matched, then finding a mismatch, and then going back.

This is the second problem with this regular expression.

The solution

Now that you understand that the backtracking is what’s causing the problem, and you’re actually reducing the backtracking, you can see that if I underline the third part and add the percent sign, it works.

public static void main(String[] args) {
    String badRegex = "^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).) +([A-Za-z0-9-~_%\\\\/])+$";
    String bugUrl = "http://www.fapiao.com/dddp-web/pdf/download?request=6e7JGxxxxx4ILd-kExxxxxxxqJ4-CHLmqVnenXC692m74H38sdfdsazxcUmfcOH2fAf Y1Vw__%5EDadIfJgiEf";
    if (bugUrl.matches(badRegex)) {
        System.out.println("match!!");
    } else {
        System.out.println("no match!!"); }}Copy the code

Run the above program and it will print match immediately!! .

But this is not enough. If there are other urls that contain confusing characters in the future, we may have to change them again. It can’t be realistic!

There are actually three modes in regular expressions: greedy mode, lazy mode, and exclusive mode.

In terms of quantity matching, we have +? * {min, Max} four twice, if only used alone, then they are greedy mode.

** What if you add one more after them? Symbol, then the greedy pattern becomes the lazy pattern, matching as little as possible. ** But lazy patterns can backtrack. TODO for example:

text="abbc"
regex="Ab {1, 3}? c"
Copy the code

The first operator a of the regular expression matches the first character a of the string. So the second operator of the regular expression b{1,3}? Matches the second character b of the string. Because of the rule of least match, the third operator c of the regular expression matches the third character b of the string, and it does not match. So go back and take the second regular expression operator b{1,3}? Matches the third character b of the string. Then the third operator c of the regular expression matches the fourth character c of the string, and the match succeeds. So the end.

If a + sign is added after them, the original greedy mode becomes exclusive mode, matching as much as possible, but not backtracking.

Therefore, if you want to solve the problem completely, you need to ensure that the functionality does not occur at the same time. I added a + sign after the second part of the regular expression above to verify the URL, so it looks like this:

^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://) (([A-Za-z0-9-~]+).) ([a-za-z0-9 -~\\/])+$. ([a-za-z0-9 -~\\/])+$Copy the code

After that, there is no problem running the original program.

Finally, I recommend a website that can check if your regular expression matches the corresponding string.

Online regex tester and debugger: PHP, PCRE, Python, Golang and JavaScript

For example, the URL in question in this article will tilt on the site and it will tell you: catastrophic backgracking.

When you click on “Regex Debugger” in the lower left corner, it will tell you how many steps you have gone through, and it will list them all and indicate where the backtracking occurred.

The regular expression in this article stops automatically after 110,000 attempts. This shows that there is something wrong with this regular expression that needs to be improved.

But when I test it with our modified regular expression, the following regular expression.

^([hH][tT]{2}[pP]:\/\/|[hH][tT]{2}[pP][sS]:\/\/)(([A-Za-z0-9-~]+).) ++([A-Za-z0-9-~\\\/])+$Copy the code

The tooltip completes the check in just 58 steps.

A character difference, the performance gap is tens of thousands of times.

Tree righteousness has words

It’s amazing how a small regular expression can wear down a CPU. This also gives us a warning to write programs at ordinary times, when encountering regular expressions, we should pay attention to the greedy pattern and backtracking problems, otherwise every expression we write is a thunder.

By looking up online materials, I found that students in LAZADA, Shenzhen Ali Center, also encountered this problem in 2017. They also didn’t find any problems in the test environment, but once they got online they had 100% problems with the CPU, and their problems were almost exactly the same as ours. Interested friends can click to read the original article to see their later summary of the article: a blood caused by regular expressions – Mingzhijian Zhiyuan – blog garden

Although I have finished this article, I have not explained the principle of NFA automata, especially the explanation of lazy mode and exclusive mode. Because THE NFA automata is really not so easy to understand, so we still need to learn and strengthen in this aspect. Welcome friends in the know to learn and exchange, promote each other.


Question and answer

How do I know regular expressions?

reading

Regular expressions

Regular extended exercise

C# regular expressions

Has been authorized by the author tencent cloud + community release, the original link: https://cloud.tencent.com/developer/article/1148472?fromSource=waitui

Welcome toTencent Cloud + communityOr follow the Cloud Community wechat public number (QcloudCommunity), the first time to obtain more massive technology practice dry goods oh ~