directory

  • Regular expression engine
  • Backtracking of NFA automata
  • The solution
  • The tree said something

The article was first publishedBlog Garden – Chen ShuyiClick to jump to the original textThe Trap hidden in regular Expressions

A few days ago, an online project monitoring information suddenly reported an exception. After logging on to the machine, I checked the usage of related resources and found that the CPU utilization rate was nearly 100%. Using Java’s own thread Dump tool, we exported the faulty stack information.

You can see that all of the stack points to a method called validateUrl, and there are more than 100 such error messages on the stack. By reviewing the code, we know that the main function of this method is to verify that the URL is valid.

It is strange how a regular expression can lead to high CPU utilization. To solve the recurrence problem, we extracted the key code and did a simple unit test.

public static void main(String[] args) { String badRegex = "^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).) +([A-Za-z0-9-~\\\\/])+$"; String bugUrl = "http://www.fapiao.com/dddp-web/pdf/download?request=6e7JGxxxxx4ILd-kExxxxxxxqJ4-CHLmqVnenXC692m74H38sdfdsazxcUmfcOH2fAf Y1Vw__%5EDadIfJgiEf"; if (bugUrl.matches(badRegex)) { System.out.println("match!!" ); } else { System.out.println("no match!!" ); }}Copy the code

When we run the example above, we can see from the resource monitor that the CPU utilization of one process named Java jumped to 91.4%.

From here, we can almost conclude that this regular expression is responsible for the high CPU utilization!

Instead, we focused on that regular expression:

^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).) +([A-Za-z0-9-~\\/])+$Copy the code

This regular expression looks fine and can be broken down into three parts:

The first part matches HTTP and HTTPS protocols, and the second part matches WWW. Character, the third part matches many characters. I stared at this expression for a long time, but I didn’t see that there was anything wrong with it.

In fact, the key reason for the high CPU usage here is that the Java regular expression engine is implemented by NFA automata, which can backtrack when performing character matching. When backtracking occurs, it can take a long time, from minutes to hours, depending on the number and complexity of backtracking.

Now, if you don’t know what backtracking is, it’s a little confusing. It doesn’t matter. Let’s start with the principles of regular expressions.

Regular expression engine

Regular expression is a very convenient matching symbol, but to achieve such a complex and powerful matching syntax, it is necessary to have a set of algorithms to achieve this algorithm, and the thing to implement this algorithm is called regular expression engine. In short, there are two ways to implement a regular expression engine: DFA (Deterministic Final Automata) and NFA (Non-Deterministic Finite Automaton).

For these two kinds of automata, they have their own differences, and we will not go into their principles here. Simply put, DFA automata is linear in time complexity and more stable, but has limited functions. The time complexity of NFA is very variable, sometimes very good, sometimes not so good, depending on the regular expression you write. However, NFA is more powerful, so languages including Java,.NET, Perl, Python, Ruby, PHP and so on all use NFA to implement their regular expressions.

So how does the NFA automata match? Let’s use the following characters and expressions as examples.

text="Today is a nice day."
regex="day"Copy the code

It is important to remember that the NFA matches against regular expressions. That is, the NFA automatically reads the regular expression character by character and matches it with the target string. If the match is successful, the next character of the regular expression is changed. Otherwise, it continues to compare with the next character of the target string. Maybe you don’t understand, ok, let’s go through the above example step by step.

  • First, get the first card of the regular expression: d. So that compares to the character of the string, the first character of the string is T, doesn’t match, change to the next one. The second one is o, which also doesn’t match. Let’s switch to the next one. The third is d, and if it matches, the second character of the regular expression, a, is read.
  • The second card of the regular expression, A, is read. The comparison continues with the fourth character of the string, a, and matches again. Then read the third character of the regular expression: y.
  • The third card of the regular expression, y, is read. Then it goes on to the fifth character of the string, y, and matches again. If you try to read the next character of the regular expression and find it missing, the match ends.

The matching process above is the matching process of NFA automata, but the actual matching process will be much more complicated than this, but the principle remains the same.

The article was first publishedBlog Garden – Chen ShuyiClick to jump to the original textThe Trap hidden in regular Expressions

Backtracking of NFA automata

Now that you know how NFA does string matching, you can move on to the focus of this article: backtracking. To better explain backtracking, let’s also use the following example.

Text = "abbc" regex = "ab {1, 3} c"Copy the code

The purpose of the above example is simple: match strings that start with a, end with C, and have 1-3 B characters in between. The NFA parse process looks like this:

  • First, it reads the first card of the regular expression, a, and compares it with the first character of the string. The second character of the regular expression is read.
  • Read the regular expression second card b{1,3} and the second character b of the string, match. However, because b{1,3} represents 1-3 B strings and the greedy nature of NFA automata (that is, to match as many as possible), the matching characters of the next regular expression are not read at this time. Instead, b{1,3} is still used to compare with the third character b of the string, and the match is still found. We continue to compare b{1,3} with the fourth character of the string, c, and find a mismatch. This is where backtracking occurs.
  • How does backtracking work? After a backtrace occurs, the fourth character C of the string we have read is spit out and the pointer goes back to the position of the third string. After that, the program reads the next operator c of the regular expression and the next character C of the current pointer for comparison and finds a match. So we read the next operator, but we’re done here.

Let’s go back to the regular expression that validates the URL:

^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).) +([A-Za-z0-9-~\\/])+$Copy the code

The URL in question is:

http://www.fapiao.com/dzfp-web/pdf/download?request=6e7JGm38jfjghVrv4ILd-kEn64HcUX4qL4a4qJ4-CHLmqVnenXC692m74H5oxkjgdsYa zxcUmfcOH2fAfY1Vw__%5EDadIfJgiEfCopy the code

We split this regular expression into three parts:

  • Part ONE: Verification protocol.^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://).
  • Part two: Verifying domain names.(([A-Za-z0-9-~]+).) +.
  • Part three: calibration parameters.([A-Za-z0-9-~\\/])+$.

We can see that the regular expression verification protocol http:// part is fine, but when checking www.fapiao.com, it uses XXXX. This is the way to check. So the matching process actually looks like this:

  • Matching to the WWW.
  • Matching to the fapiao.
  • Match to thecom/dzfp-web/pdf/download? request=6e7JGm38jf.....“, you’ll find that because of greedy matching, the program will keep reading the following string for matching, and then it will find no dots, so it will go back character by character.

This is the first problem with this regular expression.

Another problem is that in the third part of the regular expression, we find that the URL in question has underscores (_) and percent signs (%), but not in the regular expression corresponding to the third part. This will result in a long string of characters being matched, then finding a mismatch and eventually backtracking.

This is the second problem with this regular expression.

The solution

Now that I understand that backtracking is the cause of the problem, and I’m actually reducing that backtracking, you can see that if I add an underscore and a percent sign in the third part, it works fine.

public static void main(String[] args) { String badRegex = "^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).) +([A-Za-z0-9-~_%\\\\/])+$"; String bugUrl = "http://www.fapiao.com/dddp-web/pdf/download?request=6e7JGxxxxx4ILd-kExxxxxxxqJ4-CHLmqVnenXC692m74H38sdfdsazxcUmfcOH2fAf Y1Vw__%5EDadIfJgiEf"; if (bugUrl.matches(badRegex)) { System.out.println("match!!" ); } else { System.out.println("no match!!" ); }}Copy the code

Run the above program and it will print match in no time!! .

But this is not enough, and if there are other urls that contain random characters, we will have to change them again. Surely that’s not realistic!

In fact, there are three patterns in regular expressions: greedy, lazy, and exclusive.

In the match for quantity, we have +? * {min, Max} four kinds of twice, if used alone, then they are greedy mode.

What if we add one more after them? Symbol, then the original greedy mode will become lazy mode, as few matches as possible. But the lazy pattern can backtrack. Take this example:

Text = "abbc" regex = "ab {1, 3}? c"Copy the code

The first operator a of the regular expression matches the first character A of the string, and the match succeeds. So the second operator of the regular expression b{1,3}? Matches the second character b in the string, and the match succeeds. Because of the minimum matching principle, the third regular expression operator C matches the third character B of the string. So go back and take the second regular expression operator b{1,3}? Matches the third character b in the string, and the match succeeds. The third regular expression operator c matches the fourth character C of the string, and the match succeeds. So the end.

If an extra + sign is added after them, the greedy mode becomes exclusive mode, matching as much as possible, but not backtracking.

Therefore, if you want to solve the problem completely, you need to ensure that functionality is maintained without backtracking. I added a + sign to the second part of the URL validation regular expression above, which looks like this:

^([hH][tT]{2}[pP]:\/\/|[hH][tT]{2}[pP][sS]:\/\/) (([A-Za-z0-9-~]+).) ++ -->>> (with A + sign) ([a-za-z0-9 -~_%\\ /])+$Copy the code

After that, there is no problem running the original program.

Finally, I recommend a website that can check to see if your regular expressions match the corresponding strings.

Online regex tester and debugger: PHP, PCRE, Python, Golang and JavaScript

For example, the URL in question in this article will tilt backgracking after being checked on this website.

When you click on the “Regex Debugger” in the lower left corner, it tells you how many steps have been checked and lists all of them, along with the location where the backtracking occurred.

The regular expression in this article stops automatically after 110,000 attempts. This shows that the regular expression does have a problem and needs to be improved.

But when I tested it with our modified regular expression, the one below.

^([hH][tT]{2}[pP]:\/\/|[hH][tT]{2}[pP][sS]:\/\/)(([A-Za-z0-9-~]+).) ++([A-Za-z0-9-~\\\/])+$Copy the code

The tooltip completes the check in just 58 steps.

A character difference, performance gap tens of thousands of times.

The tree said something

It’s amazing how a small regular expression can drag down a CPU. This is also a warning to programmers, when encountering regular expressions to pay attention to greedy patterns and backtracking problems, otherwise we write every expression is a thunder.

Through consulting online materials, I found that my classmates from Shenzhen Ali Center LAZADA also encountered this problem in 2017. They also found no problems in the test environment, but as soon as they went online they had 100% CPU problems, and they had almost exactly the same problems as ours. If you are interested, you can click to read the original article to see their summary: A bloody case caused by regular expressions – Clear ambition and Long Journey – Blog garden

Although this article is finished, the principle of NFA automata, especially the explanation of lazy mode and exclusive mode, is not explained deeply enough. Because NFA automata is really not that easy to understand, there is still a lot to learn in this area. Welcome knowledgeable friends to study and exchange, promote each other.

The article was first publishedBlog Garden – Chen ShuyiClick to jump to the original textThe Trap hidden in regular Expressions

Keywords: Regex Express, CPU Abnormal, 100%CPU, backtracking.