preface

Cloudflare, a well-known CDN vendor, experienced a global outage for nearly half an hour on July 2. When users opened Cloudflare’s domain name page, an error 502 was displayed. It was its first worldwide failure in nearly six years, and officials described it as shameful: We’re ashamed it happened.

According to the official explanation, the fault was caused by a faulty regular expression firewall rule. When the service process performed the matching of the regular expression, the CPU was overloaded and did not respond for a long time, resulting in the service being unavailable.

IT can be said that this is a bloody case caused by regular expressions. The Stack Exchange, also a famous IT manufacturer, also had a failure for half an hour in 2016. The same page could not be opened, and the same CPU was overloaded. The same culprit – regex Catastrophic Backtracking.

Technical students often use regular expressions in daily programming, can not escape this trap, fault and blame naturally can not escape, so it is necessary to know exactly how Backtracking is the same.

Backtracking back

Backtracking simply means that most regular expression engines will try all possible paths to find a match. Take a simple re.*x that matches the string “xy” :

  1. Due to the*The default is greedy,. *I’m going to match the “xy”xMatching failure
  2. Then the engine backtrack, let. *The matched “xy” spits out a “y”, and then. *Match the “X” and get the enginexThe match “y” fails
  3. The engine backtrack again let. *Spit out the “X”, at this point. *Match is empty (*Already allowed to match empty), engine take againxMatches “x”, at which point the engine successfully found a match.

.*x path diagram:

Execution process (via Regexp::Debugger) :

Like *, +,? The quantifier}, {1, 2, and even | this selector, there may be multiple legal matching, there is more than one path to find the final match of, general regular expression engine in order to successfully find a match, will backtrack as much as possible, will try as far as possible namely all paths.

The regular expression engine can be roughly divided into two types, one is NFA type (Nondeterministic Finite Automaton) and the other is DFA type (Deterministic Finite Automaton). DFA type is dominated by the input string. Backtracking will not appear, so it will not appear as described below, but it will not support some features that rely on backtracking implementation, such as Backreference.

Most regex engines, such as JS and Java’s built-in regex engine, have chosen NFA to support more features, so they have to face backtracking problems.

Catastrophic Backtracking back to hell

When there are a lot of backtrack positions in the regular expression, the regular engine keeps backtrack in order to successfully find a match, which leads to the program not responding for a long time and occupies a large amount of CPU resources, which is called Catastrophic Backtracking.

Take (x+)+y for example. This deceptively simple re can be used to match strings like “xxy” and “XXX” without any exception. However, when it meets the carefully constructed the string “XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX” (40 x), a regular engine will fall back to hell, in order to match the non-existent y and try all possible paths, It took a lot of time (more than 10 minutes with NodeJS on my machine).

It might be helpful to break down the steps and look at what the regular engine does:

  1. The engine will be(x+)Match 40 x’s,yMatching failure
  2. Engine backtrack, let(x+)I spit out an X(x+)+I try to match one of the extra x’s, and it works(x+)+I matched two rounds, (39 x’s) and (1 x’s),yMatching failure
  3. Again backtrack, (1 x) is thrown away,yMatches the last x, fails to match
  4. Backtrack again, (39 x) spit out 1 x to become (38 x),(x+)+Matches the remaining 2 x’s, matches two rounds, 38 x’s and 2 x’s,yMatching failure
  5. Backtrack again, at this time (2 x) this round spit out 1 X turned into (1 X),yFailed to match last x…

With matching 40 + x (x +) can appear (40), (39, 1), (38, 2), (38,1,1), (37, 3)… Visible to the naked eye such as exponential complexity, the regex engine in order to successfully find a match for the tireless will try every path again, until finally dare to rigorous to give a “mismatch” as a result, however, the process time is enough to make the program directly, not with the service, if the CPU kills machine performance.

Identify re’s that could lead to backtracking hell

Nested Quantifier

Quantifier nested regular expressions like (x+)+ are prone to Catastrophic Backtracking for the reasons described above. Unless the tokens in the inner layer are mutually exclusive, such as (x+y+)+z is safe.

(.+,)+END matches the CSV string “1,2,3,4,” in the same way that (.+,)+END matches the CSV string “1,2,3,4,” in the long string “,,,,”. Replace it with [^,].

Tips: Most of the time we don’t need any wildcards /./.

The Alternative selector

In addition to like * +? {1,} the quantifier, | selector also produces can backtrack path, if the overlap between the selector, like (.) | x * y. Overlaps with x’s, and backtracking hell occurs when you encounter long strings of X’s.

Specifically, matching “XXX” can have (…) , (.. X), (.x.), (.xx), (x..) , (x.x), (xx.), (XXX) multiple possible paths, also visible to the naked eye in exponential complexity, in “xx… A little bit longer x” is enough to jam the program.

Quantifiers in Sequence

*.*=.* as an example, if successive quantifiers match objects that overlap (. Overlapped with.), or overlapped with the delimiter (overlapped with. =), backtracking hell also occurs. You can refer to the official blog for details.

other

In the case of the Stack Exchange problem, the simplified version of the regular \ S *$, for example, seemed normal enough to clean up Spaces at the end of lines, but crashed its website when it encountered 20,000 Spaces.

The reason for this is simply that when a line with 20000 Spaces does not end in a space, the engine backtrace tries to start with 1, 2, 3… Start matching, this will produce 20000+19999+… +1=199,990,000 match steps, which is better than the exponential backtracking hell above, but still long enough to make the service unavailable.

For the solution, please refer to the Possessive Quantifier below. The lesson here is that seemingly normal regees can be deadly.

How do you avoid going back to hell

Instead of an NFA regular engine, use a DFA regular engine

No NFA, no Backtracking, no Catastrophic Backtracking, end of story. However, the cost of replacing the built-in regular engine with a new one is not trivial, and using a DFA engine means giving up some features, such as backreference.

Use Possessive Quantifiers or Atomic Grouping

Adding a + to the Quantifier, such as x*+, turns x* into exclusive mode, and this extra + is Possessive Quantifier. After x*+ matches the entire “XXX”, it does not backtrack bit by bit when backtrack is needed.

Possessive Quantifier has the benefit of reducing the path of backtrack and improving performance. However, it should be noted that it may erase some matches that can be completed in non-exclusive mode, such as “.*” can match “ABC “x but “.*+” cannot be successfully matched.

Atomic Grouping and Possessive Quantifier play a similar role. The writing rule is (? >x*), Atomic Grouping provides more regular engine support.

However, JS does not support either.

Set the timeout value

Some regex engines (such as.NET) support timeout values to prevent regex matches from causing the service to remain unresponsive for a long time. However, this timeout value is also a tradeoff. Too long will increase the upper limit of the service response time, and too short will not allow the engine time to find a match.

For a regex engine that does not support setting a timeout value, you can start a new thread to run the regex exclusively and timeout the thread. For JS, consider the WebWorker or childProcess module.

Review regular expressions in the code

Based on the aforementioned pattern of common problem regex, hardcode’s regular expressions can be manually reviewed in code and modified as soon as a suspect is found. There are also tools available for scanning, such as the detect-unsafe-regex rule for JS programs using eslint-plugin-Security.

It is not easy to review dynamically generated regees in advance, and developers need to carefully consider whether dynamically generated regees can fall into backtracking hell.

Hardcode regees in JS have one less compilation process than those created dynamically with new Regexp().

Avoid running user-submitted regular expressions

This is obvious, but running user-submitted Regular expressions on a server is too easy to ReDoS.

If business requirements do not allow, at least set the timeout value as described above.

Multiple tests Worst Case

Most problematic regees behave fine when encountering normal input strings, and only fall into backtracking hell when encountering extreme input strings.

When writing a re, the developer should test it with a Worst Case. If you don’t, the hacker will have to find it for you.

conclusion

Regular expressions are a complex and powerful tool that can easily solve patternmatching problems if used well, but can sink into backtracking hell if you don’t. Only by understanding the regex engine’s Backtracking can we understand why it is Catastrophic Backtracking, and avoid the trap that has tripped Up Cloudflare and Stack Exchange.

reference

The Details of the Cloudflare outage on out 2, 2019: blog.cloudflare.com/details-of-…

Stack Exchange Outage Postmortem – out 20, 2016: stackstatus.net/post/147710…

Runaway Regular Expressions: Catastrophic Backtracking: www.regular-expressions.info/catastrophi…

Preventing Regular Expression Denial of Service (ReDoS): www.regular-expressions.info/redos.html

Regular drawing tool: www.debuggex.com/

Regular visual debugging tool: metacpan.org/pod/Regexp:…