How Regular Expressions Work (I)

1.1 introduction

The purpose of this series of articles is to introduce you to how the regular expression engine works, and understanding these principles is key to writing effective regular expressions. Not only that, but it will also help you avoid many common mistakes and reduce the time you spend guessing the behavior of regular expressions.

2.1. Literal Characters

The most basic regular expression consists of a literal character, such as a. It matches where the first A appears in a string. If it matches Jack is a boy, then it matches the A after the J. The fact that a appears in the middle of the string does not affect the re. If you want to control whether a appears at the beginning or end of the string, then you need to use a literal boundary. We will discuss this in later chapters.

In fact, this re can also match the second A in the string, but you have to tell the re engine to start the second match through a function call.

Similarly, the expression cat matches cat in About cats and dogs. This expression consists of three literal characters, and for the regular engine, it means to find a C followed by an A, followed by a T.

Note: By default, the re engine is case-sensitive. Unless your factor engine ignores case.

2.2 yuan character

To handle more complex regular matches, we need to use some characters as special characters. The following lists the 12 metacharacters in regular expressions:

Generally, these special characters raise an error when used alone.

If you want to use the above characters as literal characters, you must escape them with \. For example, if you want to match 1+1=2, then the correct expression is /1\+1=2/, because the plus sign has special meaning.

Note: /1+1=2/ is also a correct expression, but it will not match 1+1=2, it will match 111=2 where 123+111=234. This is due to the special character plus.

If you forget to escape metacharacters, there are cases where the regular expression is invalid, such as /+1/. The program throws an exception.

2.2.1 `{`The escape of

In most re engines, we can use {directly as a literal character, except for repeated operations such as a{1,3}. So in general, we don’t need to escape {, and there’s no error in escaping. In some special re engines, we may need to escape it. For example, in Java,} needs to escape, and in Boost and STD ::regex {,} needs to escape.

2.2.2 `]`The escape of

] is a literal character when used outside of the character class. When used in character classes, there are different rules. We will discuss the specific rules in the character section.

There are exceptions, of course. In STD ::regex and Ruby,] needs to be escaped even if it is not in a character class.

2.2.3 Other characters

With the exception of these metacharacters, no other characters need to be escaped by backslashes. This is because the combination of the transfer and literal characters becomes a regular instruction (regex token), which has a special meaning. For example, \d can match any character from 0 to 9.

2.3. Special characters and programming languages

‘and’ are not special characters in regular expressions, which may surprise you if you’re a programmer. You don’t need to escape single and double quotes when you’re programming or using the text editor’s advanced search feature.

When using regular expressions in your code, you should be aware that some characters have special meanings in the programming language you are using. This is because these characters are processed by the compiler before they enter the re engine. In c++ code, the expression 1\+1=2 needs to be written as 1\\+1=2. The c++ compiler removes a backslash during compilation and passes the compiled result to the re engine. As another example, the regular expression c:\temp matches the string c:\temp, but in C ++ you need to write c:\\\\temp. To put it simply, replace one regular slash with four backslashes.

The use of regex in programming languages is discussed in a later section.

3.1 Line Breaks

You can match non-print characters in regular expressions with special characters.

\t: Matches tabs
\r: Match carriage return
\n: Matches a newline
\a: Matches the ringing bell
\e: Matches the ESC key
\f: Matches a page break

In Windows, \r\n is used to end a line, while in UNIX \n is used

\R is a special escape character that matches all line terminals, including Unicode line terminals. Compared to \r or \n, \r is special in that it matches [CRLF pairs] as a whole, and does not match CR and LF separately (if they are present at the same time). When \R appears before a CRLF pair, a single \R will match the entire CRLF pair. When matching a CRLF, \R does not backtrack to match to CR. \R can match to a single CR or a single LF, but \R{2} or \R\R will not match to a CRLF pair, because the first \R has matched to the entire CRLF pair, another \R can not match any character.

However, in some languages, \R does not behave according to this specification. For example, in java9 \R\R can match a CRLF pair, in Perl \R{2} can match a CRLF pair.

\R can only search forward and match a complete CRLF pair. \r\ r can also match a CRLF pair because \r matches CR and \r matches LF. This rule is consistent across all engines.

4.1 Engine Classification

Understanding the internals of the regular expression engine will help you write more efficient expressions and help you quickly debug exceptions in regular expressions.

In each of the following sections we introduce a new regular feature, and then we explain how the engine processes this feature in detail. With this understanding, we can quickly write regular expressions without using regular visualization tools. While understanding the mechanics of an engine can be difficult, it can help you avoid some common mistakes.

With the basics covered, we’ll introduce a number of interesting application examples that you can quickly apply to your projects.

While there are many different implementations of a regex engine, they generally fall into two categories: text-directed engines and regex-directed engines. Almost all modern regex engines adopt a regex driver engine because some very useful features, such as lazy Quantifiers and Backreferences, can only be implemented on this engine

4.1.1 Regex-directed Engine

A regular expression engine does this by iterating through the regular expression, trying to match the next token in the expression to the next character in the string. If the current token matches successfully, the engine moves to the next token and matches it to the next character in the string. If the match fails, the regex engine backtracks through the regex and string and researches the path. Subsequent chapters on regex backtracking will be detailed.

4.1.2 Text-directed Engine

A text-driven engine iterates through the text to match. He tries all the permutations in the expression before matching the next character. A text-driven engine has no backtracking, so its matching process is relatively simple. In most cases, the matches between the two engines are the same.

This tutorial focuses on regex-driven engines, so by default all engines mentioned are regex-driven engines, unless the two engines match differently. This only happens if we use a selector and the two options match to the same location.

4.2 The regular expression always matches the leftmost result

A very important feature of regular expressions is that they always match the leftmost match, regardless of whether there is a better match later. When the regex engine matches a string, it starts the search from the leftmost part of the string. The engine matches all permutations in the re to the first character of the string. If one of the permutations matches successfully, the engine continues to match the next character in the string. Next the engine matches the next character in the string against all permutations in the re. Eventually the engine returns the leftmost match.

Now let’s take an example. We use the expression cat to match the string He captured a catfish for his cat. First, the engine uses c to match the first character in the string, H. The match is unsuccessful and there is no other permutation (because C contains only a literal character). The engine then matches token C and e, which also fails, as does the trailing space. When the engine tries to match the fourth character, token C matches c successfully, so the engine continues to match token A with the fifth character in the string, and the match succeeds. But the third token T does not match the sixth character p. At this point, the engine knows that the expression cannot match the first four characters in the string, so the engine will rematch the first token C and the fifth character P until the 15th character c is matched, followed by a and t.

At this point, the re can start at the 15th character of the string, and the engine is eager to report a match. The engine doesn’t search backwards (even if a better match comes up later) because it thinks the result is good enough.

In this example, the results of both re engines are the same. The pattern in which the re works largely determines its matching results. Some of the matches in later examples may surprise you, but as long as you keep this search rule in mind, you can logically deduce the engine’s matching results.

5.1 Character Classes (Character sets)

A character class (also called a character set) that matches one character in a set of characters. The syntax for a character set is simple: just write characters between square brackets. For example, [AE] can match either a or e. You can match gray or grey with GR [AE]y. Gray is American English, grey is British English.

A character set can match only one character. For example, gr[ae]y does not match graay or graey. The sequence of characters in the character set is not in any order, and the matching result is the same in different order.

You can use – to represent a range. For example, [0-9] can match numeric characters. You can use multiple ranges at the same time, for example [0-9a-fa-f] can match a single hexadecimal character. You can also combine individual characters with ranges, for example [0-9a-fxa-fx] can match a hexadecimal character or an X. As before, the order of the combinations has no effect on the final result.

Character sets are one of the most commonly used features in regular expressions. You can use it to match a misspelled word, such as sep[ae]r[ae]te and li[cs]en[cs]e. You can use it to find A variable name, such as [a-za-z_][a-za-z_0-9]*. Or A C-style hexadecimal number 0[xX][a-FA-f0-9]+

5.2 Negated Character Classes

It is used to match any character that is not part of the character set. Unlike the. Symbol, the reverse character set can match an invisible line terminal. If you do not want to match a line terminal, you can add a line terminal to the character set. For example [^0-9\r\n] it can match any character except newlines and numbers.

The reverse character set still needs to match one character. Q [^u] does not mean that it matches a q that is not followed by u, it means that q is followed by a character that is not followed by u. Q [^u] does not match the Q in Iraq. It matches the q space in Iraq is a country because it matches the q space in Iraq is a country. If you want to match only the q and not the space behind it, you can use negative lookahead: q(? ! U). We’ll talk about this property later.

5.3 Metacharacters in a character set

In most re engines, if you use the four metacharacters], \, ^, and – in a character set, then you need to escape these metacharacters. Other metacharacters do not need to be escaped. For example, you can use [+*] to match plus and asterisks. Of course, you can also escape all metacaracters. This does not cause an error, but it reduces the readability of the regular expression.

5.3.1 Backslash escape

If you want to match a \ in a character set, then you need to escape \. For example, [\\x] can match either a \ or an x. There is no need to escape for], ^, and – as long as the positions they use do not cause ambiguity.

5.3.2 Character escape

There is no need to escape the character ^ as long as it is not directly followed by [. For example, [x^] can match an x or a ^.

5.3.3 Escape the right square brackets

For], it does not need to be escaped as long as it is followed by [or ^. For example, []x] can match either] or x, and [^]x] can match any character except] and x.

In Javascript, the above rule does not hold true. [] cannot match any character, and [^] can match any character.

5.3.4 `-`The escape of

In the following cases – no escape is required:

then[the-, e.g.[-x]
]In front of-, e.g.[x-]
then^the-, e.g.[^-x],[^x-]

Using – elsewhere may result in an error if it does not form a range, or it may be treated as a literal. At this point, the different engines do not handle this in the same way.

If you are using an engine that supports Unicode, you can also use Unicode in character sets, such as [\u20AC].

To be translated:

Many regex tokens that work outside character classes can also be used inside character classes. This includes character escapes, octal escapes, and hexadecimal escapes for non-printable characters. For flavors that support Unicode, it also includes Unicode character escapes and Unicode properties. [$\u20AC] matches a dollar or euro sign, assuming your regex flavor supports Unicode escapes.

5.4 Quantifier matching of character set

If you use quantifiers at the end of a character set (for example? , *, +), then you repeat the entire character set, not just the matched character set. For example, [0-9]+ can match both 837 and 222.

If you want to use quantifiers for matching results in a character set, you can use BackReferences. For example ([0-9])\1+ can match 222 but cannot match 837. If it matches 833337, it matches 3333. If that’s not what you want, use LookAround

5.5 Principles of Character sets

In this section we use an example to illustrate the character set interpretation process. We use gr[ae]y to match Is his hair grey or gray? , the result will be matched to Grey. We have studied the literal character matching process before, now let’s look at how character sets with multiple permutations match.

In the process of matching, the first 12 characters are not matched because they do not match g. Until the 13th character g finally matches the first tokeng of the regular expression. Next the engine matches the rest of the expression to the string, at which point r matches successfully. The engine then matches [AE] to the character E. Since the token is a character set, the engine matches all combinations in the character set to the next character e in the string. The first is to match the character a with the character e. This time there is no match, but at this point it is not clear whether the starting character of the string, the 13th character, will match the expression because there is another permutation to try. The engine then uses the second tokene in the character set, and this time the match is successful. The regex engine then makes the next tokeny match, which also succeeds.

At this point the entire regular expression has been matched. You may have noticed that the gray in the string also matches successfully, but according to the leftmost rule of the re engine, the engine will not continue to match the next possible result. Unless you tell the engine to make a second match through a function call.