Regular expressions have long been a technique that has troubled many programmers, including myself. Most of the time when we need to use some regular expressions in the development process, we will open Google or Baidu search directly and then copy and paste. The next time you encounter the same problem, the same scenario will be repeated. As a versatile technique, I believe it pays to have a deep understanding and mastery of regular expressions. So hopefully, this article has helped you get your head around the interconnections between the symbols of regular expressions, and form a body of knowledge that you can use the next time you encounter regular expressions without a search engine.

What exactly is a regular expression

Regular Expression is a tool for searching and replacing strings for pattern matching. It originated with some work done in mathematics in the 1950s and was later introduced to computers. From its name, it is an expression used to describe a rule. The underlying principle is simple, using the idea of a state machine for pattern matching. Regexper.com is a great tool for visualizing your own regular expressions:

/\d\w+/

For a specific algorithm implementation, you can read the Introduction to Algorithms if you are interested.

Start with characters

When we learn a systematic knowledge, we must understand it from its basic composition. Regular expressions consist of characters and metacharacters. Characters are easy to understand, basic computer character encoding, usually using numbers and letters in regular expressions. Metacharacters, also known as special characters, are characters that represent special semantics. Such as ^ said the, or | said. Powerful expression patterns can be constructed using these metacharacters. Let’s start with these basic units and learn how to build regular expressions.

A single character

The simplest regular expression can consist of simple numbers and letters, with no special semantics, and is purely a one-to-one correspondence. To find the character ‘a ‘in the word ‘apple’, just use the re /a/.

But if we want to match special characters, we have to call out our first metacaracter, **\**, which is an escape character that, as the name suggests, makes its subsequent characters lose their original meaning. Here’s an example:

I want to match the symbol *, and since * itself is a special character, I use the escape metacaracter \ to make it lose its original meaning:

/ \ * /Copy the code

If the character is not a special character, using the escape symbol will give it a special meaning. We often need to match special characters such as Spaces, tabs, carriage returns, newlines, etc., and these need to be matched using escape characters. To facilitate memorization, I have organized the following table and attached the memorization method:

Special characters Regular expressions memory
A newline \n new line
Page identifier \f form feed
A carriage return \r return
Whitespace characters \s space
tabs \t tab
Vertical TAB \v vertical tab
The fallback operator [\b] bAckspace, the reason for using the [] symbol is to avoid repeating with \b

More characters

The mapping of individual characters is one-to-one, that is, the regular expression is used to filter only one character for matching. But this is obviously not enough, as long as the introduction of set interval and wildcard way to achieve one-to-many matching.

In regular expressions, sets are defined by using brackets [and]. For example, /[123]/ matches both 1,2, and 3 characters. What if I want to match all the numbers? Writing from 0 to 9 is obviously inefficient, so the metacaracter – can be used to represent the range, using /[0-9]/ to match all numbers, and /[a-z]/ to match all lowercase letters.

Even with the definition of sets and intervals, if you want to match more than one character at a time, you still have to list them, which is inefficient. So within the regular expression, we derive a bunch of simple regular expressions that match multiple characters at once:

Match the range Regular expressions memory
Any character other than a newline character . Period, except for the end of the sentence
Single digit, [0-9] \d digit
In addition to the [0-9] \D not digit
A single character including an underscore, [a-za-z0-9_] \w word
Non-single character \W not word
Matches whitespace characters, including Spaces, tabs, page feeds, and newlines \s space
Matches non-whitespace characters \S not space

Cycle and Repetition

So we’re done with one-to-one and one-to-many character matching. Next, it’s time to show how to match more than one character at a time. To match multiple characters we simply loop through and reuse our previous regular rules. So depending on how many times we do it, we can do it zero times, one times, many times, certain times.

0 | 1

Metacharacters? Represents to match one character or zero characters. Imagine that if you wanted to match the words color and colour, you would need to make sure that the character u was matched in both cases. So your regular expression should look something like this: /colou? R /.

> = 0

The metacharacter * is used to indicate that 0 characters or an infinite number of characters are matched. Usually used to filter some non-essential strings.

> = 1

Metacharacter + applies when you want to match the same character one or more times.

A specific number of

In cases where we need to match a specific number of repetitions, the metacaracters {and} are used to set the exact range of repeated matches. If ‘a’ wants to match 3 times, I use the re /a{3}/, or ‘a’ wants to match at least twice, I use the re /a{2,}/.

Here’s the full syntax:

- {x}: x times - {min, Max}: between min times and Max times - {min,}: at least min times - {0, Max}: at most Max timesCopy the code

Because these metacaracters are very abstract and confusing, I’ve made up the formulas by association so that they can be recalled when I use them.

Match rule metacharacters Lenovo way
Zero times or one time ? andq,There arealsoThere is no
Zero times or countless times * The universe,Chen lodgeList: The universe began, from nothing, and finally stars filled the sky
1 or countless times + OnePlus, + 1
A specific number of {x}, {min, max} You can think of it as a number line, going from a point to a ray to a line segment. Min and Max represent the left bound and right bound of the left closed and right closed interval respectively

Position the border

So that’s all we need for character matching, and now we need positional boundary matching. In long text string lookups, we often need to limit the location of the query. For example, I only want to look at the beginning and end of a word.

Word boundaries

Words are the basic units of sentences and articles. A common use scenario is to find specific words in articles or sentences. Such as:

The cat scattered his food all over the room.
Copy the code

I wanted to find the word cat, but using the regular /cat/ would have matched both cat and scattered text. In this case we need to use the boundary regular expression \b, where b is the first letter of boundary. In the regex engine it actually matches the position between the characters (\w) that make up a word and the characters (\w) that don’t.

The above example is rewritten as /\bcat\b/ to match the word cat.

String bounds

After matching the words, let’s look at how the bounds of an entire string match. The metacharacter ^ is used to match the beginning of the string. The metacharacter $is used to match the end of the string. Note that in long text, if we want to exclude newlines, we use multiline mode. Try matching the sentence “I am scq000” :

I am scq000.
I am scq000.
I am scq000.
Copy the code

You can use a regular expression like /^I am scq000\.$/m, where m is the first letter of multiple line. In addition to m, I and g are more commonly used in the regular pattern. The former means ignoring case, and the latter means finding all matches that match.

Finally, to sum up:

Boundary and mark Regular expressions memory
Word boundaries \b boundary
Non-word boundary \B not boundary
Beginning of string ^ smallThe first sharpSo big
End of string $ Put an end toThe dollar sign is $
Multiline mode M logo multiple of lines
Ignore case I sign ignore case, case-insensitive
The global model G marks global

subexpression

Now that we’ve covered character matching, more advanced uses involve subexpression. You can make re even more powerful by nesting recursions and referencing themselves.

The evolution of regular expressions from simple to complex often takes the idea of grouping, backtracking references, and logical processing. With these three rules, you can deduce infinitely complex regular expressions.

grouping

Groups include: All regular expressions containing metacaracters (and) are grouped into groups. Each group is a subexpression, which is the basis of advanced regular expressions. Using a simple (regex) matching syntax is essentially the same as not grouping, but it is often combined with backtracking if it is to be powerful.

Back in the reference

A backreference is when a later part of a pattern refers to a previously matched substring. You can think of it as a variable, with backtracking syntax like \1,\2,…. Where \1 represents the first subexpression of the reference, \2 represents the second subexpression of the reference, and so on. The \0 represents the entire expression.

Suppose you wanted to match two consecutive identical words in the following text. What would you do?

Hello what what is the first thing, and I am am scq000.
Copy the code

Using backtracking references, we can easily write regular expressions like \b(\w+)\s\1.

Backtracking references are very common in substitution strings. The syntax is slightly different. Use $1,$2… To reference the string to be replaced. The following is a demonstration of the JS code:

var str = 'abc abc 123';
str.replace(/(ab)c/g.'$1g');
// Get the result 'abg abg 123'
Copy the code

If we don’t want subexpressions to be referenced, we can use a non-captured regular (? :regex) to avoid wasting memory.

var str = 'scq000'.
str.replace(/(scq00)(? : 0) /.'$1, $2')
/ / return scq00, $2
// The second reference has no value due to the use of an uncaptured re, so replace it with $2
Copy the code

Sometimes we need to limit the scope of backtracking references. You can do this by looking forward and looking backward.

Prior to find

A lookahead is used to restrict suffixes. In the (? =regex) is used to restrict the matching of the previous expression during the match. For example, happy Happily, I want to get an adverb beginning with happ, so I can use happ(? =ily) to match. If I want to filter all adverbs beginning with happ, I can also use the regular happ(? ! Ily), matches the happ prefix of the happy word.

To find after

After looking forward, let’s look at its reverse: lookbehind. Lookbehind is done by specifying a subexpression and starting from the positions that match the subexpression to find the strings that match the rule. A simple example: both apple and people contain the suffix ple, so what if I just want to find apple’s ple? We can uniquely identify the word ple by limiting the prefix app.

/ (? <=app)ple/Copy the code

Among them,? The syntax for <=regex) is the backward lookup we will introduce here. The subexpression referred to by regex is matched as a constraint, and once it matches this subexpression, the search continues backwards. Another type of constraint matching is using (?

It is important to note that not every implementation of the re supports backward lookups. This is not supported in javascript, so if a backward lookup is useful, one idea is to flip the string, then use a forward lookup, and then flip it back. Look at a simple example:

// For example, I want to replace apple's ple with ply
var str = 'apple people';
str.split(' ').reverse().join(' ').replace(/elp(? =pa)/.'ylp').split(' ').reverse().join(' ');
Copy the code

Ps: Thanks to the comment section for reminding me that since ES2018, reverse lookup is now supported in Regular expressions in Chrome. However, support for older browsers needs to be taken care of in real projects to prevent bugs from appearing online. See http://kangax.github.io/compat-table/es2016plus/#test-RegExp_Lookbehind_Assertions for details

A final review of this section:

Go back to find regular memory
reference \0,\1,\2 and $0, $1, $2 Escape + number
Non capturing group (? 🙂 The reference expression (()) itself is not consumed. , references (:)
Prior to find (? =) Reference subexpression (()), which itself is not consumed (?). , forward lookup (=)
Negative lookup forward (? !). Reference subexpression (()), which itself is not consumed (?). , negative lookup (!)
To find after (? < =) Reference subexpression (()), which itself is not consumed (?). , backward (<, opening back), positive search (=)
Negative search back (? The <! Reference subexpression (()), which itself is not consumed (?). , backward (<, opening back), negative lookup (!)

Logical processing

Computer science is a science with logic. Let’s recall three logical relationships that are used in programming languages: and or not.

In re, the default re rules are all related to and, so we won’t discuss them here.

Instead of relationships, there are two cases: one is character matching and the other is subexpression matching. The metacharacter ^ is used when characters match. The important thing to remember here is that only the “^” used within [and] represents the “not” relationship. The non-relationship for subexpression matching involves looking forward to negative subexpression (? ! Regex) or backward negative lookup subexpression (?

Or relation, usually used to categorize subexpressions. I match at the same time, for example, a, b two cases can be used (a | b) such a sub-expression.

Logical relationship Regular metacharacter
with There is no
non [^ regex] and!
or |

conclusion

When it comes to regex, the symbolic abstraction is often a turnoff to many programmers. In view of the characteristics of bad memory, I try to make it meaningful by classifying and associating. We start with one-to-one single characters, then many-to-many substrings, and then build advanced regular expressions through grouping, backtracking references, and logical processing.

At the end, here’s a common re interview question: Write a re that deals with the thousandths of a number, such as 12345 instead of 12,345. Try to deduce the answer yourself instead of relying on a search engine :).

— This article was first published on the personal public account, please indicate the source of reprint —