Regular expressions tutorial

Note: This article is a summary of what I shared at a company technical talk.

A re is a string matching pattern that is useful when working with text. The most common operation is for find and replace.

When it comes to working with text, the code we type every day is text, so the find and replace tools in common code editors almost always support regular syntax.

Just to be clear, the rest of the text is tested against the lyrics of We Will Rock You.

Click to expand the lyrics

Buddy, you're a boy make a big noise
Playing in the streets gonna be a big man someday
You got mud on your face
You big disgrace
Kicking your can all over the place
Singing
We will, we will rock you
We will, we will rock you
Buddy you're a young man, hard man
Shouting in the street gonna take on the world someday
You got blood on your face
You big disgrace
Waving your banner all over the place
We will, we will rock you

The re test tool used is Regex 101.

Readers are advised to check each case by opening the website and Posting the lyrics. It is also recommended to modify the re slightly to see if the match still matches your understanding. Follow suit, the learning effect is better.

1. Exact match

A regular is a pattern, or pattern, used to describe a string. The most mundane use of this is exact lookup. Like I want to find all the “the” in the lyrics. I can write the regular as the.

The image above found only one of the, not all of them. This is because the re itself has two parts, one is a pattern and the other is a modifier (flags, or flag bit). A common modifier is g, which is short for the word global and stands for global lookup.

At this point, we find all the “The”. And then we look for all the weis.

However, at the same time We want to find the “We” in the text, and the W character is capitalized. Use the other common identifier I, which begins with the word ignoreCase or insensitive, to ignoreCase.

Whether the or we, this pattern match is an exact match, and if the re just looks up whatever you type, it doesn’t mean much. Its power lies in the realization of fuzzy matching.

2. Horizontal fuzzy matching

Let’s say we want to find all the consecutive “e’s” in a song.

The regular shape in the figure is p{m,n}, indicating that P occurs continuously at least m to n times (including m and n). P can be a subpattern, not just a character.

Above, I changed some of the lyrics to test them. The re uses parentheses, which, as you might expect, have a high priority. Indicates that the overall noise repeats 1 to 3 times.

I don’t know if you have any questions at this point, but {1,3} means 1 to 3 times. Why is there only one match above? Instead of matching to 3 noises. Or noisenoise and noise?

This is because quantifiers can be greedy or lazy. The quantifier {1,3} is greedy and will match as many matches as it can. A quantifier can be inert by placing a question mark after it.

Really lazy enough to find one. The question mark after the quantifier seems to ask the quantifier, “Can you stop being greedy?”

Now that quantifiers are clear, let’s look at some abbreviations.

* is equivalent to {0,}. Any number of them.
+ is equivalent to {1,}. At least one
? Equivalent to {0,1}. One or none
{m} is equivalent to {m,m}

What’s the point here? There are two possible meanings. One for lazy mode and one for quantifier.

It’s actually pretty easy to distinguish between the two, right after the quantifier? Represents an lazy match. For example, regular bo?? Y, the first question mark indicates the quantifier {0,1}, and the second indicates that the quantifier is inert.

Quantifiers allow the re to be fuzzy-matched, meaning that a few pattern codes can match a long list. I call this horizontal fuzzy matching. There is also a vertical fuzzy matching.

3. Longitudinal fuzzy matching

Let’s say the lyrics accidentally write “ruck” instead of “rock” at several points. We need to find both, which can use the character set r[ou]ck. The effect is as follows:

Where [ou], the square bracketed pattern is the character set. It is a set that matches “o” or “u”. Or if we want to find all the characters from a to E, we can write [abcde]. Such consecutive characters can also be shortened to [a-e].

Character set means set, and set has complement. The re begins with a broken character inside square brackets to indicate that the inverse [^a-e] matches a character other than a, b, c, d, or e.

Now that we know what character classes are, let’s take a look at common abbreviations

\d is equivalent to [0-9]. Represents a digit. The initials of Digit.
\D is equivalent to [^0-9].
\w is equivalent to [0-9a-za-z_]. Represents digits, uppercase letters, and underscores. The first letter of a word, also called a word character.
\W is equivalent to [^ 0-9a-za-z_].
\s is equivalent to [\t\v\n\r\f]. Represents whitespace, including Spaces, horizontal tabs, vertical tabs, line feeds, carriage returns, and page feeds. How to remember: S is the first letter of space character.
\S is equivalent to [^ t\v\n\r\f].
. Equivalent to [^\n\r\u2028\u2029]. Dots are wildcards, representing almost any character.

Character sets are another way for regex to implement fuzzy matching. For a particular bit, the characters to be matched can be indeterminable, which I call longitudinal fuzzy matching.

When quantifiers and character groups are mastered, more than half of the regular problems can be solved. Here’s another example. Find all the words that end in “ing”.

The greedy quantifier is used above, but the situation is different if you use lazy quantifiers.

The word “singing” splits into “sing” and “ing”. To match a word completely. You need to match the position.

4. Match a specific location

Matching the word “you”, for example, might match the “you” in “your”.

At this point we can use \b. B is the first letter of the word boundary. It means matching a position with \w on one side and \w on the other. Word characters on one side and non-word characters on the other, so it’s called a word boundary.

If the concept of “location” is still unclear, we can take a concrete look at what \ B looks like.

Note the pink dotted line in the figure above. They’re just positions. See if each one is flanked by a word character and a non-word character.

Position also has an antisense. For example, \B represents a non-word boundary. We could look at that.

To match the word “you” exactly once you have the word character, use \byou\b.

Except for places like word boundaries, you probably know the ^ and $. It matches the beginning and end of the entire text.

Remember earlier we looked for “we”, if we wanted to find all the we words at the beginning of the line. We can use multi-line mode:

There is an extra M in the modifier, which is the first letter of multiline, indicating a multi-line match. Multi-line matching means that ^ and $can match the beginning and end of a line, not just the beginning and end of the entire text.

In addition to \b, \b, ^, $, there is an assertion position. Such as? =p), indicating the position before mode P.

(? ! P) is its antisense. There are also reverse assertions, such as (? <=p), indicating the position after mode P. Or we could put p after that position. It also has an antisense form (? <! P). Readers are invited to try and see for themselves what matches.

Say a few more words about location. What if I want to find a position that can’t start with an S character?

(? ! ^ is the opposite of ^. It doesn’t matter if you write multiple places in a row. For example, ^^^^.

Note that the position, unlike a character, does not take up space. If it is a character, it is a null character and has no actual width.

The re matches either characters or positions. Finished with the introduction of the main content, then check the gaps.

5. References

There are two E’s in street and two L’s in all. Now I want to find all of these double-stick letters. How do I do that? Using.{2} directly will not work. Because it’s just… Short for two arbitrary characters. There is no requirement that these two characters be the same.

This is where backreferencing comes in. Refer to the following:

\1 is a backreference that represents the data captured in the first parentheses. What about \2, which is captured by the second parenthesis.

Note that the parentheses here are normal parentheses, not parentheses like (? =p) as special syntax parentheses.

The data captured by parentheses can not only be backreferenced in the re. External references can also be used in conjunction with the host API. For example, to achieve filter weight:

Using the substitution above, the tool must use the host language API internally. $1 indicates an external reference to the content captured by the first group.

Parentheses can be used to provide grouping and capture data. Can you just use parentheses for grouping? Use non-capture grouping (? :p), not (p).

6. Branching structure

Let’s say I want to find all the faces and places. What should I do now?

Pipe |, said or relationship, a commonplace. It tries from left to right, and if it succeeds, it doesn’t try again. It can be short-circuited and inert. With you for instance | your to match your, it will only match to your first three letters. So you might get a different result if you branch in a different order.

conclusion

In this paper, the regular syntax is covered in the form of examples. Include:

quantifiers
Character set
Branching structure
The modifier
location
reference
Common abbreviations

If you manually input and debug each of the examples in this article and understand them yourself, you can safely say that you are getting started with re.

If you want a more comprehensive and in-depth understanding of JS re, please read my JS re mini book.

Of course, just mastering grammar is certainly not enough, it also needs a lot of practice.

There is a boiling water theory, if the water has never been boiled, no matter how many times, the water is not drinkable. But once it’s boiled, you can drink it even if you let it sit for a while.

That’s why it’s easy to forget when you’re a beginner and don’t use it.

A great place to practice is CodeWars, where you can spend a lot of time on regex issues and get used to it in a few days.

Hope it helps.

In this paper, to the end.

1. Exact match

2. Horizontal fuzzy matching

3. Longitudinal fuzzy matching

4. Match a specific location

5. References

6. Branching structure

conclusion

Related Posts

Content-encoding What else do you know besides gzip?

About seven cow JS-SDK rely on the upload certificate algorithm notes

PC static web application development and project