• An overview,
  • Match a single character
  • Match a set of characters
  • Use metacharacters
  • Repeat matching
  • Six, position matching
  • Use subexpressions
  • Backtracking
  • Nine, before and after the search
  • 10. Embedding conditions
  • The resources

An overview,

Regular expressions are used to find and replace text content.

Regular expressions are built into other languages or software products and are not themselves a language or software.

Regular expression online tool

Match a single character

. Can be used to match any single character, but in most implementations, newlines cannot be matched;

The. Is a metacharacter, meaning that it has a special meaning rather than the character itself. If you need to match., escape with \, that is, preceded by \.

Regular expressions are generally case-sensitive, but some implementations are not.

Regular expression

nam.
Copy the code

Matching results

My name is Zheng.

Match a set of characters

[] Defines a character set;

0-9, a-z defines a character range. The range is determined using ASCII codes. The character range is used in [].

– only between [] is a metacharacter, outside [] is a normal character;

^ In [] is the take not operation.

application

Matches a string beginning with ABC and ending with a non-digit letter:

Regular expression

abc[^0-9]
Copy the code

Matching results

  1. abcd
  2. abc1
  3. abc2

Use metacharacters

Matching whitespace characters

metacharacters instructions
[\b] Rollback (delete) a character
\f Page identifier
\n A newline
\r A carriage return
\t tabs
\v Vertical TAB character

\r\n is the text line end tag on Windows, \n on Unix/Linux.

\r\n\r\n matches a blank line under Windows because it matches two consecutive end-of-line labels, which are the blank lines between two records;

Matches a specific character class

1. Numeric metacharacters

metacharacters instructions
\d Numeric character, equivalent to [0-9]
\D Non-numeric character, equivalent to [^0-9]

2. Alphanumeric metacharacters

metacharacters instructions
\w Upper – and lower-case letters, underscores, and digits, equivalent to [A-za-Z0-9_]
\W The \ w them

3. Whitespace metacharacters

metacharacters instructions
\s Any whitespace character, equivalent to [\f\n\r\t\v]
\S The \ s not

\x matches hexadecimal characters, \0 matches octal characters, such as \x0A for ASCII character 10, equivalent to \n.

Repeat matching

  • + Matches one or more characters
  • *** Matches zero or more entries
  • ? Matches 0 or 1

application

Matches the email address.

Regular expression

[\w.]+@\w+\.\w+
Copy the code

[\w.] Matches alphanumeric or., followed by + to indicate multiple matches. In the character set [],. Is not a metacharacter;

Matching results

abc.def@qq.com

  • {n} matches n characters
  • {m, n} contains m to N characters
  • {m,} contains at least m characters

* and + are greedy metacharacters that match as much content as possible. Add? Can be converted to lazy metacharacters, such as *? , +,? N}, and {m? .

Regular expression

a.+c
Copy the code

Since + is greedy, the.+ matches more likely content, so it matches the entire ABcabCABc text instead of just the preceding ABC text. The lazy type matches the previous one.

Matching results

abcabcabc

Six, position matching

Word boundaries

\b matches the boundary of a word. The boundary is the position between \w and \w. \B Matches a position that is not a word boundary.

\b matches only positions, not characters, so \babc\b matches three characters.

String boundary

^ matches the beginning of the entire string, and $matches the end.

^ Metacharacters are used in character sets to find not, and outside of character sets to start matching strings.

In multiline, a newline is treated as a string boundary.

application

Matches a comment line in the code that begins with //

Regular expression

^\s*\/\/.*$
Copy the code


Matching results

  1. public void fun() {
  2.      / / comment 1
  3.      int a = 1;
  4.      int b = 2;
  5.      / / comment 2
  6.      int c = a + b;
  7. }

Use subexpressions

Define a subexpression using (). The content of a subexpression can be treated as a single element, that is, it can be treated as a single character, and the metacharacter is equal to *.

Subexpression can be nested, but too much nesting can become difficult to understand.

Regular expression

(ab){2,}
Copy the code

Matching results

ababab

| or yuan character, it is all the left and the right part as a separate two parts, the two parts as long as there is a match.

Regular expression

(19|20)\d{2}
Copy the code

Matching results

  1. 1900
  2. 2010
  3. 1020

application

The IP address is matched.

Each part of the IP address is a number ranging from 0 to 255. The following conditions are valid when the regular expression is used to match:

  • A digital
  • A two-digit number that does not begin with 0
  • A three-digit number starting with 1
  • Starting with a 2, the second digit is a three-digit number from 0 to 4
  • Starting with 25, the third digit is a three-digit number from 0 to 5

Regular expression

((25[0-5]|(2[0-4]\d)|(1\d{2})|([1-9]\d)|(\d))\.) {3}(25[0-5]|(2[0-4]\d)|(1\d{2})|([1-9]\d)|(\d))Copy the code

Matching results

  1. 192.168.0.1
  2. 00.00.00.00
  3. 555.555.555.555

Backtracking

Backtracking references use \n to refer to a subexpression, where n represents the ordinal number of the subexpression, starting with 1. It matches what a subexpression matches. For example, if the subexpression matches ABC, then the backreference part also needs to match ABC.

application

Matches a valid title element in HTML.

Regular expression

\1 backreferences what the subexpression (h[1-6]) matched, that is, it must match what the subexpression matched.

<(h[1-6])>\w*? < 1 > \ / \Copy the code

Matching results

  1. <h1>x</h1>
  2. <h2>x</h2>
  3. <h3>x</h1>

replace

Two regular expressions are required.

application

Modify the phone number format.

The text

The 313-555-1234

Finding a regular expression

(\d{3})(-)(\d{3})(-)(\d{4})
Copy the code

Replacement regular expression

The result of the first subexpression lookup is delimited by (), followed by a space, and the third and fifth word expression lookup is delimited by -.

(The $1) $3-A $5
Copy the code

The results of

(313) 555-1234

Case conversion

metacharacters instructions
\l Converts the next character to lowercase
\u Converts the next character to uppercase
\L Convert all the characters between \L and \E to lowercase
\U Convert all the characters between \U and \E to uppercase
\E End \L or \U

application

Converts the second and third characters of the text to uppercase.

The text

abcd

To find the

(\w)(\w{2})(\w)
Copy the code

replace

The $1\U$2\E$3
Copy the code

The results of

aBCd

Nine, before and after the search

Before and after lookup specifies what matches and what should match at the beginning and end, but does not contain what matches at the beginning and end. Look forward to use? =, which specifies the content of the tail match, the content of the match in? Theta is defined later. The so-called forward search is to specify a matching content, and then to the tail of this content to find the content to be matched. Backward matching? <= definition (note: javaScript does not support backward matching, and Java has imperfect support for it).

application

Find the part of the mail address before the @ character.

Regular expression

\w+(? = @)Copy the code

The results of

abc @qq.com

To find both forward and backward, just replace = with! For example (? =) is replaced by (? !). . The take not operation matches content that doesn’t start or end well.

10. Embedding conditions

Backreference condition

The condition determines whether a subexpression matches. If a subexpression matches, the content following the condition expression needs to be matched.

Regular expression

The subexpression (\() matches an open parenthesis followed by? Matches 0 or 1. ? (1) is the condition, when the subexpression 1 matches the condition, need to perform) match, that is, match the close parenthesis.

(\ [)? abc(? (1) \])Copy the code

The results of

  1. (abc)
  2. abc
  3. (abc

Before and after search condition

The condition is whether the beginning and end of the definition match. If so, subsequent matches continue. Note that the beginning and end are not included in the matching content.

Regular expression

? (? =-) is the forward search condition, only if the end of the forward search with – can match \d{5}, continue to match -\d{4}.

\d{5}(? (? =-)-\d{4})Copy the code

The results of

  1. 11111
  2. 22222 –
  3. 33333-4444.

The resources

  • BenForta. Regular Expressions must know and must know [M]. Posts and Telecommunications Press, 2007.