Regular expression

1. Introduction

Usually use some regular expressions, take the time to sort out some knowledge points, convenient for themselves and others to view the use.

1.1 introduction

A regular expression is a single string that describes or matches a series of strings that conform to a syntactic rule. It describes a pattern of string matching, which can be used to check whether a string contains a certain substring, replace the matched substring, or take out the substring that meets a certain condition from a string. It consists of some ordinary characters and some metacharacters. Ordinary characters include uppercase and lowercase letters and numbers, while metacharacters have special meanings.

Such as/ab? (c|de*)+|fg/

1. The above due to the existence of the bracket, so (c | DE *) as a whole, and in brackets with branch structure, then c and DE * is a whole, segmentation that, d and e \ * is a whole

2. The whole statement is broken down into:

A and b? , (…). +, f, and g, where c and de* are in brackets

3. The above meaning can be divided into ab? (…). + or fg

3.1 Left: A must have one; B can exist either without or with only one B; There is at least one c or de*, where e may not exist, or there may be more than one

3.2 Right: fg must exist

Test results:

1.2 Online Tools

1.regulex

2.regexper

3. Online verification tool

4.regexr

Offline tool: RegexBuddy

2. Related concepts

2.1 Common Characters

Ordinary characters include all printable and non-printable characters that are not explicitly specified as metacharacters. This includes all upper and lower case letters, all numbers, all punctuation marks, and some other symbols.

Printable:

Such as 123456 abcdefghijklm

Non-printable (special characters whose meaning is in print) :

Special characters Regular expression memory
A newline \n new line
Page identifier \f form feed
A carriage return \r return
Whitespace characters \s space
tabs \t tab
Vertical TAB character \v vertical tab
The fallback operator [\b] bAckspace uses the [] symbol to avoid duplication with \b
var regex = /ab/g;
var string = "abc 12abv acbdd";
console.log( string.match(regex) ); 
// The combination of the words ab will be matched and queried
[ 'ab'.'ab' ]
Copy the code

2.2 yuan character

Metacharacters have a predefined meaning that makes it easier to use certain common patterns, such as \d instead of [0… 9]. Metacharacters are special characters with regular expression meanings.

Common metacharacters are as follows:

character describe
^ Matches the start of the input string. If the Multiline property of the RegExp object is set, ^ also matches the position after ‘\n’ or ‘\r’.
$ Matches the end of the input string. If the Multiline property of the RegExp object is set, $also matches the position before ‘\n’ or ‘\r’.
* Matches the preceding subexpression zero or more times. For example, zo* matches “z” and “zoo”. * is equivalent to {0,}.
+ Matches the previous subexpression one or more times. For example, ‘zo+’ matches “zo” and “zoo”, but not “z”. + is equivalent to {1,}.
? Matches the preceding subexpression zero or once. For example, “do (es)?” Can match “do” or “does”. ? Equivalent to {0,1}.
\b Match a word boundary, which is the position between words and Spaces. For example, ‘er\b’ can match ‘er’ in “never”, but not ‘er’ in “verb”.
\B Matches non-word boundaries. ‘er\B’ matches the ‘er’ in “verb” but not the ‘er’ in “never”.
\d Matches a numeric character. Equivalent to [0-9].
\D Matches a non-numeric character. That’s the same thing as ^0 minus 9.
. Matches any single character other than newline characters (\n, \r), equal to [^\n\r].
\s Matches any whitespace character, including Spaces, tabs, page feeds, and so on. Equivalent to [\f\n\r\t\v].
\S Matches any non-whitespace character. Equivalent to [^ \f\n\r\t\v].
\w Matches letters, digits, and underscores. Equivalent to ‘[a-za-z0-9_]’.
\W The value cannot contain letters, digits, or underscores. Equivalent to ‘[^ a-za-z0-9_]’.
(? :pattern) Matches pattern but does not get the result, that is, it is a non-get match and is not stored for future use. It is in use “or” character (|) to combine each part of a pattern is very useful. For example, ‘industr (? : y | ies) is a ratio ‘industry | industries’ more brief expression.
. .

2.3 locator

A locator enables you to pin a regular expression to the beginning or end of a line. They also enable you to create regular expressions that appear within a word, at the beginning of a word, or at the end of a word.

The locator is used to describe the boundary of a string or word. ^ and $indicate the beginning and end of a string, \b indicates the front or back boundary of a word, \b indicates the non-word boundary.

character describe
^ Matches the beginning of the input string. If the Multiline property of the RegExp object is set, ^ will also match the position after \n or \r.
$ Matches the position at the end of the input string. If the Multiline property of the RegExp object is set, $will also match the position before \n or \r.
\b Word boundaries, i.e\Wis\wAny character that specifies that the matching character must appear at either the beginning or the end of the target string
\B Specifies that the matching character must appear within the beginning and end boundaries of the target string

var regex=/^an/g;
var string="an This is an\nan tzone good an"; 
console.log(string.match(regex));
// Result: Even though the string has a newline character, the re still considers the whole text to be a string, so only an matches
[ 'an' ] 

var regex=/^an/gm;
var string="an This is an\n an tzone good an"; 
console.log(string.match(regex));
// Result: the second an is not the first character of the line, so it does not match
[ 'an' ] 

var regex=/^an/gm;
var string="an This is an\nan tzone good an"; 
console.log(string.match(regex));
/ / the result:
[ 'an'.'an' ] 

var regex=/ab\b/g; // indicates that the ab characters must appear at the end of the word string
var string="wwabc ab2 imab 55ab"; 
console.log(string.match(regex));
/ / the result:
[ 'ab'.'ab' ]

var regex=/\bab/g; // indicates that the ab characters must precede the word string
var string="wwabc ab2 imab 55ab"; 
console.log(string.match(regex));
/ / the result:
[ 'ab' ]

var result = "[JS] Lesson_01.mp4".replace(/\b/g.The '#');
console.log(result); 
/ / the result:
[#JS#] #Lesson_01#.#mp4#


Copy the code

2.4 Qualifiers (quantifiers)

Qualifiers are used to specify how many times a given component of a regular expression must occur to satisfy a match. Is there ***** or + or? Or {n} or {n,} or {n,m}.

character describe
* Matches the preceding subexpression zero or more times. For example, zo* matches “z” and “zoo”.* is equivalent to {0,}.
+ Matches the previous subexpression one or more times. For example, ‘zo+’ matches “zo” and “zoo”, but not “z”.+ is equivalent to {1,}.
? Matches the preceding subexpression zero or once. For example, “do (es)?” Matches “do”, “does”, and “doxy”.? Equivalent to {0, 1}.
{n} N is a non-negative integer. Match certain n times. For example, ‘o{2}’ does not match the ‘O’ in “Bob”, but does match two o’s in “food”.
{n,} N is a non-negative integer. At least n times. For example, ‘o{2,}’ does not match the ‘O’ in “Bob”, but matches all o’s in “foooood”. ‘o{1,}’ is equivalent to ‘o+’. ‘o{0,}’ is equivalent to ‘o*’.
{n,m} Both m and n are non-negative integers, where n <= m. At least n times and at most m times are matched. For example, “o{1,3}” will match the first three o’s in “fooooood”. ‘o{0,1}’ is equivalent to ‘o? ‘. Note that there can be no Spaces between commas and numbers.

Note: You cannot use a qualifier with a locator. Expressions such as ^* are not allowed because there cannot be more than one position immediately before or after a newline or word boundary.

2.4.1 Matching Rules

The default matching mode is greedy mode, that is, the more matches the better

Matching pattern to specify non-greedy pattern, you can add a quantifier after? Can be

2.5 scope operator

The re provides a metacharacter with parentheses [] to indicate interval conditions

1. Restrict certain numbers, such as [162], so that the match can be one of them, such as

var regex = /[162]/g;
var string = "123, 768, 8862";
console.log( string.match(regex) ); 
// The result is as follows: as can be seen, any number matching the brackets will be matched
[ '1'.'2'.'6'.'6'.'2' ]
Copy the code

The default relationship between characters in [] is or

2. Use – to add a range, such as [1-5a-c]

var regex = /[1-5A-C]/g;
var string = "13 27 MMAK OCBAZ";
console.log( string.match(regex) ); 
// The result is as follows:
[
  '1'.'3'.'2'.'A'.'C'.'B'.'A'
]
Copy the code

3. Special symbols of regular expressions are meaningless when they are enclosed in brackets, except for ^ and –

^: Put it in brackets to mean invertvar regex = /[^6]/ig;
var string = "23456, 567";
console.log( string.match(regex) );
// The result is as follows:
[
  '2'.'3'.'4'.'5'.' '.'5'.'7'
]

Copy the code

2.6 modifier

Additional matching strategies can be specified based on modifiers

The modifier meaning describe
i Ignore – case insensitive Set the match to case insensitive, and the search is case insensitive: there is no difference between A and A.
g Global-global matching Find all the matches.
m Multi Line – Multi-line matching Make boundary character^$matchingEach lineThe beginning and end of the string, remember multiple lines, not the beginning and end of the entire string.
s Single line mode, special character dot.Contains a newline character\n Dots by default.Is matching division newline character\nAny character other than, plussAfter the modifier,.Contains the newline character \n.

Examples are as follows:

// When there are no modifiers, return exit as soon as a match is found
var regex = /ab/;
var string = "ABC ffab2 ccb wab2 ";
console.log( string.match(regex) ); 
// result: index=6, which means that the sixth position matches the result, and returns the input character
[ 'ab'.index: 6.input: 'ABC ffab2 ccb wab2 '.groups: undefined ]
Copy the code

g:

var regex = /ab/g;
var string = "ABC ffab2 ccb wab2 ";
console.log( string.match(regex) );  
/ / the result:
[ 'ab'.'ab' ]
Copy the code

i:

var regex = /ab/i;
var string = "ABC ffab2 ccb wab2 ";
console.log( string.match(regex) ); 
/ / the result:
[ 'AB'.index: 0.input: 'ABC ffab2 ccb wab2 '.groups: undefined ]
Copy the code

M: Each line is a string, not a single string


var regex=/an$/;
var string="This is an\n antzone good"; 
console.log(string.match(regex));
/ / the result:
nullThe above code cannot match strings"an", even though"an"It's already wrapped, but there's no multi-line matching, so it's not the end of a string line.var regex=/an$/m;
var string="This is an\n antzone good"; 
console.log(string.match(regex));
/ / the result:
[
  'an'.index: 8.input: 'This is an\n antzone good'.groups: undefined
]

Copy the code

S: The entire text is treated as a string, with only a beginning and an end

var regex=/an$/s;
var string="This is an\n antzone good an"; 
console.log(string.match(regex));
/ / the result:
[
  'an'.index: 25.// If an$matches the end of the string, an$matches the end of the string
  input: 'This is an\n antzone good an'.groups: undefined
]
Copy the code

2.7 Selectors and groups

2.7.1 Selectors (Selecting branches)

Specific form is as follows: p1 | p2, p1 and p2 is sub mode, or is relationship between, with | (pipe) separated, said one of any of them.

| : Said is left and right two parts, not only refers to the words or relationship recently, such as / ^ I love JavaScript | Regular Expression $/, matching string is “I love JavaScript” and “Regular Expression.” Not JavaScript or Regular

var regex = /good|nice/g;
var string = "good idea, nice try.";
console.log( string.match(regex) ); 
/ / the result:
[ 'good'.'nice'}}}}}}}}}}}}}}var regex = /good|goodbye/g;
var string = "goodbye";
console.log( string.match(regex) ); 
// If the result matches good, it will not match later
["good"]
Copy the code

2.7.2 Grouping (In brackets)

All regular expressions contained in (and) metacharacters are grouped into groups, each of which is a subexpression

Capture group :(expression)

1. When modifying the number of matches, the expression in parentheses can be modified as a whole

2. When obtaining the matching result, the content matched by the expression in brackets can be obtained separately

3. Each pair of parentheses is assigned a number. A number of 0 means that the text is captured by matching the entire regular expression pattern

var regex=/(ab)+d/g; // Parentheses can be enclosed to represent a whole
var string="dab ksabd fsababd"; 
console.log(string.match(regex));
/ / the result:
[ 'abd'.'ababd' ]
Copy the code

Backreference:

1. Each pair of () will be automatically assigned a number, and the content captured by () can be obtained according to the number

2. You can reference the captured string in a group by reverse referencing


var regex = /(\d{4})-(\d{2})-(\d{2})/;
var string = "2022-02-12";
console.log( string.match(regex) ); 
// result: you can see that 4 results are matched, respectively 2022-02-12, 2022, 22, 12
[
  '2022-02-12'.'2022'.'02'.'12'.index: 0.input: '2022-02-12'.groups: undefined
]

Copy the code

The above results can be obtained using the constructor’s global attributes $1 through $9 by referring to the values captured in parentheses

var regex = /(\d{4})-(\d{2})-(\d{2})/;
var string = "2022-02-12";
string.match(regex);
console.log(RegExp. $1);/ / "2022"
console.log(RegExp. $2);// "02"
console.log(RegExp. $3);/ / "12"
Copy the code

It can be viewed through online analysis:

If the price is complicated, for example

var regex = /\d{4}(-|\/|\.) \d{2}(-|\/|\.) \d{2}/;
var string1 = "2022-02-12";
var string2 = "2022/02/12";
var string3 = "2022.02.12";
var string4 = "2022-02/12";
console.log( regex.test(string1) ); // true
console.log( regex.test(string2) ); // true
console.log( regex.test(string3) ); // true
console.log( regex.test(string4) ); // true
Copy the code

Where/and. Need to be escaped. Although the required condition is matched, data such as “2022-02/12” is also matched. What if we need consistent delimiters? That’s where the backreference comes in

var regex = /\d{4}(-|\/|\.) \d{2}\1\d{2}/;
var string1 = "2022-02-12";
var string2 = "2022/02/12";
var string3 = "2022.02.12";
var string4 = "2022-02/12";
console.log( regex.test(string1) ); // true
console.log( regex.test(string2) ); // true
console.log( regex.test(string3) ); // true
console.log( regex.test(string4) ); // false
Copy the code

Note \ 1 references here is the first brackets (- | / |.) For example, if () matches -, then \1 matches -, so d{2} must be – otherwise it will not match

If parentheses are nested, count from the left parentheses, such as /^((\d)(\d(\d)))\1\2\3\4$/

Non-capture group :(? : expression)

Non-capture groups can be used to offset the side effects of using () in expressions where you have to use () but do not need to save the content of the neutron expression match of ()

var regex = / (? :ab)+/g;
var string = "ababa abbb ababab";
console.log( string.match(regex) ); 
/ / the result
["abab"."ab"."ababab"]
Copy the code

2.8 Operator priority

Operations of the same priority are carried out from left to right, and operations of different priorities are first higher and then lower. The following table illustrates the order of precedence of the various regular expression operators from highest to lowest:

The operator describe
\ Escape character
(), (? :), (? =), [] Parentheses and square brackets
*, +, ?, {n}, {n,}, {n,m} qualifiers
^, $, \ Any metacharacter, any character Registration points and sequences (i.e., position and order)
| Replace, “or” operation character is higher than replace the operator priority, make “m | food” matching “m” or “food”. If you want to match the “mood” or “food”, please use parentheses to create the expression, resulting in a “(m | f ood)”.

2.9 Principle of backtracking

The basic idea of retrospective method, also known as heuristic method, is: From the problems of a particular state (initial state), the search from this state can achieve all of the “state”, when a road to the “end” (can’t), then take a step back or a number of steps, starting from another possibility “state”, continue to search until all of the “path” (state) are tested. This method of constantly “moving forward” and “backtracking” to find a solution is called “backtracking”.

It’s essentially a depth-first search algorithm. ** The process of going back to a previous step is called backtracking. ** From the process described above, you can see that “backtracking” occurs when the road is blocked. That is, when an attempt to match fails, the next step is usually backtracking.

If the target string is “abbc”, there is a backtrace.

As shown in the above, when the steps to step 5, because the {1, 3}, b will continue to go to see the fourth character is b, met c above, then don’t meet, then back to the previous step, and then execute the regular expression the next character of c, and then compared with the fourth, is c, so have string matching to want, can return

Backtracking is time-consuming. We can use specific character groups instead of wildcards to eliminate backtracking, use non-capture groups, reduce memory consumption, extract common parts of branches, reduce the number of branches, narrow their scope and other measures to optimize regular expressions to improve matching efficiency

Recognize the limits of the regular, and don’t study impossible tasks. At the same time, we can not go to the other extreme: the use of regular. Simple problems that can be solved with string apis should not be regular problems

Let's say I want to get a substringvar string = "JavaScript";
console.log( string.match(/. {4} (. +) /) [1]);/ / the result:Script can be done directly using off-the-shelf apisvar string = "JavaScript";
console.log( string.substring(4));/ / the result
Script

Copy the code

Use 3.

JS version 3.1

In JavaScript, a RegExp object is a regular expression object with predefined properties and methods.

Methods for re operations, a total of 6, 4 string instances, re instances 2

String#search

String#split

String#match

String#replace

RegExp#test

RegExp#exec

//test
var regex = /ab/;
var string = "ABC ffab2 ccb wab2 ";
console.log( regex.test(string) ); //true

//exec: the exec method can continue the match after the last match
var regex = /ab/g;
var string = "ABC ffab2 ccb wab2 ";
console.log( regex.exec(string) );
console.log( regex.lastIndex );
console.log( regex.exec(string) );
console.log( regex.lastIndex );
[ 'ab'.index: 6.input: 'ABC ffab2 ccb wab2 '.groups: undefined ]
8
[ 'ab'.index: 15.input: 'ABC ffab2 ccb wab2 '.groups: undefined ]
17

//match
var regex=/(ab)+d/g;
var string="dab ksabd fsababd"; 
console.log(string.match(regex)); //[ 'abd', 'ababd' ]
Copy the code

JAVA version 3.2


    @Test
    public void test1(a){
        Pattern compile = Pattern.compile("([a-z]+)([0-9]+)");
        Matcher matcher = compile.matcher("aksh22**fdsk21576*23*kl");
        while (matcher.find()){
            System.out.println("matcher = " + matcher.group());
            System.out.println("matcher1 = " + matcher.group(1));
            System.out.println("matcher2 = " + matcher.group(2));
        }
        System.out.println(matcher.groupCount());
    }

Copy the code

Reference:

1.JS regular expressions complete tutorial (slightly longer)

2.runoob

3. Probably the best tutorial notes on regular expressions ever…