As a front-end leader, you must be familiar with JavaScript regular expressions, and can even use regular expressions to write some amazing code. I just don’t know if you’re wondering like I am: How does a regular expression work?

We write down the regular expression (a + | b) c, and then use it to match string aacde, abcde, this is how a process?

Some time ago, I tried to find and learn relevant information, and then know the following content:

  • Currently, there are two types of regular expression engines: NFA and DFA
  • JavaScript uses the NFA engine

So what is NFA? How is it different from DFA? How does NFA implement regular expression matching?

Next, I will try to introduce it in my own way, hoping to help you who are interested in it.

NFA

NFA refers to Nondeterministic Finite Automaton, non-deterministic Finite state Automaton.

It’s a little deep. I didn’t know it when I first saw it. Let’s take it slow.

Let’s start with the finite state machine (Automation). Here’s an example:

There are some elements of a state machine, which are described as follows:

  • Starting state: The circle represents the state, and the state pointed to by an “arrow without a starting point” is the starting state, in this case S1
  • Final state: also known as the accepted state, represented by a double circle, also known as S1 in this case
  • Input: Symbol/signal input to a state machine in a state. Different inputs cause different state changes in the state machine
  • Transition: The process of changing a state to a particular state based on a particular input

Therefore, the working process of a finite state machine is a process of automatic state transition from the initial state according to different inputs.

The function of the state machine above is to detect whether a binary number contains an even number of zeros. As you can see from the graph, there are only 1 and 0 inputs. Starting with S1, only input 0 will convert to S2, and S2 only input 0 will convert to S1. So, once the binary number is entered, if it satisfies the final state, which is S1, then the binary number entered contains an even number of zeros.

Still a little confusing, what does this have to do with regular expressions?

A regular expression can be thought of as a description of a collection of strings. For example (a + | b) the corresponding c string collection is:

ac
bc
aac
aaac
aaaac
...
Copy the code

Finite state machines can also be used to describe collections of strings, also described by regular expressions, represented by finite state machines, which can look like this:

The NFA status diagram here is a simple one drawn on my own page, but I believe you can understand it. You can try it for yourself here, which supports only simple regular expressions.

Furthermore, finite state machines can be “executed”, and given such a state machine, they can be used to examine the input string. If the final match, which means the input strings and regular expressions (a + | b) c match.

Therefore, regular expressions in programming languages are generally implemented through finite state machines. The process of a regular expression matching a string can be decomposed into:

  • Regular expressions are transformed into equivalent finite state machines
  • Finite state machine input string execution

At this point, I think you have a pretty good idea of what finite state machines can do in regular expressions, but the implementation is not clear.

Here’s the difference between NFA and DFA. DFA is Deterministic Finite Automaton, determining Finite state machine. DFA can be considered as a special NFA, whose biggest characteristic is certainty. It is deterministic in that, in one state, if you input a symbol, it must be transferred to a certain state, and there is no other possibility.

For example, for regular expressions ab | ac, corresponding to the NFA can be like this:

And you can see that in state 1, if you type in a, there are actually two possibilities, if you have b, you can get a match, and if you have C, you can get a match. So the state machine, in its execution, might try all the possibilities. After a failed attempt to match one possible path, the attempt to go back to the previous state and try another path is called backtracking.

However, DFA eliminates this uncertainty, so it is expected that performance should be better than NFA because no backtracking is required.

NFA can be converted to equivalent DFA, that is, in theory, regular expressions can be implemented using DFA to achieve better performance than NFA. However, the process of converting AN NFA into a DFA consumes more resources, and even the resulting DFA takes up a large amount of storage space (which may grow exponentially, according to some sources). Moreover, COMPARED with NFA, DFA is more complex and costly to implement some features of regular expressions. So many current programming languages have regular expression engines in NFA mode.

We can test whether the engine used by the current programming language is NFA with the following regular expression:

nfa|nfa not
Copy the code

Use the above regular expression to test the string NFA not. The NFA engine returns a successful match when it detects that nFA is satisfied, while the DFA tries to continue the search, which results in the “longest match.”

From regular expressions to NFA

Knowing the application of NFA in regular expressions, the next step is to introduce how to convert regular expressions into corresponding NFA. This section will be a bit boring, but it will be useful for understanding regular expressions and NFA in depth.

Thompson algorithm

Thompson algorithm is used to convert regular expressions to NFA. It is not the most efficient algorithm, but it is practical and easy to understand.

Two basic transformations are used in Thompson’s algorithm:

A normal transition is a transition from one state to another by entering the character A. Epsilon transitions are transitions from one state to another without input.

Operations in regular expressions can be implemented by combining the above two transformations:

  • Combination operationRS:

  • Replace operation R | S:

  • Repeated operation R* :

R and S in the figure above are NFA with start and end states.

Regular expressions ab | c, for example, including two operations:

  • abcombination
  • abThe results, andcreplace

In this way, we treat the regular expression as a series of inputs and operations, which can be decomposed and combined to obtain the final NFA.

First, we need to convert regular expressions to make it easier to record inputs and operations.

Regular expressions -> postfix expressions

Postfix expressions are expressions that easily record input and operations and contain the priority of the operator. They are also called Reverse Polish Notation (RPN).

To facilitate recording operations, we also create an operator “for combined operations in regular expressions. (This article deals only with the simplest form of regular expression, which is “. Not a special symbol to match any character).

Regular expression ab | c corresponding postfix expression for ab. | c.

Thus, postfix expressions can be evaluated by scanning them one by one and identifying the operators within them. For regular expressions, the final NFA is further constructed through the process of “evaluation” after changing them into postfix expressions.

The scheduling field algorithm is used to create postfix expressions.

For the regular expression processing scenario here, the algorithm is roughly described as follows:

Code in: regex2post () | nfa. Js# L14 – luobotang/nfa

- If ch is an ordinary character, append to output - if ch is an operator, as long as the operator at the top of the OPS stack is not lower than CH, And then we stack and append to output, and then we push OPS - if ch is "(", we push OPS - if ch is") ", as long as the top of the OPS stack is not "(", Stack and append to output - Append the ops operators to output - return outputCopy the code

In the actual process, the original regular expression does not contain the combination operator, so you need to determine the proper insertion position.

Operators have the following precedence (from highest to lowest) :

  • *? +
  • .
  • |
  • (

Postfix expression -> NFA

Creating NFA based on postfix expressions is a process in which simple NFA is continuously combined to obtain complex NFA.

The data structure used to represent State is:

// State
{
  id: String.type: String.// 'n' - normal, 'e' - epsilon, 'end'
  symbol: String.// Input character corresponding to normal state
  out: State, // The next state allowed
  out1: State // The next state allowed
}
Copy the code

Each state can correspond to at most two out state, like a | b | c expressions that can be broken down into (a | b) | c, each operator “|” deals only with two expressions (child).

During the construction of the final NFA, fragments of the NFA are created each time:

// Fragment
{
  start: State,
  out: State
}
Copy the code

No matter how complex the NFA fragment is internally, it has only one entry (the start state) and one exit (the end state).

In this part of code: post2nfa () | nfa. Js# L90 – luobotang/nfa

The processing process is roughly as follows:

- If CH is the operator, unstack the desired number of NFA fragments from the stack, build new NFA fragments and push the stack - if CH is a normal character, Create a new state and build an NFA fragment containing only this state onto the stack - returns the NFA fragment at the top of the stack, the final resultCopy the code

Take the handling of combinatorial operations as an example:

const e2 = stack.pop()
const e1 = stack.pop()
e1.out.out = e2.start
stack.push(new Fragment(e1.start, e2.out))
Copy the code

Two NFA fragments are pushed out of the stack, and a new NFA fragment is constructed and pushed back by connecting them end to end.

Other processing procedures are not detailed, interested can look at the code.

The implementation of the NFA

The NFA is executed by comparing the current character with the current state of the string. If a match is made, the match continues to the next state and the next character. Otherwise, the match fails.

However, due to the uncertainty of NFA, there may be multiple matching states at the same time.

I’m going to be blunt here, so I’m going to compare all the current states, and then the next state that still satisfies the condition will go on to the next round of comparison. Tracing one path at a time and backtracking after a failed match is certainly possible, but it’s much more complicated.

The code is in: simulator. Js-orphang/nFA

conclusion

In summary, regular expressions can be executed by constructing equivalent NFA and then executing NFA to match the input string. Regular expressions in real JavaScript have more features, and their regular expression engine is more complex.

Hopefully, my introduction will give you a better understanding of regular expressions. Of course, the level is limited, improper place is inevitable, welcome correction.

Finally, thanks for reading!

The resources

  • Writing own regular expression parser – Amer Gerzic
  • Regular Expression Matching Can Be Simple And Fast – Russ Cox
  • Automata theory – wikipedia
  • Nondeterministic finite automaton – wikipedia
  • file-infix-to-postfix-regexp-js – gist.github.com
  • Shunting-yard algorithm – wikipedia.com