In this paper, the original address blog.csdn.net/xindoo/arti…

In the previous blog building a regular expression execution engine from 0 to 1 (part 1), we built a working regular expression engine. The source code can be found at github.com/xindoo/rege… , but the above paper only used NFA. The time complexity of NFA engine is O(n), but the worst time complexity of matching a string of length m is O(MN) because a lot of recursion and backtracking are involved. By contrast, the time complexity of DFA engine is O(n^2), but there is no backtracking during matching, so the complexity of matching is only O(m), and the performance gap is quite large.

The DFA and the NFA

We have mentioned NFA and DFA many times. What are they? What’s the difference? First, NFA and DFA are both finite state machines and directed graphs, which are used to describe states and the relationship between states. The full name of NFA is Nondeterministic Finite automaton, and the full name of DFA is Deterministic Finite automaton.

The difference between them mainly lies in certainty and non-certainty. What is certainty? Deterministic means that faced with the same input, there will not be more than one viable path to execute the next node. It’s a little convoluted, but you’ll understand it when you look at the picture.

The figure shows an NFA and a DFA respectively. The figure above is called NFA because it has nodes withuncertainty, such as node 0, which can go to node 0, 1, and 2 after typing “A”. Also, the figure above has the $\epsilon$edge, which can jump to the next node without input, which also introduces uncertainty. In contrast, in the DFA below, each node has at most one edge for a particular input.

To sum up, the difference between NFA and DFA is that NFA must have an ε edge or a node that corresponds to multiple states for the same input.

There is equivalence between DFA and NFA, that is, any NFA can be transformed into an equivalent DFA. Due to the non-deterministic nature of NFA, there may be multiple alternative paths in the face of an input. Therefore, if one path fails, it is necessary to go back to the choice point to take another path. However, DFA is different. In each state, there are no multiple paths for each input, so there is no need for recursion and backtracking, and one path can be taken to black. The piece complexity of the DFA is only O(n), but the match complexity of the NFA is O(n^2) because of recursion and backtracking. This is the main reason why we want to convert the NFA in the engine to DFA.

NFA to DFA

algorithm

The algorithm of transforming NFA into DFA is called subset construction method, and its specific process is as follows.

  • Step 1: The initial node of NFA and all ε reachable nodes of the initial node constitute the initial node of DFA, and then perform Step 2 for the initial DFA node.
  • Step 2: For the current DFA node, find all reachable NFA nodes of all NFA nodes for input symbol X, and the DFA nodes formed by these nodes communicate as the DFA nodes reachable for input X of the current DFA node.
  • Step 3: If a new node is found in Step 2, repeat Step 2 for the new node.
  • Step 4: Repeat Step 2 and Step 3 until no new DFA node is found.
  • Step 5: Mark all DFA nodes that contain NFA termination nodes as DFA termination nodes.

The language description is a little bit harder to understand, so let’s go straight to the example. We have to take on a site of regular expressions, a | c (b) *, for example, I was in source github.com/xindoo/rege… Joined the NFA output code, a (b | c) * of the NFA output as follows.

from to input
 0-> 1  a
 1-> 8  Epsilon
 8-> 9  Epsilon
 8-> 6  Epsilon
 6-> 2  Epsilon
 6-> 4  Epsilon
 2-> 3  b
 4-> 5  c
 3-> 7  Epsilon
 5-> 7  Epsilon
 7-> 9  Epsilon
 7-> 6  EpsilonCopy the code

The drawing is as follows:Based on the above figure, we perform step 1 to get node 0 as the starting node of DFA.

Then perform step 1 for node 0 of DFA to find all NFA nodes (1#2#4#6#8#9) reachable by A in NFA to constitute node 1 in NFA, as shown in the figure below.Starting from DFA1, I found all the NFA nodes reachable by A (2#3#4#6#7#9) and all the nodes reachable by B (2#4#5#6#7#9), respectively constituting DFA2 and DFA3 in DFA, as shown in the figure below.

Then we perform step 3 on dfa2 and dfa3 respectively. No new node can be found, but several new edges can be found. Add as follows, and we are done pairinga(b|c)*The transformation from NFA to DFA.It can be seen that DFA graph nodes are significantly less than NFA, but NFA is easier to see its corresponding regular expression. I’ve also written code for DFA to generate regular expressions, as described in this articleBlog.csdn.net/xindoo/arti…

Code implementation

The code is essentially a statement of the above process, for more details see github.com/xindoo/rege… .

private static DFAGraph convertNfa2Dfa(NFAGraph nfaGraph) { DFAGraph dfaGraph = new DFAGraph(); Set<State> startStates = new HashSet<>(); Startstates.addall (getNextEStates(nfaGraph. Start, new HashSet<>()))); if (startStates.size() == 0) { startStates.add(nfaGraph.start); } dfaGraph.start = dfaGraph.getOrBuild(startStates); Queue<DFAState> queue = new LinkedList<>(); Set<State> finishedStates = new HashSet<>(); DFA queue.add(dfagraph.start) if BFS traverses from the found start node and builds DFA queue. while (! Queue. IsEmpty ()) {// Step 2 DFAState curState = queue. Poll (); for (State nfaState : curState.nfaStates) { Set<State> nextStates = new HashSet<>(); Set<String> finishedEdges = new HashSet<>(); finishedEdges.add(Constant.EPSILON); for (String edge : nfaState.next.keySet()) { if (finishedEdges.contains(edge)) { continue; } finishedEdges.add(edge); Set<State> efinishedState = new HashSet<>(); for (State state : curState.nfaStates) { Set<State> edgeStates = state.next.getOrDefault(edge, Collections.emptySet()); nextStates.addAll(edgeStates); For (State eState: edgeStates) {// Add E reachably. if (efinishedState.contains(eState)) {continue; } nextStates.addAll(getNextEStates(eState, efinishedState)); efinishedState.add(eState); DFAState nextDFAstate = dfagraph.getorBuild (nextStates); DFAState nextDFAstate = dfagraph.getorBuild (nextStates); if (! finishedStates.contains(nextDFAstate)) { queue.add(nextDFAstate); } curState.addNext(edge, nextDFAstate); } } finishedStates.add(curState); } return dfaGraph; }Copy the code
public class DFAState extends State { public Set<State> nfaStates = new HashSet<>(); // Save the id of the corresponding NFAState. A DFAState can be a set of multiple nFAStates, so concatenate String private String allStateIds; public DFAState() { this.stateType = 2; } public DFAState(String allStateIds, Set<State> states) { this.allStateIds = allStateIds; this.nfaStates.addAll(states); For (State State: states) {if (state.isendState ()) {this.statetype = 1; } } } public String getAllStateIds() { return allStateIds; }}Copy the code

In addition, I encapsulated some NFA node list to DFA node transformation and search in DFAGraph, as follows.

public class DFAGraph { private Map<String, DFAState> nfaStates2dfaState = new HashMap<>(); public DFAState start = new DFAState(); Public DFAState getOrBuild(Set<State> states) {String allStateIds = ""; int[] ids = states.stream() .mapToInt(state -> state.getId()) .sorted() .toArray(); for (int id : ids) { allStateIds += "#"; allStateIds += id; } if (! nfaStates2dfaState.containsKey(allStateIds)) { DFAState dfaState = new DFAState(allStateIds, states); nfaStates2dfaState.put(allStateIds, dfaState); } return nfaStates2dfaState.get(allStateIds); }}Copy the code

DFA engine matching process

Dfa engine matching can also fully reuse the MATCHING process of NFA, so for previous NFA matching code, you can cancel the backtracking for DFA pattern (no problem, but there will be a performance impact).

private boolean isMatch(String text, int pos, State curState) { if (pos == text.length()) { if (curState.isEndState()) { return true; } for (State nextState : curState.next.getOrDefault(Constant.EPSILON, Collections.emptySet())) { if (isMatch(text, pos, nextState)) { return true; } } return false; } for (Map.Entry<String, Set<State>> entry : curState.next.entrySet()) { String edge = entry.getKey(); If (constant.epsilon.equals (edge)) {for (State nextState: entry.getValue()) { if (isMatch(text, pos, nextState)) { return true; } } } else { MatchStrategy matchStrategy = MatchStrategyManager.getStrategy(edge); if (! matchStrategy.isMatch(text.charAt(pos), edge)) { continue; } for (State nextState: Entry.getvalue ()) {return (nextState instanceof DFAState) {return (nextState instanceof DFAState) {return (nextState instanceof DFAState) isMatch(text, pos + 1, nextState); } if (isMatch(text, pos + 1, nextState)) { return true; } } } } return false; }Copy the code

Because DFA matches do not need to be backtracked, they can be changed completely to non-recursive code.

private boolean isDfaMatch(String text, int pos, State startState) { State curState = startState; while (pos < text.length()) { boolean canContinue = false; for (Map.Entry<String, Set<State>> entry : curState.next.entrySet()) { String edge = entry.getKey(); MatchStrategy matchStrategy = MatchStrategyManager.getStrategy(edge); if (matchStrategy.isMatch(text.charAt(pos), edge)) { curState = entry.getValue().stream().findFirst().orElse(null); pos++; canContinue = true; break; } } if (! canContinue) { return false; } } return curState.isEndState(); }Copy the code

Comparison of DFA and NFA engine performance

I used JMH to do a simple non-strict performance test, just look at it, the results are as follows:

Benchmark Mode Cnt Score Error Units RegexTest. DfaNonRecursion THRPT 2 144462.917 ops/s RegexTest. DfaRecursion THRPT 2 Ops /s regextest. nFA THRPT 2 55320.181 OPS /sCopy the code

The matching performance of DFA is much higher than that of NFA, but the recursive version is also faster than the non-recursive version. The detailed test code has been uploaded to Github github.com/xindoo/rege… , welcome to check.

The resources

  • Nfa to dfa

    This article is published by OpenWrite!