Implementing a regular expression engine in Python

preface

Project Address:Regex in Python

In recent days, I have built a regular expression engine wheel with Python.

Achieve the goal

All basic syntax is implemented

st = 'AS342abcdefg234aaaaabccccczczxczcasdzxc'
pattern = '([A-Z]+[0-9]*abcdefg)([0-9]*)(\*? |a+)(zx|bc*)([a-z]+|[0-9]*)(asd|fgh)(zxc)'

regex = Regex(st, pattern)
result = regex.match()
log(result)
Copy the code

More examples can be seen on Github

Front knowledge

In fact, the regular expression engine can be regarded as a small compiler, so it can be written in the same way as the C compiler, but not as complex

The first is lexical analysis
Parsing (top-down here)
Semantic analysis(Because re expressions are very weak, you can skip AST generation and go straight to code generation)
Code generation, in this case, NFA generation
From NFA to DFA, this is the beginning of regular and state machine knowledge
DFA minimization

The NFA and DFA

A finite state machine can be regarded as a directed graph. There are several nodes in the state machine, and each node can jump to the next node according to the input character. The difference between NFA((non-deterministic finite state automata) and DFA(deterministic finite state automata) is that the next jump state of DFA is uniquely determined.

Finite state automata reads the input string from the initial state. The automata uses the state transfer function move to judge the next state according to the current state and the current input character. However, the next state of NFA is not uniquely determined, so only the next set of states can be determined. This set of states also depends on subsequent inputs to determine the unique state. If the automaton is in the receiving state when it finishes reading, then the NFA can receive the input string

All NFA can finally be converted to the corresponding DFA

NFA constructs O(n), matching O(nm)

DFA constructs O(2^n), minimizes O(kn'logn') (n' =O(2^n)), matches O(m)

N = Regex length, m= string length, k= alphabet size, n'= original DFA size

The set of all strings accepted by the NFA is the nFA-accepted language. This language is a regular language.

example

For the regular expression: [0-9]*[a-z]+, the corresponding NFA is to join nodes 3 and 4 of the following two NFA

Lexical analysis

For NFA and DFA, it is enough to know so much and some corresponding algorithms. The corresponding algorithms will be mentioned later. The lexical analysis will be completed first.

This lexical analysis is much simpler than the parsing done by previous C compilers, just dealing with a few possibilities

Normal character
A character that contains semantics
Escape character

token

The token is the syntax of the re

Tokens = {
    '. ': Token.ANY,
    A '^': Token.AT_BOL,
    '$': Token.AT_EOL,
    '] ': Token.CCL_END,
    '[': Token.CCL_START,
    '} ': Token.CLOSE_CURLY,
    ') ': Token.CLOSE_PAREN,
    The '*': Token.CLOSURE,
    The '-': Token.DASH,
    '{': Token.OPEN_CURLY,
    '(': Token.OPEN_PAREN,
    '? ': Token.OPTIONAL,
    '|': Token.OR,
    '+': Token.PLUS_CLOSE,
}
Copy the code

advance

Advance is the main function in lexical analysis, used to return the Token type of the current input character

def advance(self):
    pos = self.pos
    pattern = self.pattern
    if pos > len(pattern) - 1:
        self.current_token = Token.EOS
        return Token.EOS

    text = self.lexeme = pattern[pos]
    if text == '\ \':
        self.isescape = not self.isescape
        self.pos = self.pos + 1
        self.current_token = self.handle_escape()
    else:
        self.current_token = self.handle_semantic_l(text)

    return self.current_token
Copy the code

Advance’s main logic is to read the current character and determine whether it is an escape character or something else

Handle_escape is used to handle escaped characters, which essentially return normal character types. The main function of this function is to record the current escaped characters and then assign values to lexem for later construction automaton use

Handle_semantic_l has only two lines, one is to look up the table, this table holds all the semantic characters, if not to return the normal character type

I’m not going to show you the whole code, it’s all theregithubon

To construct the NFA

The main files for constructing an NFA are in the NFA package, nfa.py is the definition of the NFA node, and Construction. py is the construction of the NFA implementation

NFA node definition

The NFA node definition is also very simple, in fact the full implementation of the regular expression engine is only about 900 lines, each part is very simple to break down

Edge and input_set are both used to indicate edges, which can have four possible properties
- The corresponding node has two outgoing ε edges edge = PSILON = -1
- Edge = CCL = -2 input_set = corresponding character set
- An ε edge edge = EMPTY = -3
- Edge corresponds to a single input character c edge = c
Status_num Each node has a unique identifier
Visited is used to debug the TRAVERSal of NFA

class Nfa(object):
    STATUS_NUM = 0

    def __init__(self):
        self.edge = EPSILON
        self.next_1 = None
        self.next_2 = None
        self.visited = False
        self.input_set = set()
        self.set_status_num()

    def set_status_num(self):
        self.status_num = Nfa.STATUS_NUM
        Nfa.STATUS_NUM = Nfa.STATUS_NUM + 1

    def set_input_set(self):
        self.input_set = set()
        for i in range(ASCII_COUNT):
            self.input_set.add(chr(i))
Copy the code

Construction of simple nodes

The node is constructed under Nfa. construction, where Lexer is used as a global variable to simplify the code and is shared by all functions

The BNF paradigm for regular expressions is as follows, so that we can use top-down analysis, recursing down from the topmost group

group ::= ("(" expr ")")*
expr ::= factor_conn ("|" factor_conn)*
factor_conn ::= factor | factor factor*
factor ::= (term | term ("*" | "+" | "?"))*
term ::= char | "[" char "-" char "]"|.Copy the code

Write a compiler from zero (C)

The main entrance

To simplify the code, let the lexical parser be a global variable shared by all functions

The main logic is simple: initialize the lexical analyzer and pass in the NFA header node to start the recursive creation of the node

def pattern(pattern_string):
    global lexer
    lexer = Lexer(pattern_string)
    lexer.advance()
    nfa_pair = NfaPair()
    group(nfa_pair)

    return nfa_pair.start_node
Copy the code

term

Although the top-down parsing is used, it is easier to understand when viewed from the bottom up. Term is the bottommost build, that is, things like a single character, a character set, or a character set. Construction of the node of the symbol

term ::= char | "[" char "-" char "]" | | .

The main logic of term is to determine what node should be built based on the characters currently read in

def term(pair_out):
    if lexer.match(Token.L):
        nfa_single_char(pair_out)
    elif lexer.match(Token.ANY):
        nfa_dot_char(pair_out)
    elif lexer.match(Token.CCL_START):
        nfa_set_nega_char(pair_out)
Copy the code

The constructors for all three nodes are very simple. The following diagrams are sketched with Markdown’s Mermaid

nfa_single_char

The NFA construction of a single character creates two nodes and takes the currently matched character as edge

def nfa_single_char(pair_out):
    if not lexer.match(Token.L):
        return False

    start = pair_out.start_node = Nfa()
    pair_out.end_node = pair_out.start_node.next_1 = Nfa()
    start.edge = lexer.lexeme
    lexer.advance()
    return True
Copy the code

nfa_dot_char

The only difference between this NFA and the single character above is that its edge is set to CCL and input_set is set

Matches any single character
def nfa_dot_char(pair_out):
    if not lexer.match(Token.ANY):
        return False

    start = pair_out.start_node = Nfa()
    pair_out.end_node = pair_out.start_node.next_1 = Nfa()
    start.edge = CCL
    start.set_input_set()

    lexer.advance()
    return False
Copy the code

nfa_set_nega_char

This function logically handles only one more input_set than the one above

def nfa_set_nega_char(pair_out):
    if not lexer.match(Token.CCL_START):
        return False
    
    neagtion = False
    lexer.advance()
    if lexer.match(Token.AT_BOL):
        neagtion = True
    
    start = pair_out.start_node = Nfa()
    start.next_1 = pair_out.end_node = Nfa()
    start.edge = CCL
    dodash(start.input_set)

    if neagtion:
        char_set_inversion(start.input_set)

    lexer.advance()
    return True
Copy the code

summary

Because of the length, I have already written more than 300 lines, so I will write them in three chapters. The next article is about constructing more complex NFA and parsing input strings through constructed NFA. Finally write the string that converts from NFA to DFA, and finally parses the input with DFA