I introduced you to a homemade calculator with Yacc and Lex.
Home-Made Calculators (with Yacc and Lex) — Home-Made Programming Language 1


This article describes how you can do it without yacc and lex. In fact, you can write your own lexer and parser instead of yacc and lex. Based on C language implementation


The code in this article is mostly screenshots for illustration, you can refer to the line number, but don’t worry,
I’ve uploaded the source code here

1. Self-made lexical analyzer

Description: This calculator uses the newline as a delimiter to split the input into individual calculations. Input across a complex number of rows cannot be parsed.

According to the above instructions, the lexer provides the following two functions:

Void set_line(char *line); void set_line(char *line); Void get_token(Token * Token); void get_token(Token * Token); void get_token(Token * Token);

    get_token()The accepted input parameter is oneTokenStructure pointer, the function will split the token loadTokenStructure and return. The following is the sum of the two functions aboveTokenDefinition of structure:

The lexer header file is as follows:

lexicalanalyzer.h


The code of lexer is as follows:

lexicalanalyzer.c





The lexer operates by calling get_token() and returning the token of the partition every time a string is passed in. Because the lexer needs to remember the line that was passed in by set_line() and where that line has been parsed to, the static variables st_line and st_line_pos (lines 7 and 8) are set.

The set_line() function simply sets the value of st_lin and st_line_pos

Get_token () is responsible for the actual splitting of tokens, which is the heart of the lexer.

The while statement, which starts at line 16, scans st_line character by character.

The +, -, *, and/operators in the token are only one character long and are returned once scanned.

The numeric part is a little more complicated because numeric values consist of multiple characters. When you use a while statement to scan character by character, it is very likely that the character being scanned is only part of a numerical value, so you must find a way to temporarily store values that match the numerical characteristics. To temporarily store the value, take the global variable STATUS of an enumerated type lexerStatus * (line 12)

LexerStatusEnumeration is defined in
lexicalanalyzer.hIn the

The initial status is INITIAL_STATUS, and when numbers 0\~9 are encountered, they are put into the integer part (the status is IN_INT_PART_STATUS at this point) (line 59). Once the decimal point is encountered, the status changes from IN_INT_PART_STATUS to DOT_STATUS (line 65). DOT_STATUS then switches to the fractional state IN_FRAC_PART_STATUS when it encounters a number (line 61). In the IN_INT_PART_STATUS or IN_FRAC_PART_STATUS state, if there are no more numbers or decimal points, end, accept the value and return.

As per the above treatment, the lexer will completely exclude.5, 2.. 3 inputs like this. Starting at line 23, all whitespace characters except newlines are skipped.

Since this is a lexical analyzer for a calculator, only four remote operators and values are processed. If you need to extend and support programming languages, it’s a good idea to keep the following points in mind


1. Numbers and identifiers (such as variable names, etc.) can be resolved by managing a current state in the same way as in the above example. For example, the increment operator can set an analog
IN_INCREMENT_OPERATORBut then the program becomes verbose. Therefore, it would be better to have an array of strings for the operator, such as:

Static char * str_operator_str [] = {" + + ", "-", "+", "-", / / omit};

The token currently read can be matched forward to the elements of the array to determine the type of token. The pointer part also needs to read one more character than the feature object for mutiny (such as input)
i + 2, will need to be
2Also read in to see if there is a yes
i++May). When checking, it is easier to place the long operators in front of the array, as in the previous example. In addition, as
If, whileFor these reserved words, it is relatively simple to identify them as identifiers first, and then look for the corresponding reserved words in the comparison table.


2. This calculator is written in units at the end of the line.
st_lineWill store all the information on a line, but in today’s programming languages, newlines are generally equivalent to whitespace characters, so they should not be processed as end-of-line units, but character by character from the file (
Getc () functionIt is better to read in the parsing. In patients with
whileWhere the statement reads character by character, it needs to be replaced with
getc()Function to read.

2. Make your own parser

Most programmers without a background in a homewritten programming language can guess how a lexer works, but switching to a parser is a bit confusing. Maybe the perception is, just consider the calculator program, give the operator the lowest priority
-
+Split it up and process it
*and
/“, this thinking is basically correct. In practice, however, the space used to hold the split string may be used for other purposes, and parentheses are difficult to handle.

The Yacc version of the calculator uses the following syntax rules:

Expression rules * / / * expression: term item / * and * / | expression ADD term / * item or expression + and * / | expression SUB term or expression - and * / / *; Term /* and rule */ : Primary_expression / * a expression * / | term the MUL primary_expression an expression or / * and * * / | term DIV primary_expression or / * and / Unary expression */; Primary_expression /* Rules for unary expressions */ : DOUBLE_LITERAL /* Literal constants for real numbers */;

These grammatical rules can be represented by the following grammatical diagram:

The representation of a grammar graph is clear; for example, a term’s grammar graph represents an initial entry into a primary_expression, which can either end directly, or be * or /, followed by another, which repeats the process. In the grammar diagrams of this book (series), non-terminals are represented as rectangles and terminals (symbols) as ellipses.

As shown in the syntax diagram, we useRecursive descent analysisRead in the token, then perform the parsing, and this is the parser we will write.

For example, parsing a project (term) as a function ofparse_term():

As at the beginning of primary_expression in the syntax diagram, parse_primary_expression() is called on line 41. In recursive descent analysis, a non-terminal always corresponds to a handler, and the presence of a non-terminal in the syntax graph means that the function is called. So the for statement below line 43 forms an infinite loop that continues if *(MUL_OPERATOOR) and /(DIV_OPERATOR) enter (other characters break out at line 49). Line 52 calls parse_primary_expression() a second time, corresponding to the * and/primary expression on the right side of the syntax graph. For example, if you encounter 1 * 2 + 3, parse_primary_expression() on line 42 reads 1, my_get_token() on line 53 reads *, and parse_primary_expression() on line 52 reads 2. Subsequent operators perform multiplication (line 54) or division (line 56), depending on the type. At this point 1 * 2 has been computed, and my_get_token() on line 43 reads in a + token. After +, there is no term to enter and break out of the loop. But the + has already been read in, so you need to return this token with unget_token() on line 48. C uses my_get_token() instead of using the get_token() written in lexicalanalyzer.c directly, My_get_token () creates a Ring Buffer for one token(the static variable st_look_ahead_token in line 6 of parser.c below is the full Buffer), You can borrow the ring buffer to return the last token read in with unget_token(). The dropped + here is re-read through my_get_token() on line 68 of primary_expression().

The complete code is as follows:

As you can see from the syntax diagram, when a nonterminal is hit, its subordinate functions are called recursively, so this type of parser is called a recursive descent parser.


At this point, the syntax parser is complete.


parser.h:





parser.c:





Processing of pre-read tokens

The recursive descent parsing used in this book (in this series) will pre-read a token and return it via unget_token() if it is found that the pre-read token is not needed. Another way to think about it is to “always read a token in advance”. In this way, parse_term() of parser.c can be modified as follows:

// The token variable is already in the next token v1 = parse_primary_expression(); for (;;) {// Here unordered read token if (token.kind! = MUL_OPERATOR_TOKEN && token.kind ! DIV_OPERATOR_TOKEN) {// No need to return processing break; } kind = token.kind ();} kind = token.kind ();} kind = token.kind; my_get_token(&token); // primary_expression's parsing function v2 = parse_primary_expression(); if (token.kind == MUL_OPERATOR_TOKEN) { v1 *= v2; } else if (token.kind == DIV_OPERATOR_TOKEN) { v1 /= v2; } } return v1;

The two implementations are essentially the same.

3. Some theoretical knowledge -LL(1) and LALR(1)

The syntax parser above prereads the tokens and follows the flow of the grammar graph to read all the tokens. This type of parser is called a LL(1) parser. The syntax that a LL(1) parser can parse is called a LL(1) syntax.

PascalThat’s what the syntax is
LL(1)

The LL(1) parser syntactically requires a nonterminal to correspond to functions inside the parser. In other words, looking at the first entry token, there is no way to determine whether to read further, nor what the current non-terminal is. In Pascal, for example, GOTO statements can only use labels that are numbers. The reason for this limitation is that if you allow English letters as identifiers, as C does, you can’t tell when you read the first token whether it’s part of an assignment statement or part of a label statement. Because the starting identifier is the same for both assignment and label statements. Therefore, the parser of LL(1) syntax is relatively simple, and the syntax can express a relatively narrow range.

The LL(1) syntax is a bit different from the BNF syntax. In fact, the BNF syntax rules look like this:

Expression / * expression rules * / | expression ADD term / * or expression + and * /

When implementing recursive descent analysis, calling parse_expression() at the very beginning of parse_expression() will result in an endless loop and no tokens will be read. The syntax such as BNF is called left recursion, and it is impossible to implement recursive descent analysis by copying the left recursion syntax rules.

The parser that yacc generates is called a LALR(1) parser, and the syntax that this parser can parse is called a LALR(1) syntax. The LALR(1) parser is a type of LR parser.

The first L of LL(1) represents the token read from the leftmost part of the programmer’s code. The second L stands for Leftmost derivation, where the tokens read start at the left end and are replaced into the analysis tree. An LR parser, on the other hand, reads the tokens from the left (the same as an LL(1) parser), but when it reduces, the tokens reduce from the right. This is called the Rightmost derivation, or R in an LR parser.

Recursive descent parsers generate a tree of analysis in a top-down order, so they are called recursive “descent” parsers or recursive “downward” parsers. LR parsers, on the other hand, follow a bottom-up sequence, also known as a “bottom-up” parser.

In addition, (1) in LL(1) and LALR(1) represents the number of lookahead symbols required for the analytic expression.

The two letters La at the beginning of LALR(1) are the abbreviation of Look Ahead. The state contained in the grammar rules can be determined by reading a sign in advance and the grammar analysis table can be generated. LL(1), LALR(1) The actual calculator made in this article uses LL(1) syntax as the parser, so it is relatively simple, suitable for handwriting. If LR syntax such as LALR(1) is used, it is more suitable for automatic generation with tools such as yacc.

Although Pascal uses LL(1) syntax, there are both assignment statements and procedure calls (in C, function calls). As I said, both start with the same class of identifiers, and the LL(1) parser doesn’t seem to be able to tell the difference. Instead of forcing the distinction from the beginning, Pascal reversed course and introduced a nonterminal character that also represented an “assignment statement or procedure call,” then separated it after the next token was read in.

In C, the identifier yacc(LALR(1) parser) is not resolvable if some types are named by typedef. For example: Hoge *hoge_p = NULL; Whether the number is a multiplication operator or a pointer symbol is difficult to intuitively conclude by looking at the identifier Hoge. The C language has a trick: when an identifier is declared as a type name, the parser tells the lexer to return it as a type name instead of as an identifier.

4. Expand your calculator

4.1 Make your calculator support parentheses

Yacc /lex version introduction
In this

Start by adding two tokens to mycalc.l

"+" return ADD; "-" return SUB; "*" return MUL; "/" return DIV; "(" return LP; // new ")" return RP; // add "\n" return Cr;

Then change the syntax rule for mycalc.y:

primary_expression : DOUBLE_LITERAL | LP expression RP { $$ = $2; };

Also modify the token part of mycalc.y:

token ADD SUB MUL DIV CR LP RP

A primary_expression is the expression that is wrapped in ().

The above modifications are already availableyacc/lexThe calculator of the… does support parentheses, but it’s not enough for this calculator.primary_expressionThe syntax diagram of should be changed to the following:



So that means that I’m going to use parenthesesexpressionPart of a parcel as a wholeprimary_expressionTo deal with. It was changed along these linesparser.cThe code is as follows:

static double parse_primary_expression() { Token token; double value; my_get_token(&token); if (token.kind == NUMBER_TOKEN) { return token.value; } else if (token.kind == LEFT_PAREN_TOKEN) { value = parse_expression(); my_get_token(&token); if (token.kind ! = RIGHT_PAREN_TOKEN) { fprintf(stderr, "missing `)` error. n"); exit(1); } return value; } else { unget_token(&token); The return of 0.0; }}

If you want this calculator to run, you need to modify the code in lexicalanalyzer.c:

If (current_char == '/') {token->kind = DIV_OPERATOR_TOKEN; return; } else if (current_char == '(') { token->kind = LEFT_PAREN_TOKEN; return; } else if (current_char == ')') { token->kind = RIGHT_PAREN_TOKEN; return; } if (isDigit (current_char)) {// Omit

In addition, the enumeration of tokens also needs to increase the LEFT_PAREN_TOKEN and RIGHT_PAREN_TOKEN.

Once this is done, we can compile and run our own calculator to support parentheses.

4.2 Make your calculator support negative numbers

As with support for parentheses, first make changes to the calculator with yacc/lex. Because the regular expressions used to define values are [1-9][0-9]* or [0-9]+\.[0-9]*, negative numbers are not taken into account as a value at all. To support negative numbers, the first idea might be to treat an input like -5 as DOUBLE_LITERAL in the lexer, but then 3-5 would be parsed as both 3 and -5 tokens. Therefore, if you don’t want to treat negative numbers as tokens, you need to modify the parser. The first thing that might come to mind is:

primary_expression : DOUBLE_LITERAL | LP expression RP { $$ = $2; / / this is 4.1 added support brackets} | SUB DOUBLE_LITERAL {$$= - $2; }

Sure, the code above is a sign for a given real number, but it’s not possible to add a minus sign to a parenthesized number in this way. It needs to be modified again:

primary_expression : DOUBLE_LITERAL | LP expression RP { $$ = $2; } | SUB primary_expression { $$ = -$2; };

In recursive descent analysis, the syntax diagram of symbols can be allowed as follows:

The corresponding code modification (parser.c) is as follows:

static double parse_primary_expression() { Token token; A double value = 0.0; int minus_flag = 0; my_get_token(&token); if (token.kind == SUB_OPERATOR_TOKEN) { minus_flag = 1; } else { unget_token(&token); } my_get_token(&token); if (token.kind == NUMBER_TOKEN) { value = token.value; } else if (token.kind == LEFT_PAREN_TOKEN) { value = parse_expression(); my_get_token(&token); if (token.kind ! = RIGHT_PAREN_TOKEN) { fprintf(stderr, "misssing ')' error. n"); exit(1); } } else { unget_token(&token); } if (minus_flag) { value = -value; } return value; }

This completes support for negative numbers and allows you to compile and run tests.

In this paper, the

The code in the article has been uploaded to GitHub.
The calculatorand
Extended calculator