A compiler is a program that translates from one language to another.

Babel, for example, is a compiler that translates ES6 versions of JS into ES5 versions of JS. From this point of view, the translation software that translates English into Chinese also belongs to the compiler.

In general, the CPU cannot execute the program directly, because the CPU can only recognize the machine instructions. So to execute a program, you first have to translate a program written in a high-level language into assembly code (Java has an extra step of translating the high-level language into bytecode) and then translate assembly code into machine instructions that the CPU can recognize and execute.

Because assembly language and machine language one – to – one correspondence, and assembly language is more readable. Therefore, the teaching materials of computer principles usually use assembly language instead of machine language when explaining machine instructions.

The compiler for the four operations I will write in this article needs to translate the four operation expressions, such as 1 + 1, into machine instructions and execute them. See the example for the specific process:

// CPU cannot recognize 10 + 5 // translate into assembly language push 10 push 5 add // translate into machine instruction, assembly code and machine instruction one to one corresponding // machine instruction consists of 1 and 0, the following instruction is not real instruction, Demonstration only 0011101001010101 1101010011100101 0010100111100001

The four operation compiler, although the function is very simple, can only compile four operation expressions. But the front-end part of the compiler principle is almost all involved: lexical analysis, syntax analysis. There is also the code generation of the back-end part of the compilation principle. Whether it’s a simple, complex compiler, the compilation steps are similar, it’s just more difficult to implement with a complex compiler.

One might ask, what are the benefits of learning the principles of compilation?

I think understanding the inner workings of the compilation process will make you a better senior programmer. In addition, here is a more specific answer from Zhihu netizen – as you wish:

  1. It is easier to understand what is equivalent and what is different in a language
  2. The differences between different languages can be compared more objectively
  3. And less likely to be fooled by the propagandists of a particular language
  4. Learning a new language is also more efficient
  5. Converting from language A to language B is a universal requirement, and learning the principles of compilation will make it easier to handle such a requirement

Okay, now let’s look at how to write a four-rule operation compiler.

Lexical analysis

A program is simply a series of characters stored in a text file. The purpose of lexical analysis is to break the series of characters into tokens (also known as terminators) according to certain rules, ignoring Spaces and comments.

Example:

// token 10 + 5 + 6 after lexical analysis

terminator

Terminals are the basic elements of the language and cannot be decomposed.

Terminals in the four operations include symbols and integer constants (unary operators and floating-point operations are not currently supported).

  1. symbol:+ - * / ()
  2. Integer constants: 12, 1000, 111…

Lexical analysis code implementation

function lexicalAnalysis(expression) { const symbol = ['(', ')', '+', '-', '*', '/'] const re = /\d/ const tokens = [] const chars = expression.trim().split('') let token = '' chars.forEach(c => { if (re.test(c)) { token += c } else if (c == ' ' && token) { tokens.push(token) token = '' } else if (symbol.includes(c)) {  if (token) { tokens.push(token) token = '' } tokens.push(c) } }) if (token) { tokens.push(token) } return tokens } console.log(lexicalAnalysis('100 + 23 + 34 * 10 / 2')) // ["100", "+", "23", "+", "34", "*", "10", "/", "2"]

Syntax rules for four operations (Syntax rules are hierarchical)

  1. x*, denotes zero or more occurrences of x
  2. x | y, indicating that either x or y will appear
  3. ( )The parentheses are used to group words in a language

The following rule, viewed from left to right, means that the expression on the left can be subdivided further into the expression on the right until it can no longer be subdivided.

  • expression: addExpression
  • addExpression: mulExpression (op mulExpression)*
  • mulExpression: term (op term)*
  • term: ‘(‘ expression ‘)’ | integerConstant
  • op: + - * /

AddExpression corresponds to + – expressions, and MulExpression corresponds to * / expressions.

If you don’t understand the above rules, put them down and keep reading. Let’s see how to do the parsing in code.

Syntax analysis

The process of analyzing the input text according to the grammatical rules and determining its grammatical structure is called grammatical analysis.

The output of general parsing is an abstract syntax tree (AST) or a parse tree. But since the four operations are relatively simple, the solution here is to do code generation and error reporting in real time, so you don’t need to keep the entire program structure in memory.

Let’s see how to analyze a four-word expression 1 + 2 * 3.

The first match is Expression, which is decomposed to AddExpression because there is only one possible way down Expression at present, namely AddExpression. And so on, The next order is MulExpression, term, 1 (integerConstant), + (op), mulExpression, term, 2 (integerConstant), * (op), mulExpression, term, 3 (integerCons) Tant).

As shown in the figure below:

You might be wondering why an expression is so complicated, where expression has addExpression, and addExpression has mulExpression. Mulexpr has a higher level of operation than addExpr.

1 + 2 * 3 compileExpression | compileAddExpr | | compileMultExpr | | | compileTerm | | | | _ matches IntegerConstant push 1 | | | _ | | matches' + '| | compileMultExpr | | | compileTerm | | | | _ matches integerConstant push 2 | | | matches' * '| | | compileTerm | | | | _ matches IntegerConstant push | 3 | | _ compileOp (' * ') * | | _ compileOp (' + ') + | _

There are many algorithms that can be used to build a parse tree, but I’ll cover just two.

Recursive descent analysis

Recursive descent analysis, also known as top down analysis. The token stream is recursively analyzed step by step according to the syntax rules, and if a non-terminal is encountered, the analysis continues until a terminal is encountered.

LL (0) analysis

Recursive descent analysis is a simple and efficient algorithm. LL(0) adds an additional step on this basis. When the first token is not enough to determine the element type, “looking ahead” for the next character may resolve this uncertainty.

The above is a brief introduction to the two algorithms, the specific implementation please see the code implementation below.

Expression code generation

The four operand expressions we usually use are infix expressions, but infix expressions are not easy to evaluate on computers. So during the code generation phase, you convert an infix expression to a postfix expression.

Postfix expression

A postfix expression, also known as an inverse Polish expression, contains no parentheses, the operators are placed after the two operators, and all computations are performed strictly from left to right in the order in which the operators appear (regardless of precedence rules for the operators).

Example:

The infix expression: 5 + 5 is converted to the postfix expression: 5 5 +, and the code is generated from the postfix expression.

// Convert 5 + 5 to 5 + regenerate code push 5 push 5 add

Code implementation

The theoretical knowledge of compilation principle is like a book, often let people look in the clouds, but really do it, you will find that it is actually quite simple.

If the above theoretical knowledge is not too understand, it does not matter, first look at the code implementation, and then combined with the theoretical knowledge to see.

Note: We need to introduce the lexing code we just introduced here.

Function assemblyWriter () {this.output = "} assemblyWriter. Prototype = {writePush(digit) {this.output +=  `push ${digit}\r\n` }, writeOP(op) { this.output += op + '\r\n' }, // OutputStr () {return this.output}} // function Parser(tokens, tokens) If (((())) {if (())) {if (())) {if (())) {if (())) {if (())) {if (())) {if (())) {if (())) {if (())) 'sub', } this.opMap2 = { '/': 'div', '*': 'mul' } this.init() } Parser.prototype = { init() { this.compileExpression() }, compileExpression() { this.compileAddExpr() }, compileAddExpr() { this.compileMultExpr() while (true) { this.getNextToken() if (this.opMap1[this.token]) { let op = Enclosing opMap1 [this token] this.com pileMultExpr (). This writer. WriteOP (op)} else {/ / there is no matching the corresponding operator Here there is no matching + - / / will be token This. I -- break}}, compileMultExpr() { this.compileTerm() while (true) { this.getNextToken() if (this.opMap2[this.token]) { let op = This.opmap2 [this.token] this.pileterm () this.writer.writeOp (op)} else {this.opmap2 [this.token] this.pileterm () this.writer.writeOp (op)} else {this.opmap2 [this.token] this.pileterm () this.writer.writeOp (op)} else This. I -- break}}, compileTerm() { this.getNextToken() if (this.token == '(') { this.compileExpression() this.getNextToken() if (this.token ! = ')') {throw 'missing close parenthesis: } else if (/^\d+$/.test(this.token)) {this.writer.writePush(this.token)} else {throw ': Tokens (' +this. tokens + ')'}, getNextToken() {this.tokens = this.tokens[++this. I]}, getInstructions() { return this.writer.outputStr() } } const tokens = lexicalAnalysis('100+10*10') const writer = new AssemblyWriter() const parser = new Parser(tokens, Parser.getInstructions () console.log(instructions) // output the generated assembly code /* push 100 push 10 push 10  mul add */

Simulation execution

Now let’s simulate the CPU executing machine instructions. Since assembly code and machine instructions correspond one to one, we can create an emulator that executes assembly code directly. Before creating the simulator, let’s explain the operation of the instructions.

The stack

In memory, the characteristics of the stack is only on the same side of the insert and delete operations, that is, only push and pop operations.

push

The purpose of the push directive is to push an operand onto the stack.

pop

The POP instruction is used to pop an operand off the stack.

add

What the add instruction does is perform two pop operations, pop two operands a and b, then perform a + b, and push the result onto the stack.

sub

The function of the sub instruction is to perform two POP operations, pop two operands a and b, then perform a-b, and push the result onto the stack.

mul

The purpose of the MUL instruction is to perform two POP operations, pop two operands a and b, then perform a * b, and push the result onto the stack.

div

The function of the sub instruction is to perform two POP operations, pop two operands a and b, then perform a/b, and push the result onto the stack.

All the instructions of the four operations have been explained, don’t you think it is very simple?

Code implementation

Note: Code for lexing and parsing needs to be introduced

function CpuEmulator(instructions) { this.ins = instructions.split('\r\n') this.memory = [] this.re = /^(push)\s\w+/ this.execute() } CpuEmulator.prototype = { execute() { this.ins.forEach(i => { switch (i) { case 'add': this.add() break case 'sub': this.sub() break case 'mul': this.mul() break case 'div': this.div() break default: if (this.re.test(i)) { this.push(i.split(' ')[1]) } } }) }, add() { const b = this.pop() const a = this.pop() this.memory.push(a + b) }, sub() { const b = this.pop() const a = this.pop() this.memory.push(a - b) }, mul() { const b = this.pop() const a = this.pop() this.memory.push(a * b) }, Div () {const b = this.pop() const a = this.pop(); Push (math.floor (a/b))}, push(x) {this.memory.push(ParseInt (x))}, pop() { return this.memory.pop() }, getResult() { return this.memory[0] } } const tokens = lexicalAnalysis('(100+ 10)* 10-100/ 10 +8* (4+2)') const writer =  new AssemblyWriter() const parser = new Parser(tokens, writer) const instructions = parser.getInstructions() const emulator = new CpuEmulator(instructions) console.log(emulator.getResult()) // 1138

A simple four-arithmetic compiler has been implemented. Let’s write another test function and run it to see if it works as we expect:

function assert(expression, result) {
    const tokens = lexicalAnalysis(expression)
    const writer = new AssemblyWriter()
    const parser = new Parser(tokens, writer)
    const instructions = parser.getInstructions()
    const emulator = new CpuEmulator(instructions)
    return emulator.getResult() == result
}

console.log(assert('1 + 2 + 3', 6)) // true
console.log(assert('1 + 2 * 3', 7)) // true
console.log(assert('10 / 2 * 3', 15)) // true
console.log(assert('(10 + 10) / 2', 10)) // true

The tests were all correct. In addition to the complete source code, I suggest not to understand the students to see more than two times.

Take it to the next level

For industrial-grade compilers, the four arithmetic compilers are toys within toys. But you can’t eat too much at once, so it’s best to take a step-by-step approach to learning the principles of compilation. Here’s a more advanced compiler that compiles a Jack language (Java-like language) with a syntax like this:

class Generate { field String str; static String str1; constructor Generate new(String s) { let str = s; return this; } method String getString() { return str; } } class Main { function void main() { var Generate str; let str = Generate.new("this is a test"); do Output.printString(str.getString()); return; }}

This is a test. This is a test.

Do you want to implement such a compiler?

This compiler comes from a book called “Elements of Computer Systems,” and it starts in chapter 6, Chapter 11 covers the assembly compiler (which translates assembly language into machine language), the VM compiler (which translates VM language like bytecode into assembly language), and the Jack language compiler (which translates the high-level Jack language into VM language). Each chapter has detailed knowledge points to explain and experiment, as long as you follow the experiment step by step, you can finally achieve such a compiler.

If the compiler is finished, where does the machine language end up executing?

This book has it in mind for you. It has five chapters from chapter 1 to chapter 5. Teach you to start from the logic gate, gradually build the arithmetic logic unit ALU, CPU, memory, and finally build a modern computer. And then let the program that you compile with the compiler run on this computer.

In addition, Chapter 12 of the book teaches you how to write various library functions for the operating system, such as the Math library (which contains various mathematical operations), Keyboard library (how the Keyboard is displayed on the screen when pressed), memory management, and so on.

Want to see what happens after the book’s 12 chapters of experiments? Here I provide a few pieces of this analog computer running program DEMO GIF, for your reference.

The upper right corner of these pictures is the screen of the “computer”, the rest of the “computer” is the stack area and the instruction area.

I’ve done all the experiments for this book (3 hours a day, 2 months), and the answers are on my GitHub for you to check out if you’re interested.

The resources

  • Elements of Computer Systems

My blog will be synchronized to OSCHINA community, this is my OSCHINA ID: Tan Guangzhi, invite everyone to come together: https://www.oschina.net/shari…