preface

Compilation principle related books are varied, most of them are theoretical, few of them are practical; There is too little theory in practical books to develop a methodology. And textbooks generally implement only a subset of the language, with many basic features emasculated, and some new features either not mentioned or not mentioned at all. Recently, I have realized the implementation of the source language – intermediate code – object code – to register virtual machine from the support of object-oriented, functional, closure and other features, this blog mainly write some ideas in engineering practice, try to leave out the details.

The overview

Say something has nothing to do with the theme, for the learning phase, the realization of the compiler to choose their own familiar with the characteristics of relatively abundant high-level language, the reason is that the main purpose is to understand the compile and run the related theory and practice, the emphasis is on the skewer a domain knowledge, so in the process of exclude the interference of has nothing to do with the best information, also is the main contradiction. Since to implement a language compiler/interpreter, first to know what kind of language this language is a language, probably to support what basic features, first talk about programs and languages.

Programs and Languages

A program is a set of instructions that we read to a computer to solve a computable problem. But binary instructions that can be read directly by a machine are too unreadable for a human, let alone a program. Therefore, people came up with some semantic text to represent an instruction, for example, 0x10 0x1 0x2 0x3 is replaced by text to represent add R1, R2, R3, so that the meaning of the instruction can be seen at a glance through the text – add the values of R1 and R2 and put the result into R3, the text represented is assembly language. But these texts are not recognized by the machine, so a program is needed to convert these texts into machine instructions. This conversion program is called an assembler. What does the assembler write in? You can use either assembly or high-level languages, but only if you already have programs that can convert these things into binary machine code programs. So it is conceivable that the earliest assembler was written directly in machine code.

Similarly, assembly language basically corresponds to machine code one by one, or too tedious, originally just want to write a simple program, but also pay attention to too many hardware details. Thus, previous attempts to design a more advanced language were made to remove these irrelevant details and make the same text more expressive. Of course, eventually you need to convert to a low-level language, such as assembly, so the program that converts this high-level language into a low-level language is called a compiler.

Back to the modern computer, in fact, our current program is running on the operating system, the actual said here is not a pure binary program, pure binary program is in addition to the instructions and data do not include additional information, but the general running program are corresponding to the operating system defines the format of the running program files on it, At runtime, the program loader parses fixed-format files, allocates memory, and loads program instructions and data. So if it’s a pure binary program, it really only depends on the instruction set that the hardware supports. However, our program is not running on a bare machine, generally followed by the operating system provided by the assembler, linker, program loader, which has Intel and AT&T two different styles of assembly instructions. After the compiler generates assembly instructions, the assembler generates object files, and finally the linker generates executable files supported by the operating system.

Of course, the programming language design is a separate field, only the composition of each language has been studied, it is possible to design a good language, compiler is just a tool to achieve the meaning of the language.

Compilers and interpreters

In fact, compilers and interpreters are not too strict boundaries, generally according to the final generated object code is the corresponding operating system assembly instructions to decide. Light interpreters come in several varieties and can be interpreted directly from the syntax tree, interpreted from generated intermediate code, or provided with a virtual machine to execute the last generated instruction. The former two are usually directly through the data structure in memory, the latter can directly output a file, and then the next run directly in accordance with the defined format parsing after loading execution, no need to compile. So you can see that a better word for compilers and interpreters is whether they have done a thorough semantic analysis and produced new executable files. It doesn’t matter whether the direct executor is the CPU or the virtual machine program. Usually some people love to say that such and such language is compiled/interpreted language is not accurate, the program language itself has nothing to do with how it runs, the key is how the compiler is implemented, for C/C++ this can also be directly explained execution.

Let’s look at the complete flow of the compilerDivided into the following steps

  • First read the source code text file, get the text character stream.
  • A stream of characters is entered into a lexical analyzer that converts one or a group of characters into tokens (words) according to the rules of the word.
  • The obtained word stream is thrown into the parser to generate an abstract syntax tree according to the syntax rules. For lexical and grammatical analysis, there are better tools for automatic generation, but manual writing is not troublesome.
  • Semantic analysis is carried out on the abstract syntax tree traversal. Semantic analysis generally includes reference digestion, type checking, type derivation, semantic checking, etc., depending on the language of different design objectives. Subsequent operations are carried out according to the symbolic information recorded in the analysis process.
  • According to the information of semantic analysis, generate Intermediate IR(IR) codes. Intermediate codes also include tree IR and linear IR, and the latter is more common. The purpose of generating IR is to facilitate the subsequent machine-independent optimization and make the generation of object code simpler.
  • Generate object code, usually one intermediate code corresponds to one or more object code, now the cross-platform language is generally generated a set of object code of their own design, and then directly run on the VIRTUAL machine. In a stack VM, all instructions operate on the stack. When generating code, you do not need to consider the location of operands. If the operands are register-based, register allocation is required. So the former commands take up less storage space, while the latter is more like modern computers and can run faster.
  • If you want to directly generate the corresponding operating system executable file, use the target operating system assembler and linker to do it. Run using a virtual machine, according to the runtime information, directly to the object code interpretation can be executed.
  • The whole process of a compiler is generally divided into two stages: front end and back end. Front end generally refers to from lexical analysis, syntax analysis, semantic analysis to intermediate code generation, back end generally refers to optimization, object code generation and so on.

The normal process is pretty much the same, with extra things like adding optimizations during compilation, designing intermediate code at different levels according to how close it is to the source language, handing it to the assembler after compilation, and generating executable programs for the target platform by the linker… Of course, there are many compilers that don’t follow a process, that don’t explicitly generate abstract syntax trees or intermediate code, and do everything while parsing. Let’s look at the process

Lexical analysis

A Lexer, as the name suggests, turns the input character after character into Token. Tokens generally store word types, text content, line numbers, and column numbers in source code. Nature can define the following structure

class Token {
    int tokenType;
    String text;
    int line, column;
}
Copy the code

To complete lexical analysis, we first need to define a set of lexical rules, such as: which keywords and basic operators are supported? What are the naming conventions for identifiers? How many expressions can be used for integers, decimals, strings, etc. The common rules are similar, but how do you recognize words? Books on compiler principles usually put lexical analysis and Finite Automata together, but the two are not really related. Let’s start with a brief description of Finite Automata

Finite automata

Finite automata, as a mathematical model, represents a system with a finite number of internal states. The system will have an initial state and can change the current state according to the new input until it moves to a termination state. Therefore, it can be known that a finite automaton consists of the following parts:

  • An initial state
  • A set of termination states
  • A finite set of states
  • A list of characters allowed to be entered
  • A transition table of “one state + input character -> another state”

So how do you use this model? Take the identification of variable names and number literals as an example. Assume that variable names can start with uppercase and lowercase letters and be followed by any letter or number. Then the following state transition diagram can be drawn

Double rings indicate termination. Where 1 is the initial state, according to the input character, if the number is transferred to state 3, until the input character is not a number; State 1 passes to state 2 if the input is a letter, and then receives characters if letters or numbers until the condition is not met and terminates. According to the state machine model, such a problem can be quickly coded. First according to the initial state + input character, get the state to be transferred

int state = 1;
if (isLetter(ch)) {
    state = 2
    append(ch)
} else if (isDigit(ch)) {
    state = 3
    append(ch)
}
Copy the code

Then enter characters based on the current state +

while (has next ch) {
    switch (state) {
        case 2:
            if (isLetter(ch) || isDigit(ch)) {
                append(ch)
            } else {
                appendToken()
                initState()
            }
            break;
        case 3:
            if (isDigit(ch)) {
                append(ch)
            } else {
                appendToken()
                initState()
            }
            break; }}Copy the code

From another perspective, in fact, the whole process of the state machine is similar to the logic we normally deal with. Even without using the theories related to the state machine, the problem can still be solved quickly. See the following pseudocode

while (has next ch) {
    if (isLetter(ch)) {
        processId()
    } else if (isDigit(ch)) {
        processNum()
    }
}
Copy the code

Therefore, it should be clear that the finite automata model is only a tool to solve string matching, but it does not have to do so. First, when there are too many word rules to match, the state will also increase greatly, which will increase a lot of unnecessary state maintenance. Second, the state machine model is completely used for encoding, and the code readability is very poor in the case of too many states and no state transition diagram. Of course, for lexical analysis, regular expression can be used to match, and the regular expression engine itself is also implemented by the theory of finite automata.

Directly according to the normal thinking get twice the result with half the effort, such as keywords and identifier of the recognition, the key word generally is a subset of the set of identifiers, so the better way is to build a hash table to store the keywords, lexical analysis unified identification into identifier, to go after the completion of the query keyword table to final token type, without having to maintain multiple intermediate state.

Syntax analysis

Parser aims to convert token streams from lexical Parser into data structures organized according to syntax rules. In general, trees are a good choice. Generally, generated syntax trees are called AST(Abstract Syntax Tree). Again, the first thing to do is to design the grammar rules of the language. Like natural languages, procedural languages must be some sort of fixed structure. If statements are to be expressed, they must be expressed in a conventional way, usually using EBNF(extended Bacos paradigm), as long as they are expressed accurately and not tied to specific specifications. General statements consist of non-terminal and terminal. The former represents an abstract class of things, so it can be continued. On the contrary, the latter refers to concrete things. Take Chinese for example, the general sentence consists of subject, predicate and object structure, in which the subject and predicate can be expressed by personal pronouns and nouns, and the predicate is expressed by verbs. I get the following formula

Sentences - > the subject Personal pronoun subject predicate object - > | nouns The predicate - > go eat run | | | to play... Personal pronoun object - > | nouns Personal pronoun - > you | | my he... Noun -> The specific name of a thingCopy the code

The left side of the sentence is the name of the rule, and the right side is the production formula, which is the rule of the sentence. Take parsing the sentence “Xiao Ming eats” as an example, and finally you can get such a grammar tree.Terminators and non-terminators are color-coded. It can be seen that when represented by syntax tree, the terminator at the leaf node cannot be continued. Similarly with programming languages, the ultimate goal of parsing is to get such a syntax tree. The grammar name on which the parsing phase dependsContext-free grammar“, which means that this stage can only check the grammatical structure of the sentence for errors, but cannot identify logical errors. Expressions such as “one string minus another string” are syntactically correct, but semantically not supported by normal languages. Let’s look at programming languages

Take the while loop statement for example

stmt -> expr | '{' stmt '}' | if_stmt | while_stmt
while_stmt -> WHILE '(' expr ')' stmt
Copy the code
  • Note: Use uppercase letters and characters in single quotes to indicate terminals

STMT consists of expressions, blocks of statements wrapped in curly braces, if statements, and while statements. The while statement consists of the terminator while, close parentheses, expressions, and statements. Since grammar rules are fixed structures, can we use regular expressions to match them directly? The answer is no, and a closer look shows that regular grammars can’t actually handle recursive definitions. Based on this syntax structure, it is natural to think of a recursive way to analyze the syntax, as shown in the pseudo-code

void parseStmt(a) {
    if (curToken is '{') {
        parseStmt()
    } else if (curToken is IF) {
        parseIfStmt()
    } else if (curToken is WHILE) {
        parseWhileStmt()
    } else {
        parseExpr()
    }
}

void parseWhileStmt(a) {
    match(WHILE)
    match('(')
    parseExpr()
    match(') ')
    parseStmt()
}
Copy the code

The overall logic is quite clear. In fact, the recursive descent algorithm is used for grammar analysis. Generally, functions related to token flow matching can be implemented in advance. Syntax analysis algorithm also has several kinds, the manual construction of lexical analyzer, recursive descent is more commonly used, here is also in line with the principle of enough. Again, one of the main problems with using recursive descent for parsing is the prioritization of expressions.

Expression priority

Taking the expression 1+2*3 as an example, the resulting syntax tree should look something like thisIn this way, the correct result can be calculated by backward traversal of the syntax tree. Processing expression priority, in fact, is the highest priority of the subexpression first evaluated, corresponding to the syntax tree isOperators with higher precedence are lower in the syntax tree. I can get a production formula like this

expr -> add
add -> add + mul | mul
mul -> mul * pri | pri
pri -> Id | Literal
Copy the code

But this kind of production leads to the problem of left recursion, and the matching logic is something like

void parseAdd(a) {
    parseAdd()
    match('+')
    parseMul()
}
Copy the code

I’m actually going to keep recursing indefinitely at parseAdd, and then I’m going to keep recursing at add

add + mul + mul + mul + ...
Copy the code

As you can see, since parseAdd() causes the left recursion problem in the first place, it should be parseMul() first. And this is actually the standard textbook solution, which is to rewrite the left recursion to the right recursion, you could write it like this

Add - > the mul add 'add' + the mul - > add '| epsilonCopy the code

Write it in pseudocode as follows

void parseAdd(a) {
    parseMul()
    if (next is '+') {
        parseAdd_()
    }
}
void parseAdd_(a) {
    match('+')
    parseMul()
    if (next is '+') {
        parseAdd_()
    }
}
Copy the code

Parsing with right recursion creates associative problems. Normally, expressions are left associative, but right recursion leads to right associative problems. Take the expression 1+2+3 as an example, as shown belowThe left is a combination of left to right, and the right is a combination of right to left, which seems like a good idea at first, but if you subtract 1-2-3, you go from left to right and you say 1-(2+3), and you go from right to left and you say 1-(2-3), then you have to figure out a way to solve this problem, and you can rewrite the formula as follows

add -> mul('+'mul)*
Copy the code

This is the same rule as the right recursion description, but it is logically possible to change recursion into a loop

void parseAdd(a) {
    c1 = parseMul()
    while (next is '+') {
        match('+')
        c2 = parseMul()
        c3 = new Expr('+', c1, c2)
        c1 = c3
    }
}
Copy the code

This solves the associativity problem by making the newly created node the left child of the next matched binary expression. Most operations combine from left to right, but there are still cases where they combine from right to left, such as assignment expressions

x = y = z = 1
Copy the code

The solution is to reorganize the nodes of the binary tree after parsing a complete expression, taking the expression X = y = 1 as an example, as shown in the figureYou actually get the left side, but semantically you want the right side of the syntax tree. Therefore, disconnect the right node of the left subtree to become the left child of the current node, and then change the right child of the original left subtree to become the current node. The observed rule is as followsThe left child of an assignment operator cannot be an assignment operatorTherefore, for multiple consecutive assignment expressions, the left child of the current node is cycled in accordance with the above rules until the condition is not met. The following pseudocode

while(node.left ! =null && node.left is assignExpr) {
    oldLeft = node.left
    leftRight = oldLeft.right
    oldLeft.right = node
    node.left = leftRight
    node = oldLeft
}
Copy the code

To solve these problems, parsing is mostly a stack job. Next, look at semantic analysis

Semantic analysis

Semantic analysis is an important step in the compilation process. Semantic analysis is actually analyzing the meaning of the program. Because the final result of the program or to perform some kind of calculation, so a good semantic analysis is convenient for the generation of object code. Syntax analysis is context-free, whereas semantics is context-sensitive analysis. Semantic analysis includes type derivation, type resolution, type checking, reference resolution, semantic verification, closure analysis, etc. Here are some common parts of the work that compare this phase

The type system

The purpose of the establishment of type system is to divide the values with certain common properties so as to accurately define and predict the behavior of the program, which not only ensures the safety of the program, but also improves the efficiency of the running time. So how does a general type system work? The first thing to determine is what basic types the programming language provides, which generally include integers, floating point numbers, characters, and Booleans. The basic data type of a literal can be determined directly from the literal text. In addition, the language of the mainstream support to define new combination types, such as string, array, C language structures, Pointers, object-oriented language class, language and will function as first class citizens include function types, definition of new type also to provide a set of corresponding rules, which define the type of support basic computing.

Type inference

After the establishment of the type system, the semantic analysis process should determine the type of all variables, constants, literals, expressions, depending on the specific language, need to display the declaration of variable types of the language, such as C/C++, Java system

int a = 1
Copy the code

The type of variable A can be determined by the keyword ‘int’ corresponding to the left child of the syntax tree. Some languages do not require an explicit declaration of variable types

var a = 1
Copy the code

We can determine that variable A is also an integer based on the rightmost child integer literal ‘1’ in the variable declaration syntax tree. I’m just taking a basic type for example, for a composite type, for example, an expression for the value of an array, how many dimensions is the value of the expression, what is the type of the array after removing the dimensions, for example

int[] a; // a[0] is of type int with 1 dimension removed
int[][] a1; // a1[0] = int[]
Copy the code

For function calls, look at the corresponding return value type of the function to determine the type of the expression. For binary expressions, the type of the parent node is derived from the types of the left and right children. This process, in turn, usually involves type conversions.

Type conversion

The type system also specifies conversion rules between values of different data types. If you operate on a value of type floating point, the result should be floating-point. Operation on any type of value with a string results in a string. Such as

1 + 2.0   // This expression is of type floating point
1 + "abc" // The expression type is string
Copy the code

In general, an explicit conversion is also required if a value of a different type is assigned directly; In the case of expressions, implicit type conversions are performed, that is, the syntax tree nodes of the new type conversions are added separately during the semantic analysis phase.

Type checking

Here briefly said the difference between strong type and weak type, refers to the type of a variable can be changed after the declaration, can change the language is weakly typed language. Common scripting languages do not have strict type restrictions, and variable types can be arbitrarily changed at run time, effectively eliminating strict security checks at compile time, as shown in the following code

var a = 1, b = 2
var c = input()
if (c == '1') {
    a = "sass"
} else if (c == '2') {
    a = 5
} else{... } println(a - b)Copy the code

In the case of expressions a-b, the expression is definitely not valid if a is a string, but the type of a can’t be determined at compile time. So weakly typed languages do a lot of extra type checking and conversions at runtime.

Generally, strongly typed languages require extensive type checking, usually including

  • Whether the types on the left and right sides of the assignment expression match. If it is a basic type, see whether implicit type conversion is supported. If it is a class type, see if there is an inheritance relationship; If it is of a function type, the type of the parameter of the function type and the return value type are matched.
  • Function call to pass parameters, need to check the corresponding position of the actual participation parameter type match
  • The type of the returned value needs to be checked to see if it matches the function definition.

Reference to eliminate

The purpose of reference resolution is to find the correct definition of variables and functions referenced in the program, and to find the duplicate definition of variable names in the same scope. The whole process is bound to find declared variables, functions, classes, scopes and other related information, so there must be an intermediate structure to store these information, this structure is called the symbol table, the design of the symbol table will be introduced separately later. Reference resolution of variables basically follows the following steps

  • For newly declared variables, functions, classes, etc., add to the symbol table of the current scope.
  • For a reference to a variable, try to find it in the current scope, or if it cannot find it in the parent scope, until the definition is found.
  • For a function call, the same process is followed: first, the function definition is searched from the current scope. If it fails, the function definition is searched in the parent scope. If you are in a language that supports function types, then you have to look in turn to see if there is a variable of the function type that is the name of the current function call.

As you can see, if you want to achieve the “use before declare” effect, the following

void test(a) {
    println(a)
    T t = T()
    println(t.a)
}
int a = 1
class T {
    int a = 1
}
Copy the code

Here the definitions of the variable A and class T are used later, but the correct reference can still be found when handling the function test. A better approach is multi-stage scanning. First, the declared classes, variables, functions, etc. are added to the symbol table separately, and then the reference definition can be directly searched from the symbol table during reference resolution.

The symbol table

The contents of the program source file are all symbols to the computer, so it is the compiler’s job to determine the meaning of these symbols and organize them in some kind of structure. It mainly stores the name, type, scope and other related information of symbols, which basically runs through most stages of the compilation process, so its importance is obvious. Symbol table as an intermediate structure, there is no specific implementation, but it itself is to facilitate the compilation process of symbols represented by the information acquisition. Common symbols include names, visibility, whether they are static, and scope, so a basic symbol can be designed this way

class Symbol {
    string name
    int visibility
    boolean isStatic
    Scope enclosingScope
    AST ast
}
Copy the code

In addition, the AST node corresponding to the symbol also needs to be stored. The AST node will be associated with the token, so the main purpose is to find semantic errors and directly locate the column number and line number where the error occurred. Scope represents a Scope. Scope is actually a location where symbols are stored, so different scopes must have a separate symbol table. So the scope can be represented like this

class Scope extends Symbol {
    Symbol[] symbols = Symbol[1024]
    int idx = 0

    void addSymbol(Symbol symbol) {
        symbols[idx] = symbol
        idx += 1
    }

    Symbol findSymbolByName(string name) {
        Symbol result = null
        for (int i = 0; i < symbols.length; i+=1) {
            if (symbol.name == name) {
                result = symbol
                break}}if (result == null&& enclosingScope ! =null) {
            result = enclosingScope.findSymbolByName(name)
        }
        return result
    }
    ...
}
Copy the code

Generally, the compiler will design different symbol tables for different scopes, such as global symbol table and local symbol table. In fact, just because the life cycle of different scopes symbols is different, it is convenient for the generation of later code. In fact, this simple chain structure can be handled very well, when a scope does not have a parent, it is definitely a global scope. Other symbols, such as variables, classes, function 0, and so on, can be extended on the base structure

class Variable extends Symbol {
    Type type
}

class Class extends Scope {
    Class parentClass
}
class Function extends Scope {
    Variable[] formalParams
    Type returnType
}
...
Copy the code

According to these structures, the basic information related to each symbol can be clearly expressed. The semantic analysis mentioned above is to obtain annotated abstract syntax tree, which is convenient for subsequent work. Actually don’t have to be constrained by concept, can be directly conducted on original ast node expansion symbol information, but the realization of the project is carried out step by step, so you can try the previous steps don’t rely on the back of the implementation, it can guarantee the overall logic clear, actually with a hash table is stored directly, establish the ast node and its corresponding type, And the mapping of symbolic information.

As mentioned earlier, the parsing phase establishes the syntax tree. The beauty of syntax trees is that semantic analysis is possible through simple tree traversal. For example, the type derivation of binary expressions can actually be traversed according to the post-order of the tree. When traversing reaches the root node, the left and right children must have finished processing, so the type of the root node can be marked directly according to the type information of the left and right children, and so on, until the whole grammar tree is processed.

Now look at intermediate code generation

Intermediate code generation

After semantic analysis, object code can be generated directly, but compilers often use intermediate code for platform-independent optimizations and to simplify object code generation. Be aware that even with semantic analysis, some advanced concepts in the source program, such as control flow statement statements, function calls, multidimensional arrays, compound expressions… It takes a lot of work to translate these things directly into object code, but these concepts are largely eliminated when intermediate code is generated, which is often close to the source language and close to the object code, greatly simplifying the generation of the final code. And if you convert different languages into the same intermediate representation, you can have multiple different programming languages sharing the same compiler back end and eventually compiling into object code for the same platform. In fact, LLVM and Ark compilers do just that, defining an IR where different compiler front ends only need to generate IR in the specified format.

More common today is a linear IR-three Address Code, expressed in a quad form

z = x op y
Copy the code

Where x and y represent operands, op represents operators, and z represents the location where the operation result is stored. Of course, a quad does not have to be represented by four symbols, but by a maximum of four. You can design a structure like this

class TAC {
    Symbol result
    string op
    Symbol arg1
    Symbol arg2
}
Copy the code

Because intermediate code decomposes composite expressions, it inevitably generates many temporary variables to store intermediate results. Such as

/ / source program
x = 1 * 2 * 3

// ir
v1 = 1 * 2
v2 = v1 * 3
// x = v2
v0 = v2
Copy the code

Since intermediate variables need to be generated, it is better to directly reassign all the original variables to new names and use hash tables for mapping. In this way, memory allocation can be convenient for subsequent generation of object code. Next, we will briefly introduce the intermediate representation of grammatical structure translation in several high-level languages

Looping statements

Take the while loop

int i = 0
while (i < 100) {
    i += 1
}
Copy the code

The intermediate representation is as follows

v0 = 0
L0:
v1 = v0 < 100
jumpz v1 L1
v2 = v0 + 1
v0 = v2
GOTO LO
L1:
...
Copy the code

As you can see, to express the semantics of the source program, you need to insert additional auxiliary tags. Here jumpz means jump to L1 if v1 is false. Typically when control flow statements are involved, label placeholders are inserted and replaced with the actual location of memory when object code is generated. One problem that needs to be noted here is that the source program is translated from top to bottom. When the position to jump to is behind the current program, there is no way to get the label to jump to immediately, so the label is usually backfilled for loops, branches, and other statements. See the pseudocode below

void translateWhileStmt(AST ast) {
    // Generate the start tag
    TAC start = genLabel()
    // Translate the conditional expression for the while loop
    Symol res = translateExpr(ast.condition())
    // Generate the jumpz directive with the label temporarily empty
    TAC jumpTAC = genJumpZTAC(res, null)
    // Translate the statement block
    translaBlockStmt()
    // Generate a goto to jump to the beginning of the loop
    genGoto(start)
    // backfill the label
    TAC end = genLabel()
    jumpTAC.setArg1(end)
}
Copy the code

Branch statements

Take the if statement as an example

if (flag) {
    i = 1
} else {
    i = 2
}
Copy the code

The intermediate representation of the translation into

jumpz v0 L0
v1 = 1
GOTO L1
L0:
v1 = 2
GOTO L1
L1:
...
Copy the code

If statements also backfill labels and generate instructions at the end of each branch block that jump directly after the entire branch statement.

Break, continue

The break and continue statements are used to terminate and skip the current loop, respectively. As you can imagine, it must also be translated into a goto statement, but the key is how to determine where to jump? The previous loop and branch statements are easy to backfill because you can determine where the statements start and end. For break and continue, it can occur anywhere in the loop, and if you have multiple levels of nesting, you should only exit or skip the innermost loop, using a stack to store labels. For break, a separate stack is used to store the end label of the most recent loop statement, and the top label is pushed off the stack when the loop block is exited. Therefore, when break is encountered, the current top label can be directly obtained; Similarly, with continue, a separate stack is used to store the opening label of the most recently entered loop statement, and the top stack label is also fetched when this keyword is encountered.

For more instructions, directly refer to the off-the-shelf or make their own, anyway, the ultimate goal is to generate code in line with the logic of the program, as for the specific performance is not the focus. It’s also important to be aware that there are many special cases to deal with during code generation. For example, the reference variable belongs to a scope, whether the variable is local, a member of a class, or a global variable, and whether it is a free variable if closures are supported. Variables of different scopes must be stored in different locations in memory, which can be handled during object code generation. However, it is better to do this in the middle of the code, which generally generates different instructions for different scoped variables.

The runtime

The goal of a compiler is to eventually generate instructions that run on the target platform, so it is important to understand how the program works in order to generate instructions correctly. Said run time, the main understanding is the program running environment and running process.

Runtime environment

The program ultimately runs on the target platform, which could be a real machine or a virtual machine. In a real machine, the main interaction is with the CPU and memory hardware. The first thing to understand is the basic composition of computer hardware, and modern computers are basically von Neumann structure, logically including five parts: input device, output device, storage device, arithmetic unit, controller. Corresponding to the actual machine, the CPU is responsible for the actual operation and control, and memory, hard disk is actually the input, output or storage devices, but because of the large difference in the read and write speed of different devices, directly dealing with the CPU is usually the memory. The CPU has internal registers and high-speed buffers, but the capacity is limited, so the memory becomes the main storage location of program instructions and data.

Programs fall into three categories based on the platform they run on.

  • For pure binary programs that can be loaded directly into memory for machine execution, the action of loading is usually done by the BIOS program;
  • Common application, running in the operating system, by the program loader (also a program) to complete the program load, the most core function of the load is to allocate memory, the CPU program counter to the program entry address.
  • The program running on the virtual machine, by the virtual machine to complete the same process, by the virtual machine to simulate the actual hardware on the PC register, stack pointer register and other concepts.

A program that runs directly on a machine, dividing and allocating memory by itself; The program running in the operating system is divided by the operating system to complete the running memory. The operating system usually uses virtual memory and establishes the mapping between virtual memory and physical memory through memory paging and page table. In both cases, the actual runtime is done directly by the hardware or the operating system, simply by generating conforming programs according to the platform’s conventions. The program running on the virtual machine requires the implementer of the virtual machine to complete the loading of the program, the division of memory area, the allocation of dynamic memory and so on.

For the latter, you can simulate real hardware exactly, but you can represent it as an option, and it doesn’t have to be limited to that. Therefore, the general scripting language has rich features, because of the high customizability and flexibility of its runtime platform, so it is very convenient to implement advanced language features, many features can be deferred to the runtime processing. On the other hand, languages that compile directly into the actual assembly instruction set on the target platform typically require a lot of extra work at compile time because of hardware limitations.

Typically, memory areas are divided in a similar way. Generally can be divided into code area, stack, static data area, heap.

  • Code area: an area that holds a program’s instructions and operands. This area is usually readable but not writable.
  • Static data area: usually store constants, global variables, etc
  • Stack: Usually holds short-lived variables, such as local or temporary variables, which can be allocated by the compiler. In general, a new stack space is opened when a function is called, and reclaimed when it is finished.
  • Heap: Usually holds variables with a long lifetime and dynamically allocates memory, such as objects, arrays, and so on, during runtime. Memory allocated on the heap is collected manually or automatically using the garbage collector to prevent memory leaks.

Operation process

The program starts running, when the contents of the program file are loaded from the hard disk into memory, and the CPU executes instructions from the entry address. Under normal circumstances, the command is executed along the starting address, until the jump statement is encountered, and it is necessary to locate the position to jump. Direct jump does not need to consider other things. For a function call instruction, jump to the starting address of the function and return the address of the next instruction of the original function call after execution. In addition, function calls also involve passing parameters and retrieving return values, so this information must be stored somewhere, usually on the stack.

Generally, a function call is called an activity, and each activity corresponds to an active record, which actually holds the above information. Usually, memory is managed on a stack basis, so the active record is also called stack frame. Note that stack frames are just a logical concept, intended to describe the current operating environment, not a real thing. Current frame often need to save a stack frame of the link (for directly in the actual computer running program, usually the base address register), call that needs to be saved in the process of register, etc., when the function performs the recovery stack frame so can also restore previous position, or in some cases need to refer to a stack frame of the variable, you can find along the invocation chain.

  • Note: A variable is a runtime representation of a specific memory address. To refer to a variable is to read the value to the actual address. To assign a variable is to write the value to the memory address.

As with memory areas, the implementation of stack frames can be modeled exactly in accordance with the concepts of the real machine if the virtual machine needs to implement its own runtime, or it can actually explicitly use new data structures to store and link them.

Run-time related things are hard to conceptualize, but you can see things better when you do them.

The following describes some common features of language design.

object-oriented

Programming based on object-oriented approach is closer to the logic of real life, so the support of object-oriented related features is now the mainstream language is a must. Commonly included concepts are classes, objects, inheritance, and polymorphism. Here’s a quick look at what compilers have to do to support these features. For classes first, similar to the structure of the C language, belong to the same can be user-defined types, but classes can include properties and methods (C can also be simulated structure with a function pointer), the design of such a concept is mainly into language’s type system, convenient to check the object can be accessed by the type of fields and methods.

Classes and objects

In object-oriented languages, a class is a template for an object, and an object is an instance of a class. So how does the object behave at run time? First, for class fields, each different object should be a separate copy; Class methods, on the other hand, call the same method for each instance object, so only one copy is saved. So for an object, the basic information should at least have the class that it belongs to, and the container that holds the fields. You can define it this way

class Object {
    Type type
    Slot[] slots
}
Copy the code

For a language that implements its own runtime, there is usually a metadata area in the memory area, where the information related to classes, fields, and methods can be stored, and the corresponding meta-information can be searched during the program runtime to determine the correct operation. Data types, whether integer, floating point, or Boolean, are actually stored as binary numbers. The key is how the runtime handles the data for different types. The value is an integer directly fetched from memory, while the floating point number needs to be converted into a floating point number according to the set rules. When assigning, the opposite is true. Of course, for the problem of data type, we can also change the instruction into operation of certain data type, which will increase a lot of redundant instructions, but there is no need to query meta information during the running period, and the execution efficiency can be improved to some extent.

For the fields of a class, the relative order of the fields will not change after the class is defined, so the fields can be numbered directly at compilation time, and the translated object code can directly access the corresponding memory unit according to the label. The memory unit can be fixed as 4 bytes, and more memory units are used to store larger numbers. The runtime itself is indeed more flexible. For C++, which compiles directly into hardware platform assembly instructions, and constructs are handled in a similar way, byte alignment is considered for different data types (again, the size of memory cells can be fixed directly), and field access is translated based on the offset of the object’s base address.

In general, for classes, members usually include ordinary members, which belong to objects, and static members, which belong to classes. At runtime, a static member can be treated as a global variable directly, and a static member can be uniquely identified by concatenating the member signature with the class signature.

Finally, memory allocation of objects is generally allocated dynamically in the heap, and dynamically allocated memory reclamation is a problem. So modern compilers often consider allocating space on the stack if the object is only used locally, so that it can be reclaimed.

Inheritance and polymorphism

Inheritance is intended for reuse of data and logic and for polymorphism, but it is not necessary to use inheritance for reuse, and composition is recommended in many cases because inheritance increases the coupling of code. Here’s how the compiler handles inheritance. You only need to set the inherited parent class in the information of the class. When reference resolution is performed, the parent class is directly searched for if the current class cannot be found. When type checking, it also needs to judge whether it is legal according to the inheritance relationship.

For class fields, if the current class has a parent class, the number starts after the number of the last field of the parent class, so the memory layout of the instance object of the last class actually looks like this.The object header holds the actual class type of the current object, followed by the parent and subclass fields.

At run time, for class methods, a subclass can override a method of the same name of the parent class, and during compiler time it may not be possible to determine what the object’s actual type is. The following

void test(T t) {
    t.f()
}

class T {
    void f(a)
}

class T1 extends T {
    void f(a) {
        println("t1...")}}class T2 extends T {
    void f(a) {
        println("t2...")}}Copy the code

The test() method is processed during compilation, and the digested method f() is the parent of T without concrete implementation. We do not know which subclass object the actual parameter is, so we can only find the method location according to the method signature at runtime according to the actual type of object T. This process is generally known as dynamic binding. Also, classes and methods have direct data structures to hold meta-information for their own implementation runtime. This process is more flexible, but what about dynamic binding of methods in C++? In fact, the C++ object header contains a pointer to a table that holds the entry addresses of the actual methods of the class to which the object belongs. This table is called the virtual function table. The size of the table can be determined at compile time, so when the object code is generated, the entry address of the object is known, as well as the address of the table, and the instance method that calls the object is translated directly into the address of the function that is stored at the offset of the table’s header address.

With this super

The this keyword is used to access fields or methods of the class to which the current instance object belongs, and the super keyword is used to access fields or methods of the parent class. As mentioned earlier, at runtime, there is no concept of a superclass. All superclass fields are arranged in the instance of the subclass, so what this and super actually represent is the current object instance. Super is really just a syntactic sugar for a subclass to access the members of the superclass. How do I get the current object instance from a class method? Remember, methods are shared by all instances of the class, so the actual reference to this must be different for each instance, and the logic of the method is the same. It is natural to think of passing in the instance object as an argument to the method it calls, usually in the first argument. Static methods, on the other hand, are class-like and can be translated just like regular functions.

Initialization of fields

Another question to consider is when class fields are initialized, as follows

class A {
    int a = 1
}
Copy the code

Class fields are just like normal variable declarations, which assign values directly to the declarations when they are run. Processing of the field at compile time, just after digestion, numbered reference preparation for operation period to create a class instance, the value of the field can be a complex expression, may produce more intermediate temporary variables, and for ordinary field to assign values to every instance objects, static class assignment only once, one way is to generate AST in syntax analysis stage, Create an initializer method declaration node. The actual content of the method is to assign values to the members, so that any space occupied by temporary variables is reclaimed after the method is executed. You just need to execute this hidden method first when the class creates an instance. The actual code logic for compile-time processing looks like this

class A {
    int a
    
    A(a) {
        _init_()
        ...
    }

    _init_() {
        this.a = 1}}Copy the code

Call the init method in the constructor of the class. You can do the same for static fields, assigning values on the first reference and letting the class use a flag bit to determine if it has been initialized.

Functional programming

Like object orientation, functional programming is a programming paradigm, and many languages such as Python and JavaScript actually support multi-paradigm programming styles. It’s a matter of opinion as to which one is better.

We usually write programs is imperative programming, imperative programming is mainly concerned with the steps to solve the problem, and functional programming is concerned with the mapping of data. Functions in functional programming are the same as functions in mathematics. There are also functions in programs, and the reason is that the people who started out with computers actually came from math, and they used that concept, but it’s more accurate to call a function in a program a subroutine, and I’ll show you that.

For the function f(x)=ax+b, suppose x is the input, and the result of ax+b is the output. We can know that the functions in mathematics have such two characteristics. One is that the same input will get the same output. Second, the input value will not change during the calculation, and the input x can be either a common value or a function. These characteristics can be extended to program languages. Functions in programming languages can refer to external data such as globals, class member variables, etc., so if previous calculations change the data in the external environment, the final output of the function may change. Therefore, according to the functions in mathematics, functional programming in programming languages mainly emphasizes the following points

  • 1. Pure function, that is, function does not reference external data, does not change the external environment, no side effects.
  • 2. Immutability, the input data will not change during function calculation.
  • 3. Functions can be first-class citizens, like ordinary values, which can be used as arguments, assignments, and return values of functions.

The beauty of functional programming is that because there is no need to reference or change the external environment, the results of the execution of the function are practically predictable, and because there are no shared variables, there are no concurrency problems. Let’s take a simple example

var list = [1.2.3.4]
// function f(x) { return x * x}
var f = x= > x * x
list.map(f)
// Output: [1, 4, 9, 16]
Copy the code

Note that the variable f is followed by a lambda expression, which is essentially the syntactic sugar of a normal function definition. By contrast, this satisfies some of the requirements of functional programming.

The first two are actually programming specification compliance, and the third is what the compiler needs to implement. Let’s take a look at what the compiler has to do if the language is to support this feature. First, functions should also be added to the language’s type system as a new type. Function types consist of parameter types, return value types, and declared functions themselves. Variables of function type can be defined and assigned this way

void test(int a, float b) {
    println("Function test: A =" + a + ", b=" + b)
}
function void(int.float) a = test
// Function name () refers to calling the function name refers to referencing the function itself
Copy the code

Void void void void void void void void void void void void void void void For each newly declared function, a function type is determined, where the function test with the same signature is assigned to variable A. Then, in the semantic analysis, reference digestion is carried out correctly according to the function name, and then type checking is carried out on the assignment of function type variables, function parameters and return values, that is, the compatibility between parameters and return value types is checked respectively.

When the variable a is assigned to test, it can be called directly

a(1.1.0)
Function test: a=1, b=1.0
Copy the code

As mentioned earlier, a function is a subroutine that, when translated, is a collection of instructions. So how does this function variable assignment look at run time? When assigning a value to a function, the function directly generates an object with no data, saves the reference to the original function in the object header, and finds the specific position of the function according to the function object header during runtime. Function variables can also be assigned to each other

function void(int.float) a1 = a
a1(1.1.0 f)
Function test: a=1, b=1.0
Copy the code

The assignment of a variable is actually passing a reference to a function object. New function objects are created only when the function is directly assigned.

closure

Closures usually extend functions as first-class citizens. In contrast to the pure function concept of functional programming, closures solve the problem of internal environment referring to external environment variables. The closure is really the environment in which the function + holds external variables. Look at a simple example

function int(a) outer(a) {
    int i = 0
    int inner(a) {
        int a = i + 1
        return a
    }
    return inner
}
function int(a) a = outer()
Copy the code

The inner function in the above code refers to the local variable I of the outer function, and the outer function eventually returns the value of the inner function. For local variables of a function, memory is usually allocated on the stack, has a short life cycle and is reclaimed after the function completes execution. The external variables referenced inside the function are often referred to as free variables. In the example above, the referenced free variable I would have been reclaimed if the outer function had finished executing, but the inner function was assigned to function variable A as the return value of outer’s execution, so it can be called directly from variable A. So where does the free variable I come from when I execute the function inner to which variable A actually refers?

Need to solve this problem is closure, when quoting the free variable function is as a return value or to assign a value to the function variables directly, need to pack the current operating environment, namely the current stack memory value of the variable freedom packaged together to the receiver, the receiver to perform the functions can be normal access to the free variables. In effect, it is the free variables that were in stack memory that are copied and placed in the longer lifetime heap memory.

What the compiler needs to do is figure out which variables are free at compile time, which variables referenced in the function scope are not in the current scope and are not global or class member variables. For a free variable, the correct position in the stack is found at runtime before it can be copied and packaged. The first step is to determine when the closure is created. If you think about it, you can see that the instructions to create the closure are actually in the instruction stream in the parent scope of the closure function. This means that the closure is actually created when executing instructions in the parent scope of the closure function, so the actual position of the free variable is exactly where the local variable of the current function is. Knowing the location of the current local variable, the final memory address can be obtained by offsetting the stack base address in the current stack frame.

Therefore, when performing closure analysis, it is important to record the position of the local variable of the free variable in the external scope. In addition, the variable referenced in the above example is the direct outer layer. In fact, the inner function can refer to the local variable of any external layer. It may occur that the direct outer layer does not reference the variable, but the inner function references it again. The solution is to add a copy of the referenced free variable between all layers. Since creating a closure is the instruction for the immediate outer layer, the outer closure environment must have been created when creating a closure, so the free variable can be taken from the immediate outer layer. Finally, it is packaged into an object with functions in its head and free variables in its contents.

Flexible for your own implementation of the runtime, free variable lookup can be deferred until runtime. Closures are also supported for Go, which compiles directly into assembly instructions for the target platform. It packages the free variables referenced by the closure function + directly into a structure at compile time, allocates memory to the structure on the heap, and copies the values from the currently running stack memory to the newly allocated memory (based on the offset of the stack base register EBP + local variables). For the closure function call, the structure pointer is directly obtained by the address + offset, and for the function referenced by the free variable, the function is executed to put the structure pointer in the register, reference variables are directly obtained by the address stored in the register.

Object code generation

Object code is the set of instructions supported by the platform on which the language will ultimately be run, so properly generating code requires understanding the architecture of the target platform. Computer architecture is a separate area of study, but not much knowledge is required to generate usable object code.

For object code, one is to generate actual target platform assembly instructions, such as AT&T-style x86 assembly instructions under Linux and Intel-style x86 assembly instructions under Windows. The assembly code of the two styles is finally generated by assembler and linker of the same set of machine instructions, but the operation of the program needs to rely on the runtime library. Now the program common functions: network, disk IO and so on are actually provided by the operating system to achieve the runtime library. The resulting executable file format is also different, but even if it was parsed and loaded in the Windows format under Linux (which Wine did), it would not work, and the runtime libraries that rely on the same functionality would need to be converted to the current platform. The other is running on virtual machines, usually including register virtual machines and stack virtual machines. The former typically includes Lua VIRTUAL machines and Android Dalvik virtual machines, while the latter includes Java virtual machines and Python virtual machines. Here, the generation of instructions (usually called bytecode) on register virtual machines is taken as an example.

The first problem is how to design a set of bytecode to express the operations of all source code (which can be intermediate code) correctly. The initial generation of intermediate code from a high-level language often requires consideration for the subsequent generation of object code. Similarly, the generation of object code from intermediate code must consider how the data is organized during runtime and how the hardware collaborates so that the required instructions can be specifically designed. For object code, there are multiple ways to achieve the same function, so the question is: how do you use as few instructions or memory accesses as possible for the same function? So instruction selection is part of optimization, but you don’t have to worry about that at the beginning of designing a self-sustaining system, let the system run with as few instructions as possible. Therefore, the object code can be designed to imitate Reduced Instruction Set Computing (RISC) and provide basic instructions. Complex functions can be combined with multiple instructions.

Design a set of instructions

For registri-based VMS, memory data is first loaded into a register, then values are calculated from the register, and the final results are stored in the register. Of course, the registers here are actually memory analog, including special purpose registers and some general purpose registers. According to the function of the instructions, they can generally be classified into these categories

  • Arithmetic operation: including addition, subtraction, multiplication and division, bit operation, comparison operation, etc
  • Memory operations: include loading a value from memory into a register and storing the value of a register into memory.
  • Register copy operation: copy a value from one register to another, including conversion instructions between different data types.
  • Control flow operation: including conditional and unconditional jump instructions.

Here is just a list of basic classes of instructions, to support advanced features such as object-oriented, usually also include object-oriented related operation instructions, such as creating instance objects, calling instance methods, obtaining fields, etc.; Array-related, such as creating a new array, getting an array element at a specified location, assigning an array element, getting an array length, etc. Of course, you don’t need the above instructions, and for some languages that implement operator overloading, the code will be translated this way

1 + 2= >1.add(2)
Copy the code

That is, all basic operations are converted by the compiler into function calls, which by default are performed directly through built-in functions.

If the runtime wants to rely as little as possible on meta information such as classes and fields, different types of operation instructions can be designed for the same operation, as follows

Iadd R1, R2, R3 // Add float numbers in registers R1 and R2 to r3 fadd R1, R2, R3Copy the code

When generating object code, different instructions are generated directly based on type information. There are several other considerations for instruction design. One is how much space does each instruction take up? Each instruction contains an opcode and an operand. The opcode represents the specific meaning of the instruction, such as adding. For the opcodes, stored in 1 byte, it can contain 256 different classes of instructions, which is definitely not needed and is reserved for extension. Then look at the operands, which usually include registers, immediate numbers, offsets, etc. To know how many bytes the operands occupy, you usually need to consider the addressing mode, which is how the instructions access and store the data. Generally, all instructions can be classified into several addressing modes. Then for specific operands, registers are stored in 1 byte, immediate numbers are only supported for smaller numbers for the time being, also in 1 byte, and offsets are stored in two bytes.

I don’t have to say much about the first two, but let’s look at offsets, which can mean different things in different environments, for example. Reference real computer, CPU will generally set up some registers, used for special purposes, such as BP register stack base address, SP register to store the current stack top address, PC register to store the next instruction to run, for local variable access is usually translated into the current stack base address + offset; For global variables or constants that are not stored on the stack, it is also necessary to have a starting address + offset. Small numbers can be encoded directly into the instruction as immediate numbers, and large numbers can be stored in the constant pool. The instruction stores the constant pool offset, which is loaded from the constant pool to the register at run time. Another advantage of this is that all constants referenced in the code are stored in the constant pool only, which actually reduces the size of the final program file. The offset can also be used as the offset address of the target instruction in the code segment when a jump occurs.

So the number of bytes per instruction can be determined based on the opcode and addressing mode. Machines usually execute code in the order of fetch, decode, and execute. When decoding object code, directly according to the opcode and addressing mode, to take the corresponding size of the operand, and can determine which bytes of the operand is what. So opcodes can be designed like this

class Opcode {
    / / operation code
    byte opcode 
    // Address mode
    int addressingMode

    static Opcode get(byte opcode) {
        return opcodeMap.get(opcode)
    }
}
Copy the code

This is just an in-memory structure, and when the program is compiled to actually output the external file, only 1 byte of opcode is required. An instruction can be represented like this

class Instruction {
    // Opcode object
    Opcode opcode
    / / the operands
    Operand[] operands

    // Decode the instruction
    Instruction decode(ByteReader reader) {
        byte code = reader.readByte()
        Opcode opcode = Opcode.get(code)
        Operands[] operands
        switch(opcode.addressingMode) {
            case MODE1:
                operands = Operand[2]
                operands[0] = Operand(code.readByte())
                operands[1] = Operand(code.readShort())
                break
            case MODE2:
            ...
        }
        return Instruction(opcode, operands)
    }

    // Command encoding
    byte[] encode() {
        int len = getInsLength()
        byte[] bytes = byte[len]
        bytes[0] = opcode.opcode
        switch(opcode.addressingMode) {
            case MODE1:
                bytes[1] = operands[0].getVal()
                int val = operands[1].getVal()
                bytes[2] = (val >> 8) & 0xff
                bytes[3] = val & 0xff
                break
            case MODE2:
                ...
        }
        return bytes
    }
}
Copy the code

The above also gives the pseudo-code for the instruction codec. When decoding, the corresponding opcode object is obtained according to the 1-byte content read, and then the operand is fetched according to the addressing mode. Convert all instruction object encodings directly to byte arrays and merge them when you want to output external files. Note that the byte order stored and read needs to be the same. This is designed to facilitate both code generation and runtime virtual machine decoding.

Calling convention

The calling convention describes how arguments are passed when a function call occurs, how the call is returned, and whether the allocated stack is cleaned up by the caller or the called during the call, issues that must be considered when generating object code.

For parameter transfer, the common practice of modern compilers is to give priority to register transfer, and use stack transfer when the number of parameters is large. The problem of register allocation is not considered here, and stack transfer is adopted uniformly. Usually the parameter into the stack from right to left, as for the actual computer, stack address from high to low address the way of growth, usually when passing parameters was in one of the stack frame (memory address higher position), when the function call has been located in the new stack frame (low memory location), within the function instructions to access the parameters, Based on the start address of the current stack frame + offset.

In addition to parameter passing, there are several problems when a function call occurs.

  • 1. How much space should be allocated to local variables of a function?

This can be determined at compile time by simply moving the top pointer down the specified size at the start of the function call.

  • 2. How does the function return and how does the caller receive the return value?

When the function is called, the address of the next instruction is put into the current stack frame. When the function calls the return instruction, the address saved in the stack frame is directly put into the PC register. Return values are stored in a fixed register.

  • 3. Reclaiming stack space used by parameters after the function call returns

When the object code is generated early, the function call instruction can be directly inserted after the size of the space occupied by the reclamation parameter instruction.

A simple example of the whole process is as follows

int add(int a, int b) {
   int c = 3, d = 4
   return a + b 
}
add(1.2)
Copy the code

The compiler will eventually convert to code in the following form

# function add:
// The function begins the prologue
push bp
move sp,bp
dec sp,0

// Assign a local variable to the current function
store 3,bp,-1
store 4,bp,-2
// Load parameters from the current run stack to the register
load bp,3,r1
load bp,4,r2
iadd r1,r2,r3

// End of function
move bp,sp
pop bp
move r3,rv
ret

// Function call
push 2
push 1
call add
// Reclaim stack space
inc sp,2
Copy the code

Bp and SP represent the stack base address and the pointer register at the top of the stack respectively. R1, R2 and R3 are all general registers. Rv register is specially used to store the return value of the function. The push instruction pushes the value to the top of the current stack and the top pointer moves down, while the POP instruction pops the top value to the register and the top pointer moves up. The call instruction refers to call the specified function (subroutine). The call instruction will save the address of the next instruction to the current stack memory, and the RET instruction will pop the address into the PC register. The function starts by placing the current stack base address on the stack and using the top of the stack as the bottom of the next frame. At the end of the function call, the current bottom of the stack is changed back to the top of the stack, and the bottom address of the previous frame is removed from memory. The memory layout is shown in the following figureThe first argument is actually the current bp+3, and the second argument is the current BP +4. Two arguments that need to reclaim stack space after execution. r

Note that the logic at the start and end of a function is fixed, and you need to insert a prologue and an epilogue for each function when generating the object code. Namely, the following two parts

/ / the preface
push bp
move sp,bp // bp=spDec sp, function local variables take up space/ / the end
move bp,sp // sp=bp
pop bp
Copy the code

You don’t have to worry about the size of the local variables in the function, because when the function ends, the current BP goes directly to the top of the stack, but it is better to clear it manually, otherwise the next function call may use this space and get an unexpected value for an uninitialized variable.

At the end

This blog covers the front-end, back-end, runtime and implementation of some of the language features of the compiler, but there are a lot of miscellaneous knowledge about the principles of compilation. Especially when writing the back end, it is not very well organized. The front end part has a relatively fixed routine, and the implementation of the different back end compilers and runtimes are too different. Careful readers will notice that the whole article is full of words like “can”, “generally”, “usually”, because the theory out of the book, the implementation of the project is very flexible, so don’t stick to the dead theory, more practice, this blog is here.