Home-made programming languages and virtual machines, a seemingly esoteric subject that touches on a popular topic on the Internet today, are tantalizing to many technologists but struggle to grasp its essence.

“Homemade Programming Language” explains a wealth of basic knowledge step by step, from shallow to deep, covering the basic knowledge of common compilation principles, more importantly, the knowledge explained by the author has its own unique understanding and perspective, I believe readers can benefit from this book.

                                                    

This article covers some of the basics of compilation principles, which I am afraid will be difficult for readers who have not learned the principles of compilation, so I have introduced the basics of compilation principles in passing. Of course, not being able to compile principles doesn’t stop you from writing a scripting language successfully.

Because the principle is so abstract, and for the sake of rigor, the theory always describes the simple as complex. In practice, you will find that implementing a compiler is much easier than understanding how it works. You will find that obscure concepts are so simple that you learn how to compile through practice. After all, the paper come zhongjue shallow, must know this to practice. Today we’re going to look at some of the problems that can be confusing with homemade programming languages.

Similarities and differences between compiled and scripted programs

The most obvious difference between the two is whose “dishes” they are. The commonality of the two is that the final generated instruction contains both opcodes and operands.

The instructions generated by compiled programs are machine code and operands in binary form, known as binary streams. Again, data, in contrast to text files, is in binary form, not in text string form (such as ASCII or Unicode).

If binary streams are classified as formatted or unformatted, unformatted streams are pure binary streams, and the entry point of the program is the beginning of the file. The other is binary streams organized according to a protocol (that is, a format), such as elf executable files under Lnux. It is the direct input of the hardware CPU, so the hardware CPU “sees” the instructions corresponding to the compiled program, and executes it itself, i.e. the machine code is the CPU’s dish.

Compiled program, compiled language runtime itself is a process, it is called directly by the operating system, which is by the operating system is loaded into memory, the operating system will be CS: IP register (IA32 architecture CPU) entry to the program, make it running on the CPU directly, this is known as the CPU “see” it. Anyway, the scheduler can see this process in the ready queue.

Scripting languages, also known as interpreted languages, include JavaScript, Python, Perl, Php, Shell scripts, and so on. They are text files themselves and serve as input to an application that is a script interpreter. Because it’s just text, the code in these scripts looks like strings to the script interpreter.

That is, the code in the script is never actually executed by the CPU, the CS: IP register of the CPU is never pointed to it, the script interpreter is visible to the CPU, and the code in the script is never known to exist, but the script is indirectly “run” by the hardware CPU.

This is just like the parents give their children living expenses, and the child uses the living expenses to raise a dog. The parents only care about the growth of the child and never know the existence of the dog, but the dog grows indirectly. The script code appears to be executing according to the developer’s logic. In essence, the script interpreter analyzes the script from time to time, dynamically acting based on keywords and syntax.

There are two main types of interpreters: those that execute while interpreting, and those that analyze the entire file before executing. If there is a syntax error in the script, the previously correct part of the script will be executed normally until the error is encountered. In the second category, the purpose of parsing the entire file before executing it is to create an abstract syntax tree or an equivalent traversal to generate instructions, which are then run to indicate the program’s execution, as in compiled programs.

The instructions generated by the script are textual opcodes and operands, that is, data exists as text strings. The opcode is called OpCode. Usually, Opcode is customized, so the corresponding operands must comply with the rules of OpCode. To improve efficiency, a single OpCode function is often equivalent to a combination of hundreds or thousands of machine instructions.

If the virtual machine is not for efficiency, it is probably used to run cross-platform emulators. The opcode that this virtual machine processes is the machine code of another architecture. For example, if a MIPS program is simulated on x86, the Opcode received by the virtual machine running on x86 is the machine code of MIPS.

With the exception of cross-platform emulation, the general purpose of virtual machines is to improve execution efficiency, so OpCode is rarely defined in terms of actual machine code, or it would be faster to simply generate machine instructions and hand them to the hardware CPU for execution. Therefore, these customized commands are the input of the VIRTUAL machine, namely, the dishes of the virtual machine.

Virtual machines fall into two main categories. One is emulated CPU, which means using software to simulate the behavior of a hardware CPU. This is usually reserved for language interpreters, such as Python virtual machines. The other is to virtual a complete set of computer hardware, such as virtual registers with arrays, virtual hard disks with files, etc. This kind of virtual machine is often used to run the operating system, such as VMware, because only the operating system can operate the hardware.

Scripts are text character streams (that is, strings) that are stored on disk as text files. The specific text format is determined by the text compiler, which reads it into memory, parses it and executes it line by line.

The execution process may be written into opcodes, and then handed to the virtual machine to execute sentence by sentence. At this time, the virtual machine plays the role of CPU, and the opcodes are the input of the virtual machine.

Of course, can also not through the VIRTUAL machine and direct parsing, because the order of parsing source code is in accordance with the order of the logical execution of the program, that is, the order of generating the syntax tree, so in the process of parsing can be executed at the same time, such as parsing to 2+3 can directly output 5.

However, the convenience is limited, and the implementation of complex functions is not easy, because the calculation process requires additional data structure, compared to the function call should always have a runtime stack to store parameters and local variables and the stack overhead during the function run. Therefore, for complex functions, most cases or dedicated to write a virtual machine to complete.

By the way, guess how an interpreted language works. When we execute a PHP script, we simply start an interpreter written in C. The interpreter is a process, no different from a normal process, except that the input to the process is the PHP script. In the PHP interpreter, this script is just a longer string, not an instruction code at all.

It’s just that the interpreter understands the syntax and follows the syntax rules. As an example, suppose the following is PHP code with a file named A.hp.

When the PHP interpreter analyzes the text file a. HP, it finds the echo keyword in the text file, and then calls the output function provided by C language, such as printf((echo parameter)). PHP interpreters are to PHP scripts what browsers are to JavaScript.

But this is completely my guess, I do not know the PHP interpreter inside the specific work, the above is just to clarify my ideas, please look at dialectically.

At the end of the day, you might wonder why the CPU doesn’t support strings as instructions if its operands are strings and the CPU can execute scripting languages directly. We’ll share it later.

Scripting language classification

Scripting languages can be broadly divided into the following four categories.

(1) Command-based language system

In this language system, each line of code is actually a command and the corresponding parameters, which is the form of early assembly language. A program written in such a language is a series of steps to solve a problem, and the execution of the program is a problem-solving process, just like cooking, with the steps written in mind (or in a recipe) in advance. Such as the following cooking script.

 

In the preceding steps, the first column is the command followed by the command parameters. Because there are no loops in imperative languages, you need to fill a stir in order to do the same thing in a row. An interpreter parses the file line by line, executing handlers for the corresponding commands. Here is an example interpreter.

   

(2) Rule-based language system

The execution of such languages is based on conditional rules, and actions are triggered when the rules are met. Its language structure is predicate logic → action, as shown in Figure 1-1.

                                             

FIG. 1-1

Therefore, this kind of language is often called logical language, often used in natural language processing and artificial intelligence, the typical representative is Prolog.

(3) Process-oriented language system

Procedural language systems, such as Batch and shell scripts (Perl, Lua, etc.), are familiar with. In contrast to command-based languages, they can encapsulate a series of commands into a block of code that can be called over and over again. This code block is borrowed from the concept of function in mathematics, an X for a Y, that is, for an input there is an output, so this code block is called a function.

(4) Object-oriented language system

Modern scripting languages are basically object-oriented, and people use a lot of them, like Python. Many readers mistakenly assume that as long as a language contains the keyword class, it is an object-oriented language, which is not rigorous. It is possible to define a class with the keyword class in Perl, but its internal implementation is not fully object-oriented; it is essentially a procedural language. The world’s first full-blooded object-oriented language was Smalltalk, which was everything object in implementation and had a full object-oriented gene.

Why does the CPU use numbers as instructions

At the end of the previous section, “Similarities and Differences between compiled and scripted programs,” we discussed why the CPU does not directly support strings as instructions. I suspect some readers will mistakenly assume that the CPU will execute assembly code directly. This is not true, because assembly code is a symbolic representation of machine code, almost one to one, but assembly code is definitely not machine language.

You see, if the assembly code is a machine instruction, the input the CPU sees is a string, such as the following assembly code used to calculate 1+10-2.

Assembly language is actually the input to the assembler. To the assembler, the assembly code file is also text, so the MOV instruction is also a string. It would be very inefficient for the CPU to read the assembly file directly and analyze the various strings line by line to determine instructions.

After all, there are too many characters to compare, so the efficiency of comparing more times is of course low, so numbering instructions as numbers is much easier to compare numbers. And most of all, the CPU is better at handling numbers. Its DNA is digital circuitry, and numerical computation is based on numerical processing, which is essentially why binary data is faster and more compact than textual ASCII.

Why are scripting languages slower than compiled languages

And scripting language compilation has two kinds, one kind is interpretation and implementation, do not produce instruction, this process is the most part of time the string comparison process, string comparison of time complexity is O (n), which is after is n times the interpreter to determine what is the operation code, and then to obtain the operation code operand, you see can not slow? Compiled languages are compiled into machine code, binary numbers, so they can be run directly on the CPU, and the CPU is good at dealing with numbers, and a comparison of numbers can determine the opcode.

Another type of scripting language is compiled, regenerated into opcodes, and then handed over to the virtual machine for execution. This adds an opcode generation process that “appears” slower. That’s not really the main thing.

You see, the “execute” speed of a program is a comparison. Compiled languages are binary languages when executed, and most scripting languages are text languages when executed. There must be a compilation process first.

This is full of string processing, the source of the entire script for the compiler is a long string, all have to complete a variety of comparisons, so a lengthy step, is bound to be slow. To reduce the compilation process, some scripting systems cache the compilation result as a file after the first compilation. For example, Python stores the compiled. Py file as a. Pyc file.

But does this make a scripting language that doesn’t need to be compiled twice as good as a compiled program? Since disk IO is not necessarily the slowest part of the system, doesn’t the interpreter need time to read the cached file? Wait, as the reader said, compiled programs are also read from disk when loaded by the operating system. Isn’t that different?

Of course not, remember that the script is executed by loading the interpreter, which is also a file on the hard disk, just a binary executable file. The interpreter still needs to read the hard disk, and then the interpreter reads the script language file from the hard disk and compiles the script file.

You see, a compiled program executes one IO while a script executes two, which is one more slow IO operation, so scripting languages are bound to be slower.

Why do people use scripting languages when they are slow

Language here refers to the compiler or interpreter of a language, hereinafter referred to as language.

Language slowness does not affect the whole system. The weakness that affects the speed of the whole system is not the language itself. At present, the bottleneck of the system is generally in the IO part. A language is an order of magnitude slower than IO. It does not mean that the entire system is 10 times faster if the language is 10 times faster. A slower language still does not affect the entire system, depending on the bottleneck.

It’s like when the zoo boat is overloaded, people don’t blame someone for being too fat, but they know that it’s the elephants on the boat that are the main weight, and the weight of people is not the same as the weight of elephants.

Moreover, even when the language is speeded up, IO blocks because it can’t keep up (in this case, the script interpreter is blocked because it’s a scripting language), and blocks even longer because the language is too slow.

Why is it blocking? This blocking is usually due to the data that subsequent instructions need to read from the IO device, meaning that subsequent steps of the program depend on this data, without which the program is meaningless to run. For example, the Web server first reads the data from the hard disk and sends it to the user through the network card. After obtaining the data from the hard disk, the part of the Web server process that operates the network card to send the data can be executed on the CPU.

Since the language interpreter is handled by the CPU, which must be much faster than the IO device, there’s nothing to do while waiting for the IO device to respond. Precious CPU resources operating system in order to obtain the biggest use, will certainly to process (binary executable program or a scripting language interpreter) to join the blocking queue, let the other can be run directly, do not need to block the process USES CPU (block is not meeting the CPU run, namely the process from the operating system scheduler ready queue removed).

Languages (scripting language interpreters) are slower than IO devices, so they can still be blocked by slower IO. In other words, it is the slowest part of the system that slows down the entire system, and no matter how slow the scripting language is, IO devices will always be slower than the language. Therefore, scripting language can not be blamed for “affecting system performance”.

On the other hand, people like to use scripting languages because of their high development efficiency, which is why scripting languages were invented. Many functions that require multiple steps in C can be done in a single sentence in scripting languages, which is of course more popular with developers.

What is intermediate code

Many compilers compile the source language into intermediate code and then into object code, but an intermediate language is not required. Intermediate code, referred to as IR, is a language between the source program and the machine language, with n-element (such as ternary, quaternary), inverse polish, tree and other forms.

Object code is the code that runs on the target machine. It is directly related to the architecture of the target machine. Why not generate object code directly?

(1) Cross-platform

Since intermediate code is not object code, it can be used as a common language for all platforms, allowing for the separation of the front and back ends through intermediate code. For example, developing in multi-platform and multi-language environment can improve development efficiency. As long as the intermediate code is compiled on a certain platform, the remaining work from the intermediate code to the target code can be continued by the compiler of the target platform.

(2) Easy to optimize

Intermediate code is closer to source code and more direct for optimization. Moreover, the intermediate code can be optimized on one platform, and then sent to other platforms for compilation into the target machine, improving optimization efficiency.

What is the front end and back end of a compiler

The front and back ends of the compiler are divided by intermediate code, as shown in Figure 1-2.

                       

Figure 1-2

The front end is mainly responsible for reading the source code, preprocessing the source code, transforming the words into Token stream through lexical analysis, and then conducting grammatical analysis, semantic analysis, and converting the source code into intermediate code.

The back end is responsible for converting optimized intermediate code into object code.

Lexical analysis, syntax analysis, semantic analysis, and generated code are not executed sequentially

Many textbooks divide the compilation phase into several separate parts:

(1) Lexical analysis;

(2) Grammar analysis;

(3) Semantic analysis;

(4) Generate intermediate code;

(5) Optimize the intermediate code;

(6) Generate object code.

This gives the illusion that the steps are sequential, that the six steps from source code to object code must be executed sequentially. This is not the case, at least not in an efficient compiler.

These are only functional logical steps. Take the first four steps as an example. They are executed in parallel and interspersed with the main line of parsing, i.e. the four steps begin and end with parsing.

The functional implementation of each step is accomplished by its actual module, which is responsible for lexical analysis called lexical analyzer, which is responsible for generating code called code generator, and which is responsible for parsing parsing called parser.

The compiler we are talking about is made up of lexers, parsers, and code generators (and optimization modules if there are object code optimizations).

The entry point for compilation is parsing, so compilation starts with a call to the parser, which calls the lexical parser and the code generator as two subroutines. In other words, lexers and code generators are only called by parsers, and without parsers, they have no chance of “showing their face.”

So compilation is done in parallel by parsers, interspersed with calls to lexers and code generators.

Although parsing and semantic parsing are two functions, they can be combined into one. Because the semantics are known after the parsing. After all, syntax is the rules of semantics, and the rules are made by the compiler, so the compiler can understand the semantics by analyzing its own rules.

For example, read English sentences, especially complex long sentences, first find the sentence predicate verb, predicate verb as the dividing line to split the sentence subject and predicate two parts, in the former part to find the subject, in the latter part to find the object and so on, after the analysis of grammar, the meaning of the sentence will be clear.

In other words, syntactic analysis and semantic analysis are both at the same time and at the same time, so it is not surprising that they are combined. You see, syntactic analysis and semantic analysis are really parallel.

For efficiency of parsing, the lexical analyzer is often called by the parser as a subroutine, that is, it is called every time the parser needs a token of a word. You see, grammatical analysis and lexical analysis are really parallel.

Finally, generate code. The current way of generating code is called syntactic guidance. What is syntactic guidance? This is to generate object code or intermediate code “at the same time” as parsing the syntax, which is essentially parse-oriented. The parser calls the code generator to generate object code or intermediate code as soon as it understands the source semantics, so this is also parallel to the parser.

As a reminder, instead of parsing the entire source code and then generating object or intermediate code all at once, parsing a portion of the source code and immediately generating object or intermediate code for that portion of the source code is more efficient and easier to implement.

For example, if there are 10 lines of code in the source code file, the parser keeps calling the lexical parser to obtain a token of one word at a time, and after reading the first three lines of source code to determine the semantics of the source code, it immediately generates the object code or intermediate code with the same meaning as the three lines of source code.

The parser then calls the lexical parser to read the source code after line 4, repeating the process of parsing the syntax and generating the code. In short, the main line is grammar analysis, grammar analysis of the source code in accordance with the grammar to be divided into a number of small parts, each generation of this small part of the target code or intermediate code.

In summary, to make compilation more efficient, lexical analysis, parsing, semantic analysis, and generated code are executed in parallel with parsing as the center, and both parsing and generated code are subroutines called by parsers.

What is a symbol table

The symbol list is listed because this word sounds “deceptive”, because it is invisible, many beginners think it is a very mysterious thing. A symbol table is a table for storing symbols, that’s all.

You know, symbols in the source code have to be stored somewhere so they can be found when referenced, so the purpose of a symbol table is to keep track of symbols in files. Symbols include strings, method names, variable names, variable values, and so on. Another important reason why symbols are placed in tables is to facilitate the generation of instructions, so that they are formatted uniformly.

The compiler takes the index of the symbol in the symbol table as the operand of the instruction. Without the index, the instruction would be messy. For example, if the function name or string were used as the operand, the instruction would be verbose. “Table” in the computer does not specifically refer to “table”, “table” is a general concept, used to represent all data structures for adding, deleting, modifying and searching, so symbol table can be implemented with any structure, such as linked list, hash list, array and so on.

                                                      

                                                              Homemade Programming Languages

The Zheng Gang

This book starts from the script language and virtual machine introduction, explains the implementation of lexical analysis, the implementation of some underlying data structures, symbol table and class structure symbol table, constant storage, local variables, module variables, Method storage, virtual machine principles, runtime stack implementation, compilation implementation, parsing and syntax guidance top-down operator construction rules first, debugging, view instruction flow, view runtime stack, add more methods to classes, garbage collection implementation, add command line support command line interface.

                                                     

                                                         Operating System True Restore

The Zheng Gang

University and graduate students have courses on operating system, and this group has high academic ability, but the book is too abstract and obscure, so that many students are afraid of this course to ask questions, only those who know can ask questions. Operating system theory books can’t make readers understand what an operating system is. They can’t learn an operating system by imagining it. They need to see something concrete.

Most technical people are curious about operating systems. They want a book that tells them what an operating system is, doesn’t have too much extraneous administrative stuff in it, doesn’t have a lot of code and is a prototype of a modern operating system, and they want to get to the bottom of it quickly without spending a lot of time.

Today’s interactive

Which book would you like to pick for your growth? Why is that? By 17:00, September 5th, leave a message + forward this activity to the moments of friends, xiaobian will select 2 readers to give a paper book. (Participate in the activity directly to the wechat terminal homemade programming language, six questions to confuse you)

Click to read the original article and buy the Homemade Programming Language directly.

Read the original