Ramble on programming language design and implementation

Author: Chai Jie

Programming language Education

Programming languages are complex, but their design follows some basic rules. While syntactic semantics differ from language to language, some basic cores are similar (such as expression-extension based syntax and abstraction of basic data types). Programming languages can work their way up through these cores, adding rules (i.e., type systems) and language features (such as support for multithreading, asynchronous programming) based on how the language was designed. The combination of these rules and the core of a programming language creates a variety of programming languages.

However, most domestic programming courses focus on the grammar of a specific language and do not explain the essential relationship between language features and programming languages. Students generally lack a grasp of the “overall view” of programming languages and the ability to draw inferences from one another. What they see are the “sugared” grammar and complex pragmatic environment, so they feel that programming linguistics is complicated and boring. This makes it easy to explain, for example:

  • You are often clueless about errors reported by compilers and interpreters.
  • It is impossible to judge the design of a programming language from the technical level.
  • Inability to make independent decisions about which programming language to use for development tasks, or which language features to use;
  • I can’t understand why some framework or library uses the way they do, and I have no idea about the principle behind it. I can only rely on examples and patterns to program, but when I encounter problems, I can’t find out the reason, so I have to Google or ask others.
  • It is difficult to write concise, efficient and maintainable code without concise abstraction and modeling of problems.

The ideal programming language course should be the fusion of language characteristics and programming language theory. Because theory needs practice to explain how, practice needs theory to support why, and the two promote each other, so as to know what is what. If we do not grasp the essence of programming language, it is like learning sketching without knowing human skeleton, learning Traditional Chinese medicine without telling Yin and Yang, learning music without knowing music theory. In this way, the understanding of knowledge is limited and cannot be comprehended, and neither can be studied deeply nor can be long-term [1].

Programming language literacy is the backbone of everyone’s programming knowledge tree (this highly structured knowledge system is difficult to self-explore through experience), and every programming experience is an application and exercise of the abstract expression ability of programming language. Over time, you will understand new programming languages and software frameworks more easily, become better at modeling and abstracting, and write natural, simple, efficient, and maintainable code to build complex software systems.

On May 18, 2021, Teacher Feng Xinyu gave a lecture in Zhejiang University: On the Design and Implementation of Programming Languages [2]. Professor Feng xinyu’s lecture introduced some key factors in the design and implementation of programming languages and the relationship between them, which just makes up for the deficiency of some current programming language education. Below are my lecture notes, which I hope will help you learn and understand programming languages, especially Rust.

(Pictures below are taken from lecture slides ~)

Programming language Design and implementation: What to focus on

Pay attention to what

A programming language is a human-machine interface, and an instruction set is a software/hardware interface. A compiler translates a programming language into an instruction sequence suitable for a specific instruction set.

In software development activities, what needs to be said by the programmer and what can be done without saying. That is, programming language processing is divided into static and dynamic two stages. The static phase includes Parsing and type-checking to ensure that the program is well-formed. The dynamic phase is the execution of the well-formed program. If a well-formed program is executed well- pour, then the language is safe to say [3]. Abstract semantics and type systems are at the heart of programming languages.

The process of programming language design and implementation is to determine the boundary between them. There is no single answer to how boundaries are determined. For example, to free heap space, in C language, need to be expressed in code display; In Java, the release of heap space does not need to be expressed in code. Java has a garbage collection mechanism, and heap space is automatically reclaimed at runtime.

Development mode is for software developers who want programming languages to express computing tasks simply, efficiently and correctly. The running state is specific to the computer, hoping to complete the computing task with as little time and hardware resources as possible. Programming languages are faced with a variety of application scenarios, operating environments and developer groups. It is difficult to adapt to all tastes, and there is no unified language.

In the design and implementation of programming language designs, ease of use, security, and performance are difficult to balance, often improving one metric while worsening the other. So you need to balance and compromise depending on the application scenario.

JavaScript, Python, and Lua are dynamic scripting languages with high ease of use but poor performance and maintainability. Suitable for small code volume, Run and throw small projects. Java, Go, Dart, C#, Swift, and Kotlin have balanced performance in security, ease of use, and performance, and are suitable for large and medium-sized application software development. C and C++ abandon security in pursuit of performance, resulting in many traps, making debugging and maintenance difficult. Rust is a very popular system programming language. It does a good job in security and performance, but it has a high threshold to get started and has certain requirements on the quality of users.

Ease of use

Untyped code is easier to write. But code with a type annotation is more readable and maintainable because it provides additional information, such as fun f(g: int -> int, x: int): int. From the type annotation, we know that the first argument g of f is a function, the second input argument x is an integer, and the function returns an integer.

Different groups of developers have different demands for usability. Application developers want to simplify language features, so that it is easy to learn and easy to maintain, do not have so many bizarre syntax features, difficult to master, usually do application development also do not use. Developers of libraries, frameworks, and DSLS want languages that are easy to reuse and extend, and programming languages with generics, operator overloading, macros, and so on.

The code above customizes the operator +*-, which might be useful for DSL users. But the average developer hates this custom operator because it severely reduces the readability of the program.

Ease of use vs. security

There is a trade-off between ease of use and security, with more constraints placed on developers to improve security. This constraint is mainly reflected in two aspects. First, syntactically forbid certain error-prone language features. For example, Dijkstra believes that the Go to statement is harmful, so modern programming languages do not implement this language feature. The second aspect is to prohibit error-prone writing through the type system. While these constraints improve security, they also reduce the ease of use of the language, making it difficult to write code. In summary, security and ease of use are also at odds, requiring a trade-off.

Type systems can be divided into weak type systems and strong type systems according to whether free type conversions are allowed. JavaScript is typically a weak type system that allows free type conversions, so there are all kinds of craters.

C/C++ is also a weak type system, with many pitfalls: dangling Pointers, double free, out-of-bounds subscripts, buffer overflows, and so on. Rust is strongly typed and ensures security through static type system enhancement and dynamic checking, but Rust has a high threshold for learning. Weak typing systems are rarely used in newly designed programming languages in recent years because weak typing is difficult to ensure security.

Type systems can be static or dynamic depending on when constraint rules are checked. If all or nearly all type checking is done at compile time, it is called a static type system. If all or nearly all of the type checking is done at run time, it is called a dynamic type system. Sometimes the execution flow of a program can only be determined at run time, so it is more accurate to do type checking at run time, but it degrades program performance at run time. If static type checking is used, the principle is to kill more than one error, sometimes resulting in some misjudgments.

Dynamic and static types have their pros and cons. Can we get the best of both? Gradual Type, the goal is to give consideration to the advantages of both, is currently the popular research direction of the academic community.

Ease of use vs. performance

There is also a trade-off between ease of use and performance. Dynamic dispatch has higher ease of use, but lower performance than static dispatch.

The more low-level and specific language features are, the more control developers have over the details, which can lead to performance improvements. While abstraction hides details, ease of use is good, but performance is often poor. Intel recently launched oneAPI, which claims to abstract away hardware details (CPUS, Gpus, AI accelerators, fpGas) for programmers without sacrificing performance. The real effect remains to be proven.

Security vs. performance

Using complex static type checking, while improving security, also increases learning and development costs. So some security mechanisms (dynamic type checking, garbage collection, and so on) are done at runtime, but performance is compromised. This dynamic security mechanism is embodied in non-system programming languages, such as Java and Golang.

conclusion

Due to the defects of programming language education in China, students generally lack a grasp of the “overall view” of programming language and the ability to draw inferences. Programming language is the basic tool of software development, which directly affects the development efficiency and development experience on the one hand, and also affects the performance of software running, such as performance and reliability. There are three key factors to consider when designing and implementing a programming language: ease of use, security, and performance. It is difficult to strike a balance between the three.

The resources

[1] On efficient programming language education in China

[2] Feng Xinyu. Design and Implementation of Programming languages

[3] Robert Harper, Practical Foundations for Programming Languages, Second Edition, Cambridge University Press, 2016.

In the previous

The unsung hero of Rust

The next article

Huawei | StratoVirt VCPU management – Rust thread synchronization