Over the years, there have been a number of new development languages, such as Mozilla Rust, Apple Swift, and Jetbrains Kotlin, and a number of good ones are constantly iterating, such as Java. These languages offer developers a variety of options for speed, security, convenience, portability, and functionality.

Why have programming languages evolved so rapidly in recent years? I think one of the most important reasons is that we have new tools for building languages, especially compilers. One of the first is LLVM (Low-level Virtual Machine). LLVM is an open source project originally developed by Swift language creator Chris Lattner based on a research project at the University of Illinois.

LLVM not only simplifies the creation of new languages, but also improves the development of existing ones. It provides a tool that automates many of the most demanding parts of the task of creating a language, including creating compilers, porting output code to multiple platforms and architectures, and writing code to implement common language metaphors like exception handling. LLVM is freely licensed, which means it can be freely reused as a software component or deployed as a service.

If we make a list of languages that use LLVM, we’ll see a lot of familiar names. For example, Apple’s Swift language uses LLVM as a compiler framework, and Rust uses LLVM as a core component of its tool chain. In addition, many compilers also provide LLVM versions. For example, Clang, the C/C++ compiler, is itself an LLVM based project. There is also Kotlin, nominally a JVM language developed in a language called Kotlin Native, which also uses LLVM to compile machine Native code.

Introduction of LLVM

LLVM is essentially a software library that programmatically creates machine native code. Developers call its API to generate an instruction using the IR (Intermediate Representation) format. LLVM, in turn, compiles IR to a standalone software library or performs JIT (just-in-time) compilation of code using the context of another language (for example, a compiler that uses that language).

The LLVM API provides primitives for representing common constructs and patterns in development programming languages. For example, almost all languages have the concept of functions and global variables. LLVM also includes functions and global variables as standard elements of IR. In this way, developers can use the LLVM implementation directly and focus on the uniqueness of their own language without having to spend time and effort recreating those particular wheels.

Figure 1 an example of LLVM IR. A simple program written in C is shown on the right, and the LLVM IR code transformed using the Clang compiler is shown on the left

LLVM: Designed for portability

What we usually know about C can be applied to LLVM. We often think of C as a portable high-level assembly language, because C provides structures that map directly to system hardware and have been ported to nearly all existing system architectures. But being a portable assembly language is not the design goal of C, it is just a by-product of the way the language works.

In contrast, LLVM IR was designed from the outset to be a portable assembly language. One of the ways IR achieves portability is by providing primitives that are independent of any particular machine architecture. For example, integer types can use any number of bits needed, even up to 128-bit integers, not limited by the maximum bit width of the machine. Developers also don’t have to worry about crafting the output to match the instruction set of a particular processor. LLVM solves all of this.

Readers who want to see LLVM IR in action are recommended to visit the ELLCC project website and try out a live demonstration of converting C code to LLVM IR in a browser (link at the end).

Use LLVM in programming languages

LLVM is typically used as an AOT (pre-compiled, Ahead-of-time) compiler for the language. LLVM also supports several other features.

JIT compiler using LLVM

In some cases, the code needs to be generated directly at run time, rather than being precompiled. The Julia language, for example, jit-compiles code because it values speed and can interact with the user through a REPL (read-evaluation-output loop, read-eval-print loop) or interactive prompt. Mono, the open source implementation of.NET, also provides the option to generate native code through LLVM back-end compilation.

Numba, Python’s high-performance scientific library, JIT compiles Python functions to machine code, and AOT compiles code using Numba. But as an interpreted language, Python, like Julia, also offers rapid development. JIT compilation of code is a nice complement to Python’s interactive workflow and is superior to AOT compilation.

There are also unorthodox approaches that try to use LLVM as a JIT. For example, there is a method that tries to compile PostgreSQL queries and achieves a fivefold performance increase.

Numba uses LLVM for JIT compilation of scientific code to speed up code execution. For example, the JIT-accelerated SUM2D function executes 139 times faster than regular Python code

Use LLVM for automatic code optimization

LLVM not only compiles IR into native machine code, but developers can also programmatically instruct LLVM to make highly refined optimizations of code using the linking process. This optimization works, including inlining functions, removing dead code (including unused type definitions and function parameters), and loop unrolling.

Again, the power of LLVM is that you don’t have to implement all of this on your own. LLVM does everything, and developers can turn these functions off as needed. For example, if we were considering sacrificing some performance to give smaller binaries, we could have the compiler front end tell LLVM to disallow loop unrolling.

Domain Specific Language (DSL) using LLVM

In general, LLVM is used to generate a general-purpose language compiler. However, LLVM can also be used to generate some highly vertical or exclusive DSLS. We could even say that this is where LLVM comes in. Because creating a DSL using LLVM does not involve a lot of the drudgery of creating a language and gives good performance.

For example, the Emscripten project uses LLVM IR and converts IR code to JavaScript. This will theoretically support all language export code that can run in a browser with an LLVM back end. Although Emscripten’s long-term plan is to use an LLVM-based back end to generate WebAssembly, the project is a good example of the flexibility of LLVM.

Another way to use LLVM is to add domain-specific extensions to an existing language. For example, Nvidia created the Nvidia CUDA compiler using LLVM to add native CUDA support to the language and compile it as part of the generated native code, rather than calling it from the software library that shipped with it.

LLVM is used in various languages

The common way LLVM is used is to code in a developer’s comfortable language. Of course, the language should support the LLVM software library.

Among them, C and C++ are widely used. Many LLVM developers prefer one or the other for the following reasons:

  • LLVM is written in C++.
  • LLVM’s API is provided as THE C/C++ incarnation.
  • Much language development tends to be based on C/C++.

Of course, the choice is not limited to these two languages. Many languages support native calls to the C software library. So in theory, LLVM development can be done using any of these languages. Of course, this is best if the language itself provides a software library wrapped around the LLVM API. Fortunately, many languages and runtimes have such libraries, including C#/.NET/Mono, Rust, Haskell, OCAML, node.js, Go, and Python.

One caveat is that LLVM binding in some languages is not complete. Take Python as an example. Although Python offers a variety of options, each varies in its completeness and usefulness:

  • The LLVM project itself maintains a set of bindings to the LLVM C API, but has so far stopped further maintenance.
  • Llvmpy ceased maintenance after 2015. This is not good news for any software project. This is especially true for LLVM, given the number of changes in each LLVM revision.
  • Llvmlite was developed by the Numba development team. LLVM has now emerged as a strong contender to use LLVM in Python. However, LLVMLite is limited to the needs of Numba, and therefore provides only a subset of the capabilities required by LLVM users.
  • Llvmcpy is intended to provide the latest, automatically updatable Python bindings for the C software library, enabling access using Python’s native style. Llvmcpy is still in the early stages of development, but some basic work can be done using the LLVM API.

If you are interested in learning how to build a language using the LLVM software library, you can read a tutorial written by the LLVM founder. Using C++ and OCAML, the tutorial takes the reader step by step through creating a simple language called “Kaleidoscope”. Porting to other languages:

  • Haskell: Refer to the original tutorial for direct portability.
  • Python: One way is to stick to the tutorial, the other way is to do a lot of rewriting and provide an interactive command line. Both approaches use LLVMLite as the binding to LLVM.
  • Rust and Swift: It seems inevitable that we will implement the porting of tutorial languages to these two languages created by LLVM itself.

This tutorial is also available in several other national languages, such as Chinese tutorials using raw C++ and Python.

LLVM is not yet implemented

We’ve covered a lot of the functionality LLVM provides above, but here’s a quick overview of what it doesn’t currently implement.

For example, LLVM does not parse syntax. There are a number of tools available to do this, such as LEX/YACC, Flex/Bison, and ANTLR. Parsing is bound to be taken out of compilation, so it is not surprising that LLVM does not attempt to implement this functionality.

LLVM also does not directly address much of the language-specific software culture. For example, how to install the binaries of the compiler, how to manage packages during installation, how to upgrade the toolchain, and so on, all need to be done by the developer.

Last but not least, LLVM still has no primitives for some common language components. Many languages have some form of memory management for garbage collection, either as a primary way of managing memory or as an adjunct to strategies like RAII (C ++ and Rust usage). Instead of providing a garbage collection mechanism, LLVM provides tools to implement garbage collection, allowing code to be marked up as metadata that simplifies garbage collector writing.

However, it is not ruled out that LLVM may eventually add a native mechanism to implement garbage collection. LLVM is growing rapidly with a major release every six months. Given that the development process for many languages today is LLVM-centric, LLVM development is only likely to accelerate further.

This article is a translated article.

www.infoworld.com/article/324…

Related links:

  • ellcc.org/
  • Ellcc.org/demo/index….
  • Llvm.org/docs/Garbag…

Thanks to Guo Lei for correcting this article.