A brief introduction of GCC

GCC is a traditional compiler (this model also applies to interpreters and JIT compilers). The main working principle is divided into three parts:

(1) Front end: parse source code, check syntax errors, translate into abstract syntax tree

(2) Optimizer: translate the abstract syntax tree to generate intermediate code, and optimize the intermediate code

(3) Back end: the intermediate code optimized by the optimizer is converted into the code of the target machine, with general functions including instruction selection, register allocation and instruction scheduling

2. Architectural advantages

(1) Use common intermediate code, add support language only need to add a new front end, add support target machine only need to add a new back end (2) for the front and back end, developer technology stack is different, developers only need to pay attention to their own technology stack, which is conducive to more people to participate in

3. Architectural defects

(1) GCC three-stage modules must be used together, it is difficult to achieve partial reuse

(2) There are a number of reasons why parts of GCC cannot be reused as a library, including misuse of global variables, lax restrictions on immutable variables that cannot be changed, poorly designed data structures, a large code base, and the use of macros to prevent the code base from supporting multiple compilation front-target pairs at once. However, the most difficult problem to solve is its early design and the architectural design inherent to that era. Specifically, GCC suffers from layering problems and abstraction vulnerabilities: the compiled back-end traverses the compiled front-end abstract syntax tree (AST) to generate debug info, the compiled front-end generates structures for compiled back-end data, and the entire compiler relies on global data structures set on the command line.

Ii. Introduction to LLVM

1.LLVM is proposed to solve the problem of compiler code reuse

2.LLVM is a three-stage compiler optimizer + back-end SDK collection, which provides a series of external interfaces for developers to call

(1) Front end: front end supporting various components (Clang, LLVC-gcc, GHC), need to comply with the rules of LLVM, output intermediate code LLVM IR (2) Optimizer: LLVM IR through a series of analysis and optimization to improve the code, and then input code generator to generate target machine code (3) back end: Converts the optimizer optimized intermediate code to the target machine code

3. Architectural advantages

(1) LLVM IR has both clear specifications and a unique interface with the optimizer. This means that the only thing you need to do to write a build front end for LLVM is generate LLVM IR. Because LLVM IR has a first-class text format, it is both feasible and reasonable to build a front end to LLVM IR as text output

(2) After designing LLVM IR, the next important point of LLVM is to design LLVM as a series of independent libraries, rather than an inseparable command line compiler like GCC. It takes LLVM IR as input, and then processes the input step by step to produce LLVM IR that performs more efficiently. The LLVM optimizer (like other compilers) does different optimizations for input in a pipelined manner. Common examples are inlining (replacing function entities at call locations), recombining expressions, moving looping immutable code, and so on. Depending on the optimization level, different levels of optimization are run: for example, -o0 is no optimization, and -O3 will run 67 optimizations (LLVM version 2.8).

(3) The library-based design of the LLVM optimizer allows us to flexibly choose the optimizer to be combined and customize the execution order between the optimizers

(4) The design of multi-target compatible LLVM code generator. The LLVM code generator divides the code generation into several independent processes — instruction selection, register allocation, scheduling, code layout optimization and code assembly

(5) Use BugPoint to reduce test cases

Iii. Introduction to Clang

Clang is a compiler front end for C, C++, Objective-C, and Objective-C++ programming languages

2. At the bottom layer, LLVM serves as the back end

3.Clang project includes Clang front-end, Clang static parser, etc. The purpose is to output the corresponding abstract syntax tree of the code and compile the code into LLVM bitcode

4. The backend is then compiled into platform-specific machine languages using LLVM

The relationship between LLVM\Clang\GCC

1.Clang generates an abstract syntax tree on the front end that is 20% of GCC memory

2.Clang + LLVM = GCC

3.LLVM provides flexible code reuse capability, encapsulating the algorithm of the compiler back end into an independent module and providing external interface, while GCC traditional compiler is heavily coupled and it is difficult to reuse code in small granularity

5. Reference materials

1. Relationship between LLVM\Clang\GCC

www.jianshu.com/p/f2fc44f0f…

www.zhihu.com/question/20…

www.jianshu.com/p/ed1735229…

2. The LLVM architecture

www.aosabook.org/en/llvm.htm…

zhuanlan.zhihu.com/p/100241322