Abstract:This paper focuses on analyzing what special requirements AI framework has for IR, what solutions the industry has, and some thoughts of Mindspore.

Share this article from huawei cloud community the column MindSpore technology | AI framework layer IR analysis, the original author: feet girl full month.

IR (Intermediate Representation) is the translation intermediary between source code and target code in the process of program compilation. The design of IR is very critical for the compiler. A good IR should consider the completeness of compilation from source code to target code, the ease of use and performance of compilation optimization. What does the nature of the AI framework do? The essence of the AI framework is to translate a user’s model representation into executable code for efficient execution (training and reasoning). From the user’s model representation (such as deep neural network) to the final executable code is the behavior of a compiler, which also has an IR. Its design is critical to the completeness/flexibility/ease of use/performance of the AI framework.

This paper focuses on analyzing what special requirements AI framework has for IR, what solutions the industry has, and some thoughts of Mindspore. First let us understand the general compiler IR classification and their characteristics.

Introduction of IR in industry

I. According to its organizational structure [1], IR can be divided into Linear IR, Graphical IR and Hybrid IR, in which

Linear IR:

Similar to the pseudocode of some abstract machines, the corresponding algorithm iterates through a simple linear sequence of operations

Hybrid IR:

It combines elements of graph IR and linear IR. A common hybrid IR uses the underlying linear IR to represent the non-cyclic blocks of code and a graph IR to represent the control flow between these blocks

Graphical IR: Graphical IR:

The knowledge/information of the compilation process is stored in the graph, and the corresponding algorithm is described by operating on the objects (nodes, edges, lists, and trees) in the graph

An example of a linear IR is stack-machine Code, which is a single-address Code that assumes the operands exist on a Stack. Most operations get operands from the stack and push their results onto the stack. For example, the stack machine code for the expression b-a*3 is as follows:

push 3
push a
push a

LLVM IR is a typical hybrid IR, which contains two levels (CFG+BB) :

The top layer is the Control Flow Graph (CFG), which represents the Control Flow between Basic blocks (BB). Each Node of CFG is a base block, and there is an Edge between base block B1 and B2: B1 -> B2, if the control flow may flow from the last instruction of base block B1 to the first instruction of base block B2

The bottom layer is the basic block, in which each instruction is presented in the form of SSA (Static Single Assignment) and these instructions form a linear list of instructions

Sea of Nodes IR (by Cliff Click) is a typical graph IR[2]. In this IR, the two-layer structure of BB+SSA instructions in the CFG graph is simplified, and BB is removed, leaving only a one-layer structure containing instructions. By introducing special REGION, IF and PROJECTION instructions, it relaxes the whole sequence instructions in BB block into explicit data and control dependencies, and adopts the same representation and processing mode for both control and data dependencies, thus simplifying the analysis and transformation of IR. Here is a simple example of IR:

In this example, the boxes are nodes of the graph, representing SSA instructions, and the arrows are edges of the graph. Solid arrows indicate control dependencies; Hollow arrows indicate data dependencies. As you can see from this example, the Use-Def dependency is explicitly included in this IR and does not require additional computation.

Based on the explicit Use-DEF information in the IR, we are able to carry out two types of optimizations conveniently: Parse Time Optimization (RUP) and Optimistic.

At Parse, because it does not yet have the full program information, only partial optimizations can be done, such as peephole optimizations (e.g., constant folding, identity-function). By designing an appropriate class and inheritance system, peephole optimization can be implemented using a simple algorithm:

For global optimization, such as Sparse Conditional Constant Propagation (SCCP), it can also be implemented easily. First, Def – Use Chains are calculated based on the explicit Use-Def in the figure. Then, SCCPSEA of Nodes IR can be easily realized. It provides a very important idea: to express the dependent information explicitly in the IR figure. This idea is continued in FIRM IR

2. Analyzing IR from the perspective of common programming languages, we can see that IR forms can be divided into two different camps: Imperative programming language compiler IR takes SSA as the basic form, which will not be described here. Now I will focus on the IR of functional programming language. In IR of functional programming language, CPS, or ANF, is the basic component of this continuation-passing style (CPS). ContinuationContinuation A function f always has an extra argument to its argument ContinuationContinuation is also a function, and when f completes its evaluation of its return value, instead of returning it, it calls the continuation by taking the return value as an argument to the continuation. So a CPS function formally doesn’t return, and when it does return it passes all of its arguments to the continuation, and lets the continuation continue. Such as:

def foo(x):
return x+1

In CPS form, K is a continuation:

def foo(x,k):

Intuitively, functions that do not “return” but “continue” CPS have the advantage of making the following information explicit: The procedure returns (calls a continuation), the intermediate value (with an explicit name), the order of evaluation, and the tail call (calls a procedure with the same continuation), such as a piece of Python code that takes the product of all prime numbers less than n.

def prodprimes(n):
    if n == 1:
        return 1
    if isprime(n):
        return n * prodprimes(n - 1)
return prodprimes(n - 1)

When expressed in CPS form:

def prodprimes(n, c):
    def k(b):
        if b == True:
            m = n - 1
            def j(p):
                a = n * p
            prodprimes(m, j)
            def h(q):
            i = n - 1
            prodprimes(i, h)
    if n == 1:
        isprime(n, k)

As you can see from the code above, the “procedure return” is replaced by a continuation called C, J, K, H, and so on; Median values A, B, M, and I are all given variable names. The CPS form is ideal for compiler analysis and elimination, such as tail-recursion elimination: If f ends with a call to g, then the continuation of g does not need to be a continuation generated in f, but can be replaced with a continuation passed to f. In the original code of the above example, the “return prodprimes(n-1)” statement is a tail recursion. In the CPS form, it is clear that the definition of h(q) is actually equal to c(q), so h is equal to c, so we can say:

def h(q):                         i = n - 1
    c(q)            ->           prodprimes(i, c)
i = n - 1
prodprimes(i, h)

While the CPS is very consistent and powerful, one big problem with it is that it is difficult to read. So there is the A-norm Form (ANF) 2. The ANF Form directly converts the source code of the Direct Style [4] without the need for CPS Form

The ANF form divides expressions into two categories: atomic expressions and compound expressions.

An atomic expression represents a constant value or a variable or a primitive or an anonymous function. A compound expression is composed of multiple atomic expressions and can be thought of as an anonymous function or primitive function call. The first input of the composition is the function called and the remaining inputs are the parameters of the call. A compound expression is either let-bound to a variable or appears only in the last position. As you can see, the ANF form explicitly expresses the intermediate value and the control flow and order of evaluation through let-bound. Its syntax is defined as follows [5]

The < aexp > : : = NUMBER | STRING | VAR | BOOLEAN | PRIMOP | (lambda (VAR)... <exp> <exp> <exp> <exp>... | (if <aexp> <exp> <exp>) <exp> ::= (let ([VAR <cexp>]) <exp>) | <cexp> | <aexp>

For example, the prodPrimes function above, if expressed in the above grammar, would be:

(define prodprimes
  (lambda (n)
    (let (a (= n 1))
      (if a 1 (let (b isprime(n))
                   (if b (let (m (- n 1))
                           (let (p (prodprimes m))
                             (* n p)))
                         (let (s (- n 1))
                           (prodprimes m))

This ANF expression, if translated into Python, would look something like:

def prodprimes(n):
    r = n == 1
    if r:
        return 1
    b = isprime(n)
    if b:
        m = n - 1
        p = prodprimes(m)
        return n * p
    s = n - 1
return prodprimes(s)

From this code, you can also see that the ANF form is easier to understand than the CPS form

The role of the layer IR in the AI frame

Now all the mainstream AI frameworks have layer IR. A good layer IR is conducive to the compilation, optimization and execution of AI models, and is the basis for efficient training and reasoning of AI frameworks. From the perspective of training, there are three execution modes in the current industry AI frameworks: The Eager execution mode, the Graph execution mode, and the Staging(mixed) execution mode are all based on the layer IR in the high-performance mode (Graph execution mode and Staging) : The Eager execution mode generally takes advantage of the characteristics of the host language (now mainly Python) for interpretation execution, which uses some techniques of overloading and Tape.

Graph execution mode is mainly to get the Graph structure of the AI model, and then carry out compilation, optimization and execution. Here, compilation, optimization and execution are based on the Graph IR. Now there are three ways to get the Graph structure of the AI model: The first is for programmers to use API Tracing (TF1.x, etc.) and the second is Tracing JIT (a trend brought by JAX, now supported by TF2.0/ PyTorch etc.) to simulate the user’s model script and get the forward execution sequence. And then based on that sequence the nice thing about it is it’s easy to match the Eagle pattern, The disadvantages of simple implementation are that the transformation of control flow is troublesome, and the execution sequence is not easy to realize if it is related to the execution result of operator, and it is not easy to deal with the side effects. Therefore, Autograph of TF also needs to combine AST analysis to solve the problem of control flow transformation. The third kind is AST JIT (PyTorchScript) is based on Python’s AST for composition. The advantage is that the transformation function can be comprehensive, including control flow, etc., but the disadvantage is that the implementation is complex. Many Python dynamic features are implemented in the Staging mode with large workload, which is similar to the Eager mode. Accelerating compilation execution of partial diagrams (using Tracing JIT or AST JIT) through Python modifiers also uses the diagram IR.

From the perspective of inference, AI framework needs to carry out a lot of compilation and optimization when generating the final inference model, such as quantization, pruning, etc., which are generally carried out on the layer IR. Meanwhile, the final inference model format is also directly or indirectly used on the layer IRAI framework. The layer IR of the AI framework has some special requirements and challenges:

Tensor representation: AI models mainly deal with tensor data, which is quite different from normal applications, but adding tensor data types is not difficult for compiler IR.

Automatic Differentiation: Differentiability is the biggest difference between AI model development and general application development. Modern AI frameworks provide automatic differentiation. The challenge lies in simplicity of implementation, performance, and the ability to scale higher-order differentiation in the future

JIT capabilities: Either the graph mode or the Staging mode is considered JIT from an algorithm engineer’s point of view because the compilation steps are not shown. Compiler performance is a major challenge for JITs

Implicit parallelism: From the developer, there are two types of parallelism. One is explicit parallelism, where the developer explicitly tells the system where to parallelize, such as start multithreading/add

Parallel modifier: Another way is implicit parallelism, through the compiler to analyze dependencies, automatic parallelism Generally speaking, the traditional CFG+BB compiler, because the program analysis uses full order analysis, convenient to do explicit parallelism; Functional compilers are theoretically easy for data dependency analysis and implicit parallel optimization. Interestingly, in deep learning scenarios, Kernel execution accounts for most of the overhead, and implementing asynchronous concurrency at run time can also significantly improve overall performance. Implicit parallelism is relatively less effective, but it can still be useful to achieve extreme performance

Loop optimization: AI calculation involves a lot of Tensor operations, for the compiler is Loop optimization (Tensor – > scalar – > to quantify), but the challenge is mainly on the IR of operator layer, of course, the layer, IR is also is a kind of compiler IR should have generality, including the type system, control flow and data flow analysis, the basic function such as side effects to eliminate

The industry has some genres on the layer IR

Computational Graph IR: A DAG centered implementation. Many early frameworks used this scheme. The design of Computational Graph IR is relatively natural. Computational graph is mainly composed of edges and nodes. Edges correspond to Tensors, which in effect represent a data dependency. At the back of the automatic differential and optimization is carried out based on the DAG the advantage of this scheme is simple and intuitive, optimizing the performance overhead of small shortcomings is the formal calculation chart IR is not really a compiler IR, the type system, the support of the complex logic, such as recursive, side effects of treatment, control flow and data flow analysis support is not complete

CFG + BB:Based on IR of traditional compiler to make IR layer, such as Torchscript, Julia and so on, how to realize automatic differentiation? Let’s take Julia Zygote as an example [6] : for the common code in BB block (non-PHI, non-Branch), AD code can be generated in reverse order with the help of chain rule

After the above expression is expressed as SSA, and J is inserted and AD is calculated, the pseudo SSA code as shown in the figure below can be obtained:

The %6 node in the figure above is called the “alpha node”, which corresponds to the node %6 in Primal, which is B3 in the row above, the reverse function of “/” operation

For the control flow between CFGs, reverse analysis of the control flow is required and appropriate dummy Phi nodes are inserted into Primal CFG to record and replay the control flow. For example, this code calculates power:

In the corresponding Primal CFG, the %1 Phi node is inserted as the dumb Phi node to record the control flow. This %1 is then used for control in AD CFG (%1 records the control flow through the stack, and then plays back the control flow through the stack in AD CFG).

Through subsequent code optimization, AD’s POWER code is similar to the following pseudocode:

It can be seen that the automatic differentiation of CFG+BB is finally realized through iteration, and the SSA form with Scope needs to solve the problem of boundary transfer, which will still bring some processing troubles to the automatic differentiation

How to do graph optimization? It is converted into use-def and def-use forms for optimization

How do you do parallel optimization? Since CFG+BB is a full-sequence approach, it needs to be converted into Use-DEF and analyzed with side effect information

The advantages of using CFG+BB scheme are complete function, mature scheme and high reuse, but the form of CFG+BB for automatic differential/graph optimization/parallel optimization, all need to carry out a certain conversion work, and is not so intuitive and efficient

Functional IR

Use functional IR to make IR layer, typical such as Relay, Myia, etc. How to realize automatic differentiation? For uncontrolled flow, the method for calculating AD is the same as the method for calculating AD in the BB block above. For the control flow, the functional IR takes a different approach, converting iteration to recursion, and branching is selected through the switch function. For example, the same pow() function above:

def pow(x, n):
    return header_pow(n, 1, x)
def header_pow(phi_n, phi_r, x):
def body_pow():
    phi_n_1 = phi_n - 1
    phi_r_1 = phi_r * x
        return header_pow(phi_n_1, phi_r_1, x)
    def after_pow():
        return phi_r
    f = switch(phi_n > 0, header_pow, after_pow)

Taking PoW (5,3) as an example, its recursive call process is as follows:

pow(5, 3) -> header_pow(3, 1, 5) -> body_pow() -> header_pow(2, 5, 5) -> body_pow() -> header_pow(1, 55, 5) -> body_pow -> header_pow(0, 555, 5) -> after_pow()

It can be seen that the invocation and return of the recursive call here correspond to the push-on and push-out operations of the control flow Phi node mentioned above in CFG+BB

Since AD process is the process of transformation of functions, the graph after AD is also the structure of recursive call, so there is no need to push and push out operations of control flow Phi nodes similar to CFG+BB, and the recursive call process naturally replaces the push and push out process

Take the derivative with respect to x

def x_grad_pow(x, n): phi_n = n phi_r = 1 return x_bprop_header_pow(phi_n, phi_r, x, 1) def x_bprop_header_pow(phi_n, phi_r, x, sens): def env_x_bprop_body_pow(): %3 = x_bprop_header_pow(phi_n -- 1, phi_r * phi_x, x, 1) %4 = phi_r_bprop_header_pow(phi_n -- 1, phi_r * phi_x, x, 1) %5 = %4 * phi_r return %3 + %5 def env_x_bprop_after_pow(): return 0 f = switch(phi_n > 0, env_x_bprop_body_pow, env_x_bprop_after_pow) r = switch(phi_n > 0, f(), 0) return r def phi_r_bprop_header_pow(phi_n, phi_r, x, sens): def env_phi_r_bprop_body_pow(): %3 = phi_r_bprop_header_pow(phi_n - 1, phi_r * x, x, 1) %4 = %3 * x return %4 def env_phi_r_bprop_after_pow(): return 1 if phi_n > 0: %5 = env_phi_r_bprop_body_pow() else: %5 = env_phi_r_bprop_after_pow() return %5

The advantage of functional IR is that it is friendly to automatic differentiation and is more suitable for parallel analysis, but the challenge is the elimination of side effects of functional IR and the performance of functional IR in the execution state (which is not friendly to execution with recursion).

Mindspore’s design thinking

Mindspore’s layer IR is called Mindir, and the technical route chosen by Mindir is Functional Graph IR (referring to Sea of Nodes, Thorin, Myia, etc.), which has the following characteristics:

Functional with a more natural automatic differentiation implementation and more convenient implicit parallel analysis capabilities: Function as a first class citizen, supports higher-order functions, including the flow of control is through the special function, can be done in the form of unified differential function implemented in the form of no side effects, compared with the imperative languages, can simplify the analysis and the optimization of more native support for closures, on the one hand can express users convenient source of closures, said There is also a natural support for automatic differential algorithms that require access to the intermediate result of the original function in the reverse function: the reverse function accesses the intermediate result and returns it as a closure using partial order analysis based on data dependencies, which facilitates out-of-order or parallel execution

Graph based is more suitable for JIT rapid optimization capability: it adopts a one-layer representation similar to Sea of Nodes IR, and integrates control flow and data flow, which is more suitable for JIT optimization

ANF form: Similar to Thorin, both use Graph IR, and both eliminate SCOPE. But instead of using CPS form of Thorin IR, MINDIR uses ANF form with similar expression ability, which is more intuitive and easier to check. MINDIR hopes to realize automatic differential and implicit parallel analysis in a more convenient way through Functional. The Graph Based approach combines control flow and data flow to support more efficient JIT optimization. [7] The MINDIR grammar is inherited from ANF and is defined as follows:

<ANode> ::= <ValueNode> | <ParameterNode> <ParameterNode> ::= Parameter <ValueNode> ::= Scalar | Named | Tensor | Type | Shape | Primitive | MetaFuncGraph | FuncGraph < CNode > : : = (< AnfNode >...). <AnfNode> ::= <CNode> | <ANode>

The Node in MindIR is corresponding to an atomic expression of ANF. The Node has two subclasses ValueNode and ParameterNodeValueNode, which means that a constant node can carry a constant value (scalar, symbol, tensor, type, dimension, etc.). It could also be a Primitive function or a MetaFuncGraph or aFuncGraph, because in functional programming the function definition itself is a value, ParameterNode is a compound expression of cNode corresponding to ANF in the parameter MindIR of the ParameterNode representing the function. It means that the gradient contribution of parameterNode and cNode will be calculated when a function call is automatically differentiated by MindSpore. And returns the gradient of the final ParameterNode, without calculating the gradient of ValueNode

Let’s take a program as an example to understand MindIR

def func(x, y):
 return x / y

def test_f(x, y):
    a = x - 1
    b = a + y
    c = b * func(a, b)
 return c

The corresponding ANF of this Python code is expressed as:

lambda (x, y)
    let a = x - 1 in
    let b = a + y in
    let func = lambda (x, y)
        let ret = x / y in
        ret end in
    let %1 = func(a, b) in
    let c = b * %1 in
    c end

The corresponding MindIR is:https://w.url.cn/s/Ansh1KW

In MINDIR, a function graph represents the definition of a common function. The function graph generally consists of a directed acyclic graph composed of ParameterNode, ValueNode and CNode, which can clearly express the calculation process from parameters to return values. The two functions test_f and func in Python code are converted to two function graphs. The parameters x and y are converted to the parameterNode of the function graph, and each expression is converted to a cNode. The first input of a CNode links to the function to be called, such as add, func, and return in the figure. Note that these nodes are all ValueNode, since they are understood as constant function values. The other inputs to CNode link to the parameters of the call. Parameter values can come from ParameterNode, ValueNode, and other CNodes.

In ANF, each expression is bound to a variable with a let expression, and the dependency on the output of the expression is indicated by a reference to the variable, while in MINDIR, each expression is bound to a node, and the dependency is indicated by a directed edge between nodes

Functional semantics

An important feature of MINDIR compared with the traditional computational graph is that it can not only express the data dependence between operators, but also express rich functional semantics

Higher-order functions

In MINDIR, the definition of a function is defined by a subgraph, but it can itself be a value passed as input or output to other higher-order functions. For example, in the following simple example, function f is passed as an argument to function g, so that function g is a higher-order function that receives input from the function. The actual point of call to function f is inside function g

def hof(x):
 def f(x):
 return x + 3
 def g(function, x):
 return function(x) * function(x)
    res = g(f, x)
 return res

The corresponding MindIR is:https://w.url.cn/s/A8vb8X3

GradOperation and the Partial and HyperMap functions commonly used in the optimizer are typical higher-order functions in the actual network training scripts. High-order semantics greatly enhance the flexibility and simplicity of MindSpore’s presentation

The control flow

Control flow is expressed in MINDIR in the form of higher-order function selection calls. This form converts the control flow into the data flow of higher-order functions, thus making the automatic differential algorithm more powerful. Not only can it support the automatic differentiation of data flow, but also the automatic differentiation of conditional jump, loop and recursion. Here’s a simple Fibonacci use case to illustrate

def fibonacci(n):
 if(n < 1):
 return 0
 elif(n == 1):
 return 1
 return fibonacci(n-1) + fibonacci(n-2)

The corresponding MindIR is:https://w.url.cn/s/AUiE9Mc

✓ Fibonacci is the first True branch of if, and Fibonacci, who qualify, is the first False branch of if. Those who qualify can go onto university. Those who qualify can go onto university. Fibonacci is the True branch of Elif who can go onto university.

The key thing to understand here is that in MINDIR conditional jumps and recursions are expressed in the form of higher order flow of control. For example, ✓ Fibonacci and university Fibonacci are passed in as parameters of the switch operator. The switch chooses which function to return according to the conditional parameters. The switch is a binary selection of the input function as a normal value and is not called. The actual function call is made on the CNode immediately following the switch

Free variables and closures

Free variables refer to variables in the scoped environment within a code block instead of local variables

Closure is a programming language feature that refers to the combination of a block of code and a scoped environment

In MindIR, blocks of code are rendered as function diagrams, and scoped environments can be understood as the context in which the function is called, and free variables are captured by value copies rather than by references.

A typical closure would look like this:

def func_outer(a, b):
 def func_inner(c):
 return a + b + c
 return func_inner

def ms_closure():
    closure = func_outer(1, 2)
    out1 = closure(1)
    out2 = closure(2)
 return out1, out2

The corresponding MindIR is:https://w.url.cn/s/AsUMXTS

In the example, a and b are free variables because the variables a and b in func_inner are parameters defined in the parent graph func_outer that is referenced. The variable closure is a closure that combines the function func_inner with its context func_outer(1, 2). So out of 1 is equal to 4, because that’s the same thing as 1 plus 2 plus 1, and out of 2 is equal to 5, because that’s the same thing as 1 plus 2 plus 2


[1] Engineering a Compiler, Second Edition, Chapter 5. Intermediate Representation

[2] “Combining Analyses, Combining Optimizations”

[3] Compiling WITH Continuations Chapter 1 [4] Compiling Functional Programming Languages Part V: Compiling WITH Continuations Functional Intermediate Representations [5] matt.might.net/articles [6] Don’t Unroll Adjoint: Differentiating SSA-Form Programs mindspore.cn/doc/note/z

Click on the attention, the first time to understand Huawei cloud fresh technology ~