Analysis of layer IR in AI framework

Abstract: This article focuses on analyzing the special needs of the AI framework for IR, what kind of solutions the industry has, and some thinking about MindSpore.

This article is shared from the HUAWEI CLOUD community " MindSpore Technical Column | Analysis of Layer IR in AI Framework ", the original author: Full of Energetic Girl Moon.

IR (Intermediate Representation) is the intermediate translation between source code and target code during program compilation. The design of IR is very important to the compiler. A good IR must consider the complete compilation from source code to target code. Performance, ease of use and performance of compilation optimization. And what is the essential role of the AI framework? The essential role of the AI framework is to translate a user’s model expression into executable code, and then perform efficient execution (training and inference), where from the user’s model expression (such as a deep neural network) to the final executable code is a The behavior of the compiler, this compiler also has an IR, and its design plays a vital role in the completeness/flexibility/ease of use/performance of the AI framework.

This article focuses on analyzing the special needs of the AI framework for IR, what kind of solutions the industry has, and some thinking of MindSpore. First, let everyone understand the classification and respective characteristics of the general compiler IR.

Introduction of industry IR

1. According to its organizational structure [1], IR can be divided into: Linear IR (linear IR), Graphical IR (graphic IR), Hybrid IR (mixed IR), among which

Linear IR:

Similar to the pseudo code of some abstract machines, the corresponding algorithm traverses simple linear operation sequences through iteration

Hybrid IR:

Combines the elements of graph IR and linear IR. A common hybrid IR uses the underlying linear IR to represent loop-free code blocks, and the graph IR to represent the control flow between these blocks.

Graphical IR (picture IR):

Save the knowledge/information of the compilation process in the graph, and the corresponding algorithm is described by operating on the objects (nodes, edges, lists, and trees) in the graph

An example of linear IR is Stack-Machine Code, which is a single-address code that assumes that the operands are stored in a stack. Most operations get operands from the stack and push their results onto the stack. For example: the stack machine code corresponding to the expression ba*3 is as follows:

push 3
push a
multiply
push a
substract

LLVM IR is a typical hybrid IR, which contains two levels (CFG+BB):

The top layer is the Control Flow Graph (CFG), which represents the control flow between Basic Blocks (BB). Each node (Node) of CFG is a basic block, and there is an edge (Edge) between basic blocks b1 and b2: b1->b2, if the control flow may flow from the last instruction of basic block b1 to the first instruction of basic block b2 An instruction

The bottom layer is a basic block. In the basic block, each instruction is presented in the form of SSA (Static Single Assignment), and these instructions form a linear list of instructions

Sea of Nodes IR (by Cliff Click) is a typical graph IR[2]. In this IR, the two-layer structure of the BB+SSA instruction in the CFG graph is simplified, BB is removed, and only instructions are left. One-layer structure. By introducing special REGION, IF, and PROJECTION instructions, it relaxes the total order instructions in the BB block into explicit data dependence and control dependence, and uses the same expression and processing methods for control dependence and data dependence, so that Simplifies the analysis and transformation of IR. The following is a simple IR example:

In this example, the boxes are the nodes of the graph, representing SSA instructions, and the arrows are the edges of the graph; the solid arrows represent control dependence; the open arrows represent data dependence. From this example, we can see that the use-def dependency is explicitly included in this IR, and no additional calculations are required.

Based on the explicit use-def information in this IR, two types of optimization can be conveniently implemented: Parse time optimization (Pessimistic), global optimization (Optimistic)

At the time of Parse, because there is not all the information of the program, only partial optimization can be done, such as peephole optimization (for example: constant folding, Identity-function). By designing a suitable class and inheritance system, a simple algorithm can be used to achieve peephole optimization:

For global optimization, such as Sparse Conditional Constant Propagation (SCCP), it can also be implemented very simply; first, def-use chains are calculated based on the explicit use-def in the figure, and then SCCP can be easily implemented. Sea of Nodes IR provides A very important idea: to explicitly express the dependent information in the graph IR. This idea continues in FIRM IR

2. Analyzing IR from the perspective of commonly used programming languages, we can see that the form of IR is divided into two different camps: one is the compiler IR of imperative programming language, and the other is the compiler of functional programming language. The compiler IR of IR imperative programming language uses SSA as the basic composition form. I won’t repeat it here. The following focuses on IR of functional programming language. In IR of functional programming language, CPS or ANF is its basic Composition 1. Continuation-passing style (CPS) is literally translated as: Continuous Passing Style CPS represents such a form: a function f in addition to its own parameters, there is always an additional parameter continuationcontinuation is also a function, when f is completed After calculating your own return value, instead of returning, use this return value as a parameter of the continuation and call the continuation. Therefore, the CPS form function does not return from the formal point of view. When it wants to return, it will pass all the parameters to the continuation and let the continuation continue to execute. such as:

def foo(x):
return x+1

Converted to CPS form, k is a continuation:

def foo(x,k):
k(x+1)

Intuitively, the function is not "return", but "continue". The advantage of CPS is to make the following information explicit: process return (call a continuation), intermediate value (with an explicit name), evaluation order, end Call (call a procedure with the same continuation), such as the following piece of python code, to find the product of all prime numbers less than n.

def prodprimes(n):
    if n == 1:
        return 1
    if isprime(n):
        return n * prodprimes(n - 1)
return prodprimes(n - 1)

When expressed in CPS form:

def prodprimes(n, c):
    def k(b):
        if b == True:
            m = n - 1
            def j(p):
                a = n * p
                c(a)
            prodprimes(m, j)
        else:
            def h(q):
                c(q)
            i = n - 1
            prodprimes(i, h)
    if n == 1:
        c(1)
    else:
        isprime(n, k)

As you can see from the above code, "procedure return" is replaced by continuation calls such as c, j, k, h; the intermediate values a, b, m, and i are all given variable names. The CPS form is very suitable for the compiler to analyze and transform, such as the tail-recursion elimination transformation: if the function f is called at the end of the function g, then the continuation of the function g does not need to be a continuation generated in f, but can be replaced with The continuation passed to f. In the original code of the above example, the "return prodprimes(n-1)" statement is a tail recursion. In the CPS form, you can clearly see that the definition of h(q) is actually equal to c(q), so you can say h is equal to c, so the following transformation can be performed [3]:


def h(q):                         i = n - 1
    c(q)            ->           prodprimes(i, c)
i = n - 1
prodprimes(i, h)

Although CPS is very consistent and powerful, one of its big problems is that it is difficult to read. Therefore, the A-norm Form (ANF) form appears. 2. The ANF form directly converts the Direct Style source code [4], without the need to go through the CPS form

The ANF form divides expressions into two categories: atomic expressions and compound expressions.

Atomic expression means a constant value or a variable or a primitive or an anonymous function. The compound expression is composed of multiple atomic expressions. It can be regarded as an anonymous function or primitive function call. The first input of the combination is The function to be called, the rest of the input is the parameters of the call. A compound expression is either let-bound to a variable, or it can only be seen in the last position. The ANF form explicitly expresses the intermediate value, control flow and evaluation order through let-bound. Its grammar is defined as follows [5]

<aexp> ::= NUMBER | STRING | VAR | BOOLEAN | PRIMOP
          |  (lambda (VAR …) <exp>)
<cexp> ::= (<aexp> <aexp> …)
          |  (if <aexp> <exp> <exp>)
<exp> ::= (let ([VAR <cexp>]) <exp>) | <cexp> | <aexp>

For example, the prodprimes function above, if expressed in the above grammar, should be:

(define prodprimes
  (lambda (n)
    (let (a (= n 1))
      (if a 1 (let (b isprime(n))
                   (if b (let (m (- n 1))
                           (let (p (prodprimes m))
                             (* n p)))
                         (let (s (- n 1))
                           (prodprimes m))
                    ))))))

This ANF form expression, if translated into python, should be similar to:

def prodprimes(n):
    r = n == 1
    if r:
        return 1
    b = isprime(n)
    if b:
        m = n - 1
        p = prodprimes(m)
        return n * p
    s = n - 1
return prodprimes(s)

Through this code, it can also be seen that the ANF form is simpler and easier to understand than the CPS form

The role of layer IR in the AI framework

Now mainstream AI frameworks have layer IR. Good layer IR is conducive to the compilation, optimization and execution of AI models. It is the basis for efficient training and inference of AI frameworks. From the perspective of training, there are currently three types of implementations of AI frameworks in the industry. Mode: Eager execution mode, graph execution mode and Staging (hybrid) execution mode, among which high performance mode (Graph execution mode and Staging execution mode) are based on layer IR: Eager execution mode generally uses host language (now mainly Python) features are interpreted and executed, and some techniques of overloading and Tape are used.

The Graph execution mode is mainly to get the graph structure of the AI model, and then perform compilation optimization and execution. The compilation optimization and execution here are based on the graph IR. Now there are three ways to get the graph structure of the AI model: the first is the programmer Use API composition (TF1.x version, etc.) The second is Tracing JIT (the trend brought by JAX, now TF2.0/Pytorch, etc.) is to run the user's model script simulation to get the forward execution sequence , And then composition based on this sequence. The advantage is that it is easier to match the Eagle mode. The disadvantage of simple implementation is that the conversion of control flow is more troublesome. If the execution sequence is related to the execution result of the operator, it is not easy to implement, and it is not easy to handle side effects. So TF's AutoGraph also Need to combine AST analysis to solve the problem of control flow conversion. The third type is AST JIT (Pytorch's TorchScript) for composition based on Python's AST. The advantage is that the conversion function can be more comprehensive, including control flow, etc. The disadvantage is that the implementation is complex, and many Python dynamics Feature implementation requires a lot of work
Staging execution mode is similar to Eager mode, through Python modifiers to accelerate the compilation and execution of some subgraphs (using Tracing JIT or AST JIT), and graph IR will also be used.

From the perspective of reasoning, when the AI framework generates the final reasoning model, a large number of compilation optimizations, such as quantization, pruning, etc., are generally performed on the layer IR. At the same time, the final reasoning model format is directly or indirectly used in the graph. Layer IRAI framework layer IR requirements and challenges Compared with other general IR, AI framework layer IR has some special requirements and challenges:

tensor expression: The AI model mainly deals with tensor data. This is quite different from ordinary applications, but increasing the tensor data type is not difficult for the compiler's IR.

Automatic Differentiation: Differentiable is the biggest difference between AI model development and general application development. Modern AI frameworks will provide automatic differentiation. The challenge lies in the simplicity of implementation, performance, and the ability to expand higher-order differentials in the future

JIT capability: whether it is a graph mode or a staging mode, 160e526c88e932 can be regarded as a JIT mode from the point of view of an algorithm engineer, since the compilation steps are not displayed. For JIT, compilation performance is a major challenge

Implicit Parallelism: From the developer's point of view, there are two parallel methods. One is explicit parallelism. The developer explicitly tells the system where to parallelize, such as display start multithreading/addition

Parallel modifier: Another way is implicit parallelism. The compiler analyzes dependencies and automatically realizes parallelism. Generally speaking, the traditional CFG+BB compiler, because the program analysis uses total order analysis, it is convenient to make explicit Parallel: The functional compiler is theoretically easy to analyze data dependence and facilitate implicit parallel optimization. What's interesting is that in deep learning scenarios, Kernel execution occupies most of the overhead. The implementation of asynchronous concurrency at runtime can also significantly improve overall performance. The role of implicit parallelism will be relatively weakened, but if you want to achieve extreme performance, Implicit parallelism is still useful

Loop optimization: AI calculation involves a lot of Tensor operations, which is Loop optimization for the compiler (tensor -> scalar -> vectorization), but this challenge is mainly in the IR of the operator layer. Of course, the layer IR is also a compiler IR, which should be versatile, including basic functions such as type system, control flow and data flow analysis, and side effect elimination.

Some genres in the industry on layer IR

calculation graph IR: is a DAG-centric implementation. Many early frameworks use this scheme. The IR design of the calculation graph is relatively natural. The calculation graph is mainly composed of edges and nodes. Nodes are generally used Express operators, variables, constants, etc.; edges correspond to Tensors, which actually express a data dependency. The following automatic differentiation and optimization are based on this DAG. The advantage of this scheme is that it is simple and intuitive, and the performance overhead during optimization is small. The disadvantage is that the calculation graph IR is not a truly formal compiler IR. It is in the type system and complex logic. Incomplete support (such as recursion), side-effect handling, control flow and data flow analysis

CFG+BB: based on the IR of the traditional compiler to do the layer IR, such as TorchScript, Julia, etc. how to achieve automatic differentiation? Let’s take Julia Zygote as an example [6]: For ordinary code (non-phi, non-branch) in the BB block, with the help of the chain rule, AD codes can be generated in reverse order

After expressing the above expression as SSA, inserting J and calculating AD, the pseudo SSA code as shown in the following figure can be obtained:

%6 in the above figure, the node here is called "alpha node", which corresponds to the node %6 in Primal, which is B3 in the upper row, the reverse function of "/" operation

For the control flow between CFGs, it is necessary to reversely analyze the control flow, and insert an appropriate dummy phi node in the Primal CFG to record and play back the control flow. For example, this section of code to calculate power:

In the corresponding Primal CFG, a %1 phi node is inserted as a dumb phi node to record the control flow. Then use this %1 in AD CFG to control (%1 records through the stack control flow, and then in AD CFG through the stack to play back the control flow)

Through subsequent code optimization, AD's Power code is similar to the following pseudo code:

It can be seen that the automatic differentiation of CFG+BB is finally realized by iteration. The SSA form with Scope needs to solve the problem of boundary transfer. Automatic differentiation will still bring some processing troubles.

How to optimize converted into use-def and def-use for optimization

How to do parallel optimization for Since CFG+BB is a total sequence method, it needs to be converted to use-def and combined with side effects information for analysis

The advantages of using the CFG+BB scheme are complete functions, mature schemes, and high reusability. However, the form of CFG+BB requires certain conversion work for automatic differentiation/graph optimization/parallel optimization, which is not so intuitive and efficient.

Functional IR

Use functional IR to do layer IR, such as Relay, Myia, etc. How to realize automatic differentiation? For non-control flow, the method of calculating AD is the same as the method of calculating AD in the BB block described above. For the control flow, the functional IR adopts a different processing method, which converts iteration to recursion, and selects the branch through the switch function. For example, the same pow() function above:

def pow(x, n):
    return header_pow(n, 1, x)
def header_pow(phi_n, phi_r, x):
def body_pow():
    phi_n_1 = phi_n - 1
    phi_r_1 = phi_r * x
        return header_pow(phi_n_1, phi_r_1, x)
    def after_pow():
        return phi_r
    f = switch(phi_n > 0, header_pow, after_pow)
    f()

Taking pow(5,3) as an example, the recursive calling process is as follows:

pow(5, 3) -> header_pow(3, 1, 5) -> body_pow() -> header_pow(2, 5, 5) -> body_pow() -> header_pow(1, 5 5, 5) -> body_pow -> header_pow(0, 5 5 5, 5) -> after_pow() (at this time return 5 5*5)

It can be seen that the call and return of the recursive call here correspond to the stacking and popping operations of the control flow phi node of the above CFG+BB respectively.

Since the AD process is the process of transforming the function, the graph after AD is also the structure of recursive call, so there is no need for the control flow phi node like CFG+BB to stack and pop operations, and the recursive call process naturally replaces the input Stack and pop process

Derivative of x

def x_grad_pow(x, n):
    phi_n = n
    phi_r = 1
    return x_bprop_header_pow(phi_n, phi_r, x, 1)

def x_bprop_header_pow(phi_n, phi_r, x, sens):
    def env_x_bprop_body_pow():
        %3 = x_bprop_header_pow(phi_n – 1, phi_r * phi_x, x, 1)
        %4 = phi_r_bprop_header_pow(phi_n – 1, phi_r * phi_x, x, 1)
        %5 = %4 * phi_r
        return %3 + %5
    def env_x_bprop_after_pow():
        return 0

    f = switch(phi_n > 0, env_x_bprop_body_pow, env_x_bprop_after_pow)
    r = switch(phi_n > 0, f(), 0)
    return r

def phi_r_bprop_header_pow(phi_n, phi_r, x, sens):
    def env_phi_r_bprop_body_pow():
        %3 = phi_r_bprop_header_pow(phi_n - 1, phi_r * x, x, 1)
        %4 = %3 * x
        return %4

    def env_phi_r_bprop_after_pow():
        return 1

    if phi_n > 0:
        %5 = env_phi_r_bprop_body_pow()
    else:
        %5 = env_phi_r_bprop_after_pow()
return %5

The advantage of functional IR is that it is friendly to automatic differentiation and is more suitable for parallel analysis. However, the challenge lies in the elimination of side effects of functional IR and the performance of functional IR in the execution state (including recursion is not friendly to execution)

Mindspore design thinking

The layer IR of MindSpore is called MindIR. The technical route chosen by MindIR is to use Functional Graph IR (refer to Sea of Nodes, Thorin, Myia, etc.), which has the following characteristics:

Functional uses a more natural implementation of automatic differentiation and more convenient implicit parallel analysis capabilities: functions as first-class citizens support higher-order functions, including control flow, which is also realized by special functions, and differential functions can be realized in a unified form It is implemented in a side-effect-free manner. Compared with imperative languages, it can simplify analysis and achieve more optimizations. Native support for closures. On the one hand, it can easily express the closure representation in the user's source code, and it can also naturally support automatic In the differential algorithm, the requirement to access the intermediate result of the original function in the inverse function: the inverse function accesses the intermediate result and returns as a closure using partial order analysis based on data dependence, which can facilitate out-of-order or parallel execution

Graph based is more suitable for the rapid optimization of JIT: it adopts a one-layer representation similar to Sea of Nodes IR, and the control flow and data flow are integrated, which is more suitable for JIT optimization.

ANF form: Similar to Thorin, both use Graph IR and both eliminate Scope. But instead of using Thorin IR's CPS form, it is the ANF form with similar expressive ability, which is more intuitive and easier to check. MindIR hopes to realize automatic differentiation and implicit parallel analysis more conveniently through the functional method, and the Graph Based method combines the control flow and the data flow. Unity supports more efficient JIT optimization. 1. Detailed explanation of MindIR [7] MindIR grammar is inherited from ANF, and its definition is as follows:

<ANode> ::= <ValueNode> | <ParameterNode>
<ParameterNode> ::= Parameter
<ValueNode> ::= Scalar | Named | Tensor | Type | Shape
               | Primitive | MetaFuncGraph | FuncGraph
<CNode> ::= (<AnfNode> …)
<AnfNode> ::= <CNode> | <ANode>

ANode in MindIR corresponds to the atomic expression of ANF. ANode has two subclasses: ValueNode and ParameterNode. ValueNode means that a constant node can carry a constant value (scalar, symbol, tensor, type, dimension, etc.), or it can be a primitive Function (Primitive) or a meta function (MetaFuncGraph) or a normal function (FuncGraph), because the function definition itself is also a value in functional programming. ParameterNode is a parameter node representing the formal parameter of the function. CNode in MindIR corresponds to the compound expression of ANF The formula means that when a function call is automatically differentiated in MindSpore, the gradient contribution of ParameterNode and CNode will be calculated, and the gradient of the final ParameterNode will be returned, instead of the gradient of ValueNode.

Let’s take a program as an example to compare and understand MindIR

def func(x, y):
 return x / y

@ms_function
def test_f(x, y):
    a = x - 1
    b = a + y
    c = b * func(a, b)
 return c

The ANF corresponding to this Python code is expressed as:

lambda (x, y)
    let a = x - 1 in
    let b = a + y in
    let func = lambda (x, y)
        let ret = x / y in
        ret end in
    let %1 = func(a, b) in
    let c = b * %1 in
    c end

The corresponding MindIR is: https://w.url.cn/s/Ansh1KW

In MindIR, a function graph (FuncGraph) represents the definition of a common function. The function graph is generally composed of ParameterNode, ValueNode and CNode as a directed acyclic graph, which can clearly express the calculation process from parameters to return values. It can be seen that the two functions test_f and func in the python code are converted into two function graphs, the parameters x and y are converted into ParameterNode of the function graph, and each expression is converted into a CNode. The first input of CNode is linked to the called function, such as add, func, and return in the figure. It is worth noting that these nodes are ValueNodes, because they are understood as constant function values. The other input of CNode links the parameters of this call, and the parameter values can come from ParameterNode, ValueNode and other CNodes.

In ANF, each expression is bound to a variable by let expression, and the dependence on the output of the expression is expressed through the reference to the variable. In MindIR, each expression is bound to a node, and the node and Directed edges between nodes represent dependencies

Functional semantics

An important feature of MindIR compared to traditional computational graphs is that it can not only express the data dependence between operators, but also express rich functional semantics.

Higher order function

In MindIR, the definition of a function is defined by a subgraph, but it can itself be a passed value as the input or output of other higher-order functions. For example, in the following simple example, function f is passed into function g as a parameter, so function g is a higher-order function that receives function input, and the real call point of function f is inside function g

@ms_function
def hof(x):
 def f(x):
 return x + 3
 def g(function, x):
 return function(x) * function(x)
    res = g(f, x)
 return res

The corresponding MindIR is: https://w.url.cn/s/A8vb8X3

In the actual network training script, the automatic derivative functional GradOperation and the Partial and HyperMap commonly used in the optimizer are all typical high-order functions. High-level semantics greatly improves the flexibility and conciseness of MindSpore expression

Control flow

The control flow in MindIR is expressed in the form of high-order function selection calls. This form converts the control flow into a higher-order function data flow, which makes the automatic differentiation algorithm more powerful. It can not only support automatic differentiation of data flow, but also automatic differentiation of control flow such as conditional jump, loop and recursion. Here is a simple Fibonacci use case to demonstrate

@ms_function
def fibonacci(n):
 if(n < 1):
 return 0
 elif(n == 1):
 return 1
 else:
 return fibonacci(n-1) + fibonacci(n-2)

The corresponding MindIR is: https://w.url.cn/s/AUiE9Mc

Among them, fibonacci is the top-level function graph. In the top level, there are two function graphs selected by the switch to call ✓fibonacci is the True branch of the first if, and ✗fibonacci is the False branch of the first if. ✓✗fibonacci, which is called in ✗fibonacci, is the True branch of elif, and ✗✗fibonacci is the False branch of elif.

The key to understanding here is that in MindIR, conditional jump and recursion are expressed in the form of high-order control flow. For example, ✓fibonacci and ✗fibonacci are passed as parameters of the switch operator, and switch selects which function according to the conditional parameters. As a return value, therefore, switch is a binary selection operation that takes the input function as a normal value, and does not call it. The real function call is done on the CNode immediately after the switch.

Free variables and closures

Free variable (free variable) refers to the reference to the variable in the scope environment in the code block rather than the local variable

Closure (closure) is a programming language feature, it refers to the combination of code block and scope environment

In MindIR, the code block is presented as a function graph, and the scope environment can be understood as the context environment when the function is called. Free variables are captured by value copy rather than by reference.

A typical closure use case is as follows:

@ms_function
def func_outer(a, b):
 def func_inner(c):
 return a + b + c
 return func_inner

@ms_function
def ms_closure():
    closure = func_outer(1, 2)
    out1 = closure(1)
    out2 = closure(2)
 return out1, out2

The corresponding MindIR is: https://w.url.cn/s/AsUMXTS

In the example, a and b are free variables, because the variables a and b in func_inner are the parameters defined in the referenced parent graph func_outer. The variable closure is a closure, which is a combination of the function func_inner and its context func_outer(1, 2). Therefore, the result of out1 is 4 because it is equivalent to 1+2+1, and the result of out2 is 5 because it is equivalent to 1+2+2

references

[1]《Engineering a Compiler》Second Edition，Chapter 5. Intermediate Representation

[2]《Combining Analyses, Combining Optimizations》

[3] Chapter One of "COMPILING WITH CONTINUATIONS"
[4]《Functional programming languages Part V: functional intermediate representations》
[5] matt.might.net/articles
[6]《Don't Unroll Adjoint: Differentiating SSA-Form Programs》
[7] mindspore.cn/doc/note/z

Click to follow and learn about Huawei Cloud's fresh technology for the first time~

Analysis of layer IR in AI framework

Introduction of industry IR

Linear IR:

Hybrid IR:

Graphical IR (picture IR):

The role of layer IR in the AI framework

Some genres in the industry on layer IR

Functional IR

Derivative of x

Mindspore design thinking

Functional semantics

Higher order function

Control flow

Free variables and closures

references

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

基于 MCP 的 AI Agent 应用开发实践

架构设计不合理，如何优化系统结构

硅基流动：免费领取2000万Token，畅享AI大模型盛宴！

Trae 开发工具与使用技巧

基于预生成 QA 对的 RAG 知识库解决方案

AIBrix 深度解读：字节跳动大模型推理的云原生实践