前端 - Technical dry goods | NeCodeGen: source-to-source translation tool based on clang - 网易云信技术小站

Introduction : We live in a diverse world: rich and diverse operating systems, rich and diverse programming languages, rich and diverse technology stacks, such a rich and diverse technology stack brings challenges for software providers: how to quickly cover these System/tech stack to meet the needs of users with different backgrounds? Based on the landing scene of NetEase Yunxin, this article introduces the source-to-source translation tool based on clang in detail.

Text | Kaikai NetEase Yunxin Senior C++ Development Engineer

01 Preface

We live in a diverse world: rich and diverse operating systems, rich and diverse programming languages, rich and diverse technology stacks, the following is a rough statistics front-end

Such a variety of technology stacks software provider : how to quickly cover these systems/technology stacks to meet the needs of users with different backgrounds?

Taking NetEase Yunxin IM as an example, its R&D process is roughly as follows:

(https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/5421c9de9f5d4dd5830b2257cf54a295~tplv-k3u1fbpfcp-zoom-1.image)

With the development of the business, there are more and more APIs (hundreds) of NetEase Yunxin IM. In order to adapt to other platforms, engineers need to invest a lot of time in writing language bindings. This part of the work is complicated, time-consuming, and repetitive. Very large; in the maintenance phase, the modification of the C++ interface needs to be synchronized to each language binding, and a slight omission will cause problems. In order to improve productivity and R&D efficiency, freeing engineers from repetitive and heavy "manual work" and allowing them to focus more on the R&D of important functions, Netease Yunxin's large front-end team developed a clang-based source-to-source translation tool NeCodeGen , this article will introduce NeCodeGen, in order to provide solutions and ideas for engineers who face the same problem.

02 Why rebuild the wheel?

The NetEase Yunxin team has many flexible customization requirements for language binding:

From the implementation level: Need to be able to customize the naming style, method implementation details, encapsulation of business logic, etc.;
From the perspective of interface usability and friendliness: as a software provider, needs to ensure that the API is easy to use and conforms to language best practices;

After investigating the current popular similar tools, it is found that their support for custom code generation is not enough, and it is difficult for users to control the generated code and cannot meet the above-mentioned needs. For this reason, the Yunxin team developed NeCodeGen based on its own needs, giving users complete control over the generated code through code templates, making it a universal and flexible tool.

The current open source world, there are many very good automated generation language binding instrument, such as a strong SWIG, dart ffigen etc., main goal is to meet NeCodeGen flexible customization needs , can be used as a complement existing toolset. In the Yunxin team, it is often used in combination with other code generation tools to improve R&D efficiency. The following is an application scenario of Yunxin:

Since dart ffigen only supports C interface, first use NeCodeGen to develop an application that generates C API and corresponding C implementation, and then use dart ffigen to generate dart binding from C API. Because dart binding generated by dart ffigen is widely used in dart ffi It cannot meet the requirements of ease of use and friendliness (will be called low level dart binding in the above figure). It also needs to be further encapsulated based on it. Yunxin uses NeCodeGen again to generate a more friendly and easy-to-use high level dart binding, which relies on low level dart binding for implementation.

03 Introduction to

NeCodeGen is a code generation framework, which is released as a Python package, based on which engineers can develop their own applications. Its purpose is to simplify the development costs of users with the same needs and provide the best engineering practices to solve such problems. , following features:

Flexible use: The built-in template engine jinja allows engineers to use the jinja template language to flexibly describe code templates;
Support to generate multiple target language programs from C++ at the same time, which is convenient for engineers to manage multiple target language programs at the same time, which is similar to SWIG;
Provide best engineering practices;
Make full use of Python's syntactic sugar;

In terms of implementation, NeCodeGen uses Python3 as the development language, Libclang as the compiler front end, and jinja as the template engine. It borrows from:

Flask, a very popular web framework in Python;
LibASTMatchers and LibTooling for clang;
SWIG；

The sections of NeCodeGen are described in more detail below.

04 Introduction to clang

clang is the C language compiler front end of the LLVM project. The languages it supports include: C, C++, Objective C/C++, etc. clang adopts "Library Based Architecture", which means that its various functional modules will be implemented as independent libraries, engineers can use these functions directly, and clang's AST can fully reflect the source code information. These features of clang help engineers develop some tools based on it, a typical example is clang-format. Netease Yunxin engineers chose to use clang as the compiler front end of

wants to do a good job, he must first sharpen his tools: learn clang AST

Let's do some preparatory work first: learn clang AST, which is a prerequisite for using it to implement source-to-source translation tools. If the reader has mastered clang AST, you can skip this paragraph. The clang AST is complex, which is fundamentally derived from the complexity of the C++ language. This section uses Libclang's Python binding to lead the reader to learn clang AST in a practical and exploratory way.

Readers first need to install Libclang's Python binding, the command is as follows:

pip install libclang

For the convenience of demonstration, the C++ code is not saved to a file, but is passed into Libclang through a string for compilation. The complete program is as follows:

import clang.cindex

code = """
#include <string>
/// test function
int fooFunc(){    
    return 1;
}/// test class
class FooClass{    
    int m1 = 0;    
    std::string m2 = "hello";    
    int fooMethod(){        
        return 1;    
    }
};
int main(){    
    fooFunc();    
    FooStruct foo1;    
    FooClass foo2;
 }"""  # C++源代码
index = clang.cindex.Index.create()  # 创建编译器对象

translation_unit = index.parse(path='test.cpp', unsaved_files=[('test.cpp', code)], args=['-std=c++11'])  #

The index.parse function compiles C++ code, and the parameter args represents the compilation parameters.

Translation unit

The return value type of the index.parse function is clang.cindex.TranslationUnit (translation unit), we can use Python's type function to verify:

 type(translation_unit) 
 Out[6]: clang.cindex.TranslationUnit

see include

for i in translation_unit.get_includes():    
    print(i.include.name)

All header files included in the translation unit can be viewed by calling get_includes() If the reader actually executes it, he will find that it actually contains more than <string> header files, because the header file <string> will contain other header files, and these header files will also package other header files, the compiler needs to one by one Include.

get_chidren

The cursor property of clang.cindex.TranslationUnit represents its AST, let's verify its type:

 type(translation_unit.cursor) 
 Out[9]: clang.cindex.Cursor

As you can see from the output, its type is clang.cindex.Cursor ; its member method get_children() can return its immediate children:

for child in translation_unit.cursor.get_children():    
  print(f'{child.location}, {child.kind}, {child.spelling}'
)

The output summary is as follows:

......
<SourceLocation file 'D:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\include\string', line 24, column 1>, CursorKind.NAMESPACE, std
<SourceLocation file 'test.cpp', line 4, column 5>, CursorKind.FUNCTION_DECL, fooFunc
<SourceLocation file 'test.cpp', line 8, column 7>, CursorKind.CLASS_DECL, FooClass

"..." means that part of the output is omitted; look closely at the last four lines, they are test.cpp , which can be correctly matched with the source code, which also verifies the aforementioned: "The clang AST can fully reflect the information of the source code".

DECL is short for "declaration", which means "declaration".

walk_preorder

The walk_preorder method of clang.cindex.Cursor performs a preorder traversal of the AST:

children = list(translation_unit.cursor.get_children()) 
foo_class_node = children[-2] # 选取 class FooClass 的节点树
for child in foo_class_node.walk_preorder(): # 先序遍历         
    print(f'{child.location}, {child.kind}, {child.spelling}')

The above pre-order traversal of the AST corresponding to class FooClass, the output is as follows:

<SourceLocation file 'test.cpp', line 8, column 7>, CursorKind.CLASS_DECL, FooClass
<SourceLocation file 'test.cpp', line 9, column 9>, CursorKind.FIELD_DECL, m1
<SourceLocation file 'test.cpp', line 9, column 14>, CursorKind.INTEGER_LITERAL, 
<SourceLocation file 'test.cpp', line 10, column 17>, CursorKind.FIELD_DECL, m2
<SourceLocation file 'test.cpp', line 10, column 5>, CursorKind.NAMESPACE_REF, std
<SourceLocation file 'test.cpp', line 10, column 10>, CursorKind.TYPE_REF, std::string
<SourceLocation file 'test.cpp', line 11, column 9>, CursorKind.CXX_METHOD, fooMethod
<SourceLocation file 'test.cpp', line 11, column 20>, CursorKind.COMPOUND_STMT, 
<SourceLocation file 'test.cpp', line 12, column 9>, CursorKind.RETURN_STMT, 
<SourceLocation file 'test.cpp', line 12, column 16>, CursorKind.INTEGER_LITERAL,

Readers are asked to compare the above output with the source code by themselves.

AST node: clang.cindex.Cursor

For clang.cindex.Cursor, the following are its very important members:

kind, type is clang.cindex.CursorKind;
type, type is clang.cindex.Type, through which type information can be obtained;
spelling, It represents the name of the node.

05 Introduction to jinja template engine

Since jinja will be used in the following examples, it will be briefly introduced first. Readers do not need to be afraid of learning new things, because jinja is very easy to learn, templates are not a new concept, readers familiar with template metaprogramming should be familiar with templates, and jinja's template language is basically the same as Python, so it does not Introducing too many new concepts, some concepts in jinja can actually be compared using the concepts we are familiar with.

Here is a simple jinja template and a program that renders the template:

from typing import List
from jinja2 import Environment, BaseLoader

jinja_env = Environment(loader=BaseLoader)
view_template = jinja_env.from_string(    
    'I am {{m.name}}, I am familiar with {%- for itor in m.languages %} {{itor}}, {%- endfor %}')  # jinja模板
class ProgrammerModel:    
    """    
    model    
    """
    def __init__(self):        
        self.name = ''  # 姓名        
        self.languages: List[str] = []  # 掌握的语言
def controller():    
    xiao_ming = ProgrammerModel()    
    xiao_ming.name = 'Xiao Ming'    
    xiao_ming.languages.append('Python')    
    xiao_ming.languages.append('Cpp')    
    xiao_ming.languages.append('C')    
    print(view_template.render(m=xiao_ming))
    
if __name__ == '__main__':    
    controller()

The above program defines a simple template view_template introduced by a software engineer, and then renders it to get the complete content, run the program, and its output is as follows:

I am Xiao Ming, I am familiar with Python, Cpp, C,

jinja template variable is actually a "template parameter"

Carefully compare view_template and the final output, you can find that the part enclosed in {{ }} will be replaced, it is jinja template variable, that is, "template parameter", its syntax is: {{template variable}}.

MVC Design Patterns

In the above program, we actually used the MVC design pattern:

title=

In the following programs, this design pattern will continue to be used. NeCodeGen highly recommends engineers to use this design pattern to build applications. There will be a special chapter to introduce the MVC design pattern later.

jinja render is actually a lot like "replacement"

view_template.render(m=xiao_ming) is to render the template. This process can be simply understood as "replacement", that is, the template parameter m is replaced by the variable xiao_ming. If the function parameter is used for analogy, the variable xiao_ming is the real value. Ref.

06 Abstraction and code template

When the program appears in code duplication, we first think of generic programming, meta-programming, annotation and other programming techniques, which can help engineers simplify the code, but different abstraction different programming language, and for some programming Task The above programming tricks don't help either. All of these lead engineers to inevitably repeat the same pattern of code, this problem is particularly prominent in the implementation of language binding.

For this type of problem, the solution given by NeCodeGen is:

For repetitive codes, engineers need to abstract their common patterns (code templates), and then use template language to describe code templates. In NeCodeGen, the template language used is jinja;
NeCodeGen will compile the source program file and generate the AST. The engineer needs to extract the necessary data from the AST, then perform the transformation (see the "Code Transformation" section below), and then complete the transformed data as the actual parameter of the template parameter in the code template. The rendering of the code template, thereby obtaining the target code.

The following is a more specific description of the above solution with a simple example. In this example, the engineer needs to define the struct in C++ equivalently in TypeScript. For clarity, the following table shows a Concrete example:


C++	TypeScrip

Now we need to think about how to automate this task for us. Obviously through clang, we can get the AST of struct NIM_AuthInfo, we also need to consider the following issues:

Q1: The correspondence between C++ types and TypeScript types?

A: std::string -> string，int -> integer

Q2: How to name struct in TypeScript in C++?

A: For simplicity, we keep the names in TypeScript and C++ consistent.

Q3: What syntax is used in TypeScript to describe something like a C++ struct?

A: Using the TypeScript interface to describe, we can use jinja to write a general code template to describe.

Below we give the specific implementation. According to the ideas proposed in the previous MVC chapter, we can first establish the data modeling of struct:

class StructModel:    
    def __init__(self):        
        self.src_name = ''  # 源语言中的名称        
        self.des_name = ''  # 目标语言的名称        
        self.fields: List[StructFieldModel] = []  # 结构体的字段
        
class StructFieldModel:    
    def __init__(self):        
        self.src_name = ''  # 源语言中的名称        
        self.des_name = ''  # 目标语言的名称        
        self.to_type_name = ''  # 目标语言的类型名称

Then we write the code template of TypeScript, which is based on StructModel:

export interface {{m.des_name}} {
{% for itor in m.fields %}{{itor.des_name}} : 
{{itor.to_type_name}} ,
{% endfor %}
}

The next job is to extract the key data from the C++ struct AST and make the necessary transformations:

def controller(struct_node: clang.cindex.Cursor, model: StructModel) -> str:    
    model.src_name = model.des_name = struct_node.spelling  # 提取struct的name       for field_node in struct_node.get_children():        
        field_model = StructFieldModel()        
        field_model.src_name = field_model.des_name = field_node.spelling  # 提取字段的name        
        field_model.to_type_name = map_type(field_node.type.spelling)  # 执行类型映射        
        model.fields.append(field_model)    
    return view_template.render(m=model)  # 渲染模板，得到TypeScript代码

Complete program

The full program can be obtained at the following link:

https://github.com/dengking/clang-based-src2src-demo-code

07 Translation from source language to target language

When translating a program written in the source language into a program in the target language, the following three transformations are mainly involved:

Type conversion type mapping

A conversion from a type in the source language to a type in the target language. The built-in types of C++ language and C++ standard library types are enumerated and predefined in NeCodeGen. For the conversion of these types, hash map is used to establish the mapping relationship; for user-defined types, NeCodeGen cannot give predefined types. If it is defined, it needs to be defined by the engineer.

Name conversion name mapping

Different languages have different naming conventions, so engineers need to consider naming transitions. If the source program follows a unified naming convention, then the use of regular expressions can facilitate the naming conversion, which can ensure that the generated program strictly follows the naming convention set by the user, which also reflects the advantages of automated code generation tools: program Adherence to naming conventions is stricter than engineers.

syntax conversion syntax mapping

In Netease Yunxin's NeCodeGen, syntax conversion is mainly done through code templates. engineers need to write code templates according to the syntax of the target language, and then render programs that conform to the syntax of the target language.

08 NeCodeGen's Design pattern

So far, readers have some basic understanding of Yunxin NeCodeGen. This section mainly introduces some design patterns recommended by Yunxin NeCodeGen. In the implementation of Yunxin NeCodeGen, it provides basic functions to support these design patterns. These design patterns are summed up after engineering practice and can help engineers develop applications that are easier to maintain. Due to the complexity of the C++ language, the processing of its AST will also be more complicated. Appropriate design patterns are particularly important. important for the project.

Matcher

When writing the source-to-source translation tool, common pattern matching node of interest, and then performs a corresponding process of matched nodes, such conversion name, type conversion. The Matcher pattern is created for this typical requirement: the framework traverses the AST and executes the match funcion registered by the user. Once the match is successful, the callback corresponding to the match funcion is executed. This mode is summed up by the clang community for the development of clang tools and provides the support library LibASTMatchers. For this, readers can read the following articles:

Yunxin NeCodeGen draws on this model and implements localization in combination with the characteristics of the Python language and its own needs. It uses Python's decorator syntactic sugar. The general writing method is as follows:

@frontend_action.connect(match_func)
def callback():    
    pass

The meaning of the above writing is: tell frontend_action to connect match funcionmatch_func and callback callback; when frontend_action traverses the AST, it will take the node as an input parameter and execute all the match func registered to it in turn. If the match func returns True, it means a match If successful, the framework will execute the callback function to process the successfully matched nodes, otherwise pass.

In practice, this pattern can be applied with a clearer structure and a higher degree of code reuse.

Currently, clang does not officially provide the Python binding of LibASTMatchers. For the convenience of users, Yunxin NeCodeGen provides a match funcion for matching common nodes.

MVC

Readers of the MVC pattern should not be unfamiliar with it. It is a pattern often used in front-end development. It has been briefly introduced in the "Jinja Template Engine Introduction" chapter earlier in this article. MVC in Yunxin NeCodeGen can be summarized as:

![]

In actual use, recommends engineers to use the top-down approach: Define the model, determine the members of the model, write code templates based on the model, and then write extraction and conversion functions to obtain data to initialize the model, and finally use the model to render the template.

From a practical point of view, MVC can make the code structure clearer and more maintainable; for projects that need to generate multiple target language programs from one source language, MVC can ensure that the Model is consistent in the target language, which to a certain extent can Prompt code reuse.

Summary

The Matcher pattern is the pattern used by the NeCodeGen framework, and the MVC pattern is recommended for application developers; the callback of the Matcher pattern corresponds to the controller of the MVC pattern, that is, the engineer implements the function of the controller in the callback.

09 How NeCodeGen run

Through the previous introduction, we have a general understanding of the operation process of NeCodeGen. The following is a summary of the operation process of NeCodeGen in the form of a flowchart :

10 Application value

The main purpose of code generation tools is to increase productivity, and its role is more obvious for large projects. In the engineering practice of NetEase Yunxin, engineers will comprehensively use a variety of code generation tools, give full play to the power of the tools, and add the tools to CICD, which greatly improves the efficiency of research and development; the improvement of productivity is also reflected in the refactoring. and maintenance, after modifying the source language program, the updated target language program can be obtained by running the tool, which can avoid errors caused by the inconsistency between the source language program and the target language program; the refactoring work will also become simple, The refactoring of the target language program will be reduced to modifying the code template, and after re-running the tool, all refactoring can be completed. The advantages of code generation tools are also reflected in the compliance with code specifications. By encoding naming conventions, code specifications, etc. in the tool, it can ensure that the generated programs comply with the code specifications 100%.

In addition to the generation of language binding, NeCodeGen can also be applied to other fields, such as implementing a Meta-Object System similar to QT, and it can also be used as a stub code generator.

References

https://en.wikipedia.org/wiki/Comparison_of_code_generation_tools

Author introduction

Kaikai, NetEase Yunxin senior C++ development engineer, responsible for the basic technology research and development of Yunxin, has rich experience in research and development, and is familiar with programming language theory.

Technical dry goods | NeCodeGen: source-to-source translation tool based on clang