Cross-language call C++ practice under Linux

Different development languages are suitable for different fields. For example, Python is suitable for data analysis, and C++ is suitable for the underlying development of the system. If they need to use basic components with the same functions, if the components are developed separately in multiple languages, it will not only increase the development and maintenance. costs, and does not ensure consistent processing across multiple languages. This article describes the practice summary of cross-language calling under Linux system, that is, developing a component in C++ language once, and calling C++ components in other languages through cross-language calling technology.

1 Background introduction

Query Understanding (QU, Query Understanding) is the core module of Meituan search. Its main responsibility is to understand user queries, generate basic signals such as query intent, components, and rewrites. Basic experience is critical. The online main program of the service is developed based on the C++ language, and the service will load a large amount of vocabulary data, estimation models, etc. The offline production process of these data and models has a lot of text parsing capabilities that need to be consistent with the online service, so as to ensure Consistency at the effect level, such as text normalization, word segmentation, etc.

And these offline production processes are usually implemented in Python and Java. If online and offline are developed separately in different languages, it is difficult to maintain the unity of strategy and effect. At the same time, these capabilities will have continuous iterations. In this dynamic scenario, the effect of continuously maintaining multi-language versions is even, which brings great costs to our daily iterations. Therefore, we try to solve this problem through the technology of calling dynamic link libraries across languages, that is, to develop a C++-based so, encapsulate it into component libraries of different languages through the link layers of different languages, and put them into the corresponding generation process. The advantages of this solution are very obvious. The business logic of the main body only needs to be developed once, the encapsulation layer only needs a very small amount of code, the main business is iteratively upgraded, and other languages hardly need to be changed. It only needs to include the latest dynamic link library and release the latest version. That's it. At the same time, as a lower-level language, C++ has higher computational efficiency and higher utilization of hardware resources in many scenarios, which also brings us some performance advantages.

This article has made a complete review of the problems we encountered and some practical experience when we tried this technical solution in actual production, hoping to provide you with some reference or help.

2 Program overview

In order to achieve the purpose of out-of-the-box use by the business side, and considering the usage habits of C++, Python, and Java users, we have designed the following collaboration structure:

图 1

3 Implementation Details

Python and Java support calling the C interface, but do not support calling the C++ interface. Therefore, the interface implemented by the C++ language must be converted to the C language implementation. In order not to modify the original C++ code, a C language is used to encapsulate the upper layer of the C++ interface. This part of the code is usually called "Glue Code". The specific scheme is shown in the following figure:

图 2

The parts of this chapter are as follows:

In the [Function Code] section, the coding work of each language part is described by the example of printing strings.
The [Packaging and Publishing] section describes how to package the generated dynamic library as a resource file with Python and Java code and publish it to the warehouse to reduce the access cost of the user.
The [Business Usage] section introduces out-of-the-box usage examples.
In the [Ease of Use Optimization] section, combined with the problems encountered in actual use, it describes the compatibility of Python versions and the handling of dynamic library dependencies.

3.1 Function code

3.1.1 C++ code

As an example, implement a function that prints a string. In order to simulate the actual industrial scene, the following codes are compiled to generate the dynamic library libstr_print_cpp.so and the static library libstr_print_cpp.a .

str_print.h

 #pragma once
#include <string>
class StrPrint {
 public:
    void print(const std::string& text);
};

str_print.cpp

 #include <iostream>
#include "str_print.h"
void StrPrint::print(const std::string& text) {
    std::cout << text << std::endl;
}

3.1.2 c_wrapper code

As mentioned above, it is necessary to encapsulate the C++ library and transform it to provide an interface in C language format to the outside world.

c_wrapper.cpp

 #include "str_print.h"
extern "C" {
void str_print(const char* text) {
    StrPrint cpp_ins;
    std::string str = text;
    cpp_ins.print(str);
}
}

3.1.3 Generate dynamic library

In order to support cross-language calls between Python and Java, we need to generate dynamic libraries for the encapsulated interfaces. There are three ways to generate dynamic libraries.

Method 1 : Source code dependency method, compile c_wrapper and C++ code together to generate libstr_print.so . In this way, the business party only needs to rely on one so, the use cost is small, but the source code needs to be obtained. For some off-the-shelf dynamic libraries, it may not be applicable.

 g++ -o libstr_print.so str_print.cpp c_wrapper.cpp -fPIC -shared

Method 2 : Dynamic linking method, which is generated in this way libstr_print.so , it needs to carry its dependent library libstr_print_cpp.so when it is released. In this way, the business side needs to rely on two sos at the same time, and the cost of use is relatively high, but it is not necessary to provide the source code of the original dynamic library.

 g++ -o libstr_print.so c_wrapper.cpp -fPIC -shared -L. -lstr_print_cpp

Method 3 : Static linking method, the generated libstr_print.so not need to carry libstr_print_cpp.so when publishing. In this way, the business side only needs to rely on one so, not the source code, but needs to provide a static library.

 g++ c_wrapper.cpp libstr_print_cpp.a -fPIC -shared -o libstr_print.so

The above three methods each have their applicable scenarios and advantages and disadvantages. In our business scenario this time, because the tool library and package library are developed by ourselves, and the source code can be obtained, the first method is chosen, and the business side is easier to rely on.

3.1.4 Python access code

The ctypes that comes with the Python standard library can realize the function of loading the dynamic library of C. The usage method is as follows:

str_print.py

 # -*- coding: utf-8 -*-
import ctypes
# 加载 C lib
lib = ctypes.cdll.LoadLibrary("./libstr_print.so")
# 接口参数类型映射
lib.str_print.argtypes = [ctypes.c_char_p]
lib.str_print.restype = None
# 调用接口
lib.str_print('Hello World')

LoadLibrary returns an instance that points to a dynamic library through which functions in the library can be called directly from Python. argtypes and restype are the parameter attributes of the function in the dynamic library. The former is a list or tuple of ctypes type, which is used to specify the parameter type of the function interface in the dynamic library. The latter is the return type of the function (the default is c_int, which can be left unspecified). , for non-c_int type, it needs to be specified explicitly). The parameter type mapping involved in this part, and how to pass advanced types such as struct and pointer to functions, can refer to the documents in the appendix.

3.1.5 Java Access Code

Java calls C lib in two ways: JNI and JNA. From the point of view of ease of use, JNA is more recommended.

3.1.5.1 JNI access

Java has supported the JNI interface protocol since version 1.1, which is used to implement the Java language to call C/C++ dynamic libraries. In the JNI mode, the c_wrapper module mentioned above is no longer applicable. The JNI protocol itself provides the interface definition of the adaptation layer, which needs to be implemented according to this definition. The specific access steps of JNI mode are as follows:

In Java code, add the native keyword to the method that needs to be called across languages to declare that this is a native method.

 import java.lang.String;
public class JniDemo {
    public native void print(String text);
}

Through the javah command, the native method in the code is generated into the corresponding C language header file. This header file is similar to the c_wrapper function mentioned above.

 javah JniDemo

The resulting header file is as follows (some comments and macros are simplified here to save space):

 #include <jni.h>
#ifdef __cplusplus
extern "C" {
#endif
JNIEXPORT void JNICALL Java_JniDemo_print
  (JNIEnv *, jobject, jstring);
#ifdef __cplusplus
}
#endif

jni.h is provided in the JDK, which defines the relevant implementations necessary for Java and C language calls.

JNIEXPORT and JNICALL are two macros defined in JNI. JNIEXPORT identifies methods that support calling the dynamic library in external program code, and JNICALL defines the convention of pushing and popping parameters when calling functions.

Java_JniDemo_print is an automatically generated function name, and its format is fixed by Java_{className}_{methodName} . JNI will follow this convention to register the mapping between Java methods and C functions.

Of the three parameters, the first two are fixed. JNIEnv encapsulates some tool methods in jni.h, jobject points to the calling class in Java, namely JniDemo, through which you can find the copy of the member variables in the class in Java in the C stack. jstring points to the incoming parameter text, which is a mapping to the String type in Java. The specific content of type mapping will be expanded in detail later.

Write the implementation Java_JniDemo_print method.

JniDemo.cpp

 #include <string>
#include "JniDemo.h"
#include "str_print.h"
JNIEXPORT void JNICALL Java_JniDemo_print (JNIEnv *env, jobject obj, jstring text)
{
    char* str=(char*)env->GetStringUTFChars(text,JNI_FALSE);
    std::string tmp = str;
    StrPrint ins;
    ins.print(tmp);
}

Compile and generate dynamic library.

 g++ -o libJniDemo.so JniDemo.cpp str_print.cpp -fPIC -shared -I<$JAVA_HOME>/include/ -I<$JAVA_HOME>/include/linux

Compile and run.

 java -Djava.library.path=<path_to_libJniDemo.so> JniDemo

The JNI mechanism implements a cross-language calling protocol through a layer of C/C++ bridging. This function has a large number of applications under some Java programs related to graphics computing in the Android system. On the one hand, a large number of low-level libraries of the operating system can be called through Java, which greatly reduces the workload of driver development on the JDK, and on the other hand, the hardware performance can be more fully utilized. However, it can be seen from the description in 3.1.5.1 that the implementation cost of the JNI implementation itself is still relatively high. In particular, the writing of C/C++ code in the bridging layer has a large development cost when dealing with complex types of parameter transfer. In order to optimize this process, Sun has led the work of the JNA (Java Native Access) open source project.

3.1.5.2 JNA Access

JNA is a programming framework implemented on the basis of JNI. It provides a C language dynamic forwarder and realizes the automatic conversion of Java types to C types. Therefore, Java developers only need to describe the function and structure of the target native library in a Java interface, and no longer need to write any native/JNI code, which greatly reduces the development difficulty of Java calling the native shared library.

JNA is used as follows:

Introduce the JNA library in the Java project.

 <dependency>
  <groupId>com.sun.jna</groupId>
  <artifactId>jna</artifactId>
  <version>5.4.0</version>
</dependency>

Declare the Java interface class corresponding to the dynamic library.

 public interface CLibrary extends Library {
    void str_print(String text); // 方法名和动态库接口一致，参数类型需要用Java里的类型表示，执行时会做类型映射，原理介绍章节会有详细解释
}

Load the dynamic link library and implement the interface method.

JnaDemo.java

 package com.jna.demo;
import com.sun.jna.Library;
import com.sun.jna.Native;
public class JnaDemo {
    private CLibrary cLibrary;
    public interface CLibrary extends Library {
        void str_print(String text);
    }

    public JnaDemo() {
        cLibrary = Native.load("str_print", CLibrary.class);
    }

    public void str_print(String text)
    {
        cLibrary.str_print(text);
    }
}

By comparison, it can be found that, compared with JNI, JNA no longer needs to specify the native keyword, no longer needs to generate part of JNI C code, and no longer needs to explicitly do parameter type conversion, which greatly improves the efficiency of calling dynamic libraries.

3.2 Package release

In order to achieve out-of-the-box use, we package the dynamic library with the corresponding language code, and automatically prepare the corresponding dependent environment. In this way, the user only needs to install the corresponding library and import it into the project, and then it can start calling directly. What needs to be explained here is that we did not publish the so on the running machine, but published it together with the interface code to the code repository, because the tool code we developed may be used by different businesses and backgrounds (non-C++) teams It cannot guarantee that each business team uses a unified and standardized operating environment, and cannot achieve unified release and update of so.

3.2.1 Python package release

Python can package the tool library through setuptools and publish it to the pypi public repository. The specific operation method is as follows:

Create a directory.

 .
  ├── MANIFEST.in            #指定静态依赖
  ├── setup.py               # 发布配置的代码
  └── strprint               # 工具库的源码目录
      ├── __init__.py        # 工具包的入口
      └── libstr_print.so    # 依赖的c_wrapper 动态库

Write __init__.py to encapsulate the above code into a method.

 # -*- coding: utf-8 -*-
  import ctypes
  import os
  import sys
  dirname, _ = os.path.split(os.path.abspath(__file__))
  lib = ctypes.cdll.LoadLibrary(dirname + "/libstr_print.so")
  lib.str_print.argtypes = [ctypes.c_char_p]
  lib.str_print.restype = None
  def str_print(text):
      lib.str_print(text)

Write setup.py.

 from setuptools import setup, find_packages
  setup(
      name="strprint",
      version="1.0.0",
      packages=find_packages(),
      include_package_data=True,
      description='str print',
      author='xxx',
      package_data={
          'strprint': ['*.so']
      },
  )

Write MANIFEST.in.

 include strprint/libstr_print.so

Package release.

 python setup.py sdist upload

3.2.2 Java Interface

For the Java interface, package it into a JAR package and publish it to the Maven repository.

Write the package interface code JnaDemo.java .

 package com.jna.demo;
  import com.sun.jna.Library;
  import com.sun.jna.Native;
  import com.sun.jna.Pointer;
  public class JnaDemo {
      private CLibrary cLibrary;
      public interface CLibrary extends Library {
          Pointer create();
          void str_print(String text);
      }

      public static JnaDemo create() {
          JnaDemo jnademo = new JnaDemo();
          jnademo.cLibrary = Native.load("str_print", CLibrary.class);
          //System.out.println("test");
          return jnademo;
      }

      public void print(String text)
      {
          cLibrary.str_print(text);
      }
  }

Create the resources directory and put the dependent dynamic libraries in this directory.

By packaging the plug-in, the dependent libraries are packaged into the JAR package.

 <plugin>
    <artifactId>maven-assembly-plugin</artifactId>
    <configuration>
        <appendAssemblyId>false</appendAssemblyId>
        <descriptorRefs>
            <descriptorRef>jar-with-dependencies</descriptorRef>
        </descriptorRefs>
    </configuration>
    <executions>
        <execution>
            <id>make-assembly</id>
            <phase>package</phase>
            <goals>
                <goal>assembly</goal>
            </goals>
        </execution>
    </executions>
  </plugin>

3.3 Business Use

3.3.1 Python usage

Install the strprint package.

 pip install strprint==1.0.0

Example of use:

 # -*- coding: utf-8 -*-
  import sys
  from strprint import *
  str_print('Hello py')

3.3.2 Java usage

pom introduces the JAR package.

 <dependency>
      <groupId>com.jna.demo</groupId>
      <artifactId>jnademo</artifactId>
      <version>1.0</version>
  </dependency>

Example of use:

 JnaDemo jnademo = new JnaDemo();
  jnademo.str_print("hello jna");

3.4 Usability optimization

3.4.1 Python version compatibility

The problem of Python2 and Python3 versions is a slot that Python development users have been criticizing for a long time. Because the tools are oriented to different business teams, we have no way to force the use of a unified Python version, but we can achieve the compatibility of the two versions by simply processing the tool library. In Python version compatibility, two issues need to be paid attention to:

Syntax Compatible
data encoding

In the packaging of Python code, there is basically no syntax compatibility issue. Our work mainly focuses on data encoding. Since the str type of Python 3 uses unicode encoding, and in C, the char* we need is utf8 encoding, so we need to do utf8 encoding for the incoming string, and do utf8 conversion for the string returned by C language Decoding into unicode. So for the above example, we made the following modifications:

 # -*- coding: utf-8 -*-
import ctypes
import os
import sys
dirname, _ = os.path.split(os.path.abspath(__file__))
lib = ctypes.cdll.LoadLibrary(dirname + "/libstr_print.so")
lib.str_print.argtypes = [ctypes.c_char_p]
lib.str_print.restype = None
def is_python3():
    return sys.version_info[0] == 3

def encode_str(input):
    if is_python3() and type(input) is str:
        return bytes(input, encoding='utf8')
    return input

def decode_str(input):
    if is_python3() and type(input) is bytes:
        return input.decode('utf8')
    return input

def str_print(text):
    lib.str_print(encode_str(text))

3.4.2 Dependency Management

In many cases, the dynamic library we call will depend on other dynamic libraries. For example, when the version of gcc/g++ we depend on is inconsistent with the running environment, we often encounter the problem of glibc_X.XX not found . We need to provide the specified version of libstdc.so and libstdc++.so.6 .

In order to achieve the goal of out-of-the-box use, if the dependencies are not complicated, we will also package these dependencies into the release package and provide them with the toolkit. For these indirect dependencies, no explicit load is required in the encapsulated code, because in the implementation of Python and Java, the system function dlopen is ultimately called when the dynamic library is loaded. When this function loads the target dynamic library, it will automatically load its indirect dependencies. So all we need to do is to place these dependencies in a path that dlopen can find.

The order in which dlopen looks for dependencies is as follows:

Search from the directory specified by DT_RPATH of the dlopen caller ELF (Executable and Linkable Format). ELF is the file format of so. The DT_RPATH here is written in the dynamic library file. We cannot modify this part by conventional means.
Search from the directory specified by the environment variable LD_LIBRARY_PATH, which is the most common way to specify the dynamic library path.
Search from the directory specified by DT_RUNPATH of the dlopen caller ELF, which is also the path specified in the so file.
Looking for it from /etc/ld.so.cache, it is necessary to modify the target cache constructed by the /etc/ld.so.conf file. Because root authority is required, it is generally rarely modified in actual production.
Search from /lib, the system directory, which generally stores system-dependent dynamic libraries.
Searching from /usr/lib, dynamic libraries installed by root are also rarely used in production because root privileges are required.

As can be seen from the above search sequence, the best way for dependency management is to specify the LD_LIBRARY_PATH variable to include the path where the dynamic library resources in our toolkit are located. In addition, for Java programs, we can also specify the location of the dynamic library by specifying java.library.path running parameters. The Java program will java.library.path with the dynamic library file name and pass it to dlopen as an absolute path, and its loading order is before the above order.

Finally, there is one more detail to note in Java. The toolkit we release is provided in the form of a JAR package. A JAR package is essentially a compressed package. In a Java program, we can directly pass the Native.load() method. , directly load the so located in the project resources directory. After these resource files are packaged, they will be placed in the root directory of the JAR package.

But dlopen cannot load this directory. For this problem, the best solution can refer to the packaging method in the [2.1.3 Generate Dynamic Library] section, and synthesize the dependent dynamic library into a so, which can be used out of the box without any environment configuration. However, for systems such as libstdc++.so.6 that cannot be packaged in a so-based system library, a more general approach is to copy the so file from the JAR package to a local directory when the service is initialized, LD_LIBRARY_PATH specify-- LD_LIBRARY_PATH Include this directory.

4. Principle introduction

4.1 Why do you need a c_wrapper

It is mentioned in the implementation plan section that Python/Java cannot directly call the C++ interface, and the interface provided in C++ should be encapsulated in the form of C language first. The fundamental reason here is that before using the interface in the dynamic library, it is necessary to find the address of the interface in memory according to the function name. The addressing of the functions in the dynamic library is implemented by the system function dlsym, which is strictly based on the incoming function name.

In the C language, the function signature is the name of the code function, while in the C++ language, because of the need to support function overloading, there may be multiple functions with the same name. In order to ensure the signature is unique, C++ generates different signatures for functions with the same name and different implementations through the name mangling mechanism. The generated signature will be a string like __Z4funcPN4printE, which cannot be recognized by dlsym (Note: executable programs under Linux system or Most dynamic libraries organize binary data in ELF format, in which all non-static functions (non-static) are uniquely identified by "symbols", which are used to distinguish different functions during the linking process and the execution process. When mapped to a specific instruction address, this "symbol" we usually call the function signature).

In order to solve this problem, we need to specify that the function should be compiled with C's signature by extern "C". Therefore, when the dependent dynamic library is a C++ library, a c_wrapper module needs to be used as a bridge. When the dependent library is a dynamic library compiled in C language, this module is not needed and can be called directly.

4.2 How to implement parameter passing in cross-language calls

The standard procedure for C/C++ function calls is as follows:

Allocate a stack frame for the called function in the stack space of the memory, which is used to store the formal parameters, local variables and return address of the called function.
Copies the value of the actual parameter to the corresponding formal parameter variable (can be a pointer, reference, value copy).
The flow of control is transferred to the start of the called function and executed.
The control flow returns to the function call site, and the return value is given to the caller, and the stack frame is released.

It can be seen from the above process that function calls involve the application and release of memory, the copying of actual parameters to formal parameters, etc. Python/Java, a program running on a virtual machine, also follows the above process inside its virtual machine, but it involves calling non-parameters. When a dynamic library program is implemented in a native language, what is the calling process?

Since the calling process of Python/Java is basically the same, we take the calling process of Java as an example to explain, and will not repeat the calling process of Python.

4.2.1 Memory Management

In the Java world, memory is managed by the JVM. The memory of the JVM is composed of stack area, heap area and method area. In the more detailed information, native heap and native stack will also be mentioned. In fact, we do not know about this problem. From the perspective of the JVM, it is simpler and more intuitive to understand it from the operating system level. Taking the Linux system as an example, first of all, the JVM is nominally a virtual machine, but its essence is a process running on the operating system, so the memory of this process will be divided as shown in the left figure below. The JVM's memory management is essentially re-dividing on the process's heap, and "virtual" the stack in the Java world. As shown in the figure on the right, the native stack area is the stack area of the JVM process. A part of the heap area of the process is used for management by the JVM, and the rest can be allocated to the native method.

图 3

4.2.2 Calling the procedure

As mentioned above, before the native method is called, the dynamic library where it is located needs to be loaded into the memory. This process is implemented by using Linux's dlopen. The JVM will put the code fragments in the dynamic library into the Native Code area, and at the same time, it will be stored in the JVM. The Bytecode area saves a map of the native method name and its memory address in the Native Code.

The calling steps of a native method are roughly divided into four steps:

Get the address of the native method from JVM Bytecode.
Prepare the required parameters for the method.
Switch to the native stack and execute the native method.
After the native method is popped, switch back to the JVM method, and the JVM copies the result to the JVM's stack or heap.

图 4

It can be seen from the above steps that the invocation of the native method also involves the copying of parameters, and the copy is established between the JVM stack and the native stack.

For native data types, parameters are pushed onto the stack together with the native method address by value copying. For complex data types, a set of protocols is required to map objects in Java to data bytes that can be recognized in C/C++. The reason is that the memory arrangement in the JVM and C language is quite different, and the memory cannot be copied directly. These differences mainly include:

The type length is different, for example, char is 16 bytes in Java, but it is 8 bytes in C.
The byte order (Big Endian or Little Endian) of the JVM and the operating system may not be consistent.
The JVM object will contain some meta information, while the struct in C is just a parallel arrangement of basic types. Similarly, there is no pointer in Java, and it also needs to be encapsulated and mapped.

图 5

The above figure shows the process of parameter passing during the native method invocation. The mapping copy in JNI is implemented by the glue code in the C/C++ linking part, and the type mapping is defined in jni.h.

The mapping between Java primitive types and C primitive types (pass by value. Copy the value of the Java object in the JVM memory to the formal parameter position of the stack frame):

 typedef unsigned char   jboolean;
typedef unsigned short  jchar;
typedef short           jshort;
typedef float           jfloat;
typedef double          jdouble;
typedef jint            jsize;

The mapping between Java complex types and C complex types (passed by pointer. First, according to the basic type one-to-one mapping, copy the address of the assembled new object to the formal parameter position of the stack frame):

 typedef _jobject *jobject;
typedef _jclass *jclass;
typedef _jthrowable *jthrowable;
typedef _jstring *jstring;
typedef _jarray *jarray;

Note : In Java, non-native types are all derived classes of Object. The array of multiple objects is also an object. The type of each object is a class, and the class itself is also an object.

 class _jobject {};
class _jclass : public _jobject {};
class _jthrowable : public _jobject {};
class _jarray : public _jobject {};
class _jcharArray : public _jarray {};
class _jobjectArray : public _jarray {};

jni.h provides tools for memory copying and reading, such as GetStringUTFChars in the previous example, which can copy the text content of the string in the JVM to the native heap according to the utf8 encoding format , and pass the char* pointer to the native method for use.

The entire calling process, the generated memory copy, and the objects in Java are cleaned up by the JVM's GC. If the objects in the Native Heap are allocated and generated by the JNI framework, such as the parameters in the JNI example above, the framework will release them uniformly. For newly allocated objects in C/C++, user code needs to manually release them in C/C++. In short, Native Heap is consistent with ordinary C/C++ processes, there is no GC mechanism, and follows the memory governance principle of who allocates and releases.

4.3 Extended reading (JNA direct mapping)

Compared with JNI, JNA uses the basic framework of its function call, in which the memory mapping part is automatically completed by the tool classes in the JNA tool library to complete most of the work of type mapping and memory copying, thereby avoiding the writing of a lot of glue code, It is more friendly to use, but the corresponding part of the work has some performance losses.

JNA also provides a "direct mapping" (DirectMapping) call to make up for this deficiency. However, direct mapping has strict restrictions on parameters. Only primitive types, corresponding arrays and Native reference types can be passed, and indefinite parameters are not supported. The method return type can only be primitive types.

The native keyword needs to be added to the directly mapped Java code, which is consistent with the writing method of JNI.

DirectMapping Example

 import com.sun.jna.*;
public class JnaDemo {
    public static native double cos(DoubleByReference x);
    static {
        Native.register(Platform.C_LIBRARY_NAME);
    }

    public static void main(String[] args) {
        System.out.println(cos(new DoubleByReference(1.0)));
    }
}

DoubleByReference is the implementation of the native reference type of double-precision floating-point numbers. Its JNA source code is defined as follows (only relevant code is intercepted):

 //DoubleByReference
public class DoubleByReference extends ByReference {
    public DoubleByReference(double value) {
        super(8);
        setValue(value);
    }
}

// ByReference
public abstract class ByReference extends PointerType {
    protected ByReference(int dataSize) {
        setPointer(new Memory(dataSize));
    }
}

The Memory type is the Java version of shared_ptr implementation, which encapsulates the details of memory allocation, reference, and release by referencing arguments. This type of data memory is actually allocated in the native heap, and in Java code, only a reference to this memory can be obtained. JNA allocates new memory in the heap by calling malloc when constructing a Memory object, and records a pointer to the memory.

When the object of ByReference is released, call free to release the memory. The finalize method of the ByReference base class in the JNA source code will be called during GC, and the memory corresponding to the application will be released at this time. Therefore, in the implementation of JNA, the allocated memory in the dynamic library is managed by the code of the dynamic library, and the memory allocated by the JNA framework is displayed and released by the code in JNA, but the trigger timing is to release the JNA object by the GC mechanism in the JVM time to trigger the operation. This is consistent with the fact that there is no GC mechanism in Native Heap mentioned above, and the principle of who allocates and releases is followed.

 @Override
protected void finalize() {
    dispose();
}

/** Free the native memory and set peer to zero */
protected synchronized void dispose() {
    if (peer == 0) {
        // someone called dispose before, the finalizer will call dispose again
        return;
    }

    try {
        free(peer);
    } finally {
        peer = 0;
        // no null check here, tracking is only null for SharedMemory
        // SharedMemory is overriding the dispose method
        reference.unlink();
    }
}

4.4 Performance Analysis

Improving computing efficiency is an important purpose in Native calling. However, after the above analysis, it is not difficult to find that in a cross-language localization calling process, there is still a lot of cross-language work to be completed, and these processes also require corresponding computing costs. force. Therefore, not all Native calls can improve operational efficiency. To do this, we need to understand where the performance differences between languages are, and how much computing power is required for cross-language calls.

The performance differences between languages are mainly reflected in three aspects:

Both Python and Java languages are interpreted and executed languages. During runtime, scripts or bytecodes need to be translated into binary machine instructions, and then handed over to the CPU for execution. The C/C++ compiler and execution language is directly compiled into machine instructions for execution. Although there are runtime optimization mechanisms such as JIT, it can only close this gap to a certain extent.
The upper-level language has many operations, which are implemented by the bottom layer of the operating system through cross-language calls. This part is obviously not as efficient as direct calls.
The memory management mechanism of Python and Java language introduces a garbage collection mechanism to simplify memory management. When GC work is running, it will occupy a certain amount of system overhead. This part of the efficiency difference usually appears in the form of runtime glitches, that is, it has little effect on the average operating time, but has a greater impact on the operating efficiency at individual moments.

The overhead of cross-language calls mainly includes three parts:

For the cross-language call implemented by dynamic proxy like JNA, there are stack switching, proxy routing and other tasks in the call process.
Addressing and constructing the local method stack, that is, corresponding native methods in Java to function addresses in the dynamic library, and constructing the work of the calling site.
Memory mapping, especially when a large amount of data is copied from JVM Heap to Native Heap, this part of the overhead is the main time-consuming for cross-language calls.

We made a simple performance comparison through the following experiments. We used C language, Java, JNI, JNA and JNA to directly map five methods, respectively, to perform 1 million to 10 million cosine calculations to obtain a time-consuming comparison. On a 6-core 16G machine, we get the following results:

图 6

图 7

According to the experimental data, the operating efficiency is C > Java > JNI > JNA DirectMapping > JNA . C language is more efficient than Java, but the two are very close. The performance of JNI and JNA DirectMapping is basically the same, but it will be much slower than the implementation of the native language. JNA in normal mode is the slowest, 5 to 6 times slower than JNI.

To sum up, cross-language localization calls do not always improve computing performance, and it is necessary to comprehensively weigh the complexity of computing tasks and the time-consuming of cross-language calls. The scenarios we have summarized so far that are suitable for cross-language calls are:

Offline data analysis : Offline tasks may involve the development of multiple languages and are not time-consuming. The core point is that the effect in multiple languages is leveled, and cross-language calls can save the development cost of multi-language versions.
Converting cross-language RPC calls to cross-language localization calls : For computing requests that take microseconds or less, if the results are obtained through RPC calls, the time for network transmission is at least milliseconds, which is far for computational overhead. In the case of simple dependencies, converting them into localized calls will greatly reduce the processing time of a single request.
For some complex model calculations, Python/Java cross-language calls to C++ can improve computational efficiency.

5 Application cases

As mentioned above, the approach of calling locally can bring some benefits in terms of performance and development cost. We have tried these techniques in offline task computing and real-time service invocation, and achieved ideal results.

5.1 Applications in Offline Tasks

In the search business, there will be a large number of offline computing tasks such as vocabulary mining, data processing, and index building. This process will use more text processing and recognition capabilities in query understanding, such as word segmentation, name recognition, etc. Because of the differences in development languages, it is unacceptable in terms of cost to re-develop these capabilities locally. Therefore, in the previous task, the online service will be called through RPC during the offline computing process. This solution brings the following problems:

The magnitude of offline computing tasks is usually large, and the requests are relatively intensive during the execution process, which will occupy online resources, affect online user requests, and have low security.
The time-consuming of a single RPC is at least milliseconds, and the actual computing time is often very short, so most of the time is actually wasted on network communication, which seriously affects the efficiency of task execution.
Due to network jitter and other reasons, the call success rate of the RPC service cannot reach 100%, which affects the task execution effect.
Offline tasks need to introduce code related to RPC calls. In lightweight computing tasks such as Python scripts, this part of the code is often due to the imperfection of some basic components, resulting in high access costs.

图 8

After transforming RPC calls into cross-language localization calls, the above problems can be solved, and the benefits are obvious.

Online services are no longer called, traffic is isolated, and there is no impact on online security.
For more than 10 million offline tasks, at least 10 hours of network overhead time will be saved.
Eliminate the problem of request failure caused by network jitter.
Through the work of the above chapters, out-of-the-box localization tools are provided, which greatly simplifies the cost of use.

图 9

5.2 Application in Online Services

As a basic service platform within Meituan, query understanding provides text analysis such as word segmentation, query error correction, query rewriting, landmark recognition, remote recognition, intent recognition, entity recognition, and entity linking. It is a relatively large CPU-intensive service. Undertook a lot of business scenarios of this article analysis in the company, some of which only need individual signals, or even only need to query and understand the basic function components in the service. For most of the business services developed through Java, the query and understanding cannot be directly quoted. C++ dynamic library, previously generally obtained results through RPC calls. Through the above work, in non-C++ language caller services, RPC calls can be converted into cross-language localization calls, which can significantly improve the performance and success rate of the caller, and can also effectively reduce the resource overhead of the server.

图 10

6 Summary

The development of technologies such as microservices makes it easier to create, publish, and access services. However, in actual industrial production, not all scenarios are suitable for computing through RPC services. Especially in computing-intensive and time-sensitive business scenarios, when performance becomes the bottleneck, the network overhead caused by remote calls becomes an unbearable pain for the business. This article summarizes the technology of language localization invocation, and gives some practical experience, hoping to provide some help for you to solve similar problems.

Of course, there are still many deficiencies in this work. For example, due to the requirements of the actual production environment, our work is basically concentrated in the Linux system. If it is in the form of an open library, if the user can use it freely, it may be necessary to consider compatibility. DLLs under Windows, dylibs under Mac OS, etc. There may be other deficiencies in this article, and you are welcome to leave a message to correct and discuss.

The source code of the examples in this article is available at: GitHub .

7 References

8 Author of this article

Lin Yang, Zhu Chao, and Shi Han are all from Meituan Platform/Search and NLP Department/Search Technology Department.

Read more collections of technical articles from the Meituan technical team

| Reply keywords such as [2021 stock], [2020 stock], [2019 stock], [2018 stock], [2017 stock] in the public account menu bar dialog box, you can view the collection of technical articles by the Meituan technical team over the years.

| This article is produced by Meituan technical team, and the copyright belongs to Meituan. Welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication, please indicate "The content is reproduced from the Meituan technical team". This article may not be reproduced or used commercially without permission. For any commercial activities, please send an email to tech@meituan.com to apply for authorization.