Abstract: describes how to add a new hardware backend to MindSpore.

This article is shared from the " How to add a new hardware backend to MindSpore?" Build a test environment quickly! ", the original author: HWCloudAI.

MindSpore is a new generation of AI open source computing framework developed by Huawei. The full-scenario deep learning framework that best matches the computing power of the AI processor provides data scientists and algorithm engineers with a 160eba930bee9f design-friendly and efficient development experience, and promotes the prosperity and development of the artificial intelligence software and hardware application ecosystem.

MindSpore supports heterogeneous computing power. In addition to Huawei's self-developed Da Vinci-based Ascend NPU, it also supports the operation of CPU (eg MKLDNN) and GPU (eg CUDA kernels) operators. (Note: MindSpore supports the entire network to run on different hardware platforms, and does not support different partitions of the same network to run on different hardware platforms, which is different from TensorFlow's graph partition heterogeneous operation mode).

The current AI chip industry is "extraordinarily lively." Both domestic and foreign manufacturers, both old and new, are launching their own AI acceleration chips. Now everyone should see clearly that if hardware is to be successful, the software stack and ecological support . MindSpore not only serves to support Huawei's AI software and hardware stack, but also wants to occupy its own space in the entire AI ecosystem.

MindSpore is still in the stage of promotion and development. This article would like to introduce how to add a new hardware backend to MindSpore. At the same time, it also gives some basic introduction to the directory structure of MindSpore source code. Interested developers provide some useful information and references, so that everyone can use MindSpore as a framework for testing and docking AI chips to quickly build a test environment for the entire network model.

This article is aimed at the source code of MindSpore r1.1 version: https://gitee.com/mindspore/mindspore/tree/r1.1/
For how to compile and install MindSpore from the source code and the requirements for the relevant software version, please refer to: https://www.mindspore.cn/install/

Test case

This article will focus on a simple layer network: 160eba930bef50 https://www.mindspore.cn/doc/api_python/zh-CN/r1.1/mindspore/nn/mindspore.nn.Dense.html#mindspore.nn.Dense
To demonstrate how to make this layer run on a new hardware backend.

Note: This article is aimed at the basic static graph execution mode: https://www.mindspore.cn/doc/programming_guide/zh-CN/r1.1/context.html

import mindspore
import numpy as np
import mindspore.nn as nn
from mindspore import context, Tensor

context.set_context(device_target="CPU", mode=context.GRAPH_MODE)

# 32, 16
net = nn.Dense(32, 16, weight_init='ones', bias_init=1.2)#, activation='relu')

# 48, 32
input_data = Tensor(np.ones([48, 32]).astype(np.float32), mindspore.float32)
output = net(input_data)

print(output.asnumpy())

Note: I commented out the activation ReLU here, so this Dense layer is equivalent to a small network with only 2 nodes (MatMul + BiasAdd). The result of this use case is a 48 * 16 two-dimensional matrix, each element’s The value is both 33.2)

This article will use a top-to-bottom process, introduce MindSpore to support a new hardware back-end component that needs to be modified. Here we call the new hardware that needs to be supported XPU. The effect we want to achieve after modifying the MindSpore code is to change the device_target in the above use case to XPU, and let the Dense layer run on the accelerator XPU. eg

context.set_context(device_target="XPU", mode=context.GRAPH_MODE)

Note: This article will not show the implementation details of specific classes and functions. For specific implementations, please refer to the implementation of the supported hardware backends in the corresponding catalog, such as: CPU, GPU, Ascend

Add new device target parameter option
First, you need to add new valid_targets from the front-end ME Python layer:

https://gitee.com/mindspore/mindspore/blob/r1.1/mindspore/context.py

def set_device_target(self, target):
        valid_targets = ["CPU", "GPU", "Ascend", "Davinci", "XPU"] # 将新的后端添加到此list中
        if not target in valid_targets:
            raise ValueError(f"Target device name {target} is invalid! It must be one of {valid_targets}")
        if target == "Davinci":
            target = "Ascend"
        self.set_param(ms_ctx_param.device_target, target)
        if self.enable_debug_runtime and target == "CPU":
            self.set_backend_policy("vm") 

Then you need to add a new target in the ms context component of C++: https://gitee.com/mindspore/mindspore/blob/r1.1/mindspore/core/utils/ms_context.h

const int kGraphMode = 0;
const int kPynativeMode = 1;
const char kCPUDevice[] = "CPU";
const char kGPUDevice[] = "GPU";
const char kXPUDevice[] = "XPU";  // 添加新的硬件target
const char kAscendDevice[] = "Ascend";
const char kDavinciInferenceDevice[] = "AscendInference";
const char kDavinciDevice[] = "Davinci";
const char KNpuLog[] = "_npu_log";
const unsigned int MAX_CALL_DEPTH_DEFAULT = 1000;

// 添加新的硬件到以下set中
const std::set<std::string> kTargetSet = {kCPUDevice, kGPUDevice, kXPUDevice, kAscendDevice, kDavinciDevice};

Add a new runtime device

In the runtime device directory: https://gitee.com/mindspore/mindspore/tree/r1.1/mindspore/ccsrc/runtime/device is a component related to each specific back-end hardware feature, such as the address space on the device side , Device side memory management (allocation, recycling), kernel runtime components, etc., as well as some communication components related to hardware devices, such as MPI components that support distributed communication. We first add a folder called xpu under the directory in the figure below (pay attention to modify the CMakeLists.txt to add the folder):
image.png

The following introduces three new basic components for xpu accelerators to be created:

· Xpu_device_address: Mainly represents the memory address information on the device side of the accelerator, and the API interface for memory transfer between the host side and the device side. For example, it can be wrapper of: cudaMemcpyAsyncxpu_device_address.h on the NVIDIA GPU

#include <string>
#include <vector>
#include "runtime/device/device_address.h"
#include "utils/shape_utils.h"

namespace mindspore {
namespace device {
namespace xpu {
class XPUDeviceAddress : public DeviceAddress {
 public:
  XPUDeviceAddress(void *ptr, size_t size) : DeviceAddress(ptr, size) {}

  XPUDeviceAddress(void *ptr, size_t size, const string &format, TypeId type_id)
      : DeviceAddress(ptr, size, format, type_id) {}

  ~XPUDeviceAddress() override = default;

  bool SyncDeviceToHost(const ShapeVector &shape, size_t size, TypeId type, void *host_ptr) const override;
  bool SyncHostToDevice(const ShapeVector &shape, size_t size, TypeId type, const void *host_ptr) const override;
  DeviceAddressType DeviceType() const override { return DeviceAddressType::kXPU; }
};
}  // namespace xpu
}  // namespace device
}  // namespace mindspore

· Xpu_resource_manager: Mainly responsible for the management, allocation and scheduling of memory and other resources on the device side. xpu_resource_manager.h

#include <vector>
#include <map>
#include "backend/session/kernel_graph.h"
#include "backend/session/session_basic.h"
#include "runtime/device/device_address.h"
#include "runtime/device/xpu/xpu_simple_mem_plan.h"
namespace mindspore {
namespace device {
namespace xpu {
class XPUResourceManager {
 public:
  XPUResourceManager() = default;
  ~XPUResourceManager();

  void AssignMemory(const session::KernelGraph *graph);
  void IncreaseAddressRefCount(const session::KernelGraph *graph);
  void DecreaseAddressRefCount(const AnfNodePtr &kernel);
  void *MemMalloc(size_t mem_size);
  void MemFree(void *ptr);

 private:
  void MemFree();
  XPUSimpleMemPlan mem_plan_;

  size_t mem_size_{0};
  uint8_t *mem_ptr_{nullptr};
  bool dynamic_malloc_{false};
  std::map<void *, size_t> dynamic_mem_;
};
}  // namespace xpu
}  // namespace device
}  // namespace mindspore

· Xpu_kernel_runtime: The execution control module of the hardware operator, which is mainly responsible for the startup of the hardware runtime (Init()), the execution of the network on the hardware (Run(..)), and the cleanup work after the hardware has been executed (ReleaseDeviceRes()) xpu_kernel_runtime.h

#include <memory>
#include <vector>
#include <string>
#include <map>
#include <set>
#include "runtime/device/kernel_runtime.h"
#include "runtime/device/kernel_runtime_manager.h"
#include "backend/session/kernel_graph.h"
#include "backend/session/session_basic.h"
#include "runtime/device/xpu/xpu_resource_manager.h"
#include "backend/session/anf_runtime_algorithm.h"
#include "utils/any.h"
namespace mindspore {
namespace device {
namespace xpu {
class XPUKernelRuntime : public KernelRuntime {
 public:
  XPUKernelRuntime() = default;
  ~XPUKernelRuntime() override = default;

  bool Init() override;
  void ReleaseDeviceRes() override;
  bool Run(session::KernelGraph *graph, bool is_task_sink) override;
  void AssignKernelAddress(session::KernelGraph *kernel_graph);
  void CreateOutputTensors(session::KernelGraph *kernel_graph, const std::vector<tensor::TensorPtr> &inputs,
                           VectorRef *outputs);
  void BindInputOutput(session::KernelGraph *kernel_graph, const std::vector<tensor::TensorPtr> &inputs,
                       VectorRef *outputs);

 protected:
  bool SyncStream() override { return true; };
  DeviceAddressPtr CreateDeviceAddress(void *device_ptr, size_t device_size, const string &format,
                                       TypeId type_id) override;

 private:
  XPUResourceManager resource_manager_;
  std::set<DeviceAddressPtr> bound_addresses_;
  std::map<AnfNodePtr, tensor::TensorPtr> input_param_tensor_map_;
};

MS_REG_KERNEL_RUNTIME(kXPUDevice, XPUKernelRuntime);

}  // namespace xpu
}  // namespace device
}  // namespace mindspore

Add new target session

The Session of MindSpore provides an environment for Op kernel execution and Tensor evaluation. Session is the core module that controls the data flow graph representing the neural network. It mainly has three main steps: graph compilation (kernel generation), graph optimization, and graph execution. MindSpore will have its own Session component for each backend hardware platform. The relevant code is in the backend/session directory: https://gitee.com/mindspore/mindspore/tree/r1.1/mindspore/ccsrc/backend/ session
We create a new session class for xpu: xpu_session.h

#include <string>
#include <memory>
#include <map>
#include <vector>
#include "backend/session/session_basic.h"
#include "backend/session/kernel_graph.h"
#include "runtime/device/xpu/xpu_kernel_runtime.h" // use the new xpu kernel runtime
#include "backend/session/session_factory.h"
namespace mindspore {
namespace session {
class XPUSession : public SessionBasic {
 public:
  XPUSession() = default;
  ~XPUSession() override = default;
  void Init(uint32_t device_id) override { InitExecutor(kXPUDevice, device_id); }

  GraphId CompileGraphImpl(const AnfNodePtrList &lst, const AnfNodePtrList &outputs) override;
  void RunGraphImpl(const GraphId &graph_id, const std::vector<tensor::TensorPtr> &inputs, VectorRef *outputs) override;
  void Optimize(const std::shared_ptr<KernelGraph> &kernel_graph);

 protected:
  void UnifyMindIR(const KernelGraphPtr &graph) override { return; }
  void CreateOutputTensors(const GraphId &graph_id, const std::vector<tensor::TensorPtr> &input_tensors, VectorRef *,
                           std::map<tensor::TensorPtr, session::KernelWithIndex> *tensor_to_node) override;

 private:
  void SetKernelInfo(const KernelGraph *kernel_graph);
  void BuildKernel(const KernelGraph *kernel_graph);
  device::xpu::XPUKernelRuntime *runtime_ = dynamic_cast<device::xpu::XPUKernelRuntime*>(device::KernelRuntimeManager::Instance().GetKernelRuntime(kXPUDevice, 0));
};
MS_REG_SESSION(kXPUDevice, XPUSession);
}  // namespace session
}  // namespace mindspore

In the step of graph compilation (CompileGraphImpl(..)), the main purpose is to generate (BuildKernel(..)) the kernel corresponding to each node op in the neural network data flow graph, and save the kernel information of each node In the graph (SetKernelInfo(..)), it can be called in the subsequent graph execution (RunGraphImpl(..)) step.

Add kernel for new hardware

The hardware backend supported by MindSpore supports each op operator in the backend/kernel_compiler directory: https://gitee.com/mindspore/mindspore/tree/r1.1/mindspore/ccsrc/backend/kernel_compiler
image.png

Here we can see that for a few hardware backends, each folder represents a different type of kernel, among which:

  • cpu: There are operators that call MKLDNN (oneDNN), as well as operators written in pure C++.
  • gpu: There are operators that call cudnn/cublas, operators written in cuda, and NCCL-related operators that support distributed training.
  • Ascend: The operator kernel folders related to the Huawei DaVinci AI chip are: tbe, aicpu, akg, hccl, etc.

Let’s introduce the components needed to add kernel support to our new hardware backend. First, create a folder called xpu in the above directory (pay attention to modify the CMakeLists.txt to add the folder). In the new folder, let’s create it first. Base class for xpu kernel:

xpu_kernel.h:

#include <string>
#include <vector>
#include <memory>
#include <numeric>
#include <functional>
#include "backend/kernel_compiler/kernel.h"
#include "ir/anf.h"
#include "backend/session/anf_runtime_algorithm.h"
#include "utils/ms_utils.h"

using mindspore::kernel::Address;
using mindspore::kernel::AddressPtr;
namespace mindspore {
namespace kernel {

class XPUKernel : public kernel::KernelMod {
 public:
  XPUKernel() = default;
  ~XPUKernel() override = default;

  void Init(const CNodePtr &kernel_node);
  virtual void InitKernel(const CNodePtr &kernel_node) = 0;
  bool Launch(const std::vector<AddressPtr> &inputs, const std::vector<AddressPtr> &workspace,
              const std::vector<AddressPtr> &outputs, void * stream_ptr) override {
    return Launch(inputs, workspace, outputs);
  };

  virtual bool Launch(const std::vector<AddressPtr> &inputs, const std::vector<AddressPtr> &workspace,
                      const std::vector<AddressPtr> &outputs) = 0;
  const std::vector<size_t> &GetInputSizeList() const override { return input_size_list_; }
  const std::vector<size_t> &GetOutputSizeList() const override { return output_size_list_; }
  const std::vector<size_t> &GetWorkspaceSizeList() const override { return workspace_size_list_; }

  void SetOpName(const std::string &op_name) { op_name_ = op_name; }
  const std::string GetOpName() const { return op_name_; }

 protected:
  virtual void InitInputOutputSize(const CNodePtr &kernel_node);
  std::vector<size_t> input_size_list_ = {};
  std::vector<size_t> output_size_list_ = {};
  std::vector<size_t> workspace_size_list_ = {};

  std::string bin_path_ = {};
  std::string tilingName_ = {};

};
}  // namespace kernel
}  // namespace mindspore

The popular framework for operator kernel support generally uses the operator name (opcode) to name the kernel, such as the cpu kernels of mkldnn in mindspore: MindSpore/mindspore. The advantage of this form is that the repo code file is very clear. The specific attributes of the child can be easily expressed. The disadvantage is that there may be some duplicate code logic. Since the use case targeted in this article is very simple, in fact, only two operators need to be supported: MatMul and BiasAdd. We will use the kernel class implementation method named according to the number of input and output Tensor.

Since both MatMul and BiasAdd are operators with 2 inputs and 1 output, we define our kernel class name: two_in_one_out_xpu_kernel.h

#include "backend/kernel_compiler/xpu/xpu_kernel.h" // xpu kernel base class
#include "backend/kernel_compiler/xpu/xpu_kernel_factory.h"

#include <stdio.h>
#include <limits.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <dirent.h>
#include <algorithm>

#include <fstream>
#include <iostream>

namespace mindspore {
namespace kernel {

class TwoInOneOutXPUKernel : public XPUKernel {
 public:
  TwoInOneOutXPUKernel() = default;
  ~TwoInOneOutXPUKernel() override = default;

  void InitKernel(const CNodePtr &kernel_node) override;

  bool Launch(const std::vector<AddressPtr> &inputs,
              const std::vector<AddressPtr> &workspace,
              const std::vector<AddressPtr> &outputs) override;

 private:
  bool NeedsFormatTransformation();

  char trans_a_{TRANSPOSE_NO};
  char trans_b_{TRANSPOSE_NO};
  int32_t dim_m_{0};
  int32_t dim_n_{0};
  int32_t dim_k_{0};

  std::vector<size_t> inputA_shape_;
  std::vector<size_t> inputB_shape_;
  std::vector<size_t> output_shape_;

  size_t input_a_size_ = 0;
  size_t input_b_size_ = 0;
  size_t output_size_ = 0;

  void *inputA_data_ = nullptr;
  void *inputB_data_ = nullptr;
  void *output_data_ = nullptr;
};

MS_REG_XPU_KERNEL(
  TwoInOneOutXPU,
  mindspore::device::xpu::KernelAttr().AddInputAttr(kNumberTypeFloat32).AddInputAttr(kNumberTypeFloat32).AddOutputAttr(kNumberTypeFloat32),
  TwoInOneOutXPUKernel);
}  // namespace kernel
}  // namespace mindspore

Here we have used "backend/kernel_compiler/xpu/xpu_kernel_factory.h". For the creation of the kernel factory class, we will not go into details. For details, please refer to cpu_kernel_factory.h: https://gitee.com/mindspore/mindspore/blob /r1.1/mindspore/ccsrc/backend/kernel_compiler/cpu/cpu_kernel_factory.h

The two most basic functions for each kernel are InitKernel(..) and LaunchKernel(..), which are responsible for the initialization and operation of the kernel respectively. It should be noted here that for the execution of general static graphs like CNN, InitKernel(..) will only run once when the kernel is created (during the compile graph process of the above session), and LaunchKernel(..) will be executed in each graph. The process is called. For example, to run a CNN inference, 64 images are needed infernce, the batch size of the network is 32, and the entire image needs to be executed twice, which means that for each kernel, InitKernel(..) will be called once, and LaunchKernel(..) will be called twice.

We will not go into details of the specific implementation of MatMul and BiasAdd kernels, but only introduce some basic APIs that are required for operator kernels in MindSpore:

· Get the input and output shape information of TwoInOneOutXPUKernel:

inputA_shape_ = AnfAlgo::GetInputDeviceShape(kernel_node, 0);
inputB_shape_ = AnfAlgo::GetInputDeviceShape(kernel_node, 1);
output_shape_ = AnfAlgo::GetOutputDeviceShape(kernel_node, 0);

· Get operator attribute information, eg, transpose information of MatMul:

bool trans_a = AnfAlgo::GetNodeAttr<bool>(kernel_node, TRANSPOSE_A);
bool trans_b = AnfAlgo::GetNodeAttr<bool>(kernel_node, TRANSPOSE_B);

· Get input in Launch and output memory pointer:

auto input_a = reinterpret_cast<float *>(inputs[0]->addr);
auto input_b = reinterpret_cast<float *>(inputs[1]->addr);
auto output = reinterpret_cast<float *>(outputs[0]->addr);

Other matters needing attention

Like other mainstream frameworks, MindSpore will also have some of its own standards and specifications. Here are some "pits" that I have stepped on to share with you:

· The default format of Tensor in MindSpore is NCHW. If the format supported by the hardware backend you add is not the same, pay attention to adding format conversion. The format conversion can be done before and after each kernel call (poor efficiency), or the graph can be used to optimize the pass, and the format conversion node can be inserted efficiently with the entire network as the field of view.

· precision conversion , if your hardware platform only supports certain precisions, such as fp16, and the network is fp32, you must pay attention to the precision conversion. The precision conversion is similar to the above format conversion. The precision conversion can be done on the host side or on the device side (if supported by the hardware).

· For each kernel code logic necessary to distinguish which data is unchanged and which will change. It needs to be reinitialized before each execution, so that different logic codes can be allocated reasonably and correctly to the corresponding InitKernel(.. ) Or LaunchKernel(..).

· For some Python front-end LayerAPI , MindSpore has its own property settings, for example, for Denselayer: https://gitee.com/mindspore/mindspore/blob/r1.1/mindspore/nn/layer/basic. The second input matrix of py is transposed:

self.matmul = P.MatMul(transpose_b=True)
self.batch_matmul = P.BatchMatMul(transpose_b=True)
self.activation = get_activation(activation) if isinstance(activation, str) else activation
if activation is not None and not isinstance(self.activation, (Cell, Primitive)):
    raise TypeError("The activation must be str or Cell or Primitive,"" but got {}.".format(activation))
self.activation_flag = self.activation is not None

· For Debug , you can add the following environment variables to help output information:

export GLOG_v=1
export SLOG_PRINT_TO_STDOUT=1

· modify the CMake file , you can add the newly added files under if (ENABLE_CPU) when you start the test. The CPU is equivalent to a baseline platform for MindSpore, which means whether you build GPU or Huawei D/ Ascend target and CPU related files will be built.

to sum up

This article is a technical article on how to modify the MindSpore source code to add a new hardware backend based on the author's understanding of MindSpore. The success of an open source software framework is inseparable from the support of the community and the participation of various vendors. I hope that this article can be an inspiration for more hardware vendors and developers to participate in the ecological development of MindSpore. Everyone is also welcome to discuss together! Finally, I wish you all a happy new year! Wish MindSpore get better and better in 2021! stronger! !

After understanding the key technology of MindSpore, is it very exciting? Hurry up [click the link] and [register now], you can learn a classic case on the ModelArts platform to master deep learning based on MindSpore!

Click to follow and learn about Huawei Cloud's fresh technology for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量