TKE qGPU manages cluster GPU card resources through CRD

author

Liu Xu, senior engineer of Tencent Cloud, focuses on the field of container cloud native. He has many years of experience in large-scale Kubernetes cluster management. He is currently responsible for the research and development of Tencent Cloud GPU containers.

background

At present, TKE has provided a shared GPU scheduling isolation solution based on qGPU with strong isolation of computing power and video memory. However, some users reported that they lack the observability of GPU resources, such as the inability to obtain the remaining resources of a single GPU device, which is not conducive to the operation and maintenance of GPU resources. manage. In this context, we hope to provide a solution that allows users to intuitively count and query the usage of GPU resources in a Kubernetes cluster.

Target

Based on the current TKE shared GPU scheduling scheme, the observability of GPU devices is enhanced from the following aspects:

Supports getting resource allocation information for a single GPU device.
Supports getting the health status of a single GPU device.
Support to obtain the information of each GPU device on a node.
Supports getting GPU device and Pod/Container association information.

Our solution

We scan the physical GPU information through GPU CRD and update the used physical GPU resources during the qGPU life cycle to solve the problem of lack of visibility in shared GPU scenarios.

Custom GPU CRD : Each GPU device corresponds to a GPU object, and the hardware information, health status and resource allocation of the GPU device can be obtained through the GPU object.
Elastic GPU Device Plugin : Create a GPU object based on the hardware information of the GPU device, and regularly update the health status of the GPU device.
Elastic GPU Scheduler : Schedules Pods based on GPU resource usage and updates the scheduling results to GPU objects.

TKE GPU CRD Design

 apiVersion: elasticgpu.io/v1alpha1
kind: GPU
metadata:
  labels:
    elasticgpu.io/node: 10.0.0.2
  name: 192.168.2.5-00
spec:
  index: 0
  memory: 34089730048
  model: Tesla V100-SXM2-32GB
  nodeName: 10.0.0.2
  path: /dev/nvidia0
  uuid: GPU-cf0f5fe7-0e15-4915-be3c-a6d976d65ad4
status:
  state: Healthy
  allocatable:
    tke.cloud.tencent.com/qgpu-core: "50"
    tke.cloud.tencent.com/qgpu-memory: "23"
  allocated:
    0dc3c905-2955-4346-b74e-7e65e29368d2:
      containers:
      - container: test
        resource:
          tke.cloud.tencent.com/qgpu-core: "50"
          tke.cloud.tencent.com/qgpu-memory: "8"
      namespace: default
      pod: test
  capacity:
    tke.cloud.tencent.com/qgpu-core: "100"
    tke.cloud.tencent.com/qgpu-memory: "31"

Each GPU physical card corresponds to a GPU CRD. Through the GPU CRD, you can clearly understand the hardware information such as the model and video memory of each card. At the same time, you can obtain the health status and resource allocation of each GPU device through status .

TKE GPU scheduling process

Kubernetes provides Scheduler Extender to extend the scheduler to meet the scheduling requirements in complex scenarios. The extended scheduler will call the extension program through HTTP protocol to perform pre-selection and optimization again after calling the built-in pre-selection strategy and optimization strategy, and finally select a suitable Node for Pod scheduling.

In TKE Elastic GPU Scheduler (original TKE qGPU Scheduler), we combine the GPU CRD design. When scheduling, we first filter out abnormal GPU devices according to status.state status.allocatable The GPU devices that meet the requirements are updated when the scheduling is finally completed status.allocatable and status.allocated .

TKE GPU allocation process

Kubernetes provides the Device Plugin mechanism to support hardware devices such as GPU FPGA. Device manufacturers only need to implement Device Plugin according to the interface without modifying the Kubernetes source code. Device Plugin generally runs on nodes in the form of DaemonSet.

When the TKE Elastic GPU Device Plugin (original TKE qGPU Device Plugin) starts, we will create a GPU object based on the hardware information of the GPU device on the node. At the same time, we will regularly check the health status of the GPU device and synchronize it to the GPU object's status.state .

Summarize

In order to solve the problem of the lack of observability of GPU resources in the current TKE cluster, we have introduced GPU CRD, which allows users to intuitively count and query the usage of GPU resources in the cluster. At present, this solution has been integrated with qGPU and can be displayed in the TKE console. It can be enabled by selecting Use CRD when installing the qGPU plugin.

At present, TKE qGPU has been fully launched. For details, please click: https://cloud.tencent.com/document/product/457/61448

about us

For more cases and knowledge about cloud native, you can pay attention to the public account of the same name [Tencent Cloud Native]~

Welfare:

① Reply to the [Manual] in the background of the official account, you can get the "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~

②The official account will reply to the [series] in the background, and you can get "15 series of 100+ super practical cloud native original dry goods collection", including Kubernetes cost reduction and efficiency enhancement, K8s performance optimization practices, best practices and other series.

③If you reply to the [White Paper] in the background of the official account, you can get the "Tencent Cloud Container Security White Paper" & "The Source of Cost Reduction - Cloud Native Cost Management White Paper v1.0"

④ Reply to [Introduction to the Speed of Light] in the background of the official account, you can get a 50,000-word essence tutorial of Tencent Cloud experts, Prometheus and Grafana of the speed of light.

[Tencent Cloud Native] New products of Yunshuo, new techniques of Yunyan, new activities of Yunyou, and information of cloud appreciation, scan the code to follow the public account of the same name, and get more dry goods in time! !

TKE qGPU manages cluster GPU card resources through CRD

author

background

Target

Our solution

TKE GPU CRD Design

TKE GPU scheduling process

TKE GPU allocation process

Summarize

about us

Welfare:

账号已注销

引用和评论

Serverless AI绘画技术沙龙【深圳站】火热报名中

DeepSeek 从热潮到应用，腾讯云携手行业专家共探 AI 下一步

2025免费云服务器盘点

信息安全风云录，AI 时代安全江湖如何见招拆招？

腾讯云TVP AI与安全高峰论坛圆满落幕，共探大模型时代的安全破局之道

腾讯云cos大文件上传服务端实现一篇搞定

具身智能全解读，从实验室到产业化 | TVP技术夜未眠