As major companies around the world begin to widely adopt Kubernetes, we see that Kubernetes is developing to a new stage. On the one hand, Kubernetes is adopted by workloads at the edge and provides value beyond the data center. On the other hand, Kubernetes is driving the development of machine learning (ML) and high-quality, high-speed data analysis capabilities.

The case we know about applying Kubernetes to machine learning mainly stems from a feature in Kubernetes 1.10, when graphics processing units (GPUs) became a schedulable resource-now this feature is in beta. Taken individually, these two are exciting developments in Kubernetes. What's more exciting is that you can use Kubernetes to adopt GPUs in the data center and edge. In the data center, GPU is a way to build ML libraries. Those trained libraries will be migrated to edge Kubernetes clusters as inference tools for machine learning, providing data analysis as close as possible to the place where the data is collected.

Earlier, Kubernetes still provided a pool of CPU and RAM resources for distributed applications. If we have CPU and RAM pools, why can't we have a GPU pool? This is of course no problem, but not all servers have GPUs. So, how can our server be equipped with GPU in Kubernetes?

In this article, I will explain a simple way to use GPUs in a Kubernetes cluster. In a future article, we will also push the GPU to the edge and show you how to complete this step. In order to really simplify the steps, I will use Rancher UI to operate the process of enabling the GPU. Rancher UI is just a client of Rancher RESTful APIs. You can use other API clients such as Golang, Python, and Terraform in GitOps, DevOps, and other automation solutions. However, we will not discuss these in depth in this article.

Essentially, the steps are very simple:

  • Build the infrastructure for the Kubernetes cluster
  • Install Kubernetes
  • Install gpu-operator from Helm

Use Rancher and available GPU resources to get up and running

Rancher is a multi-cluster management solution and the glue for the above steps. You can find a pure NVIDIA solution that simplifies GPU management on NVIDIA's blog, as well as some important information about the difference between gpu-operator and building a GPU driver stack without an operator.

https://developer.nvidia.com/blog/nvidia-gpu-operator-simplifying-gpu-management-in-kubernetes/

preliminary preparation

The following is the bill of materials (BOM) required to start and run the GPU in Rancher:

  1. Rancher
  2. GPU Operator(https://nvidia.github.io/gpu-operator/)
  3. Infrastructure-we will use GPU nodes on AWS

In the official documentation, we have a special chapter on how to install Rancher with high availability, so we assume that you have installed Rancher:

https://docs.rancher.cn/docs/rancher2/installation/k8s-install/_index/

process steps

Use GPUs to install a Kubernetes cluster

After Rancher is installed, we will first build and configure a Kubernetes cluster (you can use any cluster with NVIDIA GPUs).

Using the Global context, we choose Add Cluster

And in the "host from cloud service provider" section, select Amazon EC2.

We do this through a node driver-a set of pre-configured infrastructure templates, some of which have GPU resources.

Note that there are 3 node pools: one is for the master, one is for the standard worker node, and the other is for the worker with GPU. The GPU template is based on the p3.2xlarge machine type, using the Ubuntu 18.04 Amazon machine image or AMI (ami-0ac80df6eff0e70b5). Of course, these choices vary according to the needs of each infrastructure provider and enterprise. In addition, we set the Kubernetes option in the "Add Cluster" form to the default value.

set GPU Operator

Now, we will use the GPU Operator library ( https://nvidia.github.io/gpu-operator ) to set up a catalog in Rancher. (There are other solutions to expose the GPU, including the use of Linux for Tegra [L4T] Linux distributions or device plug-ins) At the time of writing this article, GPU Operator has been tested and verified by NVIDIA Tesla Driver 440.

Using the Rancher Global context menu, we select the cluster to install to:

Then use the Tools menu to view the catalog list.

Click the Add Catalog button and give it a name, then add the url: https://nvidia.github.io/gpu-operator

We chose Helm v3 and the cluster scope. We click Create to add Catalog to Rancher. When using automation, we can use this step as part of the cluster construction. According to the corporate strategy, we can add this Catalog to each cluster, even if it does not yet have GPU nodes or node pools. This step provides us with the opportunity to access the GPU Operator chart, which we will install next.

Now we want to use the Rancher context menu in the upper left corner to enter the "System" project of the cluster, where we have added the GPU Operator function.

In the System project, select Apps:

Then click the Launch button at the top right.

We can search for "nvidia" or scroll down to the catalog we just created.

Click gpu-operator app, and then click Launch at the bottom of the page.

In this case, all the default values should be fine. Similarly, we can add this step to automation through Rancher APIs.

Utilize GPU

Now that the GPU is accessible, we can now deploy a GPU-capable workload. At the same time, we can verify the success of the installation by viewing the Cluster -> Nodes page in Rancher. We see that the GPU Operator has installed Node Feature Discovery (NFD) and tagged our nodes with GPU usage tags.

to sum up

The reason why Kubernetes can run with the GPU in such a simple way is inseparable from these three important parts:

  1. NVIDIA's GPU Operator
  2. Node Feature Discovery (NFD) from the Kubernetes SIG of the same name.
  3. Rancher's cluster deployment and catalog app integration

You are welcome to try it out according to this tutorial, and please continue to pay attention. In the following tutorials, we will try to refer to the GPU to the edge.


Rancher
1.2k 声望2.5k 粉丝

Rancher是一个开源的企业级Kubernetes管理平台,实现了Kubernetes集群在混合云+本地数据中心的集中部署与管理。Rancher一向因操作体验的直观、极简备受用户青睐,被Forrester评为“2020年多云容器开发平台领导厂商...