As companies around the world begin to adopt Kubernetes widely, we see that Kubernetes is moving to a new stage of development. On the one hand, Kubernetes is adopted by edge workloads and provides value beyond the data center. Kubernetes, on the other hand, is driving machine learning (ML) and high-quality, high-speed data analysis performance.

All we know about the use of Kubernetes for machine learning comes from a feature in Kubernetes 1.10, when graphics processing units (GPUs) became a schedulable resource — now in beta. Individually, both of these are exciting developments in Kubernetes. Even more exciting, you can use Kubernetes to adopt gpus in the data center and at the edge. In data centers, gpus are a way to build ML libraries. Those trained libraries will be migrated to edge Kubernetes clusters as inference tools for machine learning, providing data analysis as close to the data collection as possible.

In the early days, Kubernetes provided a pool of CPU and RAM resources for distributed applications. If we have CPU and RAM pools, why not have a GPU pool? That’s fine, of course, but not all servers have gpus. So, how to make our server in Kubernetes can be equipped with GPU?

In this article, I’ll explain a simple way to use gpus in a Kubernetes cluster. In future articles, we’ll also push the GPU to the edge and show you how to do it. To really simplify the process, I’ll use the Rancher UI to navigate the GPU-enabling process. The Rancher UI is just a client of the Rancher RESTful APIs. You can use clients of other apis such as Golang, Python, and Terraform in GitOps, DevOps, and other automation solutions. However, we won’t go into that in this article.

In essence, the steps are simple:

  • Build the infrastructure for the Kubernetes cluster
  • Install Kubernetes
  • Install the GPU-operator from the Helm

Get up and running with Rancher and available GPU resources

Rancher is a multi-cluster management solution and is the glue behind the above steps. You can find a pure NVIDIA solution to simplify GPU management and some important information about how gPU-Operator differs from building a GPU driver stack without operator in the NVIDIA blog.

(developer.nvidia.com/blog/nvidia…).

preparation

Here is a list of materials (BOM) required to get the GPU up and running in Rancher:

  1. Rancher
  2. GPU Operator (nvidia. Making. IO/GPU – operato…
  3. Infrastructure — We will be using GPU nodes on AWS

In the official documentation, we have a section on how to install Rancher with high availability, so we’ll assume you’ve already installed Rancher:

Docs. The rancher. Cn/docs/ranche…

Process steps

Install the Kubernetes cluster using GPUs

After Rancher is installed, we’ll first build and configure a Kubernetes cluster (you can use any cluster with an NVIDIA GPU).

Using the Global context, we select Add Cluster

In the “Hosts from Cloud Service Providers” section, select Amazon EC2.

We do this via node drivers — a set of pre-configured infrastructure templates, some of which have GPU resources.

Notice that there are three node pools: one for the master, one for standard worker nodes, and one for workers with gpus. The GPU template is based on the P3.2 Xlarge machine type, using Ubuntu 18.04 Amazon Machine Image or AMI (AMI-0AC80DF6EFF0e70b5). Of course, these options vary according to the needs of each infrastructure provider and enterprise. Also, we set the Kubernetes option in the “Add Cluster” form to the default value.

Set the GPU Operator

For now, we’ll use the GPU Operator library (nvidia.github. IO/GPU-operato…) Set up a catalog in Rancher. (There are other solutions to expose gpus, including using Linux for Tegra [L4T] Linux distributions or device plug-ins.) At the time of this writing, the GPU Operator has been tested and verified with NVIDIA Tesla Driver 440.

Using the Rancher Global context menu, we select the cluster to install to:

Then use the Tools menu to view the catalog list.

Click the Add Catalog button and name it, then Add the URL: nvidia.github. IO /gpu-operato…

We chose Helm V3 and the cluster scope. We click Create to add Catalog to Rancher. When using automation, we can use this step as part of a cluster build. Depending on enterprise policy, we can add this Catalog to every cluster, even if it does not yet have a GPU node or node pool. This step gives us access to the GPU Operator Chart, which we will install next.

Now we want to use the Rancher context menu in the upper left to access the cluster’s “System” project, where we have added the GPU Operator functionality.

In the System project, select Apps:

Then click the Launch button at the top right.

We can search for “nvidia” or scroll down to the catalog we just created.

Click on the GPU-Operator app, then click Launch at the bottom of the page.

In this case, all the defaults should be fine. Again, we can add this step to automation through the Rancher APIs.

Using the GPU.

Now that the GPU is accessible, we can now deploy a GPU-Capable workload. In the meantime, we can verify that the installation is successful by viewing the Cluster -> Nodes page in Rancher. We see that the GPU Operator has installed Node Feature Discovery (NFD) and labeled our Node for GPU use.

Total knot

There are three important things that make Kubernetes work with GPU in such a simple way:

  1. NVIDIA的GPU Operator
  2. Node Feature Discovery (NFD) from Kubernetes SIG of the same name.
  3. Rancher’s cluster deployment and Catalog app integration

You’re welcome to try it out in this tutorial, and stay tuned as we try to reference the GPU to the edge in future tutorials.