GPU Environment Setup Guide: How to Use GPUs in Bare Metal, Docker, K8s, etc.

how-to-use-gpu

This article shares how to use GPUs in different environments such as bare metal, Docker and Kubernetes.

Jump to read the original article:GPU Environment Setup Guide: How to Use GPUs in Bare Metal, Docker, K8s, etc.

1. General

Taking only the more common NVIDIA GPU as an example, the system is Linux, and the process is theoretically the same for GPU devices from other manufacturers.

Save the flow:

For bare-metal environments, you only need to install the corresponding GPU Driver and CUDA Toolkit.
For Docker environments, you need to additionally install nvidia-container-toolkit and configure docker to use nvidia runtime.
For k8s environments, you need to install the corresponding device-plugin so that the kubelet can sense the GPU devices on the node and k8s can manage the GPUs.

Note: Generally in k8s you will use gpu-operator to install directly, in this article we will install manually in order to understand the role of each component.

ps; the next post shares how to use gpu-operator to quickly complete the installation

2. Bare-metal environment

The following components need to be installed in order to use the GPU on bare metal:

GPU Driver
CUDA Toolkit

The relationship between the two is shown in this diagram from the NVIDIA website:

components-of-cuda

The GPU Driver includes the GPU driver and the CUDA driver, and the CUDA Toolkit includes the CUDA Runtime.

The GPU, as a PCIE device, can be viewed in the system via the lspci command as soon as it is installed, first confirming that there is a GPU on the machine:

root@test:~# lspci|grep NVIDIA
3b:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
86:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

As you can see, the device has two Tesla T4 GPUs.

Installation of drivers

firstlyNVIDIA Driver DownloadDownload the corresponding graphics card driver:

search-gpu-driver

The final download yields a.run files, such asNVIDIA-Linux-x86_64-550.54.。

Then you can just sh-run the file

sh NVIDIA-Linux-x86_64-550.54.

You will then be taken to a graphical interface, select yes / ok all the way through.

Run the following command to check if the installation was successful

nvidia-smi

If the graphics card information appears then the installation was successful, like this:

root@test:~ nvidia-smi
Wed Jul 10 05:41:52 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id         | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:3B:00.0 Off |                    0 |
| N/A   51C    P0              29W /  70W |  12233MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       On  | 00000000:86:00.0 Off |                    0 |
| N/A   49C    P0              30W /  70W |   6017MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|   
+---------------------------------------------------------------------------------------+

At this point, we have installed the GPU driver and the system recognizes the GPU properly.

The CUDA version shown here indicates the maximum CUDA version supported by the current driver.

Installing the CUDA Toolkit

For deep learning programs, it is common to rely onCUDA environment, so you need to install theCUDA Toolkit。

Also toNVIDIA CUDA Toolkit Download Download the corresponding installation package, select the operating system and installation method can be

download-cuda-toolkit

Similar to installing a driver, it is also a .run file

# Download the installation file
wget /compute/cuda/12.2.0/local_installers/cuda_12.2.0_535.54.03_linux.run

# Start the installation
sudo sh cuda_12.2.0_535.54.03_linux.run

Note: If you have installed the driver before, you will not install the driver here, but only the CUDA Toolkit related components.

The output after the installation is complete is as follows:

root@iZbp15lv2der847tlwkkd3Z:~# sudo sh cuda_12.2.0_535.54.03_linux.run
===========
= Summary =
===========

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-12.2/

Please make sure that
 -   PATH includes /usr/local/cuda-12.2/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-12.2/lib64, or, add /usr/local/cuda-12.2/lib64 to /etc/ and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-12.2/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Logfile is /var/log/

Follow the prompts to configure the PATH

# Add CUDA 12.2 to PATH
export PATH=/usr/local/cuda-12.2/bin:$PATH

# Add CUDA 12.2 lib64 to LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH

Execute the following command to check the version and confirm that the installation was successful

root@iZbp15lv2der847tlwkkd3Z:~# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0

beta (software)

We use a simple Pytorch program to check if the GPU and CUDA are OK.

The whole call chain probably looks like this:

cuda-call-flow

Use the following code to test that it works.check_cuda_pytorch.py The content is as follows:

import torch

def check_cuda_with_pytorch().
    """Check that the PyTorch CUDA environment is working properly.""""
    try.
        print("Checking PyTorch CUDA environment:")
        print("Checking the PyTorch CUDA environment:") if .is_available().
            print(f "CUDA device is available, current CUDA version is: {}")
            print(f "PyTorch version is: {torch.__version__}")
            print(f"{.device_count()} CUDA devices detected.")
            for i in range(.device_count()): print(f "{.device_count()} CUDA devices detected.")
                print(f "Device {i}: {.get_device_name(i)}")
                print(f "Total video memory for device {i}: {.get_device_properties(i).total_memory / (1024 ** 3):.2f} GB")
                print(f "Current memory usage for device {i}: {.memory_allocated(i) / (1024 ** 3):.2f} GB")
                print(f "Maximum memory usage for device {i}: {.memory_reserved(i) / (1024 ** 3):.2f} GB")
        print("CUDA device not available")
            print("CUDA device not available.")
    except Exception as e: print(f "CUDA device not available.
        print(f "Error checking PyTorch CUDA environment: {e}")

if __name__ == "__main__".
    check_cuda_with_pytorch()

Install torch first.

pip install torch

Run it.

python3 check_cuda_pytorch.py

The normal output should look like this:

Check PyTorch CUDA environment.
CUDA device available, current CUDA version is: 12.1
PyTorch version is: 2.3.0+cu121
1 CUDA device detected.
Device 0: Tesla T4
Total video memory on device 0: 14.75 GB
Current memory usage of device 0: 0.00 GB
Maximum memory usage for device 0: 0.00 GB

3. Docker environment

In the previous step we have installed GPU Driver, CUDA Toolkit and other tools on the bare metal to enable the use of GPUs on the host.

Now that you wish to use GPUs in Docker containers, what do you need to do with them?

In order to make GPUs available in Docker containers as well, the general steps are as follows:

1) Installationnvidia-container-toolkit subassemblies
2）docker Configuring Usagenvidia-runtime
3) Add when starting the container--gpu parameters

Install nvidia-container-toolkit

The main purpose of the NVIDIA Container Toolkit is to integrate the NVIDIA GPU The device is mounted into the container.

Compatible with any container runtime in the ecosystem, docker, containerd, cri-o, etc.

NVIDIA official installation documentation:nvidia-container-toolkit-install-guide

For Ubuntu systems, the installation command is as follows:

# 1. Configure the production repository
curl -fsSL /libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/ \
  && curl -s -L /libnvidia-container/stable/deb/ | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/] https://#g' | \
    sudo tee /etc/apt//

# Optionally, configure the repository to use experimental packages 
sed -i -e '/experimental/ s/^#//g' /etc/apt//

# 2. Update the packages list from the repository
sudo apt-get update

# 3. Install the NVIDIA Container Toolkit packages
sudo apt-get install -y nvidia-container-toolkit

Configure to use this runtime

Support for Docker, Containerd, CRI-O, Podman and other CRIs.

See the official documentation for detailscontainer-toolkit#install-guide

Here's an example of a Docker configuration:

Older versions required manually adding the/etc/docker/ Add a configuration to specify the use of nvidia's runtime.

    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }

The new version of the toolkit comes with anvidia-ctk tool, execute the following command for one-click configuration:

sudo nvidia-ctk runtime configure --runtime=docker

Then restart Docker and you're done!

sudo systemctl restart docker

beta (software)

mountingnvidia-container-toolkit After that, the entire call chain is as follows:

nv-container-runtime-call-flow

The call chain starts with the containerd --> runC change into containerd --> nvidia-container-runtime --> runC 。

Then nvidia-container-runtime intercepts the container spec in the middle, so it can add the gpu-related configuration, and the spec passed to runC will contain the gpu information.

A CUDA call in a Docker environment looks roughly like this:

As you can see from the figure, CUDA Toolkit runs into the container, so there is no need to install CUDA Toolkit on the host.

Just use an image with the CUDA Toolkit.

Finally we start a Docker container for testing, where the command adds--gpu parameter to specify the GPU to be assigned to the container.

--gpu Parameter optional value:

--gpus all: indicates that all GPUs are assigned to this container
--gpus "device=<id>[,<id>...]"For multi-GPU scenarios, you can specify the GPUs assigned to the container by id, e.g. --gpu "device=0" means that only GPU #0 is assigned to the container.
- The GPU number is then passed through thenvidia-smi command to view

We'll test this directly with a cuda image by starting the container and executing thenvidia-smi command

docker run --rm --gpus all  nvidia/cuda:12.0.1-runtime-ubuntu22.04 nvidia-smi

Normally it should be possible to print out information about the GPUs in the container.

4. k8s environment

Going a step further, using GPUs in a k8s environment requires the following components to be deployed in the cluster:

gpu-device-plugin For managing GPUs, device-plugin runs as DaemonSet on each node of the cluster to sense the GPU devices on the node, thus allowing k8s to manage the GPU devices on the node.
gpu-exporter: for monitoring GPUs

The relationship of the components is shown in the figure below:

k8s-gpu-manual-instll-vs-gpu-operator

The image on the left shows a manual installation scenario, where you only need to install the device-plugin and monitor in the cluster to use it.
The image on the right shows an installation scenario using gpu-operotar, which is ignored for now.

The general workflow is as follows:

The kubelet component of each node maintains the status of the GPU devices on that node (which are used and which are unused) and regularly reports it to the scheduler, which knows how many GPU cards are available on each node.
When the scheduler selects a node for a pod, it selects a node from the eligible nodes.
Once the pod is dispatched to the node, the kubelet component assigns GPU device IDs to the pod and passes these IDs as parameters to the NVIDIA Device Plugin
The NVIDIA Device Plugin writes the GPU device ID of the container assigned to the pod to the container's environment variable NVIDIA_VISIBLE_DEVICES and returns the information to the kubelet.
kubelet starts the container.
NVIDIA Container Toolkit detects the presence of the environment variable NVIDIA_VISIBLE_DEVICES in the container's spec, and then mounts the GPU device into the container based on the value of the environment variable.

In the Docker environment we start containers with the--gpu parameter manually specifies the GPUs assigned to the container, and the k8s environment is managed by device-plugin itself.

Install device-plugin

The device-plugin is usually provided by the corresponding GPU manufacturer, such as NVIDIA'sk8s-device-plugin

Installation is as simple as applying the corresponding yaml to the cluster.

kubectl create -f /NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/

Like this.

root@test:~# kgo get po -l app=nvidia-device-plugin-daemonset
NAME                                   READY   STATUS    RESTARTS   AGE
nvidia-device-plugin-daemonset-7nkjw   1/1     Running   0          10m

After device-plugin starts, it senses GPU devices on the node and reports them to the kubelet, which eventually submits them to kube-apiserver.

So we can see the GPU in the Node allocatable resources like this:

root@test:~# k describe node test|grep Capacity -A7
Capacity:
  cpu:                48
  ephemeral-storage:  460364840Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             98260824Ki
  /gpu:     2
  pods:               110

As you can see, in addition to the common cpu, memory, there are also/gpuThis is the GPU resource, and a number of 2 means we have two GPUs.

Installing GPU Monitoring

In addition, if you need to monitor cluster GPU resource usage, you may also need to install theDCCM exporter Outputs GPU resource monitoring information in conjunction with Prometheus.

helm repo add gpu-helm-charts \
  /dcgm-exporter/helm-charts
  
 helm repo update
 
 
 helm install \
    --generate-name \
    gpu-helm-charts/dcgm-exporter

View metrics

curl -sL http://127.0.0.1:8080/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).# TYPE DCGM_FI_DEV_SM_CLOCK gauge# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).# TYPE DCGM_FI_DEV_MEM_CLOCK gauge# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
...
DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 139
DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 405
DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 9223372036854775794
...

beta (software)

Creating a pod in k8s to use GPU resources is simple, just like cpu, memory and other regular resources, you can request them in resource.

For example, in the following yaml we've requested that the Pod use 1 GPU.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: /nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          /gpu: 1 # requesting 1 GPU

This way kueb-scheduler will take this into account when scheduling this Pod to a node with GPU resources.

After startup, check the logs, which should normally print a message that the test passed.

kubectl logs gpu-pod
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

At this point, the GPU can also be used in a k8s environment.

[Kubernetes Series]Continuously updated, search the public number [Explore Cloud Native]Subscribe to read more articles.

5. Summary

This article shares how to use GPUs in bare metal, Docker environments, and k8s environments.

For bare-metal environments, you only need to install the corresponding GPU Driver.
For Docker environments, you need to additionally install thenvidia-container-toolkit and configure the docker to use nvidia runtime.
For k8s environments, you will need to install the correspondingdevice-plugin Enables the kubelet to be aware of GPU devices on the node so that k8s can perform GPU management.

Nowadays it is generally used in k8s environments, and to simplify the installation steps, NVIDIA also provides thegpu-operatorto simplify installation and deployment, and subsequently share how to use thegpu-operator to quickly install.