This article shares how to use GPUs in different environments such as bare metal, Docker and Kubernetes.
Jump to read the original article:GPU Environment Setup Guide: How to Use GPUs in Bare Metal, Docker, K8s, etc.
1. General
Taking only the more common NVIDIA GPU as an example, the system is Linux, and the process is theoretically the same for GPU devices from other manufacturers.
Save the flow:
-
For bare-metal environments, you only need to install the corresponding GPU Driver and CUDA Toolkit.
-
For Docker environments, you need to additionally install nvidia-container-toolkit and configure docker to use nvidia runtime.
-
For k8s environments, you need to install the corresponding device-plugin so that the kubelet can sense the GPU devices on the node and k8s can manage the GPUs.
Note: Generally in k8s you will use gpu-operator to install directly, in this article we will install manually in order to understand the role of each component.
ps; the next post shares how to use gpu-operator to quickly complete the installation
2. Bare-metal environment
The following components need to be installed in order to use the GPU on bare metal:
GPU Driver
CUDA Toolkit
The relationship between the two is shown in this diagram from the NVIDIA website:
The GPU Driver includes the GPU driver and the CUDA driver, and the CUDA Toolkit includes the CUDA Runtime.
The GPU, as a PCIE device, can be viewed in the system via the lspci command as soon as it is installed, first confirming that there is a GPU on the machine:
root@test:~# lspci|grep NVIDIA
3b:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
86:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
As you can see, the device has two Tesla T4 GPUs.
Installation of drivers
firstlyNVIDIA Driver DownloadDownload the corresponding graphics card driver:
The final download yields a.run
files, such asNVIDIA-Linux-x86_64-550.54.
。
Then you can just sh-run the file
sh NVIDIA-Linux-x86_64-550.54.
You will then be taken to a graphical interface, select yes / ok all the way through.
Run the following command to check if the installation was successful
nvidia-smi
If the graphics card information appears then the installation was successful, like this:
root@test:~ nvidia-smi
Wed Jul 10 05:41:52 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 On | 00000000:3B:00.0 Off | 0 |
| N/A 51C P0 29W / 70W | 12233MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:86:00.0 Off | 0 |
| N/A 49C P0 30W / 70W | 6017MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
|
+---------------------------------------------------------------------------------------+
At this point, we have installed the GPU driver and the system recognizes the GPU properly.
The CUDA version shown here indicates the maximum CUDA version supported by the current driver.
Installing the CUDA Toolkit
For deep learning programs, it is common to rely onCUDA
environment, so you need to install theCUDA Toolkit
。
Also toNVIDIA CUDA Toolkit Download Download the corresponding installation package, select the operating system and installation method can be
Similar to installing a driver, it is also a .run file
# Download the installation file
wget /compute/cuda/12.2.0/local_installers/cuda_12.2.0_535.54.03_linux.run
# Start the installation
sudo sh cuda_12.2.0_535.54.03_linux.run
Note: If you have installed the driver before, you will not install the driver here, but only the CUDA Toolkit related components.
The output after the installation is complete is as follows:
root@iZbp15lv2der847tlwkkd3Z:~# sudo sh cuda_12.2.0_535.54.03_linux.run
===========
= Summary =
===========
Driver: Installed
Toolkit: Installed in /usr/local/cuda-12.2/
Please make sure that
- PATH includes /usr/local/cuda-12.2/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-12.2/lib64, or, add /usr/local/cuda-12.2/lib64 to /etc/ and run ldconfig as root
To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-12.2/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Logfile is /var/log/
Follow the prompts to configure the PATH
# Add CUDA 12.2 to PATH
export PATH=/usr/local/cuda-12.2/bin:$PATH
# Add CUDA 12.2 lib64 to LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH
Execute the following command to check the version and confirm that the installation was successful
root@iZbp15lv2der847tlwkkd3Z:~# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0
beta (software)
We use a simple Pytorch program to check if the GPU and CUDA are OK.
The whole call chain probably looks like this:
Use the following code to test that it works.check_cuda_pytorch.py
The content is as follows:
import torch
def check_cuda_with_pytorch().
"""Check that the PyTorch CUDA environment is working properly.""""
try.
print("Checking PyTorch CUDA environment:")
print("Checking the PyTorch CUDA environment:") if .is_available().
print(f "CUDA device is available, current CUDA version is: {}")
print(f "PyTorch version is: {torch.__version__}")
print(f"{.device_count()} CUDA devices detected.")
for i in range(.device_count()): print(f "{.device_count()} CUDA devices detected.")
print(f "Device {i}: {.get_device_name(i)}")
print(f "Total video memory for device {i}: {.get_device_properties(i).total_memory / (1024 ** 3):.2f} GB")
print(f "Current memory usage for device {i}: {.memory_allocated(i) / (1024 ** 3):.2f} GB")
print(f "Maximum memory usage for device {i}: {.memory_reserved(i) / (1024 ** 3):.2f} GB")
print("CUDA device not available")
print("CUDA device not available.")
except Exception as e: print(f "CUDA device not available.
print(f "Error checking PyTorch CUDA environment: {e}")
if __name__ == "__main__".
check_cuda_with_pytorch()
Install torch first.
pip install torch
Run it.
python3 check_cuda_pytorch.py
The normal output should look like this:
Check PyTorch CUDA environment.
CUDA device available, current CUDA version is: 12.1
PyTorch version is: 2.3.0+cu121
1 CUDA device detected.
Device 0: Tesla T4
Total video memory on device 0: 14.75 GB
Current memory usage of device 0: 0.00 GB
Maximum memory usage for device 0: 0.00 GB
3. Docker environment
In the previous step we have installed GPU Driver, CUDA Toolkit and other tools on the bare metal to enable the use of GPUs on the host.
Now that you wish to use GPUs in Docker containers, what do you need to do with them?
In order to make GPUs available in Docker containers as well, the general steps are as follows:
- 1) Installation
nvidia-container-toolkit
subassemblies - 2)
docker
Configuring Usagenvidia-runtime
- 3) Add when starting the container
--gpu
parameters
Install nvidia-container-toolkit
The main purpose of the NVIDIA Container Toolkit is to integrate the NVIDIA GPU The device is mounted into the container.
Compatible with any container runtime in the ecosystem, docker, containerd, cri-o, etc.
NVIDIA official installation documentation:nvidia-container-toolkit-install-guide
For Ubuntu systems, the installation command is as follows:
# 1. Configure the production repository
curl -fsSL /libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/ \
&& curl -s -L /libnvidia-container/stable/deb/ | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/] https://#g' | \
sudo tee /etc/apt//
# Optionally, configure the repository to use experimental packages
sed -i -e '/experimental/ s/^#//g' /etc/apt//
# 2. Update the packages list from the repository
sudo apt-get update
# 3. Install the NVIDIA Container Toolkit packages
sudo apt-get install -y nvidia-container-toolkit
Configure to use this runtime
Support for Docker, Containerd, CRI-O, Podman and other CRIs.
See the official documentation for detailscontainer-toolkit#install-guide
Here's an example of a Docker configuration:
Older versions required manually adding the/etc/docker/
Add a configuration to specify the use of nvidia's runtime.
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
The new version of the toolkit comes with anvidia-ctk
tool, execute the following command for one-click configuration:
sudo nvidia-ctk runtime configure --runtime=docker
Then restart Docker and you're done!
sudo systemctl restart docker
beta (software)
mountingnvidia-container-toolkit
After that, the entire call chain is as follows:
The call chain starts with the containerd --> runC change into containerd --> nvidia-container-runtime --> runC 。
Then nvidia-container-runtime intercepts the container spec in the middle, so it can add the gpu-related configuration, and the spec passed to runC will contain the gpu information.
A CUDA call in a Docker environment looks roughly like this:
As you can see from the figure, CUDA Toolkit runs into the container, so there is no need to install CUDA Toolkit on the host.
Just use an image with the CUDA Toolkit.
Finally we start a Docker container for testing, where the command adds--gpu
parameter to specify the GPU to be assigned to the container.
--gpu
Parameter optional value:
-
--gpus all
: indicates that all GPUs are assigned to this container -
--gpus "device=<id>[,<id>...]"
For multi-GPU scenarios, you can specify the GPUs assigned to the container by id, e.g. --gpu "device=0" means that only GPU #0 is assigned to the container.- The GPU number is then passed through the
nvidia-smi
command to view
- The GPU number is then passed through the
We'll test this directly with a cuda image by starting the container and executing thenvidia-smi
command
docker run --rm --gpus all nvidia/cuda:12.0.1-runtime-ubuntu22.04 nvidia-smi
Normally it should be possible to print out information about the GPUs in the container.
4. k8s environment
Going a step further, using GPUs in a k8s environment requires the following components to be deployed in the cluster:
-
gpu-device-plugin
For managing GPUs, device-plugin runs as DaemonSet on each node of the cluster to sense the GPU devices on the node, thus allowing k8s to manage the GPU devices on the node. -
gpu-exporter
: for monitoring GPUs
The relationship of the components is shown in the figure below:
-
The image on the left shows a manual installation scenario, where you only need to install the device-plugin and monitor in the cluster to use it.
-
The image on the right shows an installation scenario using gpu-operotar, which is ignored for now.
The general workflow is as follows:
- The kubelet component of each node maintains the status of the GPU devices on that node (which are used and which are unused) and regularly reports it to the scheduler, which knows how many GPU cards are available on each node.
- When the scheduler selects a node for a pod, it selects a node from the eligible nodes.
- Once the pod is dispatched to the node, the kubelet component assigns GPU device IDs to the pod and passes these IDs as parameters to the NVIDIA Device Plugin
- The NVIDIA Device Plugin writes the GPU device ID of the container assigned to the pod to the container's environment variable NVIDIA_VISIBLE_DEVICES and returns the information to the kubelet.
- kubelet starts the container.
- NVIDIA Container Toolkit detects the presence of the environment variable NVIDIA_VISIBLE_DEVICES in the container's spec, and then mounts the GPU device into the container based on the value of the environment variable.
In the Docker environment we start containers with the--gpu
parameter manually specifies the GPUs assigned to the container, and the k8s environment is managed by device-plugin itself.
Install device-plugin
The device-plugin is usually provided by the corresponding GPU manufacturer, such as NVIDIA'sk8s-device-plugin
Installation is as simple as applying the corresponding yaml to the cluster.
kubectl create -f /NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/
Like this.
root@test:~# kgo get po -l app=nvidia-device-plugin-daemonset
NAME READY STATUS RESTARTS AGE
nvidia-device-plugin-daemonset-7nkjw 1/1 Running 0 10m
After device-plugin starts, it senses GPU devices on the node and reports them to the kubelet, which eventually submits them to kube-apiserver.
So we can see the GPU in the Node allocatable resources like this:
root@test:~# k describe node test|grep Capacity -A7
Capacity:
cpu: 48
ephemeral-storage: 460364840Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 98260824Ki
/gpu: 2
pods: 110
As you can see, in addition to the common cpu, memory, there are also/gpu
This is the GPU resource, and a number of 2 means we have two GPUs.
Installing GPU Monitoring
In addition, if you need to monitor cluster GPU resource usage, you may also need to install theDCCM exporter Outputs GPU resource monitoring information in conjunction with Prometheus.
helm repo add gpu-helm-charts \
/dcgm-exporter/helm-charts
helm repo update
helm install \
--generate-name \
gpu-helm-charts/dcgm-exporter
View metrics
curl -sL http://127.0.0.1:8080/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).# TYPE DCGM_FI_DEV_SM_CLOCK gauge# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).# TYPE DCGM_FI_DEV_MEM_CLOCK gauge# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
...
DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 139
DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 405
DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 9223372036854775794
...
beta (software)
Creating a pod in k8s to use GPU resources is simple, just like cpu, memory and other regular resources, you can request them in resource.
For example, in the following yaml we've requested that the Pod use 1 GPU.
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: /nvidia/k8s/cuda-sample:vectoradd-cuda10.2
resources:
limits:
/gpu: 1 # requesting 1 GPU
This way kueb-scheduler will take this into account when scheduling this Pod to a node with GPU resources.
After startup, check the logs, which should normally print a message that the test passed.
kubectl logs gpu-pod
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
At this point, the GPU can also be used in a k8s environment.
[Kubernetes Series]Continuously updated, search the public number [Explore Cloud Native]Subscribe to read more articles.
5. Summary
This article shares how to use GPUs in bare metal, Docker environments, and k8s environments.
-
For bare-metal environments, you only need to install the corresponding GPU Driver.
-
For Docker environments, you need to additionally install the
nvidia-container-toolkit
and configure the docker to use nvidia runtime. -
For k8s environments, you will need to install the corresponding
device-plugin
Enables the kubelet to be aware of GPU devices on the node so that k8s can perform GPU management.
Nowadays it is generally used in k8s environments, and to simplify the installation steps, NVIDIA also provides thegpu-operator
to simplify installation and deployment, and subsequently share how to use thegpu-operator
to quickly install.