How to share GPUs in a kubernetes environment

With the rapid development of AI and big models, GPU resource sharing on the cloud becomes necessary because it reduces hardware cost, improves resource utilization efficiency, and meets the demand for massively parallel computing for model training and inference.

In the built-in resource scheduling function of kubernetes, GPU scheduling can only be based on the "number of cores" for scheduling, but deep learning and other algorithms in the process of execution of the algorithmic program, the higher resource consumption is the graphics memory, which creates a lot of wasted resources.

There are two current GPU resource sharing schemes. One is to decompose a real GPU into multiple virtual GPUs, i.e., vGPUs, so that scheduling can be done based on the number of vGPUs; the other is to schedule based on the GPU's video memory.

In this article, we will talk about how to install the kubernetes component to enable scheduling of resources based on GPU memory.

System Information

System: centos stream8
Kernel: 4.18.0-490.el8.x86_64
Driver: NVIDIA-Linux-x86_64-470.182.03
docker：20.10.24
kubernetes version: 1.24.0

1. Driver installation

Please visit the nvida website to install it yourself:/Download/?lang=en-us

2. docker installation

Please install docker or other container runtime by yourself, if you use other container runtime, please refer to NVIDA official website for the third step of configuration./datacenter/cloud-native/container-toolkit/#installation-guide

Note: Officially, docker, containerd and podman are supported, but this document has only verified the use of docker, so please be aware of the differences if running with other containers.

3. NVIDIA Container Toolkit Installation

Setting up the repository with GPG Key

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L /libnvidia-container/$distribution/ | sudo tee /etc//

Start Installation

sudo dnf clean expire-cache --refresh
sudo dnf install -y nvidia-container-toolkit

Modify the docker configuration file to add the container runtime implementation

sudo nvidia-ctk runtime configure --runtime=docker

Modify /etc/docker/, set nvidia as default container runtime (required)

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Restart docker and start verifying that it takes effect

sudo systemctl restart docker
sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

If the following data is returned, the configuration was successful

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id         | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   34C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

4. Installation of the K8S GPU scheduler

First execute the following yaml to deploy the scheduler

# 
---
kind: ClusterRole
apiVersion: ./v1
metadata:
  name: gpushare-schd-extender
rules:
  - apiGroups:
      - ""
    resources:
      - nodes
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - events
    verbs:
      - create
      - patch
  - apiGroups:
      - ""
    resources:
      - pods
    verbs:
      - update
      - patch
      - get
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - bindings
      - pods/binding
    verbs:
      - create
  - apiGroups:
      - ""
    resources:
      - configmaps
    verbs:
      - get
      - list
      - watch
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: gpushare-schd-extender
  namespace: kube-system
---
kind: ClusterRoleBinding
apiVersion: ./v1
metadata:
  name: gpushare-schd-extender
  namespace: kube-system
roleRef:
  apiGroup: .
  kind: ClusterRole
  name: gpushare-schd-extender
subjects:
  - kind: ServiceAccount
    name: gpushare-schd-extender
    namespace: kube-system

# deployment yaml
---
kind: Deployment
apiVersion: apps/v1
metadata:
  name: gpushare-schd-extender
  namespace: kube-system
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: gpushare
      component: gpushare-schd-extender
  template:
    metadata:
      labels:
        app: gpushare
        component: gpushare-schd-extender
      annotations:
        /critical-pod: ''
    spec:
      hostNetwork: true
      tolerations:
        - effect: NoSchedule
          operator: Exists
          key: /master
        - effect: NoSchedule
          key: /control-plane
          operator: Exists
        - effect: NoSchedule
          operator: Exists
          key: /uninitialized
      nodeSelector:
        /control-plane: ""
      serviceAccount: gpushare-schd-extender
      containers:
        - name: gpushare-schd-extender
          image: /acs/k8s-gpushare-schd-extender:1.11-d170d8a
          env:
            - name: LOG_LEVEL
              value: debug
            - name: PORT
              value: "12345"

# 
---
apiVersion: v1
kind: Service
metadata:
  name: gpushare-schd-extender
  namespace: kube-system
  labels:
    app: gpushare
    component: gpushare-schd-extender
spec:
  type: NodePort
  ports:
    - port: 12345
      name: http
      targetPort: 12345
      nodePort: 32766
  selector:
    # select app=ingress-nginx pods
    app: gpushare
    component: gpushare-schd-extender

Add a scheduling policy configuration file to the /etc/kubernetes directory

#
---
apiVersion: ./v1beta2
kind: KubeSchedulerConfiguration
clientConnection:
  kubeconfig: /etc/kubernetes/
extenders:
    # I don't know why it's not supportedsvcThe way to call the，mustnodeport
  - urlPrefix: "-system:12345/gpushare-scheduler"
    filterVerb: filter
    bindVerb: bind
    enableHTTPS: false
    nodeCacheCapable: true
    managedResources:
      - name: /gpu-mem
        ignoredByScheduler: false
    ignorable: false

upper-system:12345 Be careful to replace {nodeIP}:{nodeport port of gpushare-schd-extender} with your locally deployed {nodeIP}, otherwise it will not be accessible!

The query command is as follows:

kubectl get service gpushare-schd-extender -n kube-system -o jsonpath='{.[?(@.name=="http")].nodePort}'

Modify kubernetes scheduling configuration /etc/kubernetes/manifests/

1. existcommondAdd
 - --config=/etc/kubernetes/

2. increasepodMounting directories
existvolumeMounts:Add
- mountPath: /etc/kubernetes/
  name: scheduler-policy-config
  readOnly: true
existvolumes:Add
- hostPath:
      path: /etc/kubernetes/
      type: FileOrCreate
  name: scheduler-policy-config

Note: Don't make any corrections here, or you may make inexplicable errors
Examples are shown below:

Configuring rbac and installing the device plugin

kubectl create -f /AliyunContainerService/gpushare-device-plugin/master/
kubectl create -f /AliyunContainerService/gpushare-device-plugin/master/

5. Adding labels to GPU nodes

kubectl label node <target_node> gpushare=true

6. install the kubectl gpu plugin

cd /usr/bin/
wget /AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
chmod u+x /usr/bin/kubectl-inspect-gpushare

7. Validation

Querying GPU resource usage using kubectl

# kubectl inspect gpushare
NAME                                IPADDRESS     GPU0(Allocated/Total)  GPU Memory(GiB)
-uf61h64dz1tmlob9hmtb  192.168.0.71  6/15                   6/15
-uf61h64dz1tmlob9hmtc  192.168.0.70  3/15                   3/15
------------------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
9/30 (30%)

Create a resource with GPU requirements and view its resource scheduling

apiVersion: apps/v1
kind: Deployment
metadata:
  name: binpack-1
  labels:
    app: binpack-1
spec:
  replicas: 1
  selector: # define how the deployment finds the pods it manages
    matchLabels:
      app: binpack-1
  template: # define the pods specifications
    metadata:
      labels:
        app: binpack-1
    spec:
      tolerations:
        - effect: NoSchedule
          key: cloudClusterNo
          operator: Exists
      containers:
        - name: binpack-1
          image: cheyang/gpu-player:v2
          resources:
            limits:
              # work unit (one's workplace)GiB
              /gpu-mem: 3

8. Troubleshooting

If you find that the resource was not installed successfully during the installation process, you can view the logs via pod

kubectl get po -n kube-system -o=wide | grep gpushare-device 
kubecl logs -n kube-system <pod_name>

Reference address:
NVIDA official container-toolkit installation documentation./datacenter/cloud-native/container-toolkit/#docker
AliCloud GPU plugin installation:/AliyunContainerService/gpushare-scheduler-extender/blob/master/docs/