In the era of AI and machine learning, efficient resource management is paramount. As Red Hat OpenShift administrators, we face the challenge of deploying intensive AI workloads on a platform where GPUs represent a significant cost. Traditional methods like pre-slicing with NVIDIA’s Multi-Instance GPU (MIG) can lead to resource wastage, especially when the static slices do not align with dynamic workload demands.
In this article, we will explore how dynamic GPU slicing—enabled by the dynamic accelerator slicer operator—can revolutionize GPU resource management in OpenShift by dynamically adjusting allocation based on workload needs.
The challenge of GPU utilization
OpenShift supports the NVIDIA GPU operator. However, conventional approaches come with several limitations:
- Rigid slicing: Predefined MIG slices may not match the actual resource requirements of your pods, leading to underutilization.
- Static allocation: Allocating GPU slices at node start-up forces administrators to reconfigure the nodes if workloads change, often causing disruptions.
- Lack of dynamic provisioning: Without the ability to allocate GPU resources on demand, clusters end up either over-provisioning or underutilizing GPUs, both of which can increase operational costs.
These issues highlight the need for a more flexible approach that enables fractional GPU sharing, ensuring each workload gets only what it needs.
Introducing the dynamic accelerator slicer
The dynamic accelerator slicer operator is currently in developer preview (DP) and is designed to dynamically allocate and manage GPU slices in OpenShift. Its core objectives include:
- Dynamic allocation: Provisioning MIG slices based on the precise pod resource requests and limits.
- Intelligent scheduling: Leveraging Kubernetes scheduling gates to hold pods until the necessary GPU slices are available.
- Seamless integration: Working alongside the NVIDIA GPU Operator, it manages GPU slices without requiring changes to the pod specifications.
- Automated lifecycle management: Tracking allocations and automatically releasing GPU slices once workloads complete.
How dynamic GPU slicing works
Dynamic GPU slicing leverages the scheduling primitives within OpenShift to optimize resource usage in this three-step process:
Dynamic allocation and placement
When a pod requests GPU resources, it is held in a pre-scheduled state via Kubernetes scheduling gates. The operator dynamically allocates the required GPU slice only when the workload is ready to run, thereby avoiding the inefficiencies associated with pre-slicing.
Integration with the NVIDIA GPU Operator
By harnessing the power of the NVIDIA GPU Operator, the Dynamic Accelerator Slicer introduces an external controller that manages slice allocation. This integration ensures that GPU management remains robust and that the slicing mechanism fits seamlessly into the existing ecosystem.
Automated slice lifecycle management
From allocation to deallocation, the operator handles the entire lifecycle of GPU slices. This automation not only streamlines resource management but also ensures that resources are promptly returned to the pool once the pod completes its execution.
Additionally, the solution extends GPU memory quota management via Kubernetes Quota and Kueue and introduces dynamic policy-driven slicing with efficient packing to meet diverse workload demands.
Deploying the dev preview in OpenShift
Deploying the dynamic accelerator slicer is straightforward. Before you begin, ensure you have the necessary dependencies installed:
oc
CLI: Used to interact with your OpenShift cluster.- Operator SDK: Required to run the operator bundle.
Note:
Make sure the KUBECONFIG environment variable is set to point to your OpenShift cluster's configuration file.
Follow these steps to deploy the operator:
Create a new project:
oc new-project instaslice-system
Run the operator bundle:
operator-sdk run bundle quay.io/ibm/instaslice-bundle:0.0.1 -n instaslice-system
These commands deploy the operator into your OpenShift cluster, allowing it to dynamically manage GPU slices alongside your existing NVIDIA GPU setup.
Deploying an inference large language model (LLM)
Multiple instances of the inference can be deployed on a given GPU with the isolation provided by the NVIDIA MIG technology. Here we have a Red Hat OpenShift Container Platform worker node with 4 NVIDIA A100 GPUs:
[root@nvidia-driver-daemonset-417 drivers]# nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-42db43a1-526e-626d-ff2d-456bf26d9df0)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-143a66c4-bd69-7559-8898-26f9886a2a56)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-7d9ad99b-3d06-cf12-bab8-e1f31672e01f)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-7cf6fcb8-d173-0224-618c-4813e24a3383)
These GPUs can be sliced to serve multiple inference LLM models. Download the sample vLLM model.
After downloading, make sure to edit the file to set the value of HF_TOKEN
in huggingface-secret
obtained from your HuggingFace account.
Apply that file using oc apply -f <filename>
.
Please note that this particular sample model is rather large and may take a while to download depending on your network speed. Make sure the vLLM deployment has come up fine.
$ oc get deployments vllm
NAME READY UP-TO-DATE AVAILABLE AGE
vllm 1/1 1 1 35s
You can verify that instaslice has indeed provisioned a slice requested by the deployment by observing the instaslice object as well as by probing the NVIDIA GPU operator.
We will find out the uuid of the pod running on the host. This is one of the ways to get it:
$ oc get pods vllm-7dbb49b8f8-znd4s -o json | jq .metadata.uid
"98c08795-2da3-4598-81b7-538c4e37093b"
This uuid will help us track the allocation status in the instaslice object:
$ oc get instaslice host-192-168-11-144 -o json | jq .status.podAllocationResults
{
"98c08795-2da3-4598-81b7-538c4e37093b": {
"allocationStatus": {
"allocationStatusController": "ungated",
"allocationStatusDaemonset": "created"
},
"configMapResourceIdentifier": "a02ae459-0618-4804-83f9-e5ba36af756f",
"gpuUUID": "GPU-143a66c4-bd69-7559-8898-26f9886a2a56",
"migPlacement": {
"size": 4,
"start": 0
},
"nodename": "host-192-168-11-144"
}
}
As you can see in the previous example, the slice has been successfully created dynamically. You can verify this with NVIDIA GPU operator:
# nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-42db43a1-526e-626d-ff2d-456bf26d9df0)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-143a66c4-bd69-7559-8898-26f9886a2a56)
MIG 3g.20gb Device 0: (UUID: MIG-b065e8ea-01ff-598f-83fd-3eea1f2869b9)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-7d9ad99b-3d06-cf12-bab8-e1f31672e01f)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-7cf6fcb8-d173-0224-618c-4813e24a3383)
Now in another terminal, make sure to set the KUBECONFIG for the OpenShift cluster and execute the following command to forward the port:
oc port-forward svc/vllm 8000:8000 -n instaslice-system
Come back to the earlier terminal to query the model:
$ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
{"id":"cmpl-474b5727068745baa4718925bcfa8f91","object":"text_completion","created":163554,"model":"facebook/opt-125m","choices":[{"index":0,"text":" great place to live. I","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}}
If you happen to delete this workload, instaslice will automatically clear corresponding slices on the respective GPUs dynamically.
The benefits of dynamic GPU slicing
Adopting dynamic GPU slicing offers multiple benefits:
- Increased utilization: Workloads can utilize GPU slices efficiently, reducing the need to reserve entire GPUs when only a fraction is necessary.
- Cost reduction: By paying only for the resources actively used, organizations can significantly reduce GPU-related expenses.
- Seamless scaling: Dynamic allocation allows clusters to adapt in real time to changing workloads, ensuring optimal performance and resource distribution.
Looking ahead: Integration with Kubernetes DRA
The Kubernetes Dynamic Resource Allocation (DRA) framework aims to provide even finer control over GPU scheduling. While DRA is still evolving, the dynamic accelerator slicer offers a practical and production-ready solution today. It paves the way for future enhancements, ensuring that OpenShift users can continue to push the boundaries of resource efficiency.
Efficient GPU management is essential for maximizing the performance of AI workloads in OpenShift. The dynamic accelerator slicer operator delivers dynamic GPU slicing that optimizes resource usage, reduces operational costs, and adapts seamlessly to the fluctuating demands of modern AI applications.
By integrating dynamically with the NVIDIA GPU operator and leveraging OpenShift’s scheduling capabilities, this innovative approach transforms GPU allocation from a static, often inefficient process into a dynamic, on-demand service. Embracing dynamic GPU slicing is a critical step towards unlocking the full potential of AI workloads in the OpenShift environment.