How to set up NVIDIA NIM on Red Hat OpenShift AI

Upon review, the year 2024 was the year of Generative AI (GenAI) adoption with a focus on secure, scalable, and easy-to-deploy solutions. Red Hat OpenShift AI is a flexible, scalable artificial intelligence and machine learning (AI/ML) platform that enables enterprises to create and deliver AI-enabled applications at scale across hybrid cloud environments. It provides the best-in-class MLOps platform for building, training, deploying, and monitoring AI models and applications at scale across hybrid cloud environments.

NVIDIA NIM is a set of easy-to-use microservices for accelerating the deployment of foundation models on any cloud or data center. It offers optimized inference microservices for deploying AI models at scale. This article demonstrates how to set up NVIDIA NIM on Red Hat OpenShift AI and discusses the benefits.

The benefits of NVIDIA NIM

The availability of NVIDIA NIM on OpenShift AI accelerates the delivery of GenAI applications for faster time to value and enables the following benefits:

Ease of deployment and scale: You can deploy and scale NVIDIA NIM with a consistent platform across hybrid cloud environments.
AI/ML maximum performance: You can access optimized NVIDIA NIM for inference on the most optimal, scalable, and secure platform.
Quick on-demand access: Provides self-service access to NVIDIA NIM and streamline the delivery of AI-enabled apps.

Figure 1 depicts the combined stack of NVIDIA and Red Hat OpenShift AI on Red Hat OpenShift.

openshit-nividia-architecture — Figure 1: The combined stack of NVIDIA and Red Hat OpenShift AI on Red Hat OpenShift.

NVIDIA NIM is a containerized inference microservice that includes industry-standard APIs, domain-specific code, optimized inference engines, and enterprise runtime (Figure 2).

nvidia-nim-architecture — Figure 2: The NVIDIA NIM architecture.

The core benefits of NVIDIA NIM and Red Hat OpenShift AI are as follows.

Benefits of NVIDIA NIM:

Deploy anywhere.
Develop with industry-standard APIs.
Leverage domain-specific models.
Run on optimized inference engines.
Support for enterprise-grade AI.

Benefits of Red Hat OpenShift AI:

Bring AI-enabled apps to production faster.
Flexibility across the hybrid cloud.
Less time managing AI infrastructure.
Tested and supported AI/ML tooling.
Leverage our best practices.

With Red Hat’s focus on hybrid and multi-cloud “deploy anywhere” strategy, NVIDIA NIM offers a substantial advantage to enterprise customers.

How to set up and deploy NVIDIA NIM

The following steps describe how to set up NVIDIA NIM on Red Hat OpenShift AI.

Log in to NVIDIA NGC and select Setup to create your key. Note that you need to have an enterprise license. You will get a 90-day trial license if you register as a NVIDIA developer and request an enterprise license.
Select Generate API Key as shown in Figure 3.

generate-apikey-nvidia-sim — Figure 3: Select the green Generate API Key button.

Click the Generate API Key button once again on the next screen (Figure 4). Then copy and save this key for the next step.

Navigate to OpenShift AI -> Applications -> Explore and Enable NVIDIA NIM (Figure 5). It will ask for the key you generated in the previous step. Once you enter the key, it takes a couple of minutes to complete the process.

rhoai-dashboard — Figure 5: In the OpenShift AI dashboard, select Explore and Enable NVIDIA NIM.

Once enabled, you should see NVIDIA NIM in the Applications -> Enabled section (Figure 6).

rhoai-enabled-nim — Figure 6: NVIDIA NIM enabled in OpenShift AI.

To use NVIDIA NIM, you need to create a Data Science Project called “NVIDIA-NIM-PROJECT” and go to the Models tab (Figure 7).

rhoai-create-project — Figure 7: Create the Data Science Project.

In the Models tab, you should also see the NIM serving platform. If not, then you need to enable it in the OpenShift AI Dashboard (Figure 8). Here, you do not see the NIM tile, so you need to enable it manually in the dashboard.

nim-project — Figure 8: The NVIDIA-NIM-PROJECT is in the Models tab.

Go to OpenShift -> API Explorer and search "OdhDashboardConfig" in the redhat-ods-applications project (Figure 9).

api-explorer — Figure 9: Search in the redhat-ods-applications project in the API Explorer.

Select OdhDashboardConfig then go to the Instances tab and select odh-dashboard-config (Figure 10).

odh dashbaord config — Figure 10: Select odhDashboardConfig under the Instances tab.

Under the YAML tab in the spec section, you will see disableNIMModelServing: true.
Change this to false to make the NIM tile visible in the Models tab on the OpenShift AI Dashboard as shown in Figure 11 and save it.

dashboard-false — Figure 11: Change disableNIMModelServing: true to false.

For the NVIDIA NIM model serving platform to be visible in the Models tab in your project, log in again to the Red Hat OpenShift AI dashboard and refresh after two or three minutes (Figure 12).

rhoai-ui-nvidia-nim-visible — Figure 12: NVIDIA NIM model serving platform visible in the Models tab in your project.

Now you can deploy the models from NVIDIA NIM by selecting Deploy from the NVIDIA NIM tile (Figure 13).

deploy-nvidia-nim — Figure 13: Deploying NVIDIA NIM.

This will start the deploying of the model, using NVIDIA NIM runtime and inference services. Make sure that you have sufficient resources to run the model on your Red Hat OpenShift cluster.
The pod will be scheduled. Once all four containers in the pod are ready in Red Hat OpenShift, the same will be reflected in Red Hat OpenShift AI, as shown in Figure 14.

ocp-pod-check — Figure 14: Check the pods in the Red Hat OpenShift console.

There are four containers in the pod (Figure 15).

The following shows the runtime ServingRuntime created in this project:

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    opendatahub.io/accelerator-name: migrated-gpu
    opendatahub.io/apiProtocol: REST
    opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
    opendatahub.io/template-display-name: NVIDIA NIM
    opendatahub.io/template-name: nvidia-nim-runtime
    openshift.io/display-name: phi-3-mini-4k-instruct
  resourceVersion: '1385642'
  name: phi-3-mini-4k-instruct
  namespace: nvidai-nim-project
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  containers:
    - env:
        - name: NIM_CACHE_PATH
          value: /mnt/models/cache
        - name: NGC_API_KEY
          valueFrom:
            secretKeyRef:
              key: NGC_API_KEY
              name: nvidia-nim-secrets
      image: 'nvcr.io/nim/microsoft/phi-3-mini-4k-instruct:1'
      name: kserve-container
      ports:
        - containerPort: 8000
          protocol: TCP
      volumeMounts:
        - mountPath: /dev/shm
          name: shm
        - mountPath: /mnt/models/cache
          name: nim-pvc-1732708712714pkx2u69rm9
  imagePullSecrets:
    - name: ngc-secret
  multiModel: false
  protocolVersions:
    - grpc-v2
    - v2
  supportedModelFormats:
    - autoSelect: true
      name: phi-3-mini-4k-instruct
      priority: 1
      version: '1'
  volumes:
    - name: nim-pvc-1732708712714pkx2u69rm9
      persistentVolumeClaim:
        claimName: nim-pvc-1732708712714pkx2u69rm9
    - emptyDir:
        medium: Memory
        sizeLimit: 2Gi
      name: shm

The inference service will appear as follows:

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    opendatahub.io/accelerator-name: migrated-gpu
    opendatahub.io/apiProtocol: REST
    opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
    opendatahub.io/template-display-name: NVIDIA NIM
    opendatahub.io/template-name: nvidia-nim-runtime
    openshift.io/display-name: phi-3-mini-4k-instruct
  resourceVersion: '1385642'
  name: phi-3-mini-4k-instruct
  namespace: nvidai-nim-project
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  containers:
    - env:
        - name: NIM_CACHE_PATH
          value: /mnt/models/cache
        - name: NGC_API_KEY
          valueFrom:
            secretKeyRef:
              key: NGC_API_KEY
              name: nvidia-nim-secrets
      image: 'nvcr.io/nim/microsoft/phi-3-mini-4k-instruct:1'
      name: kserve-container
      ports:
        - containerPort: 8000
          protocol: TCP
      volumeMounts:
        - mountPath: /dev/shm
          name: shm
        - mountPath: /mnt/models/cache
          name: nim-pvc-1732708712714pkx2u69rm9
  imagePullSecrets:
    - name: ngc-secret
  multiModel: false
  protocolVersions:
    - grpc-v2
    - v2
  supportedModelFormats:
    - autoSelect: true
      name: phi-3-mini-4k-instruct
      priority: 1
      version: '1'
  volumes:
    - name: nim-pvc-1732708712714pkx2u69rm9
      persistentVolumeClaim:
        claimName: nim-pvc-1732708712714pkx2u69rm9
    - emptyDir:
        medium: Memory
        sizeLimit: 2Gi
      name: shm

The following is the status once the inference service is successfully started:

status:
  address:
    url: 'http://phi-3-mini-4k-instruct.nvidai-nim-project.svc.cluster.local'
  components:
    predictor:
      address:
        url: 'http://phi-3-mini-4k-instruct-predictor.nvidai-nim-project.svc.cluster.local'
      latestCreatedRevision: phi-3-mini-4k-instruct-predictor-00001
      latestReadyRevision: phi-3-mini-4k-instruct-predictor-00001
      latestRolledoutRevision: phi-3-mini-4k-instruct-predictor-00001
      traffic:
        - latestRevision: true
          percent: 100
          revisionName: phi-3-mini-4k-instruct-predictor-00001
      url: 'http://phi-3-mini-4k-instruct-predictor.nvidai-nim-project.svc.cluster.local'
  conditions:
    - lastTransitionTime: '2024-11-27T12:04:38Z'
      status: 'True'
      type: IngressReady
    - lastTransitionTime: '2024-11-27T12:04:37Z'
      severity: Info
      status: 'True'
      type: LatestDeploymentReady
    - lastTransitionTime: '2024-11-27T12:04:37Z'
      severity: Info
      status: 'True'
      type: PredictorConfigurationReady
    - lastTransitionTime: '2024-11-27T12:04:38Z'
      status: 'True'
      type: PredictorReady
    - lastTransitionTime: '2024-11-27T12:04:38Z'
      severity: Info
      status: 'True'
      type: PredictorRouteReady
    - lastTransitionTime: '2024-11-27T12:04:38Z'
      status: 'True'
      type: Ready
    - lastTransitionTime: '2024-11-27T12:04:38Z'
      severity: Info
      status: 'True'
      type: RoutesReady
  modelStatus:
    copies:
      failedCopies: 0
      totalCopies: 1
    states:
      activeModelState: Loaded
      targetModelState: Loaded
    transitionStatus: UpToDate
  observedGeneration: 1
  url: 'http://phi-3-mini-4k-instruct.nvidai-nim-project.svc.cluster.local'

Once the pod is running fine, then the model status will turn green.

Note that you need to have sufficient resources (i.e., GPU and related memory) to deploy the model in consideration. This is called a profile for a specific model. You must find the right profile which in turn talks about the resource requirements for each NVIDIA NIM supported model found here.

The following is the Python code to access the NVIDIA NIM model:

from openai import OpenAI
client = OpenAI(
  base_url = "http://integrate.api.nvidia.com/v1",
  api_key = "$API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC"
)
completion = client.chat.completions.create(
  model="meta/llama-3.1-70b-instruct",
  messages=[{"role":"user","content":"What are benefits of NVIDIA NIM on Red Hat OpenShift AI?"}],
  temperature=0.2,
  top_p=0.7,
  max_tokens=1024,
  stream=True
)
for chunk in completion:
  if chunk.choices[0].delta.content is not None:
    print(chunk.choices[0].delta.content, end="")

I have connected the llama-3.1-70b-instruct model deployed using NVIDIA NIM and this code. Looking at the question I asked, I got a good response from the model. The following is a copy of the llama-3.1-70b-instruct model hosted on NVIDIA NIM.

NVIDIA NIM (NVIDIA AI Enterprise Infrastructure Monitoring) on Red Hat OpenShift AI provides several benefits for AI and data science workloads. Here are some of the key advantages:

Simplified AI infrastructure management: NIM provides a unified view of the entire AI infrastructure, including NVIDIA GPUs, Red Hat OpenShift clusters, and AI workloads. This simplifies management, monitoring, and troubleshooting of AI environments.
Real-time monitoring and alerting: NIM offers real-time monitoring of GPU utilization, temperature, and other key metrics. This enables administrators to quickly identify and respond to issues, reducing downtime and improving overall system efficiency.
AI workload optimization: NIM provides insights into AI workload performance, allowing administrators to optimize resource allocation, prioritize workloads, and ensure that critical tasks are completed efficiently.
GPU resource utilization: NIM helps optimize GPU resource utilization, ensuring that these valuable resources are allocated effectively and minimizing waste.
Integration with Red Hat OpenShift: NIM integrates seamlessly with Red Hat OpenShift, providing a single pane of glass for managing both the AI infrastructure and the containerized workloads running on it.
Scalability and flexibility: NIM supports large-scale AI deployments, making it an ideal solution for enterprises with complex AI environments.
Security and compliance: NIM provides robust security features, including role-based access control, encryption, and auditing to ensure that AI environments meet stringent security and compliance requirements.
Improved collaboration: NIM enables data scientists, developers, and IT administrators to collaborate more effectively, sharing insights and best practices to drive AI innovation.
Reduced costs: By optimizing AI infrastructure and workload performance, NIM helps reduce costs associated with GPU resources, energy consumption, and infrastructure maintenance.
Faster time-to-insight: NIM accelerates the AI development lifecycle by providing real-time insights, automating workflows, and streamlining the deployment of AI models.

By combining NVIDIA NIM with Red Hat OpenShift AI, organizations can create a powerful AI platform that accelerates innovation, improves efficiency, and drives business success.

Efficient model deployment and responses

Red Hat OpenShift AI with NVIDIA NIM offers tremendous opportunities for enterprises to deploy models anywhere, focusing on secure, scalable, and easy-to-deploy solutions. Getting the right hardware profile to deploy your model is important for efficient responses from your model.

Get started with Red Hat OpenShift AI learning paths and our free Developer Sandbox.

Linux

Java runtimes & frameworks

Kubernetes

Integration & App Connectivity

AI/ML

Automation

Developer tools

Developer Sandbox

Programming Languages & Frameworks

System Design & Architecture

Developer Productivity

Secure Development & Architectures

Platform Engineering

Automated Data Processing

Start exploring in the Developer Sandbox for free

E-Books

Cheat Sheets

Documentation

Red Hat Learning

How to set up NVIDIA NIM on Red Hat OpenShift AI

The benefits of NVIDIA NIM

How to set up and deploy NVIDIA NIM

Efficient model deployment and responses