Skip to content

This Repository contains an umbrella chart to install nim embedding model and one LLM out of 2 possible LLMs.

Notifications You must be signed in to change notification settings

RHEcosystemAppEng/exploit-iq-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exploit IQ Self-Hosted Models

Overview

This repository contains an umbrella Helm chart for deploying AI models for the Exploit IQ platform. The chart deploys:

  • Embedding model: NVIDIA NIM embedding model (nv-embedqa-e5-v5) for creating vector embeddings
  • LLM: One of the following large language models:
    • Llama 3.1 70B Instruct (4-bit quantization) with vLLM
    • NVIDIA NIM Llama 3.1 8B Instruct (16-bit quantization)

Note: Only one LLM can be deployed at a time. The chart enforces this constraint to prevent resource conflicts.

Prerequisites

Before installing this chart, ensure you have:

  • OpenShift cluster with GPU support
  • GPU nodes with NVIDIA drivers installed
  • NGC API key from NVIDIA (obtain one here)
  • Helm 3.x installed on your local machine
  • oc configured to access your cluster

Installation

Note: Run all commands from the repository root directory.

Step 1: Create the namespace

Create a dedicated namespace for the models:

oc new-project exploit-iq-models

Step 2: Prepare your NGC API key

Export your NGC API key as an environment variable:

export NGC_API_KEY=<your-ngc-api-key>

Replace <your-ngc-api-key> with your actual NGC API key.

Step 3: Create a custom values file

Create a custom values file with your NGC API key:

sed -E 's/ \&ngc-api-key changeme/ \&ngc-api-key '$NGC_API_KEY'/' \
  exploit-iq-models/values.yaml > \
  exploit-iq-models/custom-values.yaml

Step 4: Configure tolerations (if needed)

If your cluster has GPU nodes with taints, you must configure tolerations in your custom values file. Edit custom-values.yaml and uncomment the toleration sections for each component as shown in the file comments.

Example for nodes with nvidia.com/gpu taint:

llama3_1_70b_instruct_4bit:
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"

nim-embed:
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"

Step 5: Deploy the chart

Important: You cannot deploy both LLMs simultaneously. Choose one of the following deployment options:

Option A: Deploy with Llama 3.1 70B (default)

helm upgrade --install exploit-iq-models \
  exploit-iq-models/ \
  -f exploit-iq-models/custom-values.yaml

Option B: Deploy with NIM Llama 3.1 8B

helm upgrade --install exploit-iq-models \
  exploit-iq-models/ \
  -f exploit-iq-models/custom-values.yaml \
  --set llama3_1_70b_instruct_4bit.enabled=false \
  --set nim_llm.enabled=true

Attempting to deploy both LLMs results in an error:

Error: INSTALLATION FAILED: execution error at (exploit-iq-models/templates/configmap.yaml:6:3):
Only one of models should be deployed!, either llama3_1_70b_instruct_4bit or nim_llm 8b, but not both!

Verification

Wait for pods to be ready

Wait for the LLM pod to be ready (this can take several minutes as the model downloads):

oc wait --for=condition=ready pod -l component=llama3.1-70b-instruct --timeout=1000s

Retrieve the route URL

Get the route URL for your deployment:

ROUTE_URL=$(oc get route llama3-1-70b-instruct-4bit -o jsonpath='{.spec.host}')
echo "Model endpoint: http://$ROUTE_URL"

Test the model

Send a test request to the model:

curl -X POST -H "Content-Type: application/json" \
  http://$ROUTE_URL/v1/chat/completions \
  -d @exploit-iq-models/files/70b-4bit-input-example.json | jq .

Expected response: JSON output with the model's completion.

Configuration

Configuring tolerations

By default, all tolerations are empty arrays ([]), allowing the chart to work on clusters without GPU node taints. If your cluster uses taints to dedicate GPU nodes for specific workloads, configure tolerations in your values file.

See the comments in exploit-iq-models/values.yaml for detailed examples.

Security Context Constraints (SCC)

The chart automatically configures SCC permissions based on your deployment configuration. OpenShift then selects the appropriate SCC for the pod:

  • Single-GPU deployment (hostIPC: false): Pod uses anyuid SCC
  • Multi-GPU deployment (hostIPC: true): Pod uses hostaccess SCC

For multi-GPU deployments, the hostaccess SCC is required because it allows hostIPC, which vLLM needs for inter-process communication across GPUs. When you set hostIPC: true, the chart automatically grants permission to use the hostaccess SCC.

Configuring vLLM arguments

By default, vLLM runs with minimal configuration (only the --model argument). Configure additional vLLM arguments in your custom values file based on your GPU capabilities and performance requirements:

llama3_1_70b_instruct_4bit:
  hostIPC: true  # Required for multi-GPU tensor parallelism
  vllm:
    args:
      - "--tensor-parallel-size=2"      # Multi-GPU parallelism
      - "--max-model-len=8192"           # Context length
      - "--max-num-seqs=32"              # Parallel sequences
      - "--gpu-memory-utilization=0.9"  # GPU memory fraction
  resources:
    limits:
      nvidia.com/gpu: "2"  # Must match tensor-parallel-size
      memory: 35Gi         # Increase for multi-GPU
      cpu: 4000m
    requests:
      nvidia.com/gpu: "2"
      memory: 25Gi
      cpu: 2000m

Important: When using --tensor-parallel-size > 1, you must:

  1. Set hostIPC: true to enable inter-process communication between GPUs
  2. Adjust GPU resource limits to match the number of GPUs (e.g., nvidia.com/gpu: "2")
  3. The chart will automatically add the hostaccess SCC to allow hostIPC - no manual SCC configuration needed

Uninstallation

To remove the deployed models:

helm uninstall exploit-iq-models

About

This Repository contains an umbrella chart to install nim embedding model and one LLM out of 2 possible LLMs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •