Exploit IQ Self-Hosted Models

Overview

This repository contains an umbrella Helm chart for deploying AI models for the Exploit IQ platform. The chart deploys:

Embedding model: NVIDIA NIM embedding model (nv-embedqa-e5-v5) for creating vector embeddings
LLM: One of the following large language models:
- Llama 3.1 70B Instruct (4-bit quantization) with vLLM
- NVIDIA NIM Llama 3.1 8B Instruct (16-bit quantization)

Note: Only one LLM can be deployed at a time. The chart enforces this constraint to prevent resource conflicts.

Prerequisites

Before installing this chart, ensure you have:

OpenShift cluster with GPU support
GPU nodes with NVIDIA drivers installed
NGC API key from NVIDIA (obtain one here)
Helm 3.x installed on your local machine
oc configured to access your cluster

Installation

Note: Run all commands from the repository root directory.

Step 1: Create the namespace

Create a dedicated namespace for the models:

oc new-project exploit-iq-models

Step 2: Prepare your NGC API key

Export your NGC API key as an environment variable:

export NGC_API_KEY=<your-ngc-api-key>

Replace <your-ngc-api-key> with your actual NGC API key.

Step 3: Create a custom values file

Create a custom values file with your NGC API key:

sed -E 's/ \&ngc-api-key changeme/ \&ngc-api-key '$NGC_API_KEY'/' \
  exploit-iq-models/values.yaml > \
  exploit-iq-models/custom-values.yaml

Step 4: Configure tolerations (if needed)

If your cluster has GPU nodes with taints, you must configure tolerations in your custom values file. Edit custom-values.yaml and uncomment the toleration sections for each component as shown in the file comments.

Example for nodes with nvidia.com/gpu taint:

llama3_1_70b_instruct_4bit:
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"

nim-embed:
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"

Step 5: Deploy the chart

Important: You cannot deploy both LLMs simultaneously. Choose one of the following deployment options:

Option A: Deploy with Llama 3.1 70B (default)

helm upgrade --install exploit-iq-models \
  exploit-iq-models/ \
  -f exploit-iq-models/custom-values.yaml

Option B: Deploy with NIM Llama 3.1 8B

helm upgrade --install exploit-iq-models \
  exploit-iq-models/ \
  -f exploit-iq-models/custom-values.yaml \
  --set llama3_1_70b_instruct_4bit.enabled=false \
  --set nim_llm.enabled=true

Attempting to deploy both LLMs results in an error:

Error: INSTALLATION FAILED: execution error at (exploit-iq-models/templates/configmap.yaml:6:3):
Only one of models should be deployed!, either llama3_1_70b_instruct_4bit or nim_llm 8b, but not both!

Verification

Wait for pods to be ready

Wait for the LLM pod to be ready (this can take several minutes as the model downloads):

oc wait --for=condition=ready pod -l component=llama3.1-70b-instruct --timeout=1000s

Retrieve the route URL

Get the route URL for your deployment:

ROUTE_URL=$(oc get route llama3-1-70b-instruct-4bit -o jsonpath='{.spec.host}')
echo "Model endpoint: http://$ROUTE_URL"

Test the model

Send a test request to the model:

curl -X POST -H "Content-Type: application/json" \
  http://$ROUTE_URL/v1/chat/completions \
  -d @exploit-iq-models/files/70b-4bit-input-example.json | jq .

Expected response: JSON output with the model's completion.

Configuration

Configuring tolerations

By default, all tolerations are empty arrays ([]), allowing the chart to work on clusters without GPU node taints. If your cluster uses taints to dedicate GPU nodes for specific workloads, configure tolerations in your values file.

See the comments in exploit-iq-models/values.yaml for detailed examples.

Security Context Constraints (SCC)

The chart automatically configures SCC permissions based on your deployment configuration. OpenShift then selects the appropriate SCC for the pod:

Single-GPU deployment (hostIPC: false): Pod uses anyuid SCC
Multi-GPU deployment (hostIPC: true): Pod uses hostaccess SCC

For multi-GPU deployments, the hostaccess SCC is required because it allows hostIPC, which vLLM needs for inter-process communication across GPUs. When you set hostIPC: true, the chart automatically grants permission to use the hostaccess SCC.

Configuring vLLM arguments

By default, vLLM runs with minimal configuration (only the --model argument). Configure additional vLLM arguments in your custom values file based on your GPU capabilities and performance requirements:

llama3_1_70b_instruct_4bit:
  hostIPC: true  # Required for multi-GPU tensor parallelism
  vllm:
    args:
      - "--tensor-parallel-size=2"      # Multi-GPU parallelism
      - "--max-model-len=8192"           # Context length
      - "--max-num-seqs=32"              # Parallel sequences
      - "--gpu-memory-utilization=0.9"  # GPU memory fraction
  resources:
    limits:
      nvidia.com/gpu: "2"  # Must match tensor-parallel-size
      memory: 35Gi         # Increase for multi-GPU
      cpu: 4000m
    requests:
      nvidia.com/gpu: "2"
      memory: 25Gi
      cpu: 2000m

Important: When using --tensor-parallel-size > 1, you must:

Set hostIPC: true to enable inter-process communication between GPUs
Adjust GPU resource limits to match the number of GPUs (e.g., nvidia.com/gpu: "2")
The chart will automatically add the hostaccess SCC to allow hostIPC - no manual SCC configuration needed

Uninstallation

To remove the deployed models:

helm uninstall exploit-iq-models

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.idea		.idea
exploit-iq-models		exploit-iq-models
README.md		README.md
agent-morpheus-models.iml		agent-morpheus-models.iml
exploit-iq-models.iml		exploit-iq-models.iml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploit IQ Self-Hosted Models

Overview

Prerequisites

Installation

Step 1: Create the namespace

Step 2: Prepare your NGC API key

Step 3: Create a custom values file

Step 4: Configure tolerations (if needed)

Step 5: Deploy the chart

Verification

Wait for pods to be ready

Retrieve the route URL

Test the model

Configuration

Configuring tolerations

Security Context Constraints (SCC)

Configuring vLLM arguments

Uninstallation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

RHEcosystemAppEng/exploit-iq-models

Folders and files

Latest commit

History

Repository files navigation

Exploit IQ Self-Hosted Models

Overview

Prerequisites

Installation

Step 1: Create the namespace

Step 2: Prepare your NGC API key

Step 3: Create a custom values file

Step 4: Configure tolerations (if needed)

Step 5: Deploy the chart

Verification

Wait for pods to be ready

Retrieve the route URL

Test the model

Configuration

Configuring tolerations

Security Context Constraints (SCC)

Configuring vLLM arguments

Uninstallation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages