This repository contains an umbrella Helm chart for deploying AI models for the Exploit IQ platform. The chart deploys:
- Embedding model: NVIDIA NIM embedding model (nv-embedqa-e5-v5) for creating vector embeddings
- LLM: One of the following large language models:
- Llama 3.1 70B Instruct (4-bit quantization) with vLLM
- NVIDIA NIM Llama 3.1 8B Instruct (16-bit quantization)
Note: Only one LLM can be deployed at a time. The chart enforces this constraint to prevent resource conflicts.
Before installing this chart, ensure you have:
- OpenShift cluster with GPU support
- GPU nodes with NVIDIA drivers installed
- NGC API key from NVIDIA (obtain one here)
- Helm 3.x installed on your local machine
- oc configured to access your cluster
Note: Run all commands from the repository root directory.
Create a dedicated namespace for the models:
oc new-project exploit-iq-modelsExport your NGC API key as an environment variable:
export NGC_API_KEY=<your-ngc-api-key>Replace <your-ngc-api-key> with your actual NGC API key.
Create a custom values file with your NGC API key:
sed -E 's/ \&ngc-api-key changeme/ \&ngc-api-key '$NGC_API_KEY'/' \
exploit-iq-models/values.yaml > \
exploit-iq-models/custom-values.yamlIf your cluster has GPU nodes with taints, you must configure tolerations in your custom values file. Edit custom-values.yaml and uncomment the toleration sections for each component as shown in the file comments.
Example for nodes with nvidia.com/gpu taint:
llama3_1_70b_instruct_4bit:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
nim-embed:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"Important: You cannot deploy both LLMs simultaneously. Choose one of the following deployment options:
Option A: Deploy with Llama 3.1 70B (default)
helm upgrade --install exploit-iq-models \
exploit-iq-models/ \
-f exploit-iq-models/custom-values.yamlOption B: Deploy with NIM Llama 3.1 8B
helm upgrade --install exploit-iq-models \
exploit-iq-models/ \
-f exploit-iq-models/custom-values.yaml \
--set llama3_1_70b_instruct_4bit.enabled=false \
--set nim_llm.enabled=trueAttempting to deploy both LLMs results in an error:
Error: INSTALLATION FAILED: execution error at (exploit-iq-models/templates/configmap.yaml:6:3):
Only one of models should be deployed!, either llama3_1_70b_instruct_4bit or nim_llm 8b, but not both!Wait for the LLM pod to be ready (this can take several minutes as the model downloads):
oc wait --for=condition=ready pod -l component=llama3.1-70b-instruct --timeout=1000sGet the route URL for your deployment:
ROUTE_URL=$(oc get route llama3-1-70b-instruct-4bit -o jsonpath='{.spec.host}')
echo "Model endpoint: http://$ROUTE_URL"Send a test request to the model:
curl -X POST -H "Content-Type: application/json" \
http://$ROUTE_URL/v1/chat/completions \
-d @exploit-iq-models/files/70b-4bit-input-example.json | jq .Expected response: JSON output with the model's completion.
By default, all tolerations are empty arrays ([]), allowing the chart to work on clusters without GPU node taints. If your cluster uses taints to dedicate GPU nodes for specific workloads, configure tolerations in your values file.
See the comments in exploit-iq-models/values.yaml for detailed examples.
The chart automatically configures SCC permissions based on your deployment configuration. OpenShift then selects the appropriate SCC for the pod:
- Single-GPU deployment (
hostIPC: false): Pod usesanyuidSCC - Multi-GPU deployment (
hostIPC: true): Pod useshostaccessSCC
For multi-GPU deployments, the hostaccess SCC is required because it allows hostIPC, which vLLM needs for inter-process communication across GPUs. When you set hostIPC: true, the chart automatically grants permission to use the hostaccess SCC.
By default, vLLM runs with minimal configuration (only the --model argument). Configure additional vLLM arguments in your custom values file based on your GPU capabilities and performance requirements:
llama3_1_70b_instruct_4bit:
hostIPC: true # Required for multi-GPU tensor parallelism
vllm:
args:
- "--tensor-parallel-size=2" # Multi-GPU parallelism
- "--max-model-len=8192" # Context length
- "--max-num-seqs=32" # Parallel sequences
- "--gpu-memory-utilization=0.9" # GPU memory fraction
resources:
limits:
nvidia.com/gpu: "2" # Must match tensor-parallel-size
memory: 35Gi # Increase for multi-GPU
cpu: 4000m
requests:
nvidia.com/gpu: "2"
memory: 25Gi
cpu: 2000mImportant: When using --tensor-parallel-size > 1, you must:
- Set
hostIPC: trueto enable inter-process communication between GPUs - Adjust GPU resource limits to match the number of GPUs (e.g.,
nvidia.com/gpu: "2") - The chart will automatically add the
hostaccessSCC to allowhostIPC- no manual SCC configuration needed
To remove the deployed models:
helm uninstall exploit-iq-models