Enterprise Model Serving with OpenShift AI

From "Shadow AI" to Governed, Tuned, and Automated Inference

The Problem: Deploying LLMs is not like deploying microservices. Default configurations lead to Out-Of-Memory (OOM) crashes, wasted GPU funds, and unpredictable latency.
The Solution: A structured engineering approach to Selection, Sizing, Tuning, and Automation using vLLM and Red Hat OpenShift AI.

This repository contains the complete "Enterprise Serving" learning path. It guides Platform Engineers from the basics of GPU architecture to a production-ready, GitOps-based deployment of the IBM Granite-3.3-2B model.

⚡ Quick Start: The "Fast Track"

If you are an experienced Engineer and simply want to see the vLLM Serving Automation in action (skipping the theory), follow these steps to deploy a tuned Granite model immediately.

1. Prerequisites

Cluster: Red Hat OpenShift AI 3.0 installed.
Hardware: At least 1 Node with an NVIDIA GPU (T4, A10G, or L4).
CLI: oc logged in with cluster-admin privileges.

2. Setup the Environment

(Optional) If you do not have an S3 bucket or Data Connection, run this script to deploy MinIO and download the model automatically.

chmod +x deploy/fast-track.sh
./deploy/fast-track.sh

Deploy the Model (GitOps) Run the automated deployment script. This creates the ServingRuntime (Engine) and InferenceService (Workload) with specific tuning parameters (max-model-len=8192) to prevent crashes.

chmod +x deploy/deploy-serve.sh
./deploy/deploy-serve.sh

Verify Once the script reports ✅ SUCCESS, test the API:

# Get the URL
export URL=$(oc get inferenceservice granite-4-micro -n rhoai-model-vllm-lab -o jsonpath='{.status.url}')


# Test Inference
curl -k $URL/v1/completions \
  -H 'Content-Type: application/json' \
  -d '{ "model": "granite-4-micro", "prompt": "Define MLOps.", "max_tokens": 50 }'
📚 The Full Course (Antora)
This repository is structured as a self-paced course. To view the full learning experience—including GPU sizing math, architecture diagrams, and vLLM deep-dives—build the documentation site.

Using Docker (Recommended)

docker run -u $(id -u) -v $PWD:/antora:Z --rm -t antora/antora antora-playbook.yml

Open the generated site:

open build/site/index.html

Using Local NPM

npm install
npx antora antora-playbook.yml

📖 Course Modules Module 1: Strategy & Selection The Enterprise Reality: Moving beyond leaderboard hype.

Validated Patterns: Using the Red Hat AI Validated Model Repository to de-risk deployment.

Model Selection: Why we chose Granite-3.3-2B (Apache 2.0, Transparent, Efficient).

Module 2: Hardware Architecture & Sizing GPU Generations: When to use Ampere (A10G) vs. Hopper (H100).

The Math: How to calculate VRAM requirements using the formula:

Total VRAM = (Model Weights * 1.2) + KV Cache.

The Trap: Why a model fits when idle but crashes under load.

Module 3: The Engine (vLLM) Concepts: Understanding PagedAttention and efficient memory management.

Tuning Guide:

--max-model-len: The "Safety Valve" for context windows.

--gpu-memory-utilization: Optimizing for throughput.

--tensor-parallel-size: Sharding large models across GPUs.

Module 4: Automated Deployment Infrastructure-as-Code: Abandoning "Click-Ops" for reproducible scripts.

The Lab: Executing deploy-serve.sh to deploy the tuned stack.

📂 Repository Structure Plaintext

/ ├── deploy/ # Automation Scripts │ ├── fast-track.sh # Lab Setup (MinIO + Model Download) │ └── deploy-serve.sh # The Deployment Logic (vLLM + KServe) │ ├── docs/ # Course Content (AsciiDoc) │ └── modules/ROOT/pages/ │ ├── index.adoc # Introduction │ ├── hardware-sizing.adoc │ ├── vllm-tuning.adoc │ └── automated-deployment.adoc │ └── antora-playbook.yml # Documentation Build Config

🛠 Troubleshooting OOMKilled (Exit Code 137):

Cause: The model + KV Cache exceeded GPU VRAM.

Fix: Edit deploy/deploy-serve.sh and lower CONTEXT_LIMIT (e.g., from 8192 to 4096).

Pod Stuck in Pending:

Cause: No GPU nodes available or quotas exceeded.

Fix: Check oc describe pod for scheduling errors.

Timeout Waiting for Model:

Cause: Downloading the model/image took longer than the script's loop.

Fix: Check logs: oc logs -f -l serving.kserve.io/inferenceservice=granite-4-micro -c kserve-container.

🔗 Next Steps Once you have mastered single-model serving, you are ready for the advanced modules (Coming Soon):

Quantization Lab: Compressing Granite to INT8 using InstructLab.

Distributed Inference: Using llm-d for intelligent routing across multiple replicas.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.github/workflows		.github/workflows
.vscode		.vscode
deploy		deploy
images		images
modules		modules
supplemental-ui/partials		supplemental-ui/partials
ui-assets		ui-assets
ui-bundle		ui-bundle
.DS_Store		.DS_Store
.gitignore		.gitignore
DEPLOYMENT_VALIDATION.md		DEPLOYMENT_VALIDATION.md
DEVSPACE.md		DEVSPACE.md
OPENSHIFT_AI_READINESS.md		OPENSHIFT_AI_READINESS.md
README-TRAINING.md		README-TRAINING.md
README.md		README.md
USAGEGUIDE.adoc		USAGEGUIDE.adoc
antora-playbook.yml		antora-playbook.yml
antora.yml		antora.yml
course-init.sh		course-init.sh
create-ui-bundle.sh		create-ui-bundle.sh
devfile.yaml		devfile.yaml
package-lock.json		package-lock.json
package.json		package.json
pdfgen.sh		pdfgen.sh
sample-image.png		sample-image.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enterprise Model Serving with OpenShift AI

⚡ Quick Start: The "Fast Track"

1. Prerequisites

2. Setup the Environment

Open the generated site:

open build/site/index.html

About

Uh oh!

Releases

Packages

Languages

RedHatQuickCourses/rhoai3-deploy

Folders and files

Latest commit

History

Repository files navigation

Enterprise Model Serving with OpenShift AI

⚡ Quick Start: The "Fast Track"

1. Prerequisites

2. Setup the Environment

Open the generated site:

open build/site/index.html

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages