Skip to content

feat: add dynamo-inference project with Dynamo v0.9.0 + cross-node EFA LIBFABRIC#71

Open
dmvevents wants to merge 19 commits intoaws-samples:mainfrom
dmvevents:feat/dynamo-v0.9.0-efa-update
Open

feat: add dynamo-inference project with Dynamo v0.9.0 + cross-node EFA LIBFABRIC#71
dmvevents wants to merge 19 commits intoaws-samples:mainfrom
dmvevents:feat/dynamo-v0.9.0-efa-update

Conversation

@dmvevents
Copy link
Contributor

Summary

Add the dynamo-inference project with production-validated disaggregated inference on AWS using NVIDIA Dynamo v0.9.0 with EFA RDMA networking.

What's included

Dockerfiles (EFA-optimized base + framework images)

  • Dockerfile.base — NIXL 0.9.0 + EFA 1.45.1 + UCX 1.20.0 + NCCL 2.28.9 + CUDA 13.1
  • Dockerfile.dynamo-trtllm — TensorRT-LLM 1.3.0rc3
  • Dockerfile.dynamo-vllm — vLLM 0.14.1

Cross-node disaggregated inference

  • deployments/cross-node/trtllm-crossnode-libfabric.yaml — Production-validated on 2x P5.48xlarge
  • NIXL LIBFABRIC backend over EFA RDMA with confirmed cross-node handshake
  • 128Gi memory requirement documented (32 EFA device enumeration)

Build scripts and documentation

  • build.sh, build_trtllm.sh, build_vllm.sh — automated builds
  • Benchmarking guide, NGC build variants, deployment scripts
  • Complete README with version matrix and troubleshooting

Component Versions

Component Version
Dynamo v0.9.0
NIXL 0.9.0
TRT-LLM 1.3.0rc3
vLLM 0.14.1
CUDA 13.1
EFA installer 1.45.1
NCCL 2.28.9-1
UCX 1.20.0

Validation

Tested on 2x P5.48xlarge (8x H100 80GB, 32 EFA per node):

  • NIXL LIBFABRIC backend activation confirmed
  • Cross-node EFA handshake (8 bidirectional connections)
  • Inference: 147ms latency with Qwen3-0.6B
  • 10 concurrent requests: all successful

Test plan

  • Cross-node inference on P5.48xlarge with EFA
  • NIXL LIBFABRIC backend + EFA handshake verified
  • Full Docker build from scratch
  • Multi-GPU (TP8) production test

dmvevents and others added 19 commits November 14, 2025 02:30
This contribution adds comprehensive GPU-to-GPU distributed inference capabilities using NVIDIA NIXL (Network Infrastructure for eXascale Learning) framework.

**Key Components:**

Container Images:
- dynamo-vllm: vLLM backend with NIXL support
- dynamo-trtllm: TensorRT-LLM backend with NIXL support
- Production-ready Dockerfiles with UCX, EFA, GPUDirect RDMA

Infrastructure:
- EKS/HyperPod deployment manifests
- ETCD coordination service setup
- EFA networking configuration
- GPUDirect RDMA kernel module integration

Documentation:
- Complete benchmarking guide (nixlbench)
- Installation and setup instructions
- Performance validation results (0.307 GB/sec peak bandwidth)
- Troubleshooting guides

Testing:
- nixlbench validation suite
- Multi-node GPU communication tests
- UCX backend performance benchmarks

This enables high-performance distributed inference workloads with cross-node GPU-to-GPU communication on AWS infrastructure.
…ddress PR feedback

Changes made to address PR review feedback for aws-samples#49:

1. Directory Rename (nixl-distributed-inference → dynamo-inference)
   - Renamed project directory to align with Dynamo initiative naming
   - All file references updated to reflect new structure
   - Git history preserved via git mv command

2. Security: Redact AWS Account Numbers
   - Replaced all instances of AWS account ID 058264135704 with <AWS_ACCOUNT_ID>
   - Affected files: YAML configurations, documentation, build scripts, test logs
   - Total replacements: 28 instances across 12 files
   - No credentials or secrets exposed in commit history (verified)

3. Remove Production Terminology Claims
   - Changed "production-ready" → "deployment-ready"
   - Changed "Production Ready" → "Deployment Ready"
   - Changed "ready for production" → "ready for deployment"
   - Changed "production deployment" → "deployment"
   - Changed "production workloads" → "deployment workloads"
   - Preserved legitimate technical terms (Docker tags, filenames)
   - Compliance with aws-samples org policy (no production guarantees)

Files Modified:
- All documentation (.md files) updated for terminology
- Kubernetes deployment configs (.yaml) updated for account redaction
- Build scripts (.sh) updated for account redaction
- Test logs (.log) updated for account redaction
- README.md updated with deployment-ready language

This update ensures the contribution meets aws-samples organization
standards for public repositories, removing sensitive AWS account
identifiers and production readiness claims.
Replace 'production' terminology with appropriate alternatives to comply
with aws-samples organization policy:
- Rename Dockerfile.production → Dockerfile.base
- Update Docker tags: production → optimized
- Update container type references: production → base
- Update documentation: production → deployment (context-dependent)

Files changed:
- Dockerfile.base (renamed from Dockerfile.production)
- Build scripts: build.sh, build_trtllm.sh, build-all-{slim,runtime}.sh
- Validation: scripts/validate-build.sh
- Dockerfiles: Dockerfile.dynamo-{vllm,trtllm}
- Documentation: README.md, VERSION_ALIGNMENT_APPLIED.md, and 9 other docs

Legitimate contextual uses remain (e.g., "production deployment",
"production workload testing").
This commit adds complete TRT-LLM deployment capabilities to the Dynamo
inference platform, including:

- Disaggregated TRT-LLM deployment configurations (prefill/decode workers)
- KV cache transceiver configurations for disaggregated mode
- Helper scripts and benchmark suite for testing
- Comprehensive deployment documentation with troubleshooting
- A10 GPU support documentation for workshop deployments
- Benchmark results from H100 testing

Key features:
- Working TRT-LLM disaggregated architecture with 2x prefill + 2x decode workers
- Triton Unicode bug workaround applied automatically
- ConfigMap-based configuration management
- Tested and validated on H100 GPUs (381-470 tokens/sec)
- Full A10 GPU support with optimized configurations

Files added:
- deployments/trtllm/trtllm-disagg-qwen.yaml
- deployments/trtllm/trtllm-prefill-config.yaml
- deployments/trtllm/trtllm-decode-config.yaml
- scripts/trtllm-helpers.sh
- scripts/benchmark-trtllm.sh
- docs/TRTLLM_DEPLOYMENT_GUIDE.md
- docs/A10_DEPLOYMENT_GUIDE.md
- benchmark-results/trtllm_benchmark_20251118_191302.md
Replace friend/colleague references with neutral technical language:
- 'friend configuration' → 'reference configuration'
- 'colleague' → 'reference implementation'
- 'successful friend configuration' → 'validated configuration'

Files updated:
- NIXLBENCH_SETUP_GUIDE.md
- nixl-aligned/VERSION_COMPARISON.md
**Critical fix**: Correct libfabric version references
- Changed all v1.21.0 references to v2.3.0 (matches Dockerfile and validated tests)
- Files: nixl-aligned/GETTING_STARTED.md, VERSION_COMPARISON.md, README.md

**Format standardization**:
- Removed all emojis (✅, ❌, ⚠️) and replaced with text ([Completed], [No], [Warning])
- Standardized header capitalization (ALLCAPS → Standard Case)
- Files affected: 13 markdown files across project

**Summary of changes**:
- libfabric version alignment: 10 files updated
- Emoji removal: ~150 replacements across 13 files
- Header standardization: docs/KUBECTL_QUICK_REF.md, NIXLBENCH_TESTING_GUIDE.md

All documentation now matches repository standards and validated test configurations.
- Add NON_INTERACTIVE environment variable support to all build scripts
  (build.sh, build_vllm.sh, build_trtllm.sh, build-all-slim.sh, build-all-runtime.sh)
- Remove informal language from documentation
- Replace AWS account numbers with placeholder <AWS_ACCOUNT_ID>
- Remove status markers from documentation sections
- Standardize terminology (Production -> Debloated/Optimized)
…ation

Update all component versions to align with Dynamo v0.9.0 release:

Version updates:
- Dynamo: 0.4.0/0.6.1 -> v0.9.0
- NIXL: 0.6.0/0.7.1 -> 0.9.0
- TRT-LLM: 1.1.0rc5 -> 1.3.0rc3
- vLLM: 0.10.2/0.11.0 -> 0.14.1
- CUDA base: 25.01/cuda12.8 -> 25.12/cuda13.1
- PyTorch: 25.06-py3 -> 25.12-py3
- EFA installer: 1.43.1 -> 1.45.1
- NCCL: 2.23.4-1 -> 2.28.9-1
- AWS OFI NCCL: v1.12.0-aws -> v1.17.2
- GDRCopy: 2.4.1 -> 2.5.1
- UCX: v1.19.0 -> v1.20.0

New: cross-node disaggregated inference deployment YAML
- Validated on 2x P5.48xlarge with NIXL LIBFABRIC over EFA RDMA
- EFA handshake confirmed between nodes
- 128Gi memory requirement documented (64Gi causes OOM)

Files updated:
- Dockerfile.base, Dockerfile.dynamo-trtllm, Dockerfile.dynamo-vllm
- build_trtllm.sh, build_vllm.sh
- README.md (version tables, cross-node section)
- NGC build references
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants