Skip to content

Conversation

@luomingmeng
Copy link

What type of PR is this?

Features

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

@luomingmeng luomingmeng force-pushed the dev/support-enable-share-gpu-for-reclaimed branch from fafce64 to 198035f Compare November 24, 2025 06:31
@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-plugins branch 9 times, most recently from 9c9800a to c84f8f7 Compare December 1, 2025 08:56
@luomingmeng luomingmeng force-pushed the dev/support-enable-share-gpu-for-reclaimed branch 2 times, most recently from 5046660 to 6d4961a Compare December 9, 2025 21:53
@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-plugins branch 2 times, most recently from 4dfcbf5 to 8176c42 Compare December 17, 2025 02:12
@luomingmeng luomingmeng force-pushed the dev/support-enable-share-gpu-for-reclaimed branch from 541bb02 to d5d22b1 Compare December 25, 2025 06:34
luomingmeng and others added 15 commits December 25, 2025 15:01
This commit introduces a new static policy implementation for GPU resource management, including:
- GPU topology provider and state management
- Static policy implementation with allocation and deallocation logic
- Integration with existing QRM framework
- Metrics and health checks for GPU resource management
- Update GPU memory type from uint64 to float64 for precise allocation
- Implement NUMA-aware GPU topology management and allocation
- Add support for associated device allocation and topology hints
- Introduce new GPU topology provider with NUMA node tracking
- Extend GPU state management with NUMA node information
- Add utility functions for GPU memory hint generation and NUMA calculations
The preferredHintIndexes variable was declared but never used in the code. Removing it improves code clarity and maintainability.
Add new functionality to handle associated device allocation requests in the resource plugin stub. This includes:
- Adding new stub function type and default implementation
- Extending the Stub struct with new field
- Adding new methods for associated device operations
…icy structs

Add pluginapi.UnimplementedResourcePluginServer to all policy structs to ensure forward compatibility with gRPC interface changes
Implement GetAssociatedDeviceTopologyHints method in StaticPolicy and update stub to handle topology hints requests. Also update kubelet dependency version and rename mustIncludeDevices to reusableDevices for clarity.
… allocated memory

Add tracking of allocated GPU memory per NUMA node and modify hint calculation to prefer nodes with most allocated memory. This helps balance GPU memory usage across NUMA nodes.
chore: add unit tests

chore: add unit tests

chore: add unit tests

chore: add unit tests
…lugins

feat: introduce rdma state and allow states to share within gpu sub-plugins

feat: introduce rdma state and allow states to share within gpu sub-plugins
…ompany resource allocation

feat: implement rdma custom device plugin and implement logic for accompany resource allocation
luomingmeng and others added 8 commits December 25, 2025 15:01
Set allocatable memory to zero for unhealthy GPU devices and use separate capacity values instead of reusing allocatable values. This ensures accurate resource accounting for both healthy and unhealthy devices.
Clean up GenericQRMPluginConfiguration by removing unused StateFileDirectory and InMemoryStateFileDirectory fields to simplify the struct.
Return nil instead of error when numa topology is not ready and log the error

fix: handle error gracefully

fix: handle error gracefully
implement ShareGPUManager to determine GPU device sharing eligibility based on pod indicators
add periodic sync and caching for efficient decision making
integrate with base plugin and update device state handling
Introduce a new eviction plugin for GPU resources to handle eviction based on GPU topology and allocations. The plugin reuses the generic ZoneResourcesPlugin with zoneType=GPU to preserve behavior while adding specific GPU resource handling capabilities. Includes unit tests to verify functionality.
…checks

Ensure healthz state is updated when errors occur during threshold checks in both resources and zone resources plugins.
@luomingmeng luomingmeng force-pushed the dev/support-enable-share-gpu-for-reclaimed branch from d5d22b1 to b2a1de2 Compare December 25, 2025 07:01
@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-plugins branch 6 times, most recently from e2365c2 to 21b097a Compare January 2, 2026 06:33
@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-plugins branch 9 times, most recently from eb87c70 to 33b37df Compare January 7, 2026 07:10
Add CheckReclaimed condition to skip reclaimed containers when evaluating device share status
Add test case to verify reclaimed containers are ignored
Move aggregation of allocatable and capacity quantities after health check to ensure accurate totals for unhealthy or non-shared devices
Add constant thresholdMetToleranceDurationForGPU to set a fixed 15-second tolerance duration for GPU resource eviction, replacing the dynamic configuration value.
@luomingmeng luomingmeng force-pushed the dev/support-enable-share-gpu-for-reclaimed branch from 83e65e3 to b13e4d2 Compare January 14, 2026 12:08
@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-plugins branch from b3d9e26 to b1649fb Compare January 20, 2026 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants