support enable share gpu for reclaimed #12

luomingmeng · 2025-11-24T06:26:58Z

What type of PR is this?

Features

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

This commit introduces a new static policy implementation for GPU resource management, including: - GPU topology provider and state management - Static policy implementation with allocation and deallocation logic - Integration with existing QRM framework - Metrics and health checks for GPU resource management

… or numa zone node

- Update GPU memory type from uint64 to float64 for precise allocation - Implement NUMA-aware GPU topology management and allocation - Add support for associated device allocation and topology hints - Introduce new GPU topology provider with NUMA node tracking - Extend GPU state management with NUMA node information - Add utility functions for GPU memory hint generation and NUMA calculations

The preferredHintIndexes variable was declared but never used in the code. Removing it improves code clarity and maintainability.

Add new functionality to handle associated device allocation requests in the resource plugin stub. This includes: - Adding new stub function type and default implementation - Extending the Stub struct with new field - Adding new methods for associated device operations

…icy structs Add pluginapi.UnimplementedResourcePluginServer to all policy structs to ensure forward compatibility with gRPC interface changes

Implement GetAssociatedDeviceTopologyHints method in StaticPolicy and update stub to handle topology hints requests. Also update kubelet dependency version and rename mustIncludeDevices to reusableDevices for clarity.

… allocated memory Add tracking of allocated GPU memory per NUMA node and modify hint calculation to prefer nodes with most allocated memory. This helps balance GPU memory usage across NUMA nodes.

chore: add unit tests chore: add unit tests chore: add unit tests chore: add unit tests

…lugins feat: introduce rdma state and allow states to share within gpu sub-plugins feat: introduce rdma state and allow states to share within gpu sub-plugins

…ompany resource allocation feat: implement rdma custom device plugin and implement logic for accompany resource allocation

Set allocatable memory to zero for unhealthy GPU devices and use separate capacity values instead of reusing allocatable values. This ensures accurate resource accounting for both healthy and unhealthy devices.

Clean up GenericQRMPluginConfiguration by removing unused StateFileDirectory and InMemoryStateFileDirectory fields to simplify the struct.

Return nil instead of error when numa topology is not ready and log the error fix: handle error gracefully fix: handle error gracefully

implement ShareGPUManager to determine GPU device sharing eligibility based on pod indicators add periodic sync and caching for efficient decision making integrate with base plugin and update device state handling

Introduce a new eviction plugin for GPU resources to handle eviction based on GPU topology and allocations. The plugin reuses the generic ZoneResourcesPlugin with zoneType=GPU to preserve behavior while adding specific GPU resource handling capabilities. Includes unit tests to verify functionality.

…checks Ensure healthz state is updated when errors occur during threshold checks in both resources and zone resources plugins.

Add CheckReclaimed condition to skip reclaimed containers when evaluating device share status Add test case to verify reclaimed containers are ignored

Move aggregation of allocatable and capacity quantities after health check to ensure accurate totals for unhealthy or non-shared devices

Add constant thresholdMetToleranceDurationForGPU to set a fixed 15-second tolerance duration for GPU resource eviction, replacing the dynamic configuration value.

luomingmeng force-pushed the dev/support-enable-share-gpu-for-reclaimed branch from fafce64 to 198035f Compare November 24, 2025 06:31

JustinChengLZ force-pushed the dev/support-gpu-plugins branch 9 times, most recently from 9c9800a to c84f8f7 Compare December 1, 2025 08:56

luomingmeng force-pushed the dev/support-enable-share-gpu-for-reclaimed branch 2 times, most recently from 5046660 to 6d4961a Compare December 9, 2025 21:53

JustinChengLZ force-pushed the dev/support-gpu-plugins branch 2 times, most recently from 4dfcbf5 to 8176c42 Compare December 17, 2025 02:12

luomingmeng force-pushed the dev/support-enable-share-gpu-for-reclaimed branch from 541bb02 to d5d22b1 Compare December 25, 2025 06:34

luomingmeng and others added 15 commits December 25, 2025 15:01

refactor(topology): skip add zone node which is not a child of socket…

bd12c5b

… or numa zone node

refactor(cpu): remove unused preferredHintIndexes variable

c76e134

The preferredHintIndexes variable was declared but never used in the code. Removing it improves code clarity and maintainability.

refactor(qrm-plugins): embed UnimplementedResourcePluginServer in pol…

7ef74f5

…icy structs Add pluginapi.UnimplementedResourcePluginServer to all policy structs to ensure forward compatibility with gRPC interface changes

feat(gpu): add associated device topology hints support

4470d02

Implement GetAssociatedDeviceTopologyHints method in StaticPolicy and update stub to handle topology hints requests. Also update kubelet dependency version and rename mustIncludeDevices to reusableDevices for clarity.

fix typo and add logs

fe70af9

refactor(gpu): remove redundant non-numa-affinity gpu allocation logic

11f153e

feat(gpu): optimize GPU allocation by preferring NUMA nodes with most…

ea554e6

… allocated memory Add tracking of allocated GPU memory per NUMA node and modify hint calculation to prefer nodes with most allocated memory. This helps balance GPU memory usage across NUMA nodes.

feat: refactor code into resource plugins and custom device plugins

7e1e3b3

chore: add unit tests

e3e69ee

chore: add unit tests chore: add unit tests chore: add unit tests chore: add unit tests

feat: introduce rdma state and allow states to share within gpu sub-p…

2ebc709

…lugins feat: introduce rdma state and allow states to share within gpu sub-plugins feat: introduce rdma state and allow states to share within gpu sub-plugins

feat: refactor state to only be in one file

f77d28e

feat: implement rdma custom device plugin and implement logic for acc…

318f98b

…ompany resource allocation feat: implement rdma custom device plugin and implement logic for accompany resource allocation

luomingmeng and others added 8 commits December 25, 2025 15:01

fix(gpumemory): handle unhealthy devices and correct capacity values

11efc35

Set allocatable memory to zero for unhealthy GPU devices and use separate capacity values instead of reusing allocatable values. This ensures accurate resource accounting for both healthy and unhealthy devices.

refactor(qrm): remove unused state file directory fields

4f8457d

Clean up GenericQRMPluginConfiguration by removing unused StateFileDirectory and InMemoryStateFileDirectory fields to simplify the struct.

fix(gpumemory): handle numa topology not ready case gracefully

880bba4

Return nil instead of error when numa topology is not ready and log the error fix: handle error gracefully fix: handle error gracefully

chore: add context to interface methods

9ecafcc

feat(gpu): add ShareGPUManager for device sharing eligibility

3a7254f

implement ShareGPUManager to determine GPU device sharing eligibility based on pod indicators add periodic sync and caching for efficient decision making integrate with base plugin and update device state handling

build: update katalyst-api dependency version

09b42dc

fix(resources-eviction): add error healthz state update in threshold …

b2a1de2

…checks Ensure healthz state is updated when errors occur during threshold checks in both resources and zone resources plugins.

luomingmeng force-pushed the dev/support-enable-share-gpu-for-reclaimed branch from d5d22b1 to b2a1de2 Compare December 25, 2025 07:01

JustinChengLZ force-pushed the dev/support-gpu-plugins branch 6 times, most recently from e2365c2 to 21b097a Compare January 2, 2026 06:33

JustinChengLZ force-pushed the dev/support-gpu-plugins branch 9 times, most recently from eb87c70 to 33b37df Compare January 7, 2026 07:10

luomingmeng added 4 commits January 12, 2026 20:14

fix(gpu): ignore reclaimed containers in device share evaluation

4295c6a

Add CheckReclaimed condition to skip reclaimed containers when evaluating device share status Add test case to verify reclaimed containers are ignored

feat(eviction): add gpu memory threshold to reclaimed resources eviction

1ac77a8

fix(gpumemory): correct gpu memory allocation calculation

1f727f4

Move aggregation of allocatable and capacity quantities after health check to ensure accurate totals for unhealthy or non-shared devices

feat(evictionmanager): add GPU-specific threshold tolerance duration

b13e4d2

Add constant thresholdMetToleranceDurationForGPU to set a fixed 15-second tolerance duration for GPU resource eviction, replacing the dynamic configuration value.

luomingmeng force-pushed the dev/support-enable-share-gpu-for-reclaimed branch from 83e65e3 to b13e4d2 Compare January 14, 2026 12:08

JustinChengLZ force-pushed the dev/support-gpu-plugins branch from b3d9e26 to b1649fb Compare January 20, 2026 07:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support enable share gpu for reclaimed #12

support enable share gpu for reclaimed #12

Uh oh!

luomingmeng commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

support enable share gpu for reclaimed #12

Are you sure you want to change the base?

support enable share gpu for reclaimed #12

Uh oh!

Conversation

luomingmeng commented Nov 24, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants