forked from kubewharf/katalyst-core
-
Notifications
You must be signed in to change notification settings - Fork 0
support enable share gpu for reclaimed #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
luomingmeng
wants to merge
52
commits into
JustinChengLZ:dev/support-gpu-plugins
Choose a base branch
from
luomingmeng:dev/support-enable-share-gpu-for-reclaimed
base: dev/support-gpu-plugins
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
support enable share gpu for reclaimed #12
luomingmeng
wants to merge
52
commits into
JustinChengLZ:dev/support-gpu-plugins
from
luomingmeng:dev/support-enable-share-gpu-for-reclaimed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
fafce64 to
198035f
Compare
9c9800a to
c84f8f7
Compare
5046660 to
6d4961a
Compare
4dfcbf5 to
8176c42
Compare
541bb02 to
d5d22b1
Compare
This commit introduces a new static policy implementation for GPU resource management, including: - GPU topology provider and state management - Static policy implementation with allocation and deallocation logic - Integration with existing QRM framework - Metrics and health checks for GPU resource management
… or numa zone node
- Update GPU memory type from uint64 to float64 for precise allocation - Implement NUMA-aware GPU topology management and allocation - Add support for associated device allocation and topology hints - Introduce new GPU topology provider with NUMA node tracking - Extend GPU state management with NUMA node information - Add utility functions for GPU memory hint generation and NUMA calculations
The preferredHintIndexes variable was declared but never used in the code. Removing it improves code clarity and maintainability.
Add new functionality to handle associated device allocation requests in the resource plugin stub. This includes: - Adding new stub function type and default implementation - Extending the Stub struct with new field - Adding new methods for associated device operations
…icy structs Add pluginapi.UnimplementedResourcePluginServer to all policy structs to ensure forward compatibility with gRPC interface changes
Implement GetAssociatedDeviceTopologyHints method in StaticPolicy and update stub to handle topology hints requests. Also update kubelet dependency version and rename mustIncludeDevices to reusableDevices for clarity.
… allocated memory Add tracking of allocated GPU memory per NUMA node and modify hint calculation to prefer nodes with most allocated memory. This helps balance GPU memory usage across NUMA nodes.
chore: add unit tests chore: add unit tests chore: add unit tests chore: add unit tests
…lugins feat: introduce rdma state and allow states to share within gpu sub-plugins feat: introduce rdma state and allow states to share within gpu sub-plugins
…ompany resource allocation feat: implement rdma custom device plugin and implement logic for accompany resource allocation
Set allocatable memory to zero for unhealthy GPU devices and use separate capacity values instead of reusing allocatable values. This ensures accurate resource accounting for both healthy and unhealthy devices.
Clean up GenericQRMPluginConfiguration by removing unused StateFileDirectory and InMemoryStateFileDirectory fields to simplify the struct.
Return nil instead of error when numa topology is not ready and log the error fix: handle error gracefully fix: handle error gracefully
implement ShareGPUManager to determine GPU device sharing eligibility based on pod indicators add periodic sync and caching for efficient decision making integrate with base plugin and update device state handling
Introduce a new eviction plugin for GPU resources to handle eviction based on GPU topology and allocations. The plugin reuses the generic ZoneResourcesPlugin with zoneType=GPU to preserve behavior while adding specific GPU resource handling capabilities. Includes unit tests to verify functionality.
…checks Ensure healthz state is updated when errors occur during threshold checks in both resources and zone resources plugins.
d5d22b1 to
b2a1de2
Compare
e2365c2 to
21b097a
Compare
eb87c70 to
33b37df
Compare
Add CheckReclaimed condition to skip reclaimed containers when evaluating device share status Add test case to verify reclaimed containers are ignored
Move aggregation of allocatable and capacity quantities after health check to ensure accurate totals for unhealthy or non-shared devices
Add constant thresholdMetToleranceDurationForGPU to set a fixed 15-second tolerance duration for GPU resource eviction, replacing the dynamic configuration value.
83e65e3 to
b13e4d2
Compare
b3d9e26 to
b1649fb
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
Features
What this PR does / why we need it:
Which issue(s) this PR fixes:
Special notes for your reviewer: