Skip to content

[cuda, rocm, ci, tests] fix: resolve CodeRabbit review findings for controller safety and test reliability#63

Merged
Wangmerlyn merged 3 commits intomainfrom
fix/coderabbit-full-repo
Feb 15, 2026
Merged

[cuda, rocm, ci, tests] fix: resolve CodeRabbit review findings for controller safety and test reliability#63
Wangmerlyn merged 3 commits intomainfrom
fix/coderabbit-full-repo

Conversation

@Wangmerlyn
Copy link
Owner

@Wangmerlyn Wangmerlyn commented Feb 15, 2026

Summary by CodeRabbit

  • New Features

    • Added allocation-status reporting and configurable allocation retry behavior for GPU memory management.
    • Introduced a ReLU-based background workload option for GPU keep-alive behavior.
  • Bug Fixes

    • Improved VRAM handling consistency across multi-GPU controllers and more robust allocation/error recovery.
  • Documentation

    • Clarified logging/messages and added docstrings for size parsing and benchmark reporting.
  • Tests

    • Updated and added CUDA-focused tests and adjusted test fixtures for safer torch imports.
  • Chores

    • CI/workflow updates (Python/runtime and pre-commit workflow adjustments).

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 15, 2026

📝 Walkthrough

Walkthrough

Updates CI workflows (Python 3.13, pre-commit, pip install ordering), switches CUDA controller background workload from matmul to in-place ReLU with relu_iterations, adds allocation-retry and status tracking to ROCm controller, tightens VRAM parsing and logging, and adjusts tests for new APIs and safer GPU checks.

Changes

Cohort / File(s) Summary
GitHub Actions Workflows
.github/workflows/docs.yml, .github/workflows/pre-commit.yaml, .github/workflows/python-app.yml
Bumped Python runtime to 3.13, removed --hook-stage manual from pre-commit invocation, and moved editable install to the end with conditional requirements.txt install.
Global controller wiring & base docs
src/keep_gpu/global_gpu_controller/global_gpu_controller.py, src/keep_gpu/single_gpu_controller/base_gpu_controller.py
Pass parsed/normalized vram_to_keep through per-GPU controllers; docstrings updated to use "keep" meaning occupied VRAM.
CUDA GPU controller
src/keep_gpu/single_gpu_controller/cuda_gpu_controller.py
Replaced matmul workload with elementwise in-place ReLU workload; added relu_iterations (default 5000) and optional matmul_iterations alias; validate/derive _num_elements from vram_to_keep; updated loop, timing, and logging to use ReLU path.
ROCm GPU controller
src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py
Added max_allocation_retries, captured allocation failures in self._failure_exc, bounded allocation retry loop, graceful exit on repeated failures, and new public allocation_status() to inspect stored failure.
Utilities & logging
src/keep_gpu/utilities/humanized_input.py, src/keep_gpu/utilities/logger.py
Added parse_size docstring and refined unit normalization rules; split colored/plain log formats; ensure file handler creates parent directories.
Benchmark instrumentation
src/keep_gpu/benchmark.py
Use dynamic iterations (args.matmul_iterations) in time-per-iteration logging instead of hard-coded divisor.
Tests & test config
tests/conftest.py, tests/cuda_controller/*, tests/global_controller/global_keep_test.py
Made torch import lazy/robust in fixtures; updated tests to use relu_iterations parameter, added CUDA preflight VRAM checks, improved thread lifecycle assertions (joins/timeouts), added new context-manager test, and removed script main-guards.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 I nudge the threads to hum and keep,
ReLU hops in loops, not matmul deep,
Retries count softly when allocations fail,
Logs glow bright and tests set sail,
A little rabbit cheers — memory snug as a heap.

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.74% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main
Title check ✅ Passed The title accurately reflects the main changes across the PR: improvements to CUDA/ROCm controller safety, CI workflow updates, and test reliability fixes based on CodeRabbit review findings.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/coderabbit-full-repo

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Wangmerlyn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses several findings from a code analysis tool, focusing on enhancing the stability, clarity, and maintainability of the GPU memory management system. The changes primarily involve refining the logic for keeping GPU VRAM occupied, improving error handling, standardizing logging practices, and making the test suite more robust and accurate.

Highlights

  • Improved GPU Memory Management: Refined how VRAM is kept busy across CUDA and ROCm controllers, including clearer documentation, robust allocation retry mechanisms, and explicit error capturing for ROCm devices.
  • Enhanced Logging and Utility Functions: Standardized logging formats, improved directory creation for log files, and made the parse_size utility more flexible in handling memory unit inputs.
  • Test Suite Refinements: Updated various test cases to be more reliable, including better thread synchronization, VRAM availability checks for large allocations, and more precise assertions for memory usage.
  • Parameter Consistency and Validation: Ensured vram_to_keep is correctly passed and validated in GPU controllers, and adjusted default interval values for CUDA workloads.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/keep_gpu/benchmark.py
    • Updated logging format for matmul benchmark time span.
  • src/keep_gpu/global_gpu_controller/global_gpu_controller.py
    • Corrected vram_to_keep parameter to use the instance's attribute during controller initialization.
  • src/keep_gpu/single_gpu_controller/base_gpu_controller.py
    • Clarified docstrings for keep and _keep methods to indicate VRAM is kept busy/occupied.
  • src/keep_gpu/single_gpu_controller/cuda_gpu_controller.py
    • Adjusted the default interval for workload batches from 0.5 to 1.0 seconds.
    • Added validation for vram_to_keep to ensure it's a positive integer before tensor allocation.
    • Updated the docstring for _run_mat_batch to specify 'in-place ReLU ops' instead of 'dummy matmuls'.
  • src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py
    • Introduced max_allocation_retries and _failure_exc for more robust tensor allocation error handling.
    • Refactored the release method to ensure proper thread termination and logging.
    • Implemented a retry mechanism for tensor allocation within _keep_loop.
    • Added an allocation_status method to retrieve any captured allocation failure exceptions.
  • src/keep_gpu/utilities/humanized_input.py
    • Improved the parse_size function to handle various casing for memory units consistently.
  • src/keep_gpu/utilities/logger.py
    • Separated colored and plain format strings for console logging to ensure correct application.
    • Modified _build_file_handler to create parent directories if they do not exist.
  • tests/conftest.py
    • Removed top-level torch import and added conditional import within rocm_available fixture for better isolation.
  • tests/cuda_controller/context_manager.py
    • Updated test parameters, added assertions for GPU memory reservation, and ensured thread cleanup after context exit.
  • tests/cuda_controller/test_2_32pow_elements.py
    • Added a check for sufficient free VRAM before running the large allocation test and removed the if __name__ == '__main__': block.
  • tests/cuda_controller/test_keep_and_release.py
    • Added explicit _thread.join(timeout=2) calls after release to ensure worker threads are properly terminated in tests.
  • tests/global_controller/global_keep_test.py
    • Removed the if __name__ == '__main__': block.
Ignored Files
  • Ignored by pattern: .github/workflows/** (3)
    • .github/workflows/docs.yml
    • .github/workflows/pre-commit.yaml
    • .github/workflows/python-app.yml
Activity
  • Wangmerlyn created this pull request to address findings from CodeRabbit, focusing on code quality and robustness.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a series of fixes and improvements across the codebase, addressing findings likely from a static analysis tool. The changes include correcting docstrings, improving error handling with retries in rocm_gpu_controller, enhancing logger robustness, and making tests more reliable and meaningful. The use of deferred formatting for logging and adding parents=True to mkdir are also good improvements.

I've added a couple of review comments with suggestions for further improvements:

  • In CudaGPUController, there's a naming inconsistency where a method and parameters refer to matmul but the operation is ReLU.
  • In one of the tests, there's a redundant thread join that should be removed to avoid confusion.

@@ -166,7 +169,7 @@ def _keep_loop(self) -> None:
# ------------------------------------------------------------------
@torch.no_grad()
def _run_mat_batch(self, matrix: torch.Tensor) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The method name _run_mat_batch is misleading as the implementation uses torch.relu_ instead of a matrix multiplication. The updated docstring correctly identifies the operation as ReLU, which makes the method name even more confusing.

To improve clarity and maintainability, consider renaming this method to something like _run_relu_batch.

Additionally, the parameter matmul_iterations and the debug log message "mat ops batch done" are also inconsistent with the ReLU operation and should be updated accordingly throughout the class.


assert ctrl._thread is not None
ctrl.release()
ctrl._thread.join(timeout=2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This call to _thread.join(timeout=2) is redundant. The ctrl.release() method, called on the previous line, already blocks and waits for the thread to terminate by calling _thread.join() internally. Adding another join here is unnecessary and can be removed to simplify the test. This redundant pattern is repeated multiple times in this test file.

If there's a concern about the thread hanging, the timeout logic should be part of the release() method itself.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9491f5fc99

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 130 to 131
if num_elements <= 0:
raise ValueError("vram_to_keep must be positive")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Catch invalid VRAM sizes before worker thread starts

The new num_elements <= 0 guard raises ValueError inside _keep_loop, but this loop only catches RuntimeError, so keep() can report success and then the background thread immediately dies when vram_to_keep resolves to 0 (for example --vram 0 from CLI/API). This leaves the controller inactive without a synchronous failure path; validation should happen before starting the thread or ValueError should be handled and surfaced.

Useful? React with 👍 / 👎.

tensor = None
while not self._stop_evt.is_set():
attempts = 0
while not self._stop_evt.is_set() and attempts < self.max_allocation_retries:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep retrying ROCm allocation instead of exiting early

The startup allocation loop now stops after max_allocation_retries attempts and returns, which means the keeper thread permanently exits if VRAM is temporarily unavailable at startup; if memory frees up moments later, this controller never resumes unless the caller manually restarts it. This is a functional regression from the previous indefinite retry behavior and can cause keep-alive to silently stop in transiently busy environments.

Useful? React with 👍 / 👎.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/keep_gpu/single_gpu_controller/cuda_gpu_controller.py (2)

170-186: ⚠️ Potential issue | 🟡 Minor

torch.relu_ on [0, 1) data is a computational no-op; potential ZeroDivisionError on line 185.

torch.rand produces values in [0, 1), so relu_(matrix) (which clamps negatives to 0) leaves every element unchanged. The GPU still processes kernel launches, but there's essentially zero ALU work — a scheduler could optimize this away. If the goal is to generate real utilization, consider torch.neg_(matrix) alternated with torch.relu_ or retain the original matmul approach.

Separately, if self.matmul_iterations is ever 0, line 185 divides by zero. The ROCm controller guards this with max(1, self.iterations).

Suggested guard for division
-            (toc - tic) * 1000 / self.matmul_iterations,
+            (toc - tic) * 1000 / max(1, self.matmul_iterations),

127-144: ⚠️ Potential issue | 🟠 Major

Allocation retry loop differs from ROCm controller in three important ways.

  1. ValueError escapes the loop – Line 139 catches only RuntimeError, but the validation on lines 130-131 raises ValueError. If vram_to_keep resolves to a non-positive integer, the ValueError kills the background thread silently. The ROCm controller catches (RuntimeError, ValueError).

  2. Unbounded retries – This loop retries forever (while not self._stop_evt.is_set()) on allocation RuntimeError. The ROCm controller bounds retries with max_allocation_retries. A persistent OOM condition here means an infinite retry loop.

  3. raise in a daemon thread is silently lost – Line 144 raises RuntimeError inside the background thread. Since it's a daemon thread, the exception is swallowed and the caller has no way to detect the failure. The ROCm controller stores it in self._failure_exc and exposes allocation_status().

Consider aligning with the ROCm controller's pattern for consistency.

Suggested diff
+        attempts = 0
-        while not self._stop_evt.is_set():
+        while not self._stop_evt.is_set() and attempts < self.max_allocation_retries:
             try:
                 num_elements = int(self.vram_to_keep)
                 if num_elements <= 0:
                     raise ValueError("vram_to_keep must be positive")
                 matrix = torch.rand(
                     num_elements,
                     device=self.device,
                     dtype=torch.float32,
                     requires_grad=False,
                 )
                 break
-            except RuntimeError as e:
-                logger.error("rank %s: failed to allocate matrix: %s", self.rank, e)
+            except (RuntimeError, ValueError) as e:
+                attempts += 1
+                logger.error("rank %s: failed to allocate matrix (attempt %d/%d): %s", self.rank, attempts, self.max_allocation_retries, e)
                 time.sleep(self.interval)
         if matrix is None:
-            logger.error("rank %s: failed to allocate matrix, exiting loop", self.rank)
-            raise RuntimeError("Failed to allocate matrix for GPU keeping")
+            self._failure_exc = RuntimeError(
+                f"rank {self.rank}: failed to allocate matrix after {attempts} attempts"
+            )
+            logger.error("%s", self._failure_exc)
+            return
src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py (1)

68-80: ⚠️ Potential issue | 🟡 Minor

release() logs "stopped & cache cleared" even when the thread was never running.

Line 80's logger.info("rank %s: keep thread stopped & cache cleared") executes unconditionally after the if/else. When the thread isn't alive (else branch), the user sees both the "not running" warning and the "stopped & cache cleared" info — which is contradictory.

Suggested fix
     def release(self) -> None:
         if self._thread and self._thread.is_alive():
             self._stop_evt.set()
             self._thread.join()
             torch.cuda.empty_cache()
+            logger.info("rank %s: keep thread stopped & cache cleared", self.rank)
         else:
             logger.warning("rank %s: keep thread not running", self.rank)
         if self._rocm_smi:
             try:
                 self._rocm_smi.rsmi_shut_down()
             except Exception as exc:  # pragma: no cover - best effort
                 logger.debug("rsmi_shut_down failed: %s", exc)
-        logger.info("rank %s: keep thread stopped & cache cleared", self.rank)
🤖 Fix all issues with AI agents
In `@tests/cuda_controller/context_manager.py`:
- Line 12: The test file tests/cuda_controller/context_manager.py containing the
test function test_cuda_controller_context_manager will not be collected by
pytest because its filename doesn't match the default discovery patterns; fix by
either renaming the file to a pytest-recognized name like
test_context_manager.py (so the existing function
test_cuda_controller_context_manager is discovered) or add a python_files
pattern to test configuration (pyproject.toml or conftest.py) to include
context_manager.py.
🧹 Nitpick comments (2)
tests/cuda_controller/test_keep_and_release.py (1)

22-25: Redundant join() after release() – masks real bugs in production code.

CudaGPUController.release() already calls self._thread.join() (unbounded) internally. The extra ctrl._thread.join(timeout=2) on line 24 re-joins an already-joined thread, which is a no-op. More importantly, if release() ever fails to join (a real bug), this test-side join would paper over it instead of exposing the defect.

Similarly, the assert ctrl._thread is not None on line 22 is trivially true given line 20 already asserted ctrl._thread and ctrl._thread.is_alive().

Consider removing the redundant join and None check here and at lines 30-32, keeping only the post-release() liveness assertion which is the meaningful check.

Suggested diff
     assert ctrl._thread and ctrl._thread.is_alive()

-    assert ctrl._thread is not None
     ctrl.release()
-    ctrl._thread.join(timeout=2)
     assert not (ctrl._thread and ctrl._thread.is_alive())
src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py (1)

148-150: allocation_status() exposes thread-internal failure — consider documenting thread-safety semantics.

Since _failure_exc is written by the background thread and read from the main thread, a brief note that the reference assignment is atomic in CPython (GIL) but not guaranteed in other Python implementations would help future maintainers. This is not a bug today but a minor documentation opportunity.

@@ -10,15 +11,20 @@
)
def test_cuda_controller_context_manager():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -e

echo "=== Checking pytest configuration for python_files setting ==="
for config_file in pyproject.toml setup.cfg pytest.ini tox.ini; do
  if [ -f "$config_file" ]; then
    echo "--- Found $config_file ---"
    rg -n "python_files" "$config_file" 2>/dev/null || echo "No python_files setting in $config_file"
  fi
done

echo -e "\n=== Checking conftest.py for custom collection ==="
find . -name "conftest.py" -type f | while read conftest; do
  echo "--- Checking $conftest ---"
  rg -n "collect_ignore\|pytest_collect_file\|python_files" "$conftest" 2>/dev/null || echo "No custom collection config in $conftest"
done

echo -e "\n=== Checking if context_manager.py file exists and its git status ==="
if [ -f "tests/cuda_controller/context_manager.py" ]; then
  echo "File exists at tests/cuda_controller/context_manager.py"
  git log --oneline --all -- "tests/cuda_controller/context_manager.py" 2>/dev/null | head -5 || echo "File not in git history yet (new file)"
else
  echo "File does not exist at tests/cuda_controller/context_manager.py"
fi

echo -e "\n=== Listing test files in tests/cuda_controller directory ==="
ls -la tests/cuda_controller/ | grep -E "\.py$" || echo "No Python files found"

Repository: Wangmerlyn/KeepGPU

Length of output: 958


Test file context_manager.py will not be discovered by pytest.

Pytest uses default collection patterns test_*.py or *_test.py. There is no python_files configuration in pyproject.toml or custom collection logic in conftest.py. Rename the file to test_context_manager.py or add a python_files setting to enable discovery of non-standard test filenames.

🤖 Prompt for AI Agents
In `@tests/cuda_controller/context_manager.py` at line 12, The test file
tests/cuda_controller/context_manager.py containing the test function
test_cuda_controller_context_manager will not be collected by pytest because its
filename doesn't match the default discovery patterns; fix by either renaming
the file to a pytest-recognized name like test_context_manager.py (so the
existing function test_cuda_controller_context_manager is discovered) or add a
python_files pattern to test configuration (pyproject.toml or conftest.py) to
include context_manager.py.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py (1)

68-80: ⚠️ Potential issue | 🟡 Minor

"stopped & cache cleared" logged even when the thread was not running.

When release() takes the else branch (line 73-74), execution falls through to line 80, which logs the misleading message. Move the success log inside the if block or make it conditional.

Proposed fix
     def release(self) -> None:
         if self._thread and self._thread.is_alive():
             self._stop_evt.set()
             self._thread.join()
             torch.cuda.empty_cache()
+            logger.info("rank %s: keep thread stopped & cache cleared", self.rank)
         else:
             logger.warning("rank %s: keep thread not running", self.rank)
         if self._rocm_smi:
             try:
                 self._rocm_smi.rsmi_shut_down()
             except Exception as exc:  # pragma: no cover - best effort
                 logger.debug("rsmi_shut_down failed: %s", exc)
-        logger.info("rank %s: keep thread stopped & cache cleared", self.rank)
tests/cuda_controller/test_throttle.py (1)

37-38: ⚠️ Potential issue | 🟡 Minor

Stale comment: still references "matmul" after the ReLU migration.

Line 37 says "no matmul batches should run" but the workload is now ReLU-based.

Proposed fix
-    # When utilization is always high, no matmul batches should run
+    # When utilization is always high, no ReLU batches should run
🤖 Fix all issues with AI agents
In `@src/keep_gpu/single_gpu_controller/cuda_gpu_controller.py`:
- Around line 196-200: The debug log divides by self.relu_iterations and can
raise ZeroDivisionError when relu_iterations is 0; update the calculation used
in logger.debug (the expression using (toc - tic) * 1000 / self.relu_iterations)
to guard against zero—e.g., use max(1, self.relu_iterations) (similar to the
ROCm controller's pattern) or ensure relu_iterations is validated in __init__ to
be >0; apply the same guard for matmul_iterations where applicable so the
division never uses zero.

In `@src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py`:
- Around line 105-137: The ValueError raised for invalid vram_to_keep should not
be treated as transient: in the allocation loop inside rocm_gpu_controller (the
block that creates tensor with torch.rand using self.vram_to_keep and the
surrounding retry logic referencing self.max_allocation_retries, attempts, and
self._failure_exc), stop catching ValueError so it propagates immediately;
change the except clause to only catch RuntimeError (allocation failures), keep
the existing logging and retry/failure behavior for RuntimeError, and ensure
configuration errors (ValueError) are not swallowed or retried.
🧹 Nitpick comments (1)
src/keep_gpu/single_gpu_controller/cuda_gpu_controller.py (1)

134-158: Allocation-failure path raises in a daemon thread on clean shutdown.

When release() is called while the allocation-retry loop (lines 144-151) is still spinning, _stop_evt gets set, the while exits with matrix is None, and line 158 raises RuntimeError inside the daemon thread. The exception is silently swallowed (daemon thread), but it generates a noisy traceback on stderr.

Consider returning early (or logging) when _stop_evt is set, similar to the ROCm controller's graceful-exit pattern.

Proposed fix
         if matrix is None:
-            logger.error("rank %s: failed to allocate matrix, exiting loop", self.rank)
-            raise RuntimeError("Failed to allocate matrix for GPU keeping")
+            if self._stop_evt.is_set():
+                logger.info("rank %s: allocation cancelled (stop requested)", self.rank)
+                return
+            logger.error("rank %s: failed to allocate matrix, exiting loop", self.rank)
+            raise RuntimeError("Failed to allocate matrix for GPU keeping")

@Wangmerlyn Wangmerlyn changed the title fix(multi): address full-repo coderabbit findings [cuda, rocm, ci, tests] fix: resolve CodeRabbit review findings for controller safety and test reliability Feb 15, 2026
@Wangmerlyn Wangmerlyn merged commit 182ecf0 into main Feb 15, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant