[cuda, rocm, ci, tests] fix: resolve CodeRabbit review findings for controller safety and test reliability by Wangmerlyn · Pull Request #63 · Wangmerlyn/KeepGPU

Wangmerlyn · 2026-02-15T12:32:27Z

Summary by CodeRabbit

New Features
- Added allocation-status reporting and configurable allocation retry behavior for GPU memory management.
- Introduced a ReLU-based background workload option for GPU keep-alive behavior.
Bug Fixes
- Improved VRAM handling consistency across multi-GPU controllers and more robust allocation/error recovery.
Documentation
- Clarified logging/messages and added docstrings for size parsing and benchmark reporting.
Tests
- Updated and added CUDA-focused tests and adjusted test fixtures for safer torch imports.
Chores
- CI/workflow updates (Python/runtime and pre-commit workflow adjustments).

coderabbitai · 2026-02-15T12:32:44Z

📝 Walkthrough

Walkthrough

Updates CI workflows (Python 3.13, pre-commit, pip install ordering), switches CUDA controller background workload from matmul to in-place ReLU with relu_iterations, adds allocation-retry and status tracking to ROCm controller, tightens VRAM parsing and logging, and adjusts tests for new APIs and safer GPU checks.

Changes

Cohort / File(s)	Summary
GitHub Actions Workflows `.github/workflows/docs.yml`, `.github/workflows/pre-commit.yaml`, `.github/workflows/python-app.yml`	Bumped Python runtime to 3.13, removed `--hook-stage manual` from pre-commit invocation, and moved editable install to the end with conditional requirements.txt install.
Global controller wiring & base docs `src/keep_gpu/global_gpu_controller/global_gpu_controller.py`, `src/keep_gpu/single_gpu_controller/base_gpu_controller.py`	Pass parsed/normalized `vram_to_keep` through per-GPU controllers; docstrings updated to use "keep" meaning occupied VRAM.
CUDA GPU controller `src/keep_gpu/single_gpu_controller/cuda_gpu_controller.py`	Replaced matmul workload with elementwise in-place ReLU workload; added `relu_iterations` (default 5000) and optional `matmul_iterations` alias; validate/derive `_num_elements` from `vram_to_keep`; updated loop, timing, and logging to use ReLU path.
ROCm GPU controller `src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py`	Added `max_allocation_retries`, captured allocation failures in `self._failure_exc`, bounded allocation retry loop, graceful exit on repeated failures, and new public `allocation_status()` to inspect stored failure.
Utilities & logging `src/keep_gpu/utilities/humanized_input.py`, `src/keep_gpu/utilities/logger.py`	Added `parse_size` docstring and refined unit normalization rules; split colored/plain log formats; ensure file handler creates parent directories.
Benchmark instrumentation `src/keep_gpu/benchmark.py`	Use dynamic iterations (`args.matmul_iterations`) in time-per-iteration logging instead of hard-coded divisor.
Tests & test config `tests/conftest.py`, `tests/cuda_controller/*`, `tests/global_controller/global_keep_test.py`	Made torch import lazy/robust in fixtures; updated tests to use `relu_iterations` parameter, added CUDA preflight VRAM checks, improved thread lifecycle assertions (joins/timeouts), added new context-manager test, and removed script main-guards.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[rocm, test] feat: add ROCm controller and test tooling #56 — Touches controller modules and test patterns similar to allocation retry and controller instantiation changes in this PR.

Poem

🐰 I nudge the threads to hum and keep,
ReLU hops in loops, not matmul deep,
Retries count softly when allocations fail,
Logs glow bright and tests set sail,
A little rabbit cheers — memory snug as a heap.

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.74% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `main`
Title check	✅ Passed	The title accurately reflects the main changes across the PR: improvements to CUDA/ROCm controller safety, CI workflow updates, and test reliability fixes based on CodeRabbit review findings.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/coderabbit-full-repo

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-02-15T12:32:50Z

Summary of Changes

Hello @Wangmerlyn, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses several findings from a code analysis tool, focusing on enhancing the stability, clarity, and maintainability of the GPU memory management system. The changes primarily involve refining the logic for keeping GPU VRAM occupied, improving error handling, standardizing logging practices, and making the test suite more robust and accurate.

Highlights

Improved GPU Memory Management: Refined how VRAM is kept busy across CUDA and ROCm controllers, including clearer documentation, robust allocation retry mechanisms, and explicit error capturing for ROCm devices.
Enhanced Logging and Utility Functions: Standardized logging formats, improved directory creation for log files, and made the parse_size utility more flexible in handling memory unit inputs.
Test Suite Refinements: Updated various test cases to be more reliable, including better thread synchronization, VRAM availability checks for large allocations, and more precise assertions for memory usage.
Parameter Consistency and Validation: Ensured vram_to_keep is correctly passed and validated in GPU controllers, and adjusted default interval values for CUDA workloads.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/keep_gpu/benchmark.py
- Updated logging format for matmul benchmark time span.
src/keep_gpu/global_gpu_controller/global_gpu_controller.py
- Corrected vram_to_keep parameter to use the instance's attribute during controller initialization.
src/keep_gpu/single_gpu_controller/base_gpu_controller.py
- Clarified docstrings for keep and _keep methods to indicate VRAM is kept busy/occupied.
src/keep_gpu/single_gpu_controller/cuda_gpu_controller.py
- Adjusted the default interval for workload batches from 0.5 to 1.0 seconds.
- Added validation for vram_to_keep to ensure it's a positive integer before tensor allocation.
- Updated the docstring for _run_mat_batch to specify 'in-place ReLU ops' instead of 'dummy matmuls'.
src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py
- Introduced max_allocation_retries and _failure_exc for more robust tensor allocation error handling.
- Refactored the release method to ensure proper thread termination and logging.
- Implemented a retry mechanism for tensor allocation within _keep_loop.
- Added an allocation_status method to retrieve any captured allocation failure exceptions.
src/keep_gpu/utilities/humanized_input.py
- Improved the parse_size function to handle various casing for memory units consistently.
src/keep_gpu/utilities/logger.py
- Separated colored and plain format strings for console logging to ensure correct application.
- Modified _build_file_handler to create parent directories if they do not exist.
tests/conftest.py
- Removed top-level torch import and added conditional import within rocm_available fixture for better isolation.
tests/cuda_controller/context_manager.py
- Updated test parameters, added assertions for GPU memory reservation, and ensured thread cleanup after context exit.
tests/cuda_controller/test_2_32pow_elements.py
- Added a check for sufficient free VRAM before running the large allocation test and removed the if __name__ == '__main__': block.
tests/cuda_controller/test_keep_and_release.py
- Added explicit _thread.join(timeout=2) calls after release to ensure worker threads are properly terminated in tests.
tests/global_controller/global_keep_test.py
- Removed the if __name__ == '__main__': block.

Ignored Files

Ignored by pattern: .github/workflows/** (3)
- .github/workflows/docs.yml
- .github/workflows/pre-commit.yaml
- .github/workflows/python-app.yml

Activity

Wangmerlyn created this pull request to address findings from CodeRabbit, focusing on code quality and robustness.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a series of fixes and improvements across the codebase, addressing findings likely from a static analysis tool. The changes include correcting docstrings, improving error handling with retries in rocm_gpu_controller, enhancing logger robustness, and making tests more reliable and meaningful. The use of deferred formatting for logging and adding parents=True to mkdir are also good improvements.

I've added a couple of review comments with suggestions for further improvements:

In CudaGPUController, there's a naming inconsistency where a method and parameters refer to matmul but the operation is ReLU.
In one of the tests, there's a redundant thread join that should be removed to avoid confusion.

gemini-code-assist · 2026-02-15T12:34:30Z

src/keep_gpu/single_gpu_controller/cuda_gpu_controller.py

@@ -166,7 +169,7 @@ def _keep_loop(self) -> None:
    # ------------------------------------------------------------------
    @torch.no_grad()
    def _run_mat_batch(self, matrix: torch.Tensor) -> None:


The method name _run_mat_batch is misleading as the implementation uses torch.relu_ instead of a matrix multiplication. The updated docstring correctly identifies the operation as ReLU, which makes the method name even more confusing.

To improve clarity and maintainability, consider renaming this method to something like _run_relu_batch.

Additionally, the parameter matmul_iterations and the debug log message "mat ops batch done" are also inconsistent with the ReLU operation and should be updated accordingly throughout the class.

gemini-code-assist · 2026-02-15T12:34:31Z

tests/cuda_controller/test_keep_and_release.py


+    assert ctrl._thread is not None
    ctrl.release()
+    ctrl._thread.join(timeout=2)


This call to _thread.join(timeout=2) is redundant. The ctrl.release() method, called on the previous line, already blocks and waits for the thread to terminate by calling _thread.join() internally. Adding another join here is unnecessary and can be removed to simplify the test. This redundant pattern is repeated multiple times in this test file.

If there's a concern about the thread hanging, the timeout logic should be part of the release() method itself.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9491f5fc99

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-15T12:36:44Z

src/keep_gpu/single_gpu_controller/cuda_gpu_controller.py

+                if num_elements <= 0:
+                    raise ValueError("vram_to_keep must be positive")


Catch invalid VRAM sizes before worker thread starts

The new num_elements <= 0 guard raises ValueError inside _keep_loop, but this loop only catches RuntimeError, so keep() can report success and then the background thread immediately dies when vram_to_keep resolves to 0 (for example --vram 0 from CLI/API). This leaves the controller inactive without a synchronous failure path; validation should happen before starting the thread or ValueError should be handled and surfaced.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-15T12:36:44Z

src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py

        tensor = None
-        while not self._stop_evt.is_set():
+        attempts = 0
+        while not self._stop_evt.is_set() and attempts < self.max_allocation_retries:


Keep retrying ROCm allocation instead of exiting early

The startup allocation loop now stops after max_allocation_retries attempts and returns, which means the keeper thread permanently exits if VRAM is temporarily unavailable at startup; if memory frees up moments later, this controller never resumes unless the caller manually restarts it. This is a functional regression from the previous indefinite retry behavior and can cause keep-alive to silently stop in transiently busy environments.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

src/keep_gpu/single_gpu_controller/cuda_gpu_controller.py (2)
170-186: ⚠️ Potential issue | 🟡 Minor

torch.relu_ on [0, 1) data is a computational no-op; potential ZeroDivisionError on line 185.

torch.rand produces values in [0, 1), so relu_(matrix) (which clamps negatives to 0) leaves every element unchanged. The GPU still processes kernel launches, but there's essentially zero ALU work — a scheduler could optimize this away. If the goal is to generate real utilization, consider torch.neg_(matrix) alternated with torch.relu_ or retain the original matmul approach.

Separately, if self.matmul_iterations is ever 0, line 185 divides by zero. The ROCm controller guards this with max(1, self.iterations).
Suggested guard for division
-            (toc - tic) * 1000 / self.matmul_iterations,
+            (toc - tic) * 1000 / max(1, self.matmul_iterations),
127-144: ⚠️ Potential issue | 🟠 Major

Allocation retry loop differs from ROCm controller in three important ways.

ValueError escapes the loop – Line 139 catches only RuntimeError, but the validation on lines 130-131 raises ValueError. If vram_to_keep resolves to a non-positive integer, the ValueError kills the background thread silently. The ROCm controller catches (RuntimeError, ValueError).

Unbounded retries – This loop retries forever (while not self._stop_evt.is_set()) on allocation RuntimeError. The ROCm controller bounds retries with max_allocation_retries. A persistent OOM condition here means an infinite retry loop.

raise in a daemon thread is silently lost – Line 144 raises RuntimeError inside the background thread. Since it's a daemon thread, the exception is swallowed and the caller has no way to detect the failure. The ROCm controller stores it in self._failure_exc and exposes allocation_status().

Consider aligning with the ROCm controller's pattern for consistency.
Suggested diff
+        attempts = 0
-        while not self._stop_evt.is_set():
+        while not self._stop_evt.is_set() and attempts < self.max_allocation_retries:
             try:
                 num_elements = int(self.vram_to_keep)
                 if num_elements <= 0:
                     raise ValueError("vram_to_keep must be positive")
                 matrix = torch.rand(
                     num_elements,
                     device=self.device,
                     dtype=torch.float32,
                     requires_grad=False,
                 )
                 break
-            except RuntimeError as e:
-                logger.error("rank %s: failed to allocate matrix: %s", self.rank, e)
+            except (RuntimeError, ValueError) as e:
+                attempts += 1
+                logger.error("rank %s: failed to allocate matrix (attempt %d/%d): %s", self.rank, attempts, self.max_allocation_retries, e)
                 time.sleep(self.interval)
         if matrix is None:
-            logger.error("rank %s: failed to allocate matrix, exiting loop", self.rank)
-            raise RuntimeError("Failed to allocate matrix for GPU keeping")
+            self._failure_exc = RuntimeError(
+                f"rank {self.rank}: failed to allocate matrix after {attempts} attempts"
+            )
+            logger.error("%s", self._failure_exc)
+            return
src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py (1)
68-80: ⚠️ Potential issue | 🟡 Minor

release() logs "stopped & cache cleared" even when the thread was never running.

Line 80's logger.info("rank %s: keep thread stopped & cache cleared") executes unconditionally after the if/else. When the thread isn't alive (else branch), the user sees both the "not running" warning and the "stopped & cache cleared" info — which is contradictory.
Suggested fix
     def release(self) -> None:
         if self._thread and self._thread.is_alive():
             self._stop_evt.set()
             self._thread.join()
             torch.cuda.empty_cache()
+            logger.info("rank %s: keep thread stopped & cache cleared", self.rank)
         else:
             logger.warning("rank %s: keep thread not running", self.rank)
         if self._rocm_smi:
             try:
                 self._rocm_smi.rsmi_shut_down()
             except Exception as exc:  # pragma: no cover - best effort
                 logger.debug("rsmi_shut_down failed: %s", exc)
-        logger.info("rank %s: keep thread stopped & cache cleared", self.rank)

🤖 Fix all issues with AI agents

In `@tests/cuda_controller/context_manager.py`:
- Line 12: The test file tests/cuda_controller/context_manager.py containing the
test function test_cuda_controller_context_manager will not be collected by
pytest because its filename doesn't match the default discovery patterns; fix by
either renaming the file to a pytest-recognized name like
test_context_manager.py (so the existing function
test_cuda_controller_context_manager is discovered) or add a python_files
pattern to test configuration (pyproject.toml or conftest.py) to include
context_manager.py.

🧹 Nitpick comments (2)

tests/cuda_controller/test_keep_and_release.py (1)
22-25: Redundant join() after release() – masks real bugs in production code.

CudaGPUController.release() already calls self._thread.join() (unbounded) internally. The extra ctrl._thread.join(timeout=2) on line 24 re-joins an already-joined thread, which is a no-op. More importantly, if release() ever fails to join (a real bug), this test-side join would paper over it instead of exposing the defect.

Similarly, the assert ctrl._thread is not None on line 22 is trivially true given line 20 already asserted ctrl._thread and ctrl._thread.is_alive().

Consider removing the redundant join and None check here and at lines 30-32, keeping only the post-release() liveness assertion which is the meaningful check.
Suggested diff
     assert ctrl._thread and ctrl._thread.is_alive()

-    assert ctrl._thread is not None
     ctrl.release()
-    ctrl._thread.join(timeout=2)
     assert not (ctrl._thread and ctrl._thread.is_alive())
src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py (1)

148-150: allocation_status() exposes thread-internal failure — consider documenting thread-safety semantics.

Since _failure_exc is written by the background thread and read from the main thread, a brief note that the reference assignment is atomic in CPython (GIL) but not guaranteed in other Python implementations would help future maintainers. This is not a bug today but a minor documentation opportunity.

coderabbitai · 2026-02-15T12:37:56Z

tests/cuda_controller/test_context_manager.py

@@ -10,15 +11,20 @@
 )
 def test_cuda_controller_context_manager():


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -e echo "=== Checking pytest configuration for python_files setting ===" for config_file in pyproject.toml setup.cfg pytest.ini tox.ini; do if [ -f "$config_file" ]; then echo "--- Found $config_file ---" rg -n "python_files" "$config_file" 2>/dev/null || echo "No python_files setting in $config_file" fi done echo -e "\n=== Checking conftest.py for custom collection ===" find . -name "conftest.py" -type f | while read conftest; do echo "--- Checking $conftest ---" rg -n "collect_ignore\|pytest_collect_file\|python_files" "$conftest" 2>/dev/null || echo "No custom collection config in $conftest" done echo -e "\n=== Checking if context_manager.py file exists and its git status ===" if [ -f "tests/cuda_controller/context_manager.py" ]; then echo "File exists at tests/cuda_controller/context_manager.py" git log --oneline --all -- "tests/cuda_controller/context_manager.py" 2>/dev/null | head -5 || echo "File not in git history yet (new file)" else echo "File does not exist at tests/cuda_controller/context_manager.py" fi echo -e "\n=== Listing test files in tests/cuda_controller directory ===" ls -la tests/cuda_controller/ | grep -E "\.py$" || echo "No Python files found"

Repository: Wangmerlyn/KeepGPU

Length of output: 958

Test file context_manager.py will not be discovered by pytest.

Pytest uses default collection patterns test_*.py or *_test.py. There is no python_files configuration in pyproject.toml or custom collection logic in conftest.py. Rename the file to test_context_manager.py or add a python_files setting to enable discovery of non-standard test filenames.

🤖 Prompt for AI Agents

In `@tests/cuda_controller/context_manager.py` at line 12, The test file tests/cuda_controller/context_manager.py containing the test function test_cuda_controller_context_manager will not be collected by pytest because its filename doesn't match the default discovery patterns; fix by either renaming the file to a pytest-recognized name like test_context_manager.py (so the existing function test_cuda_controller_context_manager is discovered) or add a python_files pattern to test configuration (pyproject.toml or conftest.py) to include context_manager.py.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py (1)
68-80: ⚠️ Potential issue | 🟡 Minor

"stopped & cache cleared" logged even when the thread was not running.

When release() takes the else branch (line 73-74), execution falls through to line 80, which logs the misleading message. Move the success log inside the if block or make it conditional.
Proposed fix
     def release(self) -> None:
         if self._thread and self._thread.is_alive():
             self._stop_evt.set()
             self._thread.join()
             torch.cuda.empty_cache()
+            logger.info("rank %s: keep thread stopped & cache cleared", self.rank)
         else:
             logger.warning("rank %s: keep thread not running", self.rank)
         if self._rocm_smi:
             try:
                 self._rocm_smi.rsmi_shut_down()
             except Exception as exc:  # pragma: no cover - best effort
                 logger.debug("rsmi_shut_down failed: %s", exc)
-        logger.info("rank %s: keep thread stopped & cache cleared", self.rank)
tests/cuda_controller/test_throttle.py (1)
37-38: ⚠️ Potential issue | 🟡 Minor

Stale comment: still references "matmul" after the ReLU migration.

Line 37 says "no matmul batches should run" but the workload is now ReLU-based.
Proposed fix
-    # When utilization is always high, no matmul batches should run
+    # When utilization is always high, no ReLU batches should run

🤖 Fix all issues with AI agents

In `@src/keep_gpu/single_gpu_controller/cuda_gpu_controller.py`:
- Around line 196-200: The debug log divides by self.relu_iterations and can
raise ZeroDivisionError when relu_iterations is 0; update the calculation used
in logger.debug (the expression using (toc - tic) * 1000 / self.relu_iterations)
to guard against zero—e.g., use max(1, self.relu_iterations) (similar to the
ROCm controller's pattern) or ensure relu_iterations is validated in __init__ to
be >0; apply the same guard for matmul_iterations where applicable so the
division never uses zero.

In `@src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py`:
- Around line 105-137: The ValueError raised for invalid vram_to_keep should not
be treated as transient: in the allocation loop inside rocm_gpu_controller (the
block that creates tensor with torch.rand using self.vram_to_keep and the
surrounding retry logic referencing self.max_allocation_retries, attempts, and
self._failure_exc), stop catching ValueError so it propagates immediately;
change the except clause to only catch RuntimeError (allocation failures), keep
the existing logging and retry/failure behavior for RuntimeError, and ensure
configuration errors (ValueError) are not swallowed or retried.

🧹 Nitpick comments (1)

src/keep_gpu/single_gpu_controller/cuda_gpu_controller.py (1)
134-158: Allocation-failure path raises in a daemon thread on clean shutdown.

When release() is called while the allocation-retry loop (lines 144-151) is still spinning, _stop_evt gets set, the while exits with matrix is None, and line 158 raises RuntimeError inside the daemon thread. The exception is silently swallowed (daemon thread), but it generates a noisy traceback on stderr.

Consider returning early (or logging) when _stop_evt is set, similar to the ROCm controller's graceful-exit pattern.
Proposed fix
         if matrix is None:
-            logger.error("rank %s: failed to allocate matrix, exiting loop", self.rank)
-            raise RuntimeError("Failed to allocate matrix for GPU keeping")
+            if self._stop_evt.is_set():
+                logger.info("rank %s: allocation cancelled (stop requested)", self.rank)
+                return
+            logger.error("rank %s: failed to allocate matrix, exiting loop", self.rank)
+            raise RuntimeError("Failed to allocate matrix for GPU keeping")

src/keep_gpu/single_gpu_controller/cuda_gpu_controller.py

src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py

fix(multi): address full-repo coderabbit findings

9491f5f

gemini-code-assist bot reviewed Feb 15, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Feb 15, 2026

View reviewed changes

coderabbitai bot reviewed Feb 15, 2026

View reviewed changes

fix(cuda,rocm,tests): address follow-up review comments

cffc16f

coderabbitai bot reviewed Feb 15, 2026

View reviewed changes

src/keep_gpu/single_gpu_controller/cuda_gpu_controller.py Show resolved Hide resolved

src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py Outdated Show resolved Hide resolved

fix(cuda,rocm): validate iterations and stop retrying config errors

da890d8

Wangmerlyn changed the title ~~fix(multi): address full-repo coderabbit findings~~ [cuda, rocm, ci, tests] fix: resolve CodeRabbit review findings for controller safety and test reliability Feb 15, 2026

Wangmerlyn merged commit 182ecf0 into main Feb 15, 2026
5 checks passed

coderabbitai bot mentioned this pull request Feb 15, 2026

[tests, cuda] test: strengthen VRAM target and release assertions for keep sessions #64

Merged

		if num_elements <= 0:
		raise ValueError("vram_to_keep must be positive")

		@@ -10,15 +11,20 @@
		)
		def test_cuda_controller_context_manager():

Conversation

Wangmerlyn commented Feb 15, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

gemini-code-assist bot commented Feb 15, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Wangmerlyn commented Feb 15, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 15, 2026 •

edited

Loading