[VIRT] Collect VNC screenshot on post-migration verification timeout#3953
[VIRT] Collect VNC screenshot on post-migration verification timeout#3953dshchedr wants to merge 2 commits intoRedHatQE:mainfrom
Conversation
📝 WalkthroughWalkthroughThe change wraps the two VM-post-migration waits inside a single try block in Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Report bugs in Issues Welcome! 🎉This pull request will be automatically processed with the following features: 🔄 Automatic Actions
📋 Available CommandsPR Status Management
Review & Approval
Testing & Validation
Container Operations
Cherry-pick Operations
Label Management
✅ Merge RequirementsThis PR will be automatically approved when the following conditions are met:
📊 Review ProcessApprovers and ReviewersApprovers:
Reviewers:
Available Labels
💡 Tips
For more information, please refer to the project documentation or contact the maintainers. |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
utilities/virt.py (1)
1886-1907: 🛠️ Refactor suggestion | 🟠 MajorMEDIUM:
verify_vm_migratedlacks a Google-format docstring despite new observable side effectsPer coding guidelines: "Google-format docstrings REQUIRED for all public functions with non-obvious return values OR side effects." This PR adds a new side effect (VNC screenshot collection on timeout) to a public utility function that previously had none. The function still has no docstring at all.
📝 Proposed docstring
def verify_vm_migrated( vm, node_before, wait_for_interfaces=True, check_ssh_connectivity=False, ): + """ + Verify that a VM has successfully migrated to a new node. + + Asserts the VMI is running on a different node and that its migration state + is marked completed. Optionally waits for guest-agent network interfaces and + SSH connectivity. On timeout, collects a VNC screenshot for debugging before + re-raising the exception. + + Args: + vm: VM object whose VMI was migrated. + node_before: Node object the VMI was running on before migration. + wait_for_interfaces (bool): Wait for guest-agent to report network interfaces. + check_ssh_connectivity (bool): Verify SSH is reachable after migration. + + Raises: + AssertionError: If the VMI is still on the original node or migration is incomplete. + TimeoutExpiredError: If interfaces or SSH connectivity wait times out. + """🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@utilities/virt.py` around lines 1886 - 1907, Add a Google-style docstring to the public function verify_vm_migrated explaining its parameters (vm, node_before, wait_for_interfaces, check_ssh_connectivity), behavior (asserts migration completed and that VMI moved off node_before, optionally waits for interfaces via wait_for_vm_interfaces and SSH via wait_for_ssh_connectivity), return value (None), raised exceptions (AssertionError, TimeoutExpiredError), and observable side effects including that on TimeoutExpiredError it invokes collect_vnc_screenshot_for_vms to capture VNC screenshots for debugging; mention any preconditions (vm.vmi and vm.vmi.instance expected) and that the function relies on migrationState.completed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@utilities/virt.py`:
- Around line 1905-1907: Add a LOGGER.error call that logs the
TimeoutExpiredError with context (vm.name, vm.namespace) before re-raising, and
protect the diagnostic call so it cannot mask the original exception: in the
except TimeoutExpiredError block, first call LOGGER.error(...) about the
timeout, then call collect_vnc_screenshot_for_vms(...) inside its own try/except
Exception that logs any failure (also with vm context) but does not raise, and
finally re-raise the original TimeoutExpiredError; reference
TimeoutExpiredError, collect_vnc_screenshot_for_vms, and LOGGER.error to locate
the change.
---
Outside diff comments:
In `@utilities/virt.py`:
- Around line 1886-1907: Add a Google-style docstring to the public function
verify_vm_migrated explaining its parameters (vm, node_before,
wait_for_interfaces, check_ssh_connectivity), behavior (asserts migration
completed and that VMI moved off node_before, optionally waits for interfaces
via wait_for_vm_interfaces and SSH via wait_for_ssh_connectivity), return value
(None), raised exceptions (AssertionError, TimeoutExpiredError), and observable
side effects including that on TimeoutExpiredError it invokes
collect_vnc_screenshot_for_vms to capture VNC screenshots for debugging; mention
any preconditions (vm.vmi and vm.vmi.instance expected) and that the function
relies on migrationState.completed.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3953 +/- ##
=======================================
Coverage 98.56% 98.56%
=======================================
Files 25 25
Lines 2297 2307 +10
=======================================
+ Hits 2264 2274 +10
Misses 33 33
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
♻️ Duplicate comments (1)
utilities/virt.py (1)
1905-1908:⚠️ Potential issue | 🟠 MajorHIGH:
collect_vnc_screenshot_for_vmscan mask the root-causeTimeoutExpiredError.
collect_vnc_screenshot_for_vmscallsrun_virtctl_commandunconditionally (seeutilities/data_collector.pylines 113–118). If the virtctl call fails — VNC server unavailable, base directory unwritable, pod unreachable — Python replaces the in-flightTimeoutExpiredErrorwith the new exception before theraiseon line 1908 executes. The CI failure message then points to the screenshot error, not the migration-verification timeout, making diagnosis significantly harder.The
LOGGER.error(line 1906) was a good fix from the previous review round. The remaining unresolved part is protecting the diagnostic side-effect so it can never suppress the original exception.🛡️ Proposed fix — guard the screenshot call
except TimeoutExpiredError: LOGGER.error(f"VM {vm.name} unresponsive after migration; getting VNC screenshot") - collect_vnc_screenshot_for_vms(vm_name=vm.name, vm_namespace=vm.namespace) + try: + collect_vnc_screenshot_for_vms(vm_name=vm.name, vm_namespace=vm.namespace) + except Exception as screenshot_error: + LOGGER.error(f"Failed to collect VNC screenshot for VM {vm.name}: {screenshot_error}") raiseThe inner
except Exceptionis intentional here:collect_vnc_screenshot_for_vmsis a diagnostic side-effect whose failure must never replace the root-cause exception. Based on learnings, the established repo pattern isLOGGER.errorbefore re-raising (servolkov/PR3387 learning).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@utilities/virt.py` around lines 1905 - 1908, The TimeoutExpiredError handler currently calls collect_vnc_screenshot_for_vms directly which can raise its own exceptions (from run_virtctl_command) and mask the original TimeoutExpiredError; wrap the call to collect_vnc_screenshot_for_vms(vm_name=vm.name, vm_namespace=vm.namespace) in its own try/except Exception that logs the diagnostic failure via LOGGER.error (including the exception) but does not suppress or replace the original TimeoutExpiredError, then re-raise the original TimeoutExpiredError as before (i.e., keep the existing LOGGER.error("VM ... unresponsive...") and ensure only diagnostic errors are caught and logged locally).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@utilities/virt.py`:
- Around line 1905-1908: The TimeoutExpiredError handler currently calls
collect_vnc_screenshot_for_vms directly which can raise its own exceptions (from
run_virtctl_command) and mask the original TimeoutExpiredError; wrap the call to
collect_vnc_screenshot_for_vms(vm_name=vm.name, vm_namespace=vm.namespace) in
its own try/except Exception that logs the diagnostic failure via LOGGER.error
(including the exception) but does not suppress or replace the original
TimeoutExpiredError, then re-raise the original TimeoutExpiredError as before
(i.e., keep the existing LOGGER.error("VM ... unresponsive...") and ensure only
diagnostic errors are caught and logged locally).
|
/approve |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
utilities/virt.py (1)
1892-1901:⚠️ Potential issue | 🟠 MajorHIGH: Screenshot collection can hide the real timeout
If
collect_vnc_screenshot_for_vms()throws, the originalTimeoutExpiredErrornever re-raises, so CI reports the screenshot failure instead of the real post‑migration timeout. Guard the diagnostic call with a narrow exception list, log the failure, and then re‑raise the original timeout so root cause stays visible.✅ Proposed fix (guard diagnostic side‑effect)
except TimeoutExpiredError: LOGGER.error(f"VM {vm.name} unresponsive after migration; getting VNC screenshot") - collect_vnc_screenshot_for_vms(vm_name=vm.name, vm_namespace=vm.namespace) + try: + collect_vnc_screenshot_for_vms(vm_name=vm.name, vm_namespace=vm.namespace) + except (CommandExecFailed, OSError) as screenshot_error: + LOGGER.error( + f"Failed to collect VNC screenshot for VM {vm.name} in {vm.namespace}: {screenshot_error}" + ) raiseAs per coding guidelines: Exception handling: Do not catch bare
Exception- catch specific exception types only; Exception handling: NEVER silently swallow exceptions - at minimum, log the error before continuing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@utilities/virt.py` around lines 1892 - 1901, The TimeoutExpiredError handler currently calls collect_vnc_screenshot_for_vms() directly which can raise and obscure the original TimeoutExpiredError; change the except TimeoutExpiredError block to capture the original exception (e.g., orig_exc = sys.exc_info()[1]), then call collect_vnc_screenshot_for_vms() inside its own narrow try/except (catch diagnostic-specific exceptions or at minimum Exception as diag_err), log the diagnostic failure via LOGGER.exception or LOGGER.error including diag_err, and finally re-raise the original TimeoutExpiredError (raise orig_exc) so the root timeout remains visible; references: wait_for_vm_interfaces, wait_for_ssh_connectivity, collect_vnc_screenshot_for_vms, TimeoutExpiredError, LOGGER.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@utilities/virt.py`:
- Around line 1892-1901: The TimeoutExpiredError handler currently calls
collect_vnc_screenshot_for_vms() directly which can raise and obscure the
original TimeoutExpiredError; change the except TimeoutExpiredError block to
capture the original exception (e.g., orig_exc = sys.exc_info()[1]), then call
collect_vnc_screenshot_for_vms() inside its own narrow try/except (catch
diagnostic-specific exceptions or at minimum Exception as diag_err), log the
diagnostic failure via LOGGER.exception or LOGGER.error including diag_err, and
finally re-raise the original TimeoutExpiredError (raise orig_exc) so the root
timeout remains visible; references: wait_for_vm_interfaces,
wait_for_ssh_connectivity, collect_vnc_screenshot_for_vms, TimeoutExpiredError,
LOGGER.
|
/retest build-container |
|
/build-and-push-container |
|
New container for quay.io/openshift-cnv/openshift-virtualization-tests:pr-3953 published |
|
/verified smoke tests green |
Test execution plan request detailsCRITICAL: You MUST post an inline review comment on the first changed line of the first file. As an expert software testing engineer, analyze all modified files in this PR and create a targeted test execution plan. Analysis Requirements:
Your deliverable: Test Execution Plan
Guidelines:
Hardware-Related Checks (SR-IOV, GPU, DPDK): When PR modifies fixtures for hardware-specific resources:
CRITICAL WORKFLOW COMPLETION RULES: When responding to this test execution plan request, you MUST follow these rules EXACTLY:
ABSOLUTE PROHIBITIONS (violating these creates empty/meaningless reviews):
Remember: The pull request review is visible to users. Additional comments are redundant noise. CRITICAL: Before setting "Run smoke tests: True", you MUST use shell scripts to: |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 2803 🏁 Script executed: Length of output: 3829 🏁 Script executed: Length of output: 1795 🧠 Learnings used |
|
/lgtm |
|
/approve |
Add VNC screenshot collection when VM non-responsive after migration.
Signed-off-by: dshchedr@redhat.com
Assisted-by: Claude
Short description:
More details:
What this PR does / why we need it:
Which issue(s) this PR fixes:
Special notes for reviewer:
jira-ticket:
Summary by CodeRabbit