Fix flakiness of SR-IOV reboot test#3928
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughReplaces an immediate equality assertion with a TimeoutSampler polling loop (120s timeout, 5s interval) that waits for Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Report bugs in Issues Welcome! 🎉This pull request will be automatically processed with the following features: 🔄 Automatic Actions
📋 Available CommandsPR Status Management
Review & Approval
Testing & Validation
Container Operations
Cherry-pick Operations
Label Management
✅ Merge RequirementsThis PR will be automatically approved when the following conditions are met:
📊 Review ProcessApprovers and ReviewersApprovers:
Reviewers:
Available Labels
💡 Tips
For more information, please refer to the project documentation or contact the maintainers. |
|
/verified Verified by:
|
Test execution plan request detailsCRITICAL: You MUST post an inline review comment on the first changed line of the first file. As an expert software testing engineer, analyze all modified files in this PR and create a targeted test execution plan. Analysis Requirements:
Your deliverable: Test Execution Plan
Guidelines:
Hardware-Related Checks (SR-IOV, GPU, DPDK): When PR modifies fixtures for hardware-specific resources:
CRITICAL WORKFLOW COMPLETION RULES: When responding to this test execution plan request, you MUST follow these rules EXACTLY:
ABSOLUTE PROHIBITIONS (violating these creates empty/meaningless reviews):
Remember: The pull request review is visible to users. Additional comments are redundant noise. CRITICAL: Before setting "Run smoke tests: True", you MUST use shell scripts to: |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 178864 🏁 Script executed: Length of output: 1245 🏁 Script executed: Length of output: 270 🏁 Script executed: Length of output: 1735 🧠 Learnings used |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/network/sriov/test_sriov.py`:
- Around line 97-110: The except clause is catching the wrong exception so the
diagnostic log never runs; change the handler to catch TimeoutExpiredError from
timeout_sampler (add the import) instead of ValueError, so when TimeoutSampler
in the try block (used with TimeoutSampler and the lambda comparing
restarted_sriov_vm4.vmi.interfaces[1] to vm4_interfaces[1]) times out the except
TimeoutExpiredError block executes and LOGGER.error prints the before/after
interface states before re-raising.
tests/network/sriov/test_sriov.py
Outdated
| ): | ||
| if sample: | ||
| return | ||
| except ValueError: |
There was a problem hiding this comment.
Shouldn't it be TimeoutExpiredError?
There was a problem hiding this comment.
Indeed.
Thank you.
|
/retest all Auto-triggered: Files in this PR were modified by merged PR #3598. Overlapping filestests/network/sriov/test_sriov.py |
9506466 to
34010d3
Compare
Test execution plan request detailsCRITICAL: You MUST post an inline review comment on the first changed line of the first file. As an expert software testing engineer, analyze all modified files in this PR and create a targeted test execution plan. Analysis Requirements:
Your deliverable: Test Execution Plan
Guidelines:
Hardware-Related Checks (SR-IOV, GPU, DPDK): When PR modifies fixtures for hardware-specific resources:
CRITICAL WORKFLOW COMPLETION RULES: When responding to this test execution plan request, you MUST follow these rules EXACTLY:
ABSOLUTE PROHIBITIONS (violating these creates empty/meaningless reviews):
Remember: The pull request review is visible to users. Additional comments are redundant noise. CRITICAL: Before setting "Run smoke tests: True", you MUST use shell scripts to: |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 4076 🏁 Script executed: Length of output: 1735 🧠 Learnings used |
We observed flakiness of test_sriov_interfaces_post_reboot, where sometimes the interface status report (post reboot) is not yet fully updated, which is accepted and shouldn't fail the test immediately. Therefore the check is now given a grace period to allow completion of the interface status, before checking it.
9ac0864 to
42f5206
Compare
| # Grace the check with a timeout rather than immediately declaring failure. The reason is that although | ||
| # the interface already appears in the VMI status, it's not immediately fully updated with all the | ||
| # required fields. | ||
| try: |
There was a problem hiding this comment.
Please, consider using a helper with retry decorator and then more clear test structure:
The helper (precisely and roughly, use better naming if you need):
@retry(wait_timeout=120, sleep=5)
def _check_interface_match(vm, expected_iface):
current_iface = vm.vmi.interfaces[1]
return current_iface if current_iface == expected_iface else None
def check_interface_match(vm, expected_iface):
try:
return _check_interface_match(vm, expected)
except TimeoutExpiredError:
"add any logs if needed"
return None
and the test will be smth like:
def test_sriov_interfaces_post_reboot(
self,
vm4_interfaces,
restarted_sriov_vm4,
):
actual_iface = check_interface_match(restarted_sriov_vm4, vm4_interfaces[1])
assert actual_iface == vm4_interfaces[1]
It brings readability and avoids complicated code inside the test.
BTW, we already have lookup_iface_status, probably, it is even better to use it, but the logic is same.
| for sample in TimeoutSampler( | ||
| wait_timeout=120, | ||
| sleep=5, | ||
| func=lambda: restarted_sriov_vm4.vmi.interfaces[1] == vm4_interfaces[1], |
There was a problem hiding this comment.
Edi suggested to use lookup_iface_status to wait for the interfaces and then we can drop timeout sampler
| # required fields. | ||
| try: | ||
| for sample in TimeoutSampler( | ||
| wait_timeout=120, |
There was a problem hiding this comment.
The 120s timeout seems high for this case. It was flaky before this addition and going from 0 to 120 seems like a lot of additional wait in case this will fail.
By the time the test runs, restarted_sriov_vm4 already waited for guest agent + all interfaces via wait_for_vm_interfaces. We're only waiting for remaining fields (IP, MAC) to settle, and for that, the codebase typically uses 30-60s (e.g. IP change after migration: 60s, interface status lookup: 30s). I'd suggest we reduce this to 60s.
EdDev
left a comment
There was a problem hiding this comment.
I have 3 points to feedback on this:
- The flakiness of the test is not limited to the check post restart. The original interface information is by definition also flake in nature, as it does not wait for specifics. E.g. if an IP is expected, then it should explicitly wait for it and not assume it is there, otherwise in some cases you will see it in the initial collection and sometime not.
- We already have good alternatives to
wait_for_vm_interfacesand custom made retries. Inventing new ones is not recommended. - Waiting in several locations (fixture and in the test body) for different reasons needs a good reasoning.
We observed flakiness of test_sriov_interfaces_post_reboot, where sometimes the interface status report (post reboot) is not yet fully updated, which is accepted and shouldn't fail the test immediately. Therefore the check is now given a grace period to allow completion of the interface status, before checking it.
Summary by CodeRabbit