Skip to content

Conversation

@Nasf-Fan
Copy link
Contributor

@Nasf-Fan Nasf-Fan commented Dec 25, 2025

In old implementation, when the PS leader notifies the check leader that related pool has been checked, the check leader will mark such pool as 'done'. If all required pools have been marked as 'done', then the check leader will exit. But at that time, the check engine on related PS leader may not complete yet. There are something to be processed (such as restart pool server) after the checking the pool. The check engine will notify the check leader via CHK IV when exit. But the check leader does not wait such notification. Under such case, if someone tries to trigger new check instance, it will create new IV namespace. That will cause some check engines and the check leader to use different IV namespace, as to the CHK IV logic cannot recognize the leadership correctly.

The patch adjust the leader exit logic: the leader scheduler needs to wait all check engines' notification before exit.

Test-tag: recovery

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link

Ticket title is 'recovery/check_start_corner_case.py:DMGCheckStartCornerCaseTest.test_start_back_to_back - dmg check start pool_2 failed after 40 sec'
Status is 'In Progress'
Labels: 'ci_master_weekly,weekly_test'
https://daosio.atlassian.net/browse/DAOS-18355

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch 2 times, most recently from 6f9ab6d to 268abfb Compare December 25, 2025 16:01
@daosbuild3
Copy link
Collaborator

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch from 268abfb to d7ad13f Compare December 26, 2025 01:15
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/3/execution/node/1345/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/3/execution/node/1355/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch from d7ad13f to af1b78e Compare December 26, 2025 07:59
@daosbuild3
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/5/execution/node/301/log

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/6/execution/node/217/log

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch 2 times, most recently from 4953689 to bc0ea4d Compare December 28, 2025 09:36
@daosbuild3
Copy link
Collaborator

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch from bc0ea4d to 59b3671 Compare December 28, 2025 14:14
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/10/execution/node/1176/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/10/execution/node/1156/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch 2 times, most recently from edc05cf to a17fc07 Compare December 31, 2025 07:37
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/12/execution/node/1317/log

@Nasf-Fan
Copy link
Contributor Author

Nasf-Fan commented Jan 1, 2026

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/12/execution/node/1317/log

failed for DAOS-18388, not related with the patch, to be retested.

In old implementation, when the PS leader notifies the check leader
that related pool has been checked, the check leader will mark such
pool as 'done'. If all required pools have been marked as 'done',
then the check leader will exit. But at that time, the check engine
on related PS leader may not complete yet. There are something to
be processed (such as restart pool server) after the checking the
pool. The check engine will notify the check leader via CHK IV when
exit. But the check leader does not wait such notification. Under
such case, if someone tries to trigger new check instance, it will
create new IV namespace. That will cause some check engines and
the check leader to use different IV namespace, as to the CHK IV
logic cannot recognize the leadership correctly.

The patch adjust the leader exit logic: the leader scheduler needs
to wait all check engines' notification before exit.

Test-tag: recovery

Signed-off-by: Fan Yong <fan.yong@hpe.com>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch from a17fc07 to fc52e02 Compare January 1, 2026 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants