-
Notifications
You must be signed in to change notification settings - Fork 338
DAOS-18355 chk: check leader waits all check engines before exited #17315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Ticket title is 'recovery/check_start_corner_case.py:DMGCheckStartCornerCaseTest.test_start_back_to_back - dmg check start pool_2 failed after 40 sec' |
6f9ab6d to
268abfb
Compare
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17315/3/testReport/ |
268abfb to
d7ad13f
Compare
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/3/execution/node/1345/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/3/execution/node/1355/log |
d7ad13f to
af1b78e
Compare
|
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/5/execution/node/301/log |
|
Test stage Build on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/5/execution/node/317/log |
|
Test stage Build on EL 9.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/5/execution/node/309/log |
|
Test stage Build on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/5/execution/node/405/log |
|
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/6/execution/node/217/log |
|
Test stage Build on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/6/execution/node/233/log |
|
Test stage Build on EL 9.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/6/execution/node/225/log |
4953689 to
bc0ea4d
Compare
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17315/8/testReport/ |
bc0ea4d to
59b3671
Compare
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/10/execution/node/1176/log |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/10/execution/node/1156/log |
edc05cf to
a17fc07
Compare
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/12/execution/node/1317/log |
failed for DAOS-18388, not related with the patch, to be retested. |
In old implementation, when the PS leader notifies the check leader that related pool has been checked, the check leader will mark such pool as 'done'. If all required pools have been marked as 'done', then the check leader will exit. But at that time, the check engine on related PS leader may not complete yet. There are something to be processed (such as restart pool server) after the checking the pool. The check engine will notify the check leader via CHK IV when exit. But the check leader does not wait such notification. Under such case, if someone tries to trigger new check instance, it will create new IV namespace. That will cause some check engines and the check leader to use different IV namespace, as to the CHK IV logic cannot recognize the leadership correctly. The patch adjust the leader exit logic: the leader scheduler needs to wait all check engines' notification before exit. Test-tag: recovery Signed-off-by: Fan Yong <fan.yong@hpe.com>
a17fc07 to
fc52e02
Compare
In old implementation, when the PS leader notifies the check leader that related pool has been checked, the check leader will mark such pool as 'done'. If all required pools have been marked as 'done', then the check leader will exit. But at that time, the check engine on related PS leader may not complete yet. There are something to be processed (such as restart pool server) after the checking the pool. The check engine will notify the check leader via CHK IV when exit. But the check leader does not wait such notification. Under such case, if someone tries to trigger new check instance, it will create new IV namespace. That will cause some check engines and the check leader to use different IV namespace, as to the CHK IV logic cannot recognize the leadership correctly.
The patch adjust the leader exit logic: the leader scheduler needs to wait all check engines' notification before exit.
Test-tag: recovery
Steps for the author:
After all prior steps are complete: