-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-14012. SCM needs to log safemode exit rules at regular intervals #9376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sreejasahithi for working on this.
The PR title has = instead of - in the JIRA ID.
...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java
Outdated
Show resolved
Hide resolved
sarvekshayr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you've tested the changes, attach the logs so we can verify the behaviour.
...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java
Outdated
Show resolved
Hide resolved
|
Thanks for working on this. Like Sarveksha said, if you could attach before and after log examples that would be helpful. |
yes , will add the log examples, I just have a couple of changes to make after which I will add the examples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sreejasahithi IMO, we do not need print safe mode status, its logged in below condition based on event,
- DN registered
- pipeline report
- open pipeline
- container Ratis/EC registeration
So it process the event from DN on HB and validate. If satisfied, exit safemode.
We do not need again at regular interval, but CLI is present to have safemode rule info on need basis from leader. For HB, already we have audit log at SCM, that can be referred for problem analysis.
May be we need have support query safemode status from CLI as requirement from follower node also.
cc: @errose28
The information logged here is not a duplicate of the event triggered rules in
CLI works when you have direct access to the cluster but not for offline analysis where we need to triage an issue from logs.
Yes we should also circle back to HDDS-13832 and get that implemented as well. This is needed for rolling restart scenarios where we want to wait for the restarted follower to exit safemode before restarting another node. |
|
Below is a sample of SCM log messages before the changes made in this patch : Below is a sample of SCM log messages after the changes made in this patch (the SCM logs will still contain above log messages): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. The comparison of the log messages definitely helps show the use case for the improvement since I think the bottom one is much easier to follow. Can we add a test using log capturer to check that each safemode rules's getStatusText is being printed periodically while in safemode?
...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java
Outdated
Show resolved
Hide resolved
...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/HddsConfigKeys.java
Show resolved
Hide resolved
errose28
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates. The new log format looks good. Left a few more comments based on the new output and tests.
| Set<String> expected = ImmutableSet.<String>builder() | ||
| .add(OZONE_ADMINISTRATORS) | ||
| .add(OZONE_READONLY_ADMINISTRATORS) | ||
| .add(HddsConfigKeys.HDDS_SCM_SAFEMODE_LOG_INTERVAL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other config values here have dedicated tests in this suite, let's add one for this new property too.
| } | ||
| } | ||
|
|
||
| private synchronized void stopSafeModePeriodicLogger() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After stopping the periodic logger I think we should print one last summary message immediately when SCM exits safemode. Otherwise if we are grepping for the prefix while tailing the logs, we won't have in indication that it finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes , this is already satisfied for normal exit in validateSafeModeExitRules:
if (validatedRules.size() == exitRules.size()
&& status.compareAndSet(SafeModeStatus.PRE_CHECKS_PASSED, SafeModeStatus.OUT_OF_SAFE_MODE)) {
logSafeModeStatus(); <-- by making a call here
will add similarly for force exit as well.
| * while SCM is in safe mode. | ||
| */ | ||
| @Test | ||
| public void testSafeModePeriodicLogging() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is good for when SCM is in safemode, but we should also test that the logger is stopped and prints one final message when safemode exits normally or is force exited.
| statusLog.append(String.format( | ||
| "%nSCM SafeMode Status | state=%s preCheckComplete=%s validatedPreCheckRules=%d/%d validatedRules=%d/%d", | ||
| safeModeStatus.isInSafeMode() ? | ||
| (safeModeStatus.isPreCheckComplete() ? "PRE_CHECKS_PASSED" : "INITIAL") : "OUT_OF_SAFE_MODE", | ||
| safeModeStatus.isPreCheckComplete(), preCheckValidatedCount, preCheckRules.size(), validatedCount, | ||
| exitRules.size())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hard to read, can we split this out to just use the StringBuilder instead of nesting String.format and ternary operators?
| if (statusText.endsWith(";")) { | ||
| statusText = statusText.substring(0, statusText.length() - 1); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we can just remove the semicolon from the end of AbstractContainerSafeModeRule#getStatusText instead of stripping it here. Neither I nor Cursor can find a case that depends on the semicolon.
…ng(force-exit and normal exit), log interval reconfiguration
What changes were proposed in this pull request?
SCM logs rule statuses at arbitrary time intervals. Sometimes there is one log line per minute, sometimes it will go 5+ minutes without logging anything and then log one line showing a large jump in progress. This is not due to log flushing, the timestamps on the log lines exhibit these gaps too. We need a timer in the safemode manager that gives all safemode information at a configurable interval, probably once a minute by default.
What is the link to the Apache JIRA
HDDS-14012
How was this patch tested?
https://github.com/sreejasahithi/ozone/actions/runs/19693260454