HDDS-14012. SCM needs to log safemode exit rules at regular intervals #9376

sreejasahithi · 2025-11-26T08:58:02Z

What changes were proposed in this pull request?

SCM logs rule statuses at arbitrary time intervals. Sometimes there is one log line per minute, sometimes it will go 5+ minutes without logging anything and then log one line showing a large jump in progress. This is not due to log flushing, the timestamps on the log lines exhibit these gaps too. We need a timer in the safemode manager that gives all safemode information at a configurable interval, probably once a minute by default.

What is the link to the Apache JIRA

HDDS-14012

How was this patch tested?

https://github.com/sreejasahithi/ozone/actions/runs/19693260454

sarvekshayr

Thanks @sreejasahithi for working on this.

The PR title has = instead of - in the JIRA ID.

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java

sarvekshayr

If you've tested the changes, attach the logs so we can verify the behaviour.

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java

errose28 · 2025-12-05T01:10:04Z

Thanks for working on this. Like Sarveksha said, if you could attach before and after log examples that would be helpful.

sreejasahithi · 2025-12-05T05:15:41Z

Thanks for working on this. Like Sarveksha said, if you could attach before and after log examples that would be helpful.

yes , will add the log examples, I just have a couple of changes to make after which I will add the examples.

sumitagrawl

@sreejasahithi IMO, we do not need print safe mode status, its logged in below condition based on event,

DN registered
pipeline report
open pipeline
container Ratis/EC registeration
So it process the event from DN on HB and validate. If satisfied, exit safemode.

We do not need again at regular interval, but CLI is present to have safemode rule info on need basis from leader. For HB, already we have audit log at SCM, that can be referred for problem analysis.

May be we need have support query safemode status from CLI as requirement from follower node also.

cc: @errose28

errose28 · 2025-12-05T17:00:49Z

We do not need again at regular interval, its logged in below condition based on event

The information logged here is not a duplicate of the event triggered rules in safemodeExitRule#process. Those display information about only their individual rule. Here we propose adding a summary message of all safemode rule statuses. This way you can tail a log file with watch + grep for the summary keyword to get a periodic update. This workflow is not currently possible which makes tailing logs for safemode exit difficult.

CLI is present to have safemode rule info on need basis from leader

CLI works when you have direct access to the cluster but not for offline analysis where we need to triage an issue from logs.

May be we need have support query safemode status from CLI as requirement from follower node also.

Yes we should also circle back to HDDS-13832 and get that implemented as well. This is needed for rolling restart scenarios where we want to wait for the restarted follower to exit safemode before restarting another node.

sreejasahithi · 2025-12-05T18:52:34Z

Below is a sample of SCM log messages before the changes made in this patch :

2025-12-05 10:59:29 2025-12-05 05:29:29,934 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: SCM in safe mode. 1 DataNodes registered, 5 required.
2025-12-05 10:59:29 2025-12-05 05:29:29,934 [EventQueue-ContainerRegistrationReportForRatisContainerSafeModeRule] INFO safemode.SCMSafeModeManager: RatisContainerSafeModeRule rule is successfully validated
2025-12-05 10:59:29 2025-12-05 05:29:29,935 [EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule] INFO safemode.SCMSafeModeManager: OneReplicaPipelineSafeModeRule rule is successfully validated
2025-12-05 10:59:29 2025-12-05 05:29:29,935 [EventQueue-ContainerRegistrationReportForECContainerSafeModeRule] INFO safemode.SCMSafeModeManager: ECContainerSafeModeRule rule is successfully validated
2025-12-05 10:59:29 2025-12-05 05:29:29,967 [aaffa793-f02c-4e5b-b861-dfe90ff67c94@group-EF33F58B837E-StateMachineUpdater] INFO safemode.SCMSafeModeManager: RatisContainerSafeModeRule rule is success
2025-12-05 10:59:29 2025-12-05 05:29:29,999 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: SCM in safe mode. 2 DataNodes registered, 5 required.
2025-12-05 10:59:30 2025-12-05 05:29:30,018 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: SCM in safe mode. 3 DataNodes registered, 5 required.
2025-12-05 11:00:05 2025-12-05 05:30:05,491 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: SCM in safe mode. 4 DataNodes registered, 5 required.
2025-12-05 11:01:36 2025-12-05 05:31:36,079 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: SCM in safe mode. 5 DataNodes registered, 5 required.
2025-12-05 11:01:36 2025-12-05 05:31:36,080 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: DataNodeSafeModeRule rule is successfully validated
2025-12-05 11:01:36 2025-12-05 05:31:36,080 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: All SCM safe mode pre check rules have passed
2025-12-05 11:01:36 2025-12-05 05:31:36,088 [aaffa793-f02c-4e5b-b861-dfe90ff67c94@group-EF33F58B837E-StateMachineUpdater] INFO safemode.SCMSafeModeManager: DataNodeSafeModeRule rule is successfully validated
2025-12-05 11:01:44 2025-12-05 05:31:44,383 [EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] INFO safemode.SCMSafeModeManager: HealthyPipelineSafeModeRule rule is successfully validated
2025-12-05 11:01:44 2025-12-05 05:31:44,384 [EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] INFO safemode.SCMSafeModeManager: ScmSafeModeManager, all rules are successfully validated
2025-12-05 11:01:44 2025-12-05 05:31:44,384 [EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] INFO safemode.SCMSafeModeManager: SCM exiting safe mode.

Below is a sample of SCM log messages after the changes made in this patch (the SCM logs will still contain above log messages):

2025-12-05 10:59:22 2025-12-05 05:29:22,903 [main] INFO safemode.SCMSafeModeManager: Started periodic Safe Mode logging with interval 60000 ms
2025-12-05 10:59:22 2025-12-05 05:29:22,904 [SCM-SafeMode-Log-0] INFO safemode.SCMSafeModeManager: SCM SafeMode periodic status: state=SafeModeStatus{safeModeStatus=true, preCheckPassed=false}, preCheckComplete=false, validatedRules=0/5, preCheckValidated=0/1, rules=[DataNodeSafeModeRule(status=waiting, registered datanodes (=0) >= required datanodes (=5)), RatisContainerSafeModeRule(status=waiting, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);), HealthyPipelineSafeModeRule(status=waiting, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=1)), OneReplicaPipelineSafeModeRule(status=waiting, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)), ECContainerSafeModeRule(status=waiting, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);)]
2025-12-05 11:00:22 2025-12-05 05:30:22,905 [SCM-SafeMode-Log-0] INFO safemode.SCMSafeModeManager: SCM SafeMode periodic status: state=SafeModeStatus{safeModeStatus=true, preCheckPassed=false}, preCheckComplete=false, validatedRules=3/5, preCheckValidated=0/1, rules=[DataNodeSafeModeRule(status=waiting, registered datanodes (=4) >= required datanodes (=5)), RatisContainerSafeModeRule(status=validated, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);), HealthyPipelineSafeModeRule(status=waiting, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=1)), OneReplicaPipelineSafeModeRule(status=validated, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)), ECContainerSafeModeRule(status=validated, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);)]
2025-12-05 11:01:22 2025-12-05 05:31:22,905 [SCM-SafeMode-Log-0] INFO safemode.SCMSafeModeManager: SCM SafeMode periodic status: state=SafeModeStatus{safeModeStatus=true, preCheckPassed=false}, preCheckComplete=false, validatedRules=3/5, preCheckValidated=0/1, rules=[DataNodeSafeModeRule(status=waiting, registered datanodes (=4) >= required datanodes (=5)), RatisContainerSafeModeRule(status=validated, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);), HealthyPipelineSafeModeRule(status=waiting, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=1)), OneReplicaPipelineSafeModeRule(status=validated, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)), ECContainerSafeModeRule(status=validated, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);)]
2025-12-05 11:01:36 2025-12-05 05:31:36,080 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: SCM SafeMode periodic status: state=SafeModeStatus{safeModeStatus=true, preCheckPassed=true}, preCheckComplete=true, validatedRules=4/5, preCheckValidated=1/1, rules=[DataNodeSafeModeRule(status=validated, registered datanodes (=5) >= required datanodes (=5)), RatisContainerSafeModeRule(status=validated, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);), HealthyPipelineSafeModeRule(status=waiting, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=1)), OneReplicaPipelineSafeModeRule(status=validated, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)), ECContainerSafeModeRule(status=validated, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);)]
2025-12-05 11:01:44 2025-12-05 05:31:44,384 [EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] INFO safemode.SCMSafeModeManager: SCM SafeMode periodic status: state=SafeModeStatus{safeModeStatus=true, preCheckPassed=true}, preCheckComplete=true, validatedRules=5/5, preCheckValidated=1/1, rules=[DataNodeSafeModeRule(status=validated, registered datanodes (=5) >= required datanodes (=5)), RatisContainerSafeModeRule(status=validated, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);), HealthyPipelineSafeModeRule(status=validated, healthy Ratis/THREE pipelines (=1) >= healthyPipelineThresholdCount (=1)), OneReplicaPipelineSafeModeRule(status=validated, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)), ECContainerSafeModeRule(status=validated, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);)]
2025-12-05 11:01:44 2025-12-05 05:31:44,384 [EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] INFO safemode.SCMSafeModeManager: Stopped periodic Safe Mode logging

errose28

Thanks for working on this. The comparison of the log messages definitely helps show the use case for the improvement since I think the bottom one is much easier to follow. Can we add a test using log capturer to check that each safemode rules's getStatusText is being printed periodically while in safemode?

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/HddsConfigKeys.java

… format

errose28

Thanks for the updates. The new log format looks good. Left a few more comments based on the new output and tests.

errose28 · 2026-01-06T17:36:15Z

.../integration-test/src/test/java/org/apache/hadoop/ozone/reconfig/TestScmReconfiguration.java

    Set<String> expected = ImmutableSet.<String>builder()
        .add(OZONE_ADMINISTRATORS)
        .add(OZONE_READONLY_ADMINISTRATORS)
+        .add(HddsConfigKeys.HDDS_SCM_SAFEMODE_LOG_INTERVAL)


The other config values here have dedicated tests in this suite, let's add one for this new property too.

errose28 · 2026-01-06T18:45:27Z

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java

+    }
+  }
+
+  private synchronized void stopSafeModePeriodicLogger() {


After stopping the periodic logger I think we should print one last summary message immediately when SCM exits safemode. Otherwise if we are grepping for the prefix while tailing the logs, we won't have in indication that it finished.

yes , this is already satisfied for normal exit in validateSafeModeExitRules:

if (validatedRules.size() == exitRules.size() && status.compareAndSet(SafeModeStatus.PRE_CHECKS_PASSED, SafeModeStatus.OUT_OF_SAFE_MODE)) { logSafeModeStatus(); <-- by making a call here

will add similarly for force exit as well.

errose28 · 2026-01-06T19:00:48Z

...dds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/safemode/TestSCMSafeModeManager.java

+   * while SCM is in safe mode.
+   */
+  @Test
+  public void testSafeModePeriodicLogging() throws Exception {


This test is good for when SCM is in safemode, but we should also test that the logger is stopped and prints one final message when safemode exits normally or is force exited.

errose28 · 2026-01-06T19:42:39Z

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java

+    statusLog.append(String.format(
+        "%nSCM SafeMode Status | state=%s preCheckComplete=%s validatedPreCheckRules=%d/%d validatedRules=%d/%d",
+        safeModeStatus.isInSafeMode() ? 
+            (safeModeStatus.isPreCheckComplete() ? "PRE_CHECKS_PASSED" : "INITIAL") : "OUT_OF_SAFE_MODE",
+        safeModeStatus.isPreCheckComplete(), preCheckValidatedCount, preCheckRules.size(), validatedCount,
+        exitRules.size()));


This is hard to read, can we split this out to just use the StringBuilder instead of nesting String.format and ternary operators?

errose28 · 2026-01-06T19:52:04Z

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java

+      if (statusText.endsWith(";")) {
+        statusText = statusText.substring(0, statusText.length() - 1);
+      }


Looks like we can just remove the semicolon from the end of AbstractContainerSafeModeRule#getStatusText instead of stripping it here. Neither I nor Cursor can find a case that depends on the semicolon.

…ng(force-exit and normal exit), log interval reconfiguration

HDDS=14012. SCM needs to log safemode exit rules at regular intervals

4809504

sarvekshayr reviewed Nov 26, 2025

View reviewed changes

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java Outdated Show resolved Hide resolved

sreejasahithi changed the title ~~HDDS=14012. SCM needs to log safemode exit rules at regular intervals~~ HDDS-14012. SCM needs to log safemode exit rules at regular intervals Nov 26, 2025

Minor fix

24674f7

sreejasahithi requested a review from sarvekshayr November 26, 2025 10:04

sarvekshayr reviewed Nov 26, 2025

View reviewed changes

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java Outdated Show resolved Hide resolved

sreejasahithi marked this pull request as draft November 26, 2025 10:23

jojochuang requested a review from sumitagrawl December 1, 2025 17:42

sumitagrawl reviewed Dec 5, 2025

View reviewed changes

Updated safemode status logging

0cd6226

errose28 reviewed Dec 12, 2025

View reviewed changes

Sreeja Chintalapati added 3 commits December 16, 2025 14:53

Added testcase to verify periodic logging and updated the log message…

1f4075e

… format

Fixed finbugs issue

1586d71

Made SCM safemode log interval dynamically reconfigurable

dda8b43

sreejasahithi requested a review from errose28 December 30, 2025 04:29

Fixed reconfigurableProperties test failure

d85a0a1

sreejasahithi marked this pull request as ready for review January 5, 2026 04:51

errose28 reviewed Jan 6, 2026

View reviewed changes

Sreeja Chintalapati added 2 commits January 7, 2026 20:17

Added force-exit logging, and added tests for safemode periodic loggi…

c798866

…ng(force-exit and normal exit), log interval reconfiguration

Added javadoc and fixed pmd issue

7cd6a70

sreejasahithi requested a review from errose28 January 7, 2026 15:15

HDDS-14012. SCM needs to log safemode exit rules at regular intervals #9376

Are you sure you want to change the base?

HDDS-14012. SCM needs to log safemode exit rules at regular intervals #9376

Uh oh!

Conversation

sreejasahithi commented Nov 26, 2025

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

sarvekshayr left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sarvekshayr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

errose28 commented Dec 5, 2025

Uh oh!

sreejasahithi commented Dec 5, 2025

Uh oh!

sumitagrawl left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

errose28 commented Dec 5, 2025

Uh oh!

sreejasahithi commented Dec 5, 2025

Uh oh!

errose28 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

errose28 Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

errose28 Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

sreejasahithi Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

errose28 Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

errose28 Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

errose28 Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sarvekshayr left a comment •

edited

Loading

sumitagrawl left a comment •

edited

Loading

errose28 left a comment •

edited

Loading

errose28 Jan 6, 2026 •

edited

Loading