Skip to content

[Consensus] Add --require-beacon-key flag to fail fast on missing DKG keys at consensus node startup#8410

Open
zhangchiqing wants to merge 16 commits intomasterfrom
leo/verify-dkg-key-present
Open

[Consensus] Add --require-beacon-key flag to fail fast on missing DKG keys at consensus node startup#8410
zhangchiqing wants to merge 16 commits intomasterfrom
leo/verify-dkg-key-present

Conversation

@zhangchiqing
Copy link
Member

@zhangchiqing zhangchiqing commented Feb 12, 2026

Problem

When DKG/random beacon keys are accidentally deleted from /var/flow/data/secrets, consensus nodes continue to start but fail later during vote signing. This makes debugging difficult.

Solution

  Add `--require-beacon-key` flag to consensus node that verifies beacon key existence and validity at startup:
  - Checks key exists in secrets database
  - Verifies key is marked safe
  - Confirms private key matches expected public key from protocol state

  Node fails immediately with clear error if verification fails.

Summary by CodeRabbit

  • New Features

    • Added a startup flag (--require-beacon-key, default: true) to control optional beacon key verification at node startup.
    • Added beacon key verification for DKG participants to ensure stored beacon keys match protocol expectations and surface errors when mismatched or missing.
  • Tests

    • Added a comprehensive test suite covering verification success, missing/unsafe keys, participation cases, and error paths.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds startup-controlled beacon key verification: a new CLI flag and startup module call, plus a dkg package function VerifyBeaconKeyForEpoch with tests that validate DKG participation, key retrieval, safeness, and public-key matching against protocol state.

Changes

Cohort / File(s) Summary
Startup integration
cmd/consensus/main.go
Adds --require-beacon-key flag (default true) and registers a startup module to call beacon key verification at node startup.
DKG verification implementation
module/dkg/verification.go
New function VerifyBeaconKeyForEpoch(...) error that checks current epoch, DKG participation, optionally requires key presence, retrieves safe beacon private key from storage, enforces non-nil/safe key, and verifies public key matches DKG key share; returns descriptive errors and logs outcomes.
Unit tests
module/dkg/verification_test.go
New testify suite covering successful verification, non-participation, key-present/absent paths, unsafe/nil key, public-key mismatch, and various error propagation scenarios.

Sequence Diagram

sequenceDiagram
    participant Startup as Startup Module
    participant Verify as VerifyBeaconKeyForEpoch
    participant Proto as Protocol State
    participant Storage as Beacon Key Storage

    Startup->>Verify: Invoke VerifyBeaconKeyForEpoch(requireKeyPresent)
    Verify->>Proto: GetCurrentEpoch()
    Proto-->>Verify: epoch, counter
    Verify->>Proto: GetDKGForEpoch(epoch, counter)
    Proto-->>Verify: DKG participants & key shares
    Verify->>Verify: Is node participant?
    alt Not participant
        Verify-->>Startup: return nil
    else Participant
        Verify->>Storage: RetrievePrivateKeyForEpoch(epoch)
        Storage-->>Verify: privateKey or error
        alt retrieval error
            Verify-->>Startup: return error (or nil if requireKeyPresent=false)
        else key returned
            Verify->>Verify: Check key non-nil and marked safe
            Verify->>Proto: ExpectedPublicKeyFromKeyShare(nodeID)
            Proto-->>Verify: expectedPublicKey
            Verify->>Verify: Compare public keys
            alt match
                Verify-->>Startup: return nil (success)
            else mismatch
                Verify-->>Startup: return error
            end
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I hopped at dawn to check the key,
Through epochs, shares, and storage tree,
I sniffed for safety, matched the skin,
And thumped my foot when tests begin.
A tiny hop — the node says "ready"!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding a --require-beacon-key flag to consensus nodes for early detection of missing DKG keys at startup.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch leo/verify-dkg-key-present

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 12, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@codecov-commenter
Copy link

codecov-commenter commented Feb 12, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 18 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
cmd/consensus/main.go 0.00% 14 Missing ⚠️
module/dkg/verification.go 90.00% 3 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@zhangchiqing zhangchiqing marked this pull request as ready for review February 12, 2026 21:32
@zhangchiqing zhangchiqing requested a review from a team as a code owner February 12, 2026 21:32
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@module/dkg/verification.go`:
- Around line 63-66: The error handling after calling RetrieveMyBeaconPrivateKey
needs to distinguish a missing-key case from other failures: call
RetrieveMyBeaconPrivateKey(epochCounter) as before but inspect the returned err
(e.g., errors.Is(err, secrets.ErrNotFound) or compare to the package's
ErrNotFound) and return a specific "beacon key for epoch %d not found in secrets
database: %w" only for that not-found case; for any other error return a more
general message like "failed to retrieve beacon key for epoch %d: %w"
(referencing RetrieveMyBeaconPrivateKey, epochCounter, err and the package
ErrNotFound symbol) so non-missing DB/I/O errors are not misreported as “not
found.”

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Comment on lines +67 to +69
if !safe {
return fmt.Errorf("beacon key for epoch %d exists but is marked unsafe - cannot participate in consensus", epochCounter)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we enable the flag by default, there is a case to consider this case valid. This case means:

  • the operator transferred over the secrets database properly
  • but the node happened to fail the DKG

There is just nothing we or the operator can do in this case, and the fact that they are Dynamic Bootstrapping doesn't worsen the overall situation.

Copy link
Member Author

@zhangchiqing zhangchiqing Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, makes sense. Then I will change the flag to be true by default after some testing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed to enable by default, and moved the flag check further down, so that regardless it's set true or false, we will run the check, and the flag is just to control weather it would crash or report an error. @jordanschalm can you review again

Copy link
Member Author

@zhangchiqing zhangchiqing Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested with localnet. That setting the --require-beacon-key flag to true and removing the secret folder would cause a crash. And restoring the secret folder will recover the node.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@module/dkg/verification.go`:
- Around line 64-67: The updated error message in the RetrieveMyBeaconPrivateKey
call correctly generalizes failures, so no code change is needed; simply mark
the review as resolved and remove the duplicate review comment referring to
beaconKeys.RetrieveMyBeaconPrivateKey and the epochCounter handling in
verification.go.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
module/dkg/verification.go (1)

72-74: Prefer log.Warn() over log.Error() for the opted-out missing-key path.

log.Error() implies an unexpected, actionable failure, but this code path is reached only when the operator has deliberately set --require-beacon-key=false. A Warn level (or at most Error with a note that this is a deliberate override) is more appropriate here; otherwise, operators who intentionally opt out will see spurious error-level noise on every restart before the key is available.

-			log.Error().Uint64("epoch", epochCounter).
-				Msg("beacon key not found for current epoch, but --require-beacon-key flag is not set, skipping verification failure")
+			log.Warn().Uint64("epoch", epochCounter).
+				Msg("beacon key not found for current epoch, but --require-beacon-key flag is not set, skipping verification failure")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@module/dkg/verification.go` around lines 72 - 74, The current
missing-beacon-key branch in verification.go uses log.Error() which emits
erroneous error-level noise when the operator deliberately set
--require-beacon-key=false; change the log level to log.Warn() (or log.Info() if
you prefer) in the block that references epochCounter so the message reflects an
opted-out condition (update the call that currently starts with
log.Error().Uint64("epoch", epochCounter) to use log.Warn()) and keep the
descriptive message indicating the operator has opted out of the requirement.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@module/dkg/verification_test.go`:
- Around line 74-78: TestRequireKeyPresentFalse fails because
VerifyBeaconKeyForEpoch calls dkg.KeyShare unconditionally even when
requireKeyPresent is false; update VerifyBeaconKeyForEpoch (in verification.go)
to not invoke dkg.KeyShare (or any methods on s.beaconKeys / dkg.KeyShare) when
requireKeyPresent is false and return early, or alternatively update the test to
register an expected mock call on dkg.KeyShare; reference
VerifyBeaconKeyForEpoch, requireKeyPresent, and dkg.KeyShare when making the
change so the test TestRequireKeyPresentFalse no longer triggers an unexpected
mock invocation.

In `@module/dkg/verification.go`:
- Around line 23-78: The docstring promises that passing requireKeyPresent=false
skips verification, but VerifyBeaconKeyForEpoch still calls
protocolState.Final().Epochs().Current(), currentEpoch.DKG(),
dkg.KeyShare(nodeID) and beaconKeys.RetrieveMyBeaconPrivateKey(); change the
function to return early when requireKeyPresent == false (before calling
protocolState.Final()/Current()/DKG()/KeyShare) so no mocks are invoked in
TestRequireKeyPresentFalse, and remove the now-dead soft-fail branch that checks
requireKeyPresent inside the storage.ErrNotFound handling; ensure the early
return returns nil to match the documented behavior.

---

Nitpick comments:
In `@module/dkg/verification.go`:
- Around line 72-74: The current missing-beacon-key branch in verification.go
uses log.Error() which emits erroneous error-level noise when the operator
deliberately set --require-beacon-key=false; change the log level to log.Warn()
(or log.Info() if you prefer) in the block that references epochCounter so the
message reflects an opted-out condition (update the call that currently starts
with log.Error().Uint64("epoch", epochCounter) to use log.Warn()) and keep the
descriptive message indicating the operator has opted out of the requirement.

Comment on lines 74 to 78
func (s *VerifyBeaconKeyForEpochSuite) TestRequireKeyPresentFalse() {
// No mocks should be called since we skip verification
err := VerifyBeaconKeyForEpoch(unittest.Logger(), s.nodeID, s.state, s.beaconKeys, false)
require.NoError(s.T(), err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

TestRequireKeyPresentFalse will fail — this is the observable consequence of the implementation/docstring mismatch flagged in verification.go.

dkg.KeyShare has no On(...) registration for this test. Because the implementation calls dkg.KeyShare unconditionally (regardless of requireKeyPresent), testify will hit t.FailNow() on the unexpected invocation. Resolving either Option A or Option B described in the verification.go comment will also fix this test.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@module/dkg/verification_test.go` around lines 74 - 78,
TestRequireKeyPresentFalse fails because VerifyBeaconKeyForEpoch calls
dkg.KeyShare unconditionally even when requireKeyPresent is false; update
VerifyBeaconKeyForEpoch (in verification.go) to not invoke dkg.KeyShare (or any
methods on s.beaconKeys / dkg.KeyShare) when requireKeyPresent is false and
return early, or alternatively update the test to register an expected mock call
on dkg.KeyShare; reference VerifyBeaconKeyForEpoch, requireKeyPresent, and
dkg.KeyShare when making the change so the test TestRequireKeyPresentFalse no
longer triggers an unexpected mock invocation.

Comment on lines 23 to 78
// - requireKeyPresent: if false, verification is skipped and the function returns nil immediately
//
// Returns nil if:
// - requireKeyPresent is false (verification skipped), OR
// - the beacon key exists, is safe, and matches the expected public key, OR
// - the node is not a DKG participant for the current epoch (nothing to verify)
//
// This is a binary validation function and all errors indicate that validation failed, which should be interpreted by the upper layer as an exception.
// Returns an error if:
// - the beacon key is missing from storage
// - the beacon key exists but is marked unsafe
// - the beacon key does not match the expected public key
// - any unexpected error occurs while querying state
func VerifyBeaconKeyForEpoch(
log zerolog.Logger,
nodeID flow.Identifier,
protocolState protocol.State,
beaconKeys storage.SafeBeaconKeys,
requireKeyPresent bool,
) error {
// Get current epoch
currentEpoch, err := protocolState.Final().Epochs().Current()
if err != nil {
return fmt.Errorf("could not get current epoch for beacon key verification: %w", err)
}
epochCounter := currentEpoch.Counter()

// Check if we're in the DKG committee for this epoch
dkg, err := currentEpoch.DKG()
if err != nil {
return fmt.Errorf("could not get DKG info for epoch %d: %w", epochCounter, err)
}

// Check if this node is a DKG participant
expectedPubKey, err := dkg.KeyShare(nodeID)
if protocol.IsIdentityNotFound(err) {
log.Info().Uint64("epoch", epochCounter).
Msg("node is not a DKG participant for current epoch, skipping beacon key verification")
return nil
}
if err != nil {
return fmt.Errorf("could not get DKG key share for node %s in epoch %d: %w", nodeID, epochCounter, err)
}

// Verify beacon key exists and is safe
key, safe, err := beaconKeys.RetrieveMyBeaconPrivateKey(epochCounter)
if err != nil {
if errors.Is(err, storage.ErrNotFound) {
if !requireKeyPresent {
log.Error().Uint64("epoch", epochCounter).
Msg("beacon key not found for current epoch, but --require-beacon-key flag is not set, skipping verification failure")
return nil
}
}
return fmt.Errorf("could not retrieve beacon key for epoch %d from secrets database - cannot participate in consensus: %w", epochCounter, err)
}

This comment was marked as resolved.

) error {
// Get current epoch
currentEpoch, err := protocolState.Final().Epochs().Current()
if err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never fail because we should always be able to retrieve current epoch


// Check if we're in the DKG committee for this epoch
dkg, err := currentEpoch.DKG()
if err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should ever fail, because we should always be able to retrieve the DKG committee for the current epoch.

Msg("node is not a DKG participant for current epoch, skipping beacon key verification")
return nil
}
if err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never fail, because we have checked protocol.IsIdentityNotFound error before.

// Verify beacon key exists and is safe
key, safe, err := beaconKeys.RetrieveMyBeaconPrivateKey(epochCounter)
if err != nil {
if errors.Is(err, storage.ErrNotFound) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could possibly fail, that's why we are using a flag to control, but by default, requireKeyPresent is true, which will hit the error, and crash the node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return dkgmodule.VerifyBeaconKeyForEpoch(node.Logger, node.NodeID, node.State, myBeaconKeyStateMachine, requireBeaconKeyOnStartup)
if err := dkgmodule.VerifyBeaconKeyForEpoch(node.Logger, node.NodeID, node.State, myBeaconKeyStateMachine, requireBeaconKeyOnStartup) {
log.Fatal("This node is configured with --require-beacon-key=true (default), but failed to find a valid beacon key for the current epoch on startup. This default check is used as a safety precaution to prevent many Consensus nodes from being started up, all without valid beacon keys, as this can compromise liveness on the network. If you operate more than one Flow Consensus node, contact Flow Foundation operator support for guidance on how to proceed."
}
return nil

}
return fmt.Errorf("could not retrieve beacon key for epoch %d from secrets database - cannot participate in consensus: %w", epochCounter, err)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the new changes is that:

  • we always run VerifyBeaconKeyForEpoch
  • we only crash the node if requireKeyPresent==true and we don't have a valid key

Given that, we need a guard above the safe/nil checks (line 79) to log and return nil

Suggested change
if !requireKeyPresent {
// log
// return nil
}

It is valid for an individual node to not have a key or have a key that is invalid (marked unsafe). We need to allow those nodes to start up if the sanity check flag is turned off.

zhangchiqing and others added 3 commits February 18, 2026 13:46
Co-authored-by: Jordan Schalm <jordan.schalm@flowfoundation.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments