[Consensus] Add --require-beacon-key flag to fail fast on missing DKG keys at consensus node startup#8410
[Consensus] Add --require-beacon-key flag to fail fast on missing DKG keys at consensus node startup#8410zhangchiqing wants to merge 16 commits intomasterfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds startup-controlled beacon key verification: a new CLI flag and startup module call, plus a dkg package function VerifyBeaconKeyForEpoch with tests that validate DKG participation, key retrieval, safeness, and public-key matching against protocol state. Changes
Sequence DiagramsequenceDiagram
participant Startup as Startup Module
participant Verify as VerifyBeaconKeyForEpoch
participant Proto as Protocol State
participant Storage as Beacon Key Storage
Startup->>Verify: Invoke VerifyBeaconKeyForEpoch(requireKeyPresent)
Verify->>Proto: GetCurrentEpoch()
Proto-->>Verify: epoch, counter
Verify->>Proto: GetDKGForEpoch(epoch, counter)
Proto-->>Verify: DKG participants & key shares
Verify->>Verify: Is node participant?
alt Not participant
Verify-->>Startup: return nil
else Participant
Verify->>Storage: RetrievePrivateKeyForEpoch(epoch)
Storage-->>Verify: privateKey or error
alt retrieval error
Verify-->>Startup: return error (or nil if requireKeyPresent=false)
else key returned
Verify->>Verify: Check key non-nil and marked safe
Verify->>Proto: ExpectedPublicKeyFromKeyShare(nodeID)
Proto-->>Verify: expectedPublicKey
Verify->>Verify: Compare public keys
alt match
Verify-->>Startup: return nil (success)
else mismatch
Verify-->>Startup: return error
end
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@module/dkg/verification.go`:
- Around line 63-66: The error handling after calling RetrieveMyBeaconPrivateKey
needs to distinguish a missing-key case from other failures: call
RetrieveMyBeaconPrivateKey(epochCounter) as before but inspect the returned err
(e.g., errors.Is(err, secrets.ErrNotFound) or compare to the package's
ErrNotFound) and return a specific "beacon key for epoch %d not found in secrets
database: %w" only for that not-found case; for any other error return a more
general message like "failed to retrieve beacon key for epoch %d: %w"
(referencing RetrieveMyBeaconPrivateKey, epochCounter, err and the package
ErrNotFound symbol) so non-missing DB/I/O errors are not misreported as “not
found.”
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
| if !safe { | ||
| return fmt.Errorf("beacon key for epoch %d exists but is marked unsafe - cannot participate in consensus", epochCounter) | ||
| } |
There was a problem hiding this comment.
If we enable the flag by default, there is a case to consider this case valid. This case means:
- the operator transferred over the secrets database properly
- but the node happened to fail the DKG
There is just nothing we or the operator can do in this case, and the fact that they are Dynamic Bootstrapping doesn't worsen the overall situation.
There was a problem hiding this comment.
OK, makes sense. Then I will change the flag to be true by default after some testing.
There was a problem hiding this comment.
I changed to enable by default, and moved the flag check further down, so that regardless it's set true or false, we will run the check, and the flag is just to control weather it would crash or report an error. @jordanschalm can you review again
There was a problem hiding this comment.
I've tested with localnet. That setting the --require-beacon-key flag to true and removing the secret folder would cause a crash. And restoring the secret folder will recover the node.
Co-authored-by: Jordan Schalm <jordan.schalm@flowfoundation.org>
There was a problem hiding this comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@module/dkg/verification.go`:
- Around line 64-67: The updated error message in the RetrieveMyBeaconPrivateKey
call correctly generalizes failures, so no code change is needed; simply mark
the review as resolved and remove the duplicate review comment referring to
beaconKeys.RetrieveMyBeaconPrivateKey and the epochCounter handling in
verification.go.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
module/dkg/verification.go (1)
72-74: Preferlog.Warn()overlog.Error()for the opted-out missing-key path.
log.Error()implies an unexpected, actionable failure, but this code path is reached only when the operator has deliberately set--require-beacon-key=false. AWarnlevel (or at mostErrorwith a note that this is a deliberate override) is more appropriate here; otherwise, operators who intentionally opt out will see spurious error-level noise on every restart before the key is available.- log.Error().Uint64("epoch", epochCounter). - Msg("beacon key not found for current epoch, but --require-beacon-key flag is not set, skipping verification failure") + log.Warn().Uint64("epoch", epochCounter). + Msg("beacon key not found for current epoch, but --require-beacon-key flag is not set, skipping verification failure")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@module/dkg/verification.go` around lines 72 - 74, The current missing-beacon-key branch in verification.go uses log.Error() which emits erroneous error-level noise when the operator deliberately set --require-beacon-key=false; change the log level to log.Warn() (or log.Info() if you prefer) in the block that references epochCounter so the message reflects an opted-out condition (update the call that currently starts with log.Error().Uint64("epoch", epochCounter) to use log.Warn()) and keep the descriptive message indicating the operator has opted out of the requirement.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@module/dkg/verification_test.go`:
- Around line 74-78: TestRequireKeyPresentFalse fails because
VerifyBeaconKeyForEpoch calls dkg.KeyShare unconditionally even when
requireKeyPresent is false; update VerifyBeaconKeyForEpoch (in verification.go)
to not invoke dkg.KeyShare (or any methods on s.beaconKeys / dkg.KeyShare) when
requireKeyPresent is false and return early, or alternatively update the test to
register an expected mock call on dkg.KeyShare; reference
VerifyBeaconKeyForEpoch, requireKeyPresent, and dkg.KeyShare when making the
change so the test TestRequireKeyPresentFalse no longer triggers an unexpected
mock invocation.
In `@module/dkg/verification.go`:
- Around line 23-78: The docstring promises that passing requireKeyPresent=false
skips verification, but VerifyBeaconKeyForEpoch still calls
protocolState.Final().Epochs().Current(), currentEpoch.DKG(),
dkg.KeyShare(nodeID) and beaconKeys.RetrieveMyBeaconPrivateKey(); change the
function to return early when requireKeyPresent == false (before calling
protocolState.Final()/Current()/DKG()/KeyShare) so no mocks are invoked in
TestRequireKeyPresentFalse, and remove the now-dead soft-fail branch that checks
requireKeyPresent inside the storage.ErrNotFound handling; ensure the early
return returns nil to match the documented behavior.
---
Nitpick comments:
In `@module/dkg/verification.go`:
- Around line 72-74: The current missing-beacon-key branch in verification.go
uses log.Error() which emits erroneous error-level noise when the operator
deliberately set --require-beacon-key=false; change the log level to log.Warn()
(or log.Info() if you prefer) in the block that references epochCounter so the
message reflects an opted-out condition (update the call that currently starts
with log.Error().Uint64("epoch", epochCounter) to use log.Warn()) and keep the
descriptive message indicating the operator has opted out of the requirement.
| func (s *VerifyBeaconKeyForEpochSuite) TestRequireKeyPresentFalse() { | ||
| // No mocks should be called since we skip verification | ||
| err := VerifyBeaconKeyForEpoch(unittest.Logger(), s.nodeID, s.state, s.beaconKeys, false) | ||
| require.NoError(s.T(), err) | ||
| } |
There was a problem hiding this comment.
TestRequireKeyPresentFalse will fail — this is the observable consequence of the implementation/docstring mismatch flagged in verification.go.
dkg.KeyShare has no On(...) registration for this test. Because the implementation calls dkg.KeyShare unconditionally (regardless of requireKeyPresent), testify will hit t.FailNow() on the unexpected invocation. Resolving either Option A or Option B described in the verification.go comment will also fix this test.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@module/dkg/verification_test.go` around lines 74 - 78,
TestRequireKeyPresentFalse fails because VerifyBeaconKeyForEpoch calls
dkg.KeyShare unconditionally even when requireKeyPresent is false; update
VerifyBeaconKeyForEpoch (in verification.go) to not invoke dkg.KeyShare (or any
methods on s.beaconKeys / dkg.KeyShare) when requireKeyPresent is false and
return early, or alternatively update the test to register an expected mock call
on dkg.KeyShare; reference VerifyBeaconKeyForEpoch, requireKeyPresent, and
dkg.KeyShare when making the change so the test TestRequireKeyPresentFalse no
longer triggers an unexpected mock invocation.
| // - requireKeyPresent: if false, verification is skipped and the function returns nil immediately | ||
| // | ||
| // Returns nil if: | ||
| // - requireKeyPresent is false (verification skipped), OR | ||
| // - the beacon key exists, is safe, and matches the expected public key, OR | ||
| // - the node is not a DKG participant for the current epoch (nothing to verify) | ||
| // | ||
| // This is a binary validation function and all errors indicate that validation failed, which should be interpreted by the upper layer as an exception. | ||
| // Returns an error if: | ||
| // - the beacon key is missing from storage | ||
| // - the beacon key exists but is marked unsafe | ||
| // - the beacon key does not match the expected public key | ||
| // - any unexpected error occurs while querying state | ||
| func VerifyBeaconKeyForEpoch( | ||
| log zerolog.Logger, | ||
| nodeID flow.Identifier, | ||
| protocolState protocol.State, | ||
| beaconKeys storage.SafeBeaconKeys, | ||
| requireKeyPresent bool, | ||
| ) error { | ||
| // Get current epoch | ||
| currentEpoch, err := protocolState.Final().Epochs().Current() | ||
| if err != nil { | ||
| return fmt.Errorf("could not get current epoch for beacon key verification: %w", err) | ||
| } | ||
| epochCounter := currentEpoch.Counter() | ||
|
|
||
| // Check if we're in the DKG committee for this epoch | ||
| dkg, err := currentEpoch.DKG() | ||
| if err != nil { | ||
| return fmt.Errorf("could not get DKG info for epoch %d: %w", epochCounter, err) | ||
| } | ||
|
|
||
| // Check if this node is a DKG participant | ||
| expectedPubKey, err := dkg.KeyShare(nodeID) | ||
| if protocol.IsIdentityNotFound(err) { | ||
| log.Info().Uint64("epoch", epochCounter). | ||
| Msg("node is not a DKG participant for current epoch, skipping beacon key verification") | ||
| return nil | ||
| } | ||
| if err != nil { | ||
| return fmt.Errorf("could not get DKG key share for node %s in epoch %d: %w", nodeID, epochCounter, err) | ||
| } | ||
|
|
||
| // Verify beacon key exists and is safe | ||
| key, safe, err := beaconKeys.RetrieveMyBeaconPrivateKey(epochCounter) | ||
| if err != nil { | ||
| if errors.Is(err, storage.ErrNotFound) { | ||
| if !requireKeyPresent { | ||
| log.Error().Uint64("epoch", epochCounter). | ||
| Msg("beacon key not found for current epoch, but --require-beacon-key flag is not set, skipping verification failure") | ||
| return nil | ||
| } | ||
| } | ||
| return fmt.Errorf("could not retrieve beacon key for epoch %d from secrets database - cannot participate in consensus: %w", epochCounter, err) | ||
| } |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| ) error { | ||
| // Get current epoch | ||
| currentEpoch, err := protocolState.Final().Epochs().Current() | ||
| if err != nil { |
There was a problem hiding this comment.
This should never fail because we should always be able to retrieve current epoch
|
|
||
| // Check if we're in the DKG committee for this epoch | ||
| dkg, err := currentEpoch.DKG() | ||
| if err != nil { |
There was a problem hiding this comment.
This should ever fail, because we should always be able to retrieve the DKG committee for the current epoch.
| Msg("node is not a DKG participant for current epoch, skipping beacon key verification") | ||
| return nil | ||
| } | ||
| if err != nil { |
There was a problem hiding this comment.
This should never fail, because we have checked protocol.IsIdentityNotFound error before.
| // Verify beacon key exists and is safe | ||
| key, safe, err := beaconKeys.RetrieveMyBeaconPrivateKey(epochCounter) | ||
| if err != nil { | ||
| if errors.Is(err, storage.ErrNotFound) { |
There was a problem hiding this comment.
This could possibly fail, that's why we are using a flag to control, but by default, requireKeyPresent is true, which will hit the error, and crash the node.
cmd/consensus/main.go
Outdated
There was a problem hiding this comment.
| return dkgmodule.VerifyBeaconKeyForEpoch(node.Logger, node.NodeID, node.State, myBeaconKeyStateMachine, requireBeaconKeyOnStartup) | |
| if err := dkgmodule.VerifyBeaconKeyForEpoch(node.Logger, node.NodeID, node.State, myBeaconKeyStateMachine, requireBeaconKeyOnStartup) { | |
| log.Fatal("This node is configured with --require-beacon-key=true (default), but failed to find a valid beacon key for the current epoch on startup. This default check is used as a safety precaution to prevent many Consensus nodes from being started up, all without valid beacon keys, as this can compromise liveness on the network. If you operate more than one Flow Consensus node, contact Flow Foundation operator support for guidance on how to proceed." | |
| } | |
| return nil |
| } | ||
| return fmt.Errorf("could not retrieve beacon key for epoch %d from secrets database - cannot participate in consensus: %w", epochCounter, err) | ||
| } | ||
|
|
There was a problem hiding this comment.
My understanding of the new changes is that:
- we always run
VerifyBeaconKeyForEpoch - we only crash the node if
requireKeyPresent==trueand we don't have a valid key
Given that, we need a guard above the safe/nil checks (line 79) to log and return nil
| if !requireKeyPresent { | |
| // log | |
| // return nil | |
| } | |
It is valid for an individual node to not have a key or have a key that is invalid (marked unsafe). We need to allow those nodes to start up if the sanity check flag is turned off.
Co-authored-by: Jordan Schalm <jordan.schalm@flowfoundation.org>
Problem
When DKG/random beacon keys are accidentally deleted from
/var/flow/data/secrets, consensus nodes continue to start but fail later during vote signing. This makes debugging difficult.Solution
Summary by CodeRabbit
New Features
Tests