fix(): Add validation and retry logic for VPN key rotation race condition #453
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fix(): Add validation and retry logic for VPN key rotation race condition
Description
This PR fixes a critical race condition during concurrent VPN key rotation and gateway certificate recycling operations that caused ~129 errors per rotation cycle.
Problem Statement
During VPN key rotation, the system was triggering FSM (Finite State Machine) operations before gateway pods had finished reloading their certificates and establishing tunnel connectivity. This resulted in:
The system eventually recovered after ~9 minutes through Kubernetes reconciliation, but this caused significant operational noise and degraded service during the rotation window.
Root Cause
The
VpnKeyRotationcontroller was callingGetPeerGwPodName()and triggering FSM before:Solution
This PR implements a three-layer validation approach:
GetPeerGwPodName()with detailed error messages that include contextValidateGatewayPodReadiness()function that validates:Changes
Modified Files:
controllers/slicegateway/utils.go: EnhancedGetPeerGwPodName(), addedValidateGatewayPodReadiness()pkg/hub/controllers/vpnkeyrotation/reconciler.go: Added validation and retry logicNew Files:
controllers/slicegateway/utils_validation_test.go: Comprehensive unit tests (7 test cases)Impact
Fixes #(issue-number)
How Has This Been Tested?
Unit Tests
TestValidateGatewayPodReadiness- 7 test cases covering all validation scenarios:Integration Testing Plan
Test Environment Details
Checklist:
go fmtDoes this PR introduce a breaking change?
NO - This PR does not introduce any breaking changes.
This is a backward-compatible bug fix that:
Additional Notes
Error Patterns Fixed
All of the following error patterns will no longer appear during VPN key rotation: