-
Notifications
You must be signed in to change notification settings - Fork 47
Description
Description
Problem
When the gateway controller restarts, the policy engine reconnects successfully but doesn't receive any configuration updates. APIs remain broken until someone manually deploys a new API or makes a change.
Steps to Reproduce:
- Deploy an API successfully
- Restart the gateway controller
- Policy engine reconnects but has no config
- API requests return 404 (broken)
- Only after deploying a NEW API does everything work again
Root Cause (Simple Explanation)
The xDS protocol uses version numbers to track changes. When the controller restarts:
- Controller: Loads configs from database, creates snapshot with version "1"
- Policy Engine: Still has version "1" in memory from before the restart
- xDS Protocol: Compares versions → "1" equals "1" → "no update needed"
- Result: Policy engine never gets the configs until version changes
The version counter resets to 0 on every controller restart, causing version collisions.
How to Fix
Simple Solution (Recommended)
Make snapshot versions unique across restarts by including a timestamp.
Change version format from:
"1","2","3"(just a counter)
To:
"1739750400-1","1739750400-2"(timestamp + counter)
Files to Change:
-
gateway/gateway-controller/pkg/storage/memory.go- Add
startupTimestampfield to store when the controller started - Change
IncrementSnapshotVersion()to return"{timestamp}-{counter}"instead of just counter - Change return type from
int64tostring
- Add
-
gateway/gateway-controller/pkg/xds/snapshot.go- Use the string version directly (already compatible with string versions)
-
gateway/gateway-controller/pkg/api/handlers/handlers.go- Update status callback to accept string version instead of int64
Why This Works:
- Every restart gets a new timestamp
- Versions are always unique across restarts
- Policy engine sees different version → gets update immediately
- xDS protocol supports string versions natively
Alternative Quick Fix (Client-Side)
Reset policy engine's version memory on every reconnection.
File: gateway/gateway-runtime/policy-engine/internal/xdsclient/client.go
Add after line 258 (after c.setState(StateConnected)):
// Reset versions to force full sync
c.mu.Lock()
c.policyChainVersion = ""
c.apiKeyVersion = ""
c.lazyResourceVersion = ""
c.mu.Unlock()Trade-off: Simple fix, but policy engine reprocesses everything on every reconnection.
Testing the Fix
# 1. Deploy an API
curl -X POST http://localhost:9090/apis -d @test-api.yaml
# 2. Verify it works
curl http://localhost:8080/test-path
# Should return 200 OK
# 3. Restart controller
docker compose restart gateway-controller
# 4. Wait 5 seconds for reconnection
sleep 5
# 5. Test API again (WITHOUT redeploying)
curl http://localhost:8080/test-path
# Should STILL return 200 OK (this is the fix!)Priority
High - This breaks all deployed APIs on controller restart, requiring manual intervention to restore service.
Additional Context
- The gateway controller uses go-control-plane's State-of-the-World xDS protocol
- Version comparison is done by the go-control-plane library
- This only affects persistent mode (when configs are stored in database)
- Fresh deployments work fine - only restarts are affected