Skip to content

Gateway Controller Restart may not update Policy Engine/Router in some cases #1196

@renuka-fernando

Description

@renuka-fernando

Description

Problem

When the gateway controller restarts, the policy engine reconnects successfully but doesn't receive any configuration updates. APIs remain broken until someone manually deploys a new API or makes a change.

Steps to Reproduce:

  1. Deploy an API successfully
  2. Restart the gateway controller
  3. Policy engine reconnects but has no config
  4. API requests return 404 (broken)
  5. Only after deploying a NEW API does everything work again

Root Cause (Simple Explanation)

The xDS protocol uses version numbers to track changes. When the controller restarts:

  1. Controller: Loads configs from database, creates snapshot with version "1"
  2. Policy Engine: Still has version "1" in memory from before the restart
  3. xDS Protocol: Compares versions → "1" equals "1" → "no update needed"
  4. Result: Policy engine never gets the configs until version changes

The version counter resets to 0 on every controller restart, causing version collisions.

How to Fix

Simple Solution (Recommended)

Make snapshot versions unique across restarts by including a timestamp.

Change version format from:

  • "1", "2", "3" (just a counter)

To:

  • "1739750400-1", "1739750400-2" (timestamp + counter)

Files to Change:

  1. gateway/gateway-controller/pkg/storage/memory.go

    • Add startupTimestamp field to store when the controller started
    • Change IncrementSnapshotVersion() to return "{timestamp}-{counter}" instead of just counter
    • Change return type from int64 to string
  2. gateway/gateway-controller/pkg/xds/snapshot.go

    • Use the string version directly (already compatible with string versions)
  3. gateway/gateway-controller/pkg/api/handlers/handlers.go

    • Update status callback to accept string version instead of int64

Why This Works:

  • Every restart gets a new timestamp
  • Versions are always unique across restarts
  • Policy engine sees different version → gets update immediately
  • xDS protocol supports string versions natively

Alternative Quick Fix (Client-Side)

Reset policy engine's version memory on every reconnection.

File: gateway/gateway-runtime/policy-engine/internal/xdsclient/client.go

Add after line 258 (after c.setState(StateConnected)):

// Reset versions to force full sync
c.mu.Lock()
c.policyChainVersion = ""
c.apiKeyVersion = ""
c.lazyResourceVersion = ""
c.mu.Unlock()

Trade-off: Simple fix, but policy engine reprocesses everything on every reconnection.

Testing the Fix

# 1. Deploy an API
curl -X POST http://localhost:9090/apis -d @test-api.yaml

# 2. Verify it works
curl http://localhost:8080/test-path
# Should return 200 OK

# 3. Restart controller
docker compose restart gateway-controller

# 4. Wait 5 seconds for reconnection
sleep 5

# 5. Test API again (WITHOUT redeploying)
curl http://localhost:8080/test-path
# Should STILL return 200 OK (this is the fix!)

Priority

High - This breaks all deployed APIs on controller restart, requiring manual intervention to restore service.

Additional Context

  • The gateway controller uses go-control-plane's State-of-the-World xDS protocol
  • Version comparison is done by the go-control-plane library
  • This only affects persistent mode (when configs are stored in database)
  • Fresh deployments work fine - only restarts are affected

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions