Skip to content

Comments

feat: tunnel resilience — fatal error detection, reachability gate, pre-restart check, watchdog jitter#77

Merged
dzianisv merged 2 commits intomainfrom
feature/issue-62-tunnel-resilience
Feb 14, 2026
Merged

feat: tunnel resilience — fatal error detection, reachability gate, pre-restart check, watchdog jitter#77
dzianisv merged 2 commits intomainfrom
feature/issue-62-tunnel-resilience

Conversation

@dzianisv
Copy link
Owner

@dzianisv dzianisv commented Feb 14, 2026

Summary

  • Jittered watchdog interval: Replace fixed setInterval(30s) with setTimeout(30s + random 0-5s jitter) to prevent thundering herd with launchd ThrottleInterval
  • Fatal error detection: Parse cloudflared stderr for unrecoverable errors (unauthorized, tunnel not found, invalid credentials, etc.) and halt restarts immediately instead of burning through the circuit breaker
  • Re-check before restart: Final checkConnected() call right before killing the tunnel process — if it reconnected during the failure window, skip the restart
  • Reachability-gated endpoints.json: Only write tunnel URL to endpoints.json after verifyReachable() confirms the URL is actually accessible, preventing stale/broken URLs from being published
  • CLI health display: Show fatal error details in opencode-manager health output

Testing Done

  • Build passes: pnpm build
  • All 258 unit tests pass: pnpm test (including 17 tunnel-service tests)

Issue

Closes #62

engineer added 2 commits February 14, 2026 12:11
…e-check before restart, watchdog jitter (#62)

- Fix 1: Parse cloudflared stderr for fatal errors (unauthorized, tunnel not
  found, invalid credentials, etc.) and halt watchdog immediately instead of
  wasting restart cycles on unrecoverable failures
- Fix 2: Gate endpoints.json update on verifyReachable() — tunnel URL is only
  published after confirming the tunnel is actually reachable
- Fix 4a: Re-check ha_connections immediately before doRestart() to avoid
  killing a tunnel that recovered during the threshold window
- Fix 4b: Replace setInterval with self-scheduling setTimeout + random jitter
  (0-5s) to prevent thundering herd when multiple instances run
…opulated

If verifyReachable() fails initially (slow tunnel startup), retry up to 5
times at 10s intervals. If all retries fail, write the endpoint anyway to
satisfy the requirement that endpoints.json MUST contain a tunnel URL.
@github-actions
Copy link

🔔 Push Browser E2E Test Recording

Screencast

Run #22023558352 | Commit 2604f56

@github-actions
Copy link

⚙️ Settings E2E Test Recording

Screencast

Run #22023558352 | Commit 2604f56

@github-actions
Copy link

🎥 Browser E2E Test Recording

Screencast

Run #22023558352 | Commit 2604f56

@github-actions
Copy link

🎥 Browser E2E Test Recording

Screencast

Run #22023582069 | Commit edab165

@github-actions
Copy link

⚙️ Settings E2E Test Recording

Screencast

Run #22023582069 | Commit edab165

@github-actions
Copy link

🔔 Push Browser E2E Test Recording

Screencast

Run #22023582069 | Commit edab165

@dzianisv dzianisv merged commit 4e88da4 into main Feb 14, 2026
5 checks passed
@dzianisv dzianisv deleted the feature/issue-62-tunnel-resilience branch February 14, 2026 23:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cloudflare tunnel resilience: circuit breaker, reachability check, PID guard, watchdog improvements

1 participant