feat: add spawn circuit breaker, lifecycle logging, and heartbeat auto-recovery by vltbaudbot · Pull Request #198 · modem-dev/baudbot

vltbaudbot · 2026-03-05T17:03:48Z

Summary

Addresses remaining resilience items from #185 — the three highest-value improvements that don't require pi core changes.

1. Spawn Circuit Breaker (`agent-spawn.ts`)

After 3 consecutive spawn failures, the circuit opens and rejects new spawn attempts for 5 minutes. This prevents resource waste when spawns are failing systemically (missing API keys, model unavailability, etc.).

closed → open (3 failures) → half-open (5min cooldown) → closed (on success)

Failure types that trip the circuit:

tmux spawn exit code ≠ 0
Readiness timeout (alias/socket never appeared)
Readiness aborted (signal cancelled)

Validation errors (bad name, missing model, etc.) do NOT trip the circuit.

2. Worker Lifecycle Logging (`agent-spawn.ts`)

All spawn events are appended to ~/.pi/agent/logs/worker-lifecycle.jsonl:

Event	When
`spawn_started`	Spawn attempt begins
`spawn_success`	Readiness verified (includes `ready_after_ms`)
`spawn_failed`	tmux error, readiness timeout, or abort
`circuit_rejected`	Spawn refused by open circuit

New spawn_status tool exposes circuit breaker state and recent lifecycle events for observability.

3. Heartbeat Auto-Recovery (`heartbeat.ts`)

Before prompting the control-agent about failures, the heartbeat now attempts automatic recovery for two safe, idempotent failure types:

Failure	Auto-Recovery Action
Bridge down	Kill bridge tmux session → clear port 7890 → run startup-pi.sh → verify
Orphaned dev-agent	Kill tmux session → remove stale alias

If recovery succeeds, no LLM tokens are consumed — same as a healthy check. Only unrecoverable failures prompt the agent.

Recovery actions are logged to ~/.pi/agent/logs/auto-recovery.jsonl.

What this addresses from #185

#	Item	Status
1	Non-blocking spawns	✅ Already done (#188 `agent_spawn`)
2	Bash tool timeouts	⏳ Requires pi core changes
3	Async message processing	⏳ Requires pi core changes
4	Health monitoring & auto-recovery	✅ This PR (heartbeat auto-recovery)
5	Circuit breaker	✅ This PR (spawn circuit breaker)
6	Graceful degradation	⚠️ Partial (operational docs exist)
7	Observability	✅ This PR (lifecycle log + spawn_status tool)
8	Worker lifecycle tracking	✅ This PR (lifecycle JSONL log)

Testing

Fixed test harness to handle multiple tool registrations
Added circuit breaker test: 3 failures → open → 4th spawn rejected
Added spawn_status tool registration test
All 128 tests pass (73 heartbeat + 6 agent-spawn + 49 memory)
TypeScript: tsc --noEmit ✅
Lint: biome lint ✅

Files changed

File	Changes
`pi/extensions/agent-spawn.ts`	Circuit breaker, lifecycle logging, `spawn_status` tool
`pi/extensions/heartbeat.ts`	Auto-recovery for bridge-down and orphaned dev-agents
`pi/extensions/agent-spawn.test.mjs`	Fixed harness, 2 new tests

Refs #185

…o-recovery Addresses remaining items from modem-dev#185 (control-agent resilience): ## Circuit breaker (agent-spawn.ts) After 3 consecutive spawn failures, the circuit opens and rejects new spawn attempts for 5 minutes (cooldown). This prevents resource waste when spawns are failing due to systemic issues (missing API keys, model unavailability, etc.). The circuit transitions: closed → open (after 3 failures) → half-open (after cooldown) → closed (on success) Failure tracking counts tmux spawn failures, readiness timeouts, and aborted readiness checks. Validation errors (bad name, missing model) don't affect the circuit. ## Worker lifecycle logging (agent-spawn.ts) All spawn events are logged to ~/.pi/agent/logs/worker-lifecycle.jsonl: - spawn_started: when a spawn attempt begins - spawn_success: readiness verified (includes ready_after_ms) - spawn_failed: tmux error, readiness timeout, or abort - circuit_rejected: spawn refused by open circuit New `spawn_status` tool exposes circuit breaker state and recent lifecycle events for observability. ## Heartbeat auto-recovery (heartbeat.ts) Before prompting the control-agent about failures, the heartbeat now attempts automatic recovery for two failure types: - Bridge down: kills existing bridge tmux session, clears port holders, runs startup-pi.sh, verifies bridge comes back - Orphaned dev-agents: kills tmux session, removes stale alias If recovery succeeds, no LLM tokens are consumed (same as healthy check). Only unrecoverable failures prompt the agent. Recovery actions are logged to ~/.pi/agent/logs/auto-recovery.jsonl for audit and debugging. ## Tests - Fixed test harness to handle multiple tool registrations - Added circuit breaker test (3 failures → open → rejected) - Added spawn_status tool registration test - All 128 tests pass (73 heartbeat + 6 agent-spawn + 49 memory) Refs modem-dev#185 Co-authored-by: Darcy Clarke <darcy@darcyclarke.me>

greptile-apps · 2026-03-05T17:07:10Z

Greptile Summary

This PR adds three resilience features: a spawn circuit breaker (3-strike, 5-minute cooldown) in agent-spawn.ts, a JSONL worker lifecycle log with a spawn_status observability tool, and heartbeat auto-recovery in heartbeat.ts that silently resolves bridge-down and orphaned dev-agent failures before consuming LLM tokens. The changes fit cleanly into the existing extension architecture.

Key findings:

Logic bug in bridge recovery (heartbeat.ts lines 614–655): The bridge tmux session and port 7890 processes are killed before the code checks whether the startup script exists or whether live socket UUIDs are available. If either condition fails, no RecoveryAction is pushed and nothing is logged — the bridge is silently left in a dead state while the agent is prompted to fix it without knowing auto-recovery already killed it.
Shell injection surface (heartbeat.ts line 676): Session names from the filesystem are interpolated directly into execSync template literals; use execFileSync with an argument array to avoid the shell entirely.
Dynamic require() in ESM files (agent-spawn.ts line 545, heartbeat.ts lines 595, 684): Both files use static import at the top but fall back to require() inline for child_process and fs. spawn_status in particular shells out to tail for reading its own JSONL log file, which is non-portable and redundant given readFileSync is already imported.

Confidence Score: 3/5

Merge with caution — bridge recovery has a silent partial-failure path that can leave the bridge killed without any recovery record or agent notification of the kill.
The circuit breaker and lifecycle logging in agent-spawn.ts are well-implemented and low-risk. The heartbeat auto-recovery has a real logic bug: the bridge is killed before verifying that a restart is possible, and the silent failure paths push no RecoveryAction, leaving the system in a worse state than before recovery was attempted with no audit trail. This is a production-impacting issue for the bridge restart path.
pi/extensions/heartbeat.ts — specifically the bridge restart block in tryAutoRecover() (lines 614–667)

Important Files Changed

Filename	Overview
pi/extensions/agent-spawn.ts	Adds circuit breaker (3-strike, 5-min cooldown), lifecycle JSONL logging, and a new spawn_status tool. Logic is sound; minor issue with dynamic require() and use of shell tail in spawn_status instead of readFileSync.
pi/extensions/heartbeat.ts	Adds tryAutoRecover() for bridge-down and orphaned-dev-agent cases. Has a logic bug: bridge session is killed before checking whether the startup script or socket UUIDs are available, and no RecoveryAction is pushed in the silent failure paths — leaving the bridge dead with no audit trail. Also uses dynamic require() and unsanitized session names in shell commands.
pi/extensions/agent-spawn.test.mjs	Test harness correctly updated to support multiple registered tools. New circuit breaker test and spawn_status registration test are well-structured and cover the expected behaviors.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[agent_spawn called] --> B{Circuit open?}
    B -- Yes --> C[Log circuit_rejected\nReturn circuit_open error]
    B -- No --> D[Validate inputs]
    D -- Invalid --> E[Return validation error\nNo circuit trip]
    D -- Valid --> F[Log spawn_started\nRun tmux new-session]
    F -- tmux code ≠ 0 --> G[recordFailure\nLog spawn_failed stage=spawn]
    F -- tmux code = 0 --> H[waitForSessionReadiness]
    H -- aborted --> I[recordFailure\nLog spawn_failed stage=aborted]
    H -- timeout --> J[recordFailure\nLog spawn_failed stage=readiness]
    H -- ready --> K[recordSuccess\nLog spawn_success]
    G --> L{consecutiveFailures ≥ 3?}
    I --> L
    J --> L
    L -- Yes --> M[state = open]
    L -- No --> N[state unchanged]

    subgraph Heartbeat Auto-Recovery
    P[failures detected] --> Q{failure.name = bridge?}
    Q -- Yes --> R[Kill bridge tmux\nKill port 7890\nRun startup-pi.sh]
    R --> S{Startup script\nexists AND uuids?}
    S -- No --> T[⚠️ No action logged\nBridge remains dead]
    S -- Yes --> U[Verify bridge\nLog RecoveryAction]
    Q -- No --> V{failure.name\nstartsWith orphan:?}
    V -- Yes --> W[Kill tmux session\nRemove alias\nLog success]
    end

    U --> X{All recovered?}
    W --> X
    X -- Yes --> Y[No LLM tokens consumed]
    X -- No --> Z[Inject heartbeat prompt\nwith recovery details]

_{Last reviewed commit: 760d1e3}

greptile-apps · 2026-03-05T17:07:13Z

pi/extensions/heartbeat.ts

+        // Kill existing bridge tmux session
+        try {
+          execSync('tmux kill-session -t baudbot-gateway-bridge 2>/dev/null', { timeout: 5000 });
+        } catch {
+          // May not exist — that's fine
+        }
+
+        // Kill anything holding port 7890
+        try {
+          execSync('lsof -ti :7890 2>/dev/null | xargs kill -9 2>/dev/null', { timeout: 5000 });
+        } catch {
+          // Nothing holding port — fine
+        }
+
+        // Restart via startup script
+        const startupScript = join(homedir(), ".pi", "agent", "skills", "control-agent", "startup-pi.sh");
+        if (existsSync(startupScript)) {
+          // Get live session UUIDs from session-control dir
+          const sockFiles = readdirSync(SOCKET_DIR).filter((f) => f.endsWith(".sock"));
+          const uuids = sockFiles.map((f) => f.replace(".sock", "")).join(" ");
+          if (uuids) {
+            execSync(`bash "${startupScript}" ${uuids} 2>&1`, {
+              timeout: 30000,
+              encoding: "utf-8",
+            });
+
+            // Verify bridge came back
+            await new Promise((resolve) => setTimeout(resolve, 3000));
+            const verifyResult = await checkBridge();
+
+            const entry: RecoveryAction = {
+              timestamp: new Date().toISOString(),
+              check: failure.name,
+              action: "bridge_restart",
+              success: verifyResult.ok,
+              detail: verifyResult.ok
+                ? "Bridge restarted and verified healthy"
+                : `Bridge restart attempted but still failing: ${verifyResult.detail}`,
+            };
+            actions.push(entry);
+            logRecovery(entry);
+          }


Bridge killed before verifying restart is possible

The bridge session and port 7890 are killed before verifying that either the startup script exists or that there are live socket UUIDs to pass it. If startupScript doesn't exist, or sockFiles is empty (no .sock files), the code falls through both inner if guards without pushing any RecoveryAction entry and without logging anything. The bridge is now dead but recoveryActions contains no entry for "bridge", so:

recoveredChecks will NOT contain "bridge".

The bridge failure remains in remainingFailures — the agent is prompted.

But the agent has no idea the previous bridge tmux session was already killed and port 7890 was cleared — it's working against a ghost.

A minimal fix is to push a "partial failure" action in the missing branches:

// After killing bridge/port but before the startup script check: if (!existsSync(startupScript)) { const entry: RecoveryAction = { timestamp: new Date().toISOString(), check: failure.name, action: "bridge_restart", success: false, detail: `Bridge session killed but startup script not found at ${startupScript} — cannot restart`, }; actions.push(entry); logRecovery(entry); continue; } // …and similarly if uuids is empty: if (!uuids) { const entry: RecoveryAction = { timestamp: new Date().toISOString(), check: failure.name, action: "bridge_restart", success: false, detail: "Bridge session killed but no live socket UUIDs found — cannot restart", }; actions.push(entry); logRecovery(entry); continue; }

This ensures the agent prompt always includes the full picture of what auto-recovery attempted.

Prompt To Fix With AI

This is a comment left during a code review. Path: pi/extensions/heartbeat.ts Line: 614-655 Comment: **Bridge killed before verifying restart is possible** The bridge session and port 7890 are killed *before* verifying that either the startup script exists or that there are live socket UUIDs to pass it. If `startupScript` doesn't exist, or `sockFiles` is empty (no `.sock` files), the code falls through both inner `if` guards without pushing any `RecoveryAction` entry and without logging anything. The bridge is now dead but `recoveryActions` contains no entry for `"bridge"`, so: 1. `recoveredChecks` will NOT contain `"bridge"`. 2. The bridge failure remains in `remainingFailures` — the agent is prompted. 3. But the agent has no idea the previous bridge tmux session was already killed and port 7890 was cleared — it's working against a ghost. A minimal fix is to push a "partial failure" action in the missing branches: ```typescript // After killing bridge/port but before the startup script check: if (!existsSync(startupScript)) { const entry: RecoveryAction = { timestamp: new Date().toISOString(), check: failure.name, action: "bridge_restart", success: false, detail: `Bridge session killed but startup script not found at ${startupScript} — cannot restart`, }; actions.push(entry); logRecovery(entry); continue; } // …and similarly if uuids is empty: if (!uuids) { const entry: RecoveryAction = { timestamp: new Date().toISOString(), check: failure.name, action: "bridge_restart", success: false, detail: "Bridge session killed but no live socket UUIDs found — cannot restart", }; actions.push(entry); logRecovery(entry); continue; } ``` This ensures the agent prompt always includes the full picture of what auto-recovery attempted. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-05T17:07:14Z

pi/extensions/heartbeat.ts

+      try {
+        // Kill the tmux session
+        try {
+          execSync(`tmux kill-session -t "${sessionName}" 2>/dev/null`, { timeout: 5000 });


Shell injection via unsanitized session name from filesystem

sessionName is derived directly from filenames in SOCKET_DIR (e.g. a file named dev-agent-foo"; rm -rf ~; echo ".alias). While the directory is inside ~/.pi/, the shell interpolation via template literal in execSync would execute arbitrary injected commands if such a filename existed.

Prefer the array-argument form of spawnSync / execFileSync which bypasses the shell entirely:

import { execFileSync } from "node:child_process"; // … execFileSync("tmux", ["kill-session", "-t", sessionName], { timeout: 5000 });

Similarly on line 635 for the startup script invocation — the UUIDs should each be passed as separate shell-quoted arguments, not space-joined into a single string.

Prompt To Fix With AI

This is a comment left during a code review. Path: pi/extensions/heartbeat.ts Line: 676 Comment: **Shell injection via unsanitized session name from filesystem** `sessionName` is derived directly from filenames in `SOCKET_DIR` (e.g. a file named `dev-agent-foo"; rm -rf ~; echo ".alias`). While the directory is inside `~/.pi/`, the shell interpolation via template literal in `execSync` would execute arbitrary injected commands if such a filename existed. Prefer the array-argument form of `spawnSync` / `execFileSync` which bypasses the shell entirely: ```typescript import { execFileSync } from "node:child_process"; // … execFileSync("tmux", ["kill-session", "-t", sessionName], { timeout: 5000 }); ``` Similarly on line 635 for the startup script invocation — the UUIDs should each be passed as separate shell-quoted arguments, not space-joined into a single string. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-05T17:07:15Z

pi/extensions/agent-spawn.ts

+      try {
+        const { execSync } = require("node:child_process");
+        const tail = execSync(`tail -20 "${LIFECYCLE_LOG_PATH}" 2>/dev/null`, { encoding: "utf-8" });
+        if (tail.trim()) {
+          const lines = tail.trim().split("\n");


Dynamic require() in an ESM/TypeScript file — use readFileSync instead

This file uses static ES module import declarations at the top, but spawn_status falls back to require("node:child_process") and a shell tail command to read the last 20 lines of the JSONL log. Two problems:

Dynamic require() is a CommonJS pattern and can be surprising in an ESM context.

Shelling out to tail for a simple file-tail operation is non-portable (unavailable on Windows, behaves differently under PATH changes) and redundant — readFileSync is already available as a static import in this file.

A portable alternative using the already-imported readFileSync:

import { appendFileSync, existsSync, mkdirSync, readFileSync, readlinkSync, statSync } from "node:fs"; // … // Inside spawn_status execute(): let recentEvents = ""; try { if (existsSync(LIFECYCLE_LOG_PATH)) { const lines = readFileSync(LIFECYCLE_LOG_PATH, "utf-8") .trimEnd() .split("\n") .slice(-20); // … same parse/format logic } } catch { recentEvents = " (no lifecycle log)"; }

The same require("node:child_process") / require("node:fs") pattern also appears in heartbeat.ts (lines 595, 684) where top-level static imports should be used instead.

Prompt To Fix With AI

This is a comment left during a code review. Path: pi/extensions/agent-spawn.ts Line: 544-548 Comment: **Dynamic `require()` in an ESM/TypeScript file — use `readFileSync` instead** This file uses static ES module `import` declarations at the top, but `spawn_status` falls back to `require("node:child_process")` and a shell `tail` command to read the last 20 lines of the JSONL log. Two problems: 1. Dynamic `require()` is a CommonJS pattern and can be surprising in an ESM context. 2. Shelling out to `tail` for a simple file-tail operation is non-portable (unavailable on Windows, behaves differently under PATH changes) and redundant — `readFileSync` is already available as a static import in this file. A portable alternative using the already-imported `readFileSync`: ```typescript import { appendFileSync, existsSync, mkdirSync, readFileSync, readlinkSync, statSync } from "node:fs"; // … // Inside spawn_status execute(): let recentEvents = ""; try { if (existsSync(LIFECYCLE_LOG_PATH)) { const lines = readFileSync(LIFECYCLE_LOG_PATH, "utf-8") .trimEnd() .split("\n") .slice(-20); // … same parse/format logic } } catch { recentEvents = " (no lifecycle log)"; } ``` The same `require("node:child_process")` / `require("node:fs")` pattern also appears in `heartbeat.ts` (lines 595, 684) where top-level static imports should be used instead. How can I resolve this? If you propose a fix, please make it concise.

sentry · 2026-03-05T17:07:54Z

pi/extensions/agent-spawn.ts

+      // Circuit breaker check
+      if (isCircuitOpen(circuit)) {
+        const cooldownLeft = circuit.lastFailureAt
+          ? Math.max(0, CIRCUIT_COOLDOWN_MS - (Date.now() - circuit.lastFailureAt))


Bug: The isCircuitOpen function mutates the circuit state to half-open before input validation, allowing invalid requests to waste the single recovery probe attempt.
_{Severity: MEDIUM}

Suggested Fix

Move the circuit state transition to occur after all input validation passes and immediately before the spawn attempt. This ensures only valid requests consume the half-open probe slot.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: pi/extensions/agent-spawn.ts#L314-L317 Potential issue: The `isCircuitOpen` function mutates the circuit breaker's state to `half-open` as a side effect. This check is performed before input validation. Consequently, a request with invalid input (e.g., an unsafe `sessionName`) can trigger the state transition to `half-open` and then fail validation, returning an error. This consumes the single "probe" attempt allowed in the half-open state without actually trying to spawn an agent. This violates the intended behavior of allowing exactly one probe attempt to test system recovery, potentially delaying recovery from an open circuit.

_{Did we get this right? 👍 / 👎 to inform future reviews.}

sentry · 2026-03-05T17:07:54Z

pi/extensions/heartbeat.ts

+          success: true,
+          detail: `Killed tmux session and removed alias for orphaned dev-agent "${sessionName}"`,
+        };
+        actions.push(entry);


Bug: The orphan cleanup logic unconditionally reports success, even when the underlying tmux kill-session command fails. This can leave orphaned processes running without notifying the agent.
_{Severity: MEDIUM}

Suggested Fix

Inspect the error within the catch block for the tmux kill-session command. Only ignore errors indicating the session was already gone. For all other errors (e.g., command timeout, permission issues), the recovery action should be marked as success: false to ensure the failure is correctly reported.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: pi/extensions/heartbeat.ts#L695 Potential issue: The orphan cleanup recovery action uses `execSync` to kill a tmux session within a `try...catch` block that silently swallows all errors. This means if the `tmux kill-session` command fails for any reason other than the session not existing (e.g., tmux daemon is down, permissions error), the error is ignored. The recovery action is then unconditionally reported as `success: true`. This prevents the agent from being notified about the failed cleanup, leaving an orphaned session running on the server while the system believes it has been successfully removed.

_{Did we get this right? 👍 / 👎 to inform future reviews.}

vltbaudbot mentioned this pull request Mar 5, 2026

Control-agent blocks on synchronous worker spawns, causing complete unresponsiveness #185

Open

7 tasks

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

sentry bot reviewed Mar 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add spawn circuit breaker, lifecycle logging, and heartbeat auto-recovery#198

feat: add spawn circuit breaker, lifecycle logging, and heartbeat auto-recovery#198
vltbaudbot wants to merge 1 commit intomodem-dev:mainfrom
vltbaudbot:fix/spawn-resilience-and-auto-recovery

vltbaudbot commented Mar 5, 2026

Uh oh!

greptile-apps bot commented Mar 5, 2026

Uh oh!

greptile-apps bot Mar 5, 2026

Uh oh!

greptile-apps bot Mar 5, 2026

Uh oh!

greptile-apps bot Mar 5, 2026

Uh oh!

sentry bot Mar 5, 2026

Uh oh!

sentry bot Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vltbaudbot commented Mar 5, 2026

Summary

1. Spawn Circuit Breaker (agent-spawn.ts)

2. Worker Lifecycle Logging (agent-spawn.ts)

3. Heartbeat Auto-Recovery (heartbeat.ts)

What this addresses from #185

Testing

Files changed

Uh oh!

greptile-apps bot commented Mar 5, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

sentry bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

sentry bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Spawn Circuit Breaker (`agent-spawn.ts`)

2. Worker Lifecycle Logging (`agent-spawn.ts`)

3. Heartbeat Auto-Recovery (`heartbeat.ts`)