Skip to content

feat: add spawn circuit breaker, lifecycle logging, and heartbeat auto-recovery#198

Open
vltbaudbot wants to merge 1 commit intomodem-dev:mainfrom
vltbaudbot:fix/spawn-resilience-and-auto-recovery
Open

feat: add spawn circuit breaker, lifecycle logging, and heartbeat auto-recovery#198
vltbaudbot wants to merge 1 commit intomodem-dev:mainfrom
vltbaudbot:fix/spawn-resilience-and-auto-recovery

Conversation

@vltbaudbot
Copy link
Contributor

Summary

Addresses remaining resilience items from #185 — the three highest-value improvements that don't require pi core changes.

1. Spawn Circuit Breaker (agent-spawn.ts)

After 3 consecutive spawn failures, the circuit opens and rejects new spawn attempts for 5 minutes. This prevents resource waste when spawns are failing systemically (missing API keys, model unavailability, etc.).

closed → open (3 failures) → half-open (5min cooldown) → closed (on success)

Failure types that trip the circuit:

  • tmux spawn exit code ≠ 0
  • Readiness timeout (alias/socket never appeared)
  • Readiness aborted (signal cancelled)

Validation errors (bad name, missing model, etc.) do NOT trip the circuit.

2. Worker Lifecycle Logging (agent-spawn.ts)

All spawn events are appended to ~/.pi/agent/logs/worker-lifecycle.jsonl:

Event When
spawn_started Spawn attempt begins
spawn_success Readiness verified (includes ready_after_ms)
spawn_failed tmux error, readiness timeout, or abort
circuit_rejected Spawn refused by open circuit

New spawn_status tool exposes circuit breaker state and recent lifecycle events for observability.

3. Heartbeat Auto-Recovery (heartbeat.ts)

Before prompting the control-agent about failures, the heartbeat now attempts automatic recovery for two safe, idempotent failure types:

Failure Auto-Recovery Action
Bridge down Kill bridge tmux session → clear port 7890 → run startup-pi.sh → verify
Orphaned dev-agent Kill tmux session → remove stale alias

If recovery succeeds, no LLM tokens are consumed — same as a healthy check. Only unrecoverable failures prompt the agent.

Recovery actions are logged to ~/.pi/agent/logs/auto-recovery.jsonl.

What this addresses from #185

# Item Status
1 Non-blocking spawns ✅ Already done (#188 agent_spawn)
2 Bash tool timeouts ⏳ Requires pi core changes
3 Async message processing ⏳ Requires pi core changes
4 Health monitoring & auto-recovery This PR (heartbeat auto-recovery)
5 Circuit breaker This PR (spawn circuit breaker)
6 Graceful degradation ⚠️ Partial (operational docs exist)
7 Observability This PR (lifecycle log + spawn_status tool)
8 Worker lifecycle tracking This PR (lifecycle JSONL log)

Testing

  • Fixed test harness to handle multiple tool registrations
  • Added circuit breaker test: 3 failures → open → 4th spawn rejected
  • Added spawn_status tool registration test
  • All 128 tests pass (73 heartbeat + 6 agent-spawn + 49 memory)
  • TypeScript: tsc --noEmit
  • Lint: biome lint

Files changed

File Changes
pi/extensions/agent-spawn.ts Circuit breaker, lifecycle logging, spawn_status tool
pi/extensions/heartbeat.ts Auto-recovery for bridge-down and orphaned dev-agents
pi/extensions/agent-spawn.test.mjs Fixed harness, 2 new tests

Refs #185

…o-recovery

Addresses remaining items from modem-dev#185 (control-agent resilience):

## Circuit breaker (agent-spawn.ts)

After 3 consecutive spawn failures, the circuit opens and rejects new
spawn attempts for 5 minutes (cooldown). This prevents resource waste
when spawns are failing due to systemic issues (missing API keys,
model unavailability, etc.). The circuit transitions:

  closed → open (after 3 failures) → half-open (after cooldown) → closed (on success)

Failure tracking counts tmux spawn failures, readiness timeouts, and
aborted readiness checks. Validation errors (bad name, missing model)
don't affect the circuit.

## Worker lifecycle logging (agent-spawn.ts)

All spawn events are logged to ~/.pi/agent/logs/worker-lifecycle.jsonl:
  - spawn_started: when a spawn attempt begins
  - spawn_success: readiness verified (includes ready_after_ms)
  - spawn_failed: tmux error, readiness timeout, or abort
  - circuit_rejected: spawn refused by open circuit

New `spawn_status` tool exposes circuit breaker state and recent
lifecycle events for observability.

## Heartbeat auto-recovery (heartbeat.ts)

Before prompting the control-agent about failures, the heartbeat now
attempts automatic recovery for two failure types:

  - Bridge down: kills existing bridge tmux session, clears port
    holders, runs startup-pi.sh, verifies bridge comes back
  - Orphaned dev-agents: kills tmux session, removes stale alias

If recovery succeeds, no LLM tokens are consumed (same as healthy
check). Only unrecoverable failures prompt the agent.

Recovery actions are logged to ~/.pi/agent/logs/auto-recovery.jsonl
for audit and debugging.

## Tests

- Fixed test harness to handle multiple tool registrations
- Added circuit breaker test (3 failures → open → rejected)
- Added spawn_status tool registration test
- All 128 tests pass (73 heartbeat + 6 agent-spawn + 49 memory)

Refs modem-dev#185

Co-authored-by: Darcy Clarke <darcy@darcyclarke.me>
@greptile-apps
Copy link

greptile-apps bot commented Mar 5, 2026

Greptile Summary

This PR adds three resilience features: a spawn circuit breaker (3-strike, 5-minute cooldown) in agent-spawn.ts, a JSONL worker lifecycle log with a spawn_status observability tool, and heartbeat auto-recovery in heartbeat.ts that silently resolves bridge-down and orphaned dev-agent failures before consuming LLM tokens. The changes fit cleanly into the existing extension architecture.

Key findings:

  • Logic bug in bridge recovery (heartbeat.ts lines 614–655): The bridge tmux session and port 7890 processes are killed before the code checks whether the startup script exists or whether live socket UUIDs are available. If either condition fails, no RecoveryAction is pushed and nothing is logged — the bridge is silently left in a dead state while the agent is prompted to fix it without knowing auto-recovery already killed it.
  • Shell injection surface (heartbeat.ts line 676): Session names from the filesystem are interpolated directly into execSync template literals; use execFileSync with an argument array to avoid the shell entirely.
  • Dynamic require() in ESM files (agent-spawn.ts line 545, heartbeat.ts lines 595, 684): Both files use static import at the top but fall back to require() inline for child_process and fs. spawn_status in particular shells out to tail for reading its own JSONL log file, which is non-portable and redundant given readFileSync is already imported.

Confidence Score: 3/5

  • Merge with caution — bridge recovery has a silent partial-failure path that can leave the bridge killed without any recovery record or agent notification of the kill.
  • The circuit breaker and lifecycle logging in agent-spawn.ts are well-implemented and low-risk. The heartbeat auto-recovery has a real logic bug: the bridge is killed before verifying that a restart is possible, and the silent failure paths push no RecoveryAction, leaving the system in a worse state than before recovery was attempted with no audit trail. This is a production-impacting issue for the bridge restart path.
  • pi/extensions/heartbeat.ts — specifically the bridge restart block in tryAutoRecover() (lines 614–667)

Important Files Changed

Filename Overview
pi/extensions/agent-spawn.ts Adds circuit breaker (3-strike, 5-min cooldown), lifecycle JSONL logging, and a new spawn_status tool. Logic is sound; minor issue with dynamic require() and use of shell tail in spawn_status instead of readFileSync.
pi/extensions/heartbeat.ts Adds tryAutoRecover() for bridge-down and orphaned-dev-agent cases. Has a logic bug: bridge session is killed before checking whether the startup script or socket UUIDs are available, and no RecoveryAction is pushed in the silent failure paths — leaving the bridge dead with no audit trail. Also uses dynamic require() and unsanitized session names in shell commands.
pi/extensions/agent-spawn.test.mjs Test harness correctly updated to support multiple registered tools. New circuit breaker test and spawn_status registration test are well-structured and cover the expected behaviors.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[agent_spawn called] --> B{Circuit open?}
    B -- Yes --> C[Log circuit_rejected\nReturn circuit_open error]
    B -- No --> D[Validate inputs]
    D -- Invalid --> E[Return validation error\nNo circuit trip]
    D -- Valid --> F[Log spawn_started\nRun tmux new-session]
    F -- tmux code ≠ 0 --> G[recordFailure\nLog spawn_failed stage=spawn]
    F -- tmux code = 0 --> H[waitForSessionReadiness]
    H -- aborted --> I[recordFailure\nLog spawn_failed stage=aborted]
    H -- timeout --> J[recordFailure\nLog spawn_failed stage=readiness]
    H -- ready --> K[recordSuccess\nLog spawn_success]
    G --> L{consecutiveFailures ≥ 3?}
    I --> L
    J --> L
    L -- Yes --> M[state = open]
    L -- No --> N[state unchanged]

    subgraph Heartbeat Auto-Recovery
    P[failures detected] --> Q{failure.name = bridge?}
    Q -- Yes --> R[Kill bridge tmux\nKill port 7890\nRun startup-pi.sh]
    R --> S{Startup script\nexists AND uuids?}
    S -- No --> T[⚠️ No action logged\nBridge remains dead]
    S -- Yes --> U[Verify bridge\nLog RecoveryAction]
    Q -- No --> V{failure.name\nstartsWith orphan:?}
    V -- Yes --> W[Kill tmux session\nRemove alias\nLog success]
    end

    U --> X{All recovered?}
    W --> X
    X -- Yes --> Y[No LLM tokens consumed]
    X -- No --> Z[Inject heartbeat prompt\nwith recovery details]
Loading

Last reviewed commit: 760d1e3

Comment on lines +614 to +655
// Kill existing bridge tmux session
try {
execSync('tmux kill-session -t baudbot-gateway-bridge 2>/dev/null', { timeout: 5000 });
} catch {
// May not exist — that's fine
}

// Kill anything holding port 7890
try {
execSync('lsof -ti :7890 2>/dev/null | xargs kill -9 2>/dev/null', { timeout: 5000 });
} catch {
// Nothing holding port — fine
}

// Restart via startup script
const startupScript = join(homedir(), ".pi", "agent", "skills", "control-agent", "startup-pi.sh");
if (existsSync(startupScript)) {
// Get live session UUIDs from session-control dir
const sockFiles = readdirSync(SOCKET_DIR).filter((f) => f.endsWith(".sock"));
const uuids = sockFiles.map((f) => f.replace(".sock", "")).join(" ");
if (uuids) {
execSync(`bash "${startupScript}" ${uuids} 2>&1`, {
timeout: 30000,
encoding: "utf-8",
});

// Verify bridge came back
await new Promise((resolve) => setTimeout(resolve, 3000));
const verifyResult = await checkBridge();

const entry: RecoveryAction = {
timestamp: new Date().toISOString(),
check: failure.name,
action: "bridge_restart",
success: verifyResult.ok,
detail: verifyResult.ok
? "Bridge restarted and verified healthy"
: `Bridge restart attempted but still failing: ${verifyResult.detail}`,
};
actions.push(entry);
logRecovery(entry);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bridge killed before verifying restart is possible

The bridge session and port 7890 are killed before verifying that either the startup script exists or that there are live socket UUIDs to pass it. If startupScript doesn't exist, or sockFiles is empty (no .sock files), the code falls through both inner if guards without pushing any RecoveryAction entry and without logging anything. The bridge is now dead but recoveryActions contains no entry for "bridge", so:

  1. recoveredChecks will NOT contain "bridge".
  2. The bridge failure remains in remainingFailures — the agent is prompted.
  3. But the agent has no idea the previous bridge tmux session was already killed and port 7890 was cleared — it's working against a ghost.

A minimal fix is to push a "partial failure" action in the missing branches:

// After killing bridge/port but before the startup script check:
if (!existsSync(startupScript)) {
  const entry: RecoveryAction = {
    timestamp: new Date().toISOString(),
    check: failure.name,
    action: "bridge_restart",
    success: false,
    detail: `Bridge session killed but startup script not found at ${startupScript} — cannot restart`,
  };
  actions.push(entry);
  logRecovery(entry);
  continue;
}

// …and similarly if uuids is empty:
if (!uuids) {
  const entry: RecoveryAction = {
    timestamp: new Date().toISOString(),
    check: failure.name,
    action: "bridge_restart",
    success: false,
    detail: "Bridge session killed but no live socket UUIDs found — cannot restart",
  };
  actions.push(entry);
  logRecovery(entry);
  continue;
}

This ensures the agent prompt always includes the full picture of what auto-recovery attempted.

Prompt To Fix With AI
This is a comment left during a code review.
Path: pi/extensions/heartbeat.ts
Line: 614-655

Comment:
**Bridge killed before verifying restart is possible**

The bridge session and port 7890 are killed *before* verifying that either the startup script exists or that there are live socket UUIDs to pass it. If `startupScript` doesn't exist, or `sockFiles` is empty (no `.sock` files), the code falls through both inner `if` guards without pushing any `RecoveryAction` entry and without logging anything. The bridge is now dead but `recoveryActions` contains no entry for `"bridge"`, so:

1. `recoveredChecks` will NOT contain `"bridge"`.
2. The bridge failure remains in `remainingFailures` — the agent is prompted.
3. But the agent has no idea the previous bridge tmux session was already killed and port 7890 was cleared — it's working against a ghost.

A minimal fix is to push a "partial failure" action in the missing branches:

```typescript
// After killing bridge/port but before the startup script check:
if (!existsSync(startupScript)) {
  const entry: RecoveryAction = {
    timestamp: new Date().toISOString(),
    check: failure.name,
    action: "bridge_restart",
    success: false,
    detail: `Bridge session killed but startup script not found at ${startupScript} — cannot restart`,
  };
  actions.push(entry);
  logRecovery(entry);
  continue;
}

// …and similarly if uuids is empty:
if (!uuids) {
  const entry: RecoveryAction = {
    timestamp: new Date().toISOString(),
    check: failure.name,
    action: "bridge_restart",
    success: false,
    detail: "Bridge session killed but no live socket UUIDs found — cannot restart",
  };
  actions.push(entry);
  logRecovery(entry);
  continue;
}
```

This ensures the agent prompt always includes the full picture of what auto-recovery attempted.

How can I resolve this? If you propose a fix, please make it concise.

try {
// Kill the tmux session
try {
execSync(`tmux kill-session -t "${sessionName}" 2>/dev/null`, { timeout: 5000 });
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shell injection via unsanitized session name from filesystem

sessionName is derived directly from filenames in SOCKET_DIR (e.g. a file named dev-agent-foo"; rm -rf ~; echo ".alias). While the directory is inside ~/.pi/, the shell interpolation via template literal in execSync would execute arbitrary injected commands if such a filename existed.

Prefer the array-argument form of spawnSync / execFileSync which bypasses the shell entirely:

import { execFileSync } from "node:child_process";
// …
execFileSync("tmux", ["kill-session", "-t", sessionName], { timeout: 5000 });

Similarly on line 635 for the startup script invocation — the UUIDs should each be passed as separate shell-quoted arguments, not space-joined into a single string.

Prompt To Fix With AI
This is a comment left during a code review.
Path: pi/extensions/heartbeat.ts
Line: 676

Comment:
**Shell injection via unsanitized session name from filesystem**

`sessionName` is derived directly from filenames in `SOCKET_DIR` (e.g. a file named `dev-agent-foo"; rm -rf ~; echo ".alias`). While the directory is inside `~/.pi/`, the shell interpolation via template literal in `execSync` would execute arbitrary injected commands if such a filename existed.

Prefer the array-argument form of `spawnSync` / `execFileSync` which bypasses the shell entirely:

```typescript
import { execFileSync } from "node:child_process";
//
execFileSync("tmux", ["kill-session", "-t", sessionName], { timeout: 5000 });
```

Similarly on line 635 for the startup script invocation — the UUIDs should each be passed as separate shell-quoted arguments, not space-joined into a single string.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +544 to +548
try {
const { execSync } = require("node:child_process");
const tail = execSync(`tail -20 "${LIFECYCLE_LOG_PATH}" 2>/dev/null`, { encoding: "utf-8" });
if (tail.trim()) {
const lines = tail.trim().split("\n");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamic require() in an ESM/TypeScript file — use readFileSync instead

This file uses static ES module import declarations at the top, but spawn_status falls back to require("node:child_process") and a shell tail command to read the last 20 lines of the JSONL log. Two problems:

  1. Dynamic require() is a CommonJS pattern and can be surprising in an ESM context.
  2. Shelling out to tail for a simple file-tail operation is non-portable (unavailable on Windows, behaves differently under PATH changes) and redundant — readFileSync is already available as a static import in this file.

A portable alternative using the already-imported readFileSync:

import { appendFileSync, existsSync, mkdirSync, readFileSync, readlinkSync, statSync } from "node:fs";
// …
// Inside spawn_status execute():
let recentEvents = "";
try {
  if (existsSync(LIFECYCLE_LOG_PATH)) {
    const lines = readFileSync(LIFECYCLE_LOG_PATH, "utf-8")
      .trimEnd()
      .split("\n")
      .slice(-20);
    // … same parse/format logic
  }
} catch {
  recentEvents = "  (no lifecycle log)";
}

The same require("node:child_process") / require("node:fs") pattern also appears in heartbeat.ts (lines 595, 684) where top-level static imports should be used instead.

Prompt To Fix With AI
This is a comment left during a code review.
Path: pi/extensions/agent-spawn.ts
Line: 544-548

Comment:
**Dynamic `require()` in an ESM/TypeScript file — use `readFileSync` instead**

This file uses static ES module `import` declarations at the top, but `spawn_status` falls back to `require("node:child_process")` and a shell `tail` command to read the last 20 lines of the JSONL log. Two problems:

1. Dynamic `require()` is a CommonJS pattern and can be surprising in an ESM context.
2. Shelling out to `tail` for a simple file-tail operation is non-portable (unavailable on Windows, behaves differently under PATH changes) and redundant — `readFileSync` is already available as a static import in this file.

A portable alternative using the already-imported `readFileSync`:

```typescript
import { appendFileSync, existsSync, mkdirSync, readFileSync, readlinkSync, statSync } from "node:fs";
//
// Inside spawn_status execute():
let recentEvents = "";
try {
  if (existsSync(LIFECYCLE_LOG_PATH)) {
    const lines = readFileSync(LIFECYCLE_LOG_PATH, "utf-8")
      .trimEnd()
      .split("\n")
      .slice(-20);
    // … same parse/format logic
  }
} catch {
  recentEvents = "  (no lifecycle log)";
}
```

The same `require("node:child_process")` / `require("node:fs")` pattern also appears in `heartbeat.ts` (lines 595, 684) where top-level static imports should be used instead.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +314 to +317
// Circuit breaker check
if (isCircuitOpen(circuit)) {
const cooldownLeft = circuit.lastFailureAt
? Math.max(0, CIRCUIT_COOLDOWN_MS - (Date.now() - circuit.lastFailureAt))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The isCircuitOpen function mutates the circuit state to half-open before input validation, allowing invalid requests to waste the single recovery probe attempt.
Severity: MEDIUM

Suggested Fix

Move the circuit state transition to occur after all input validation passes and immediately before the spawn attempt. This ensures only valid requests consume the half-open probe slot.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: pi/extensions/agent-spawn.ts#L314-L317

Potential issue: The `isCircuitOpen` function mutates the circuit breaker's state to
`half-open` as a side effect. This check is performed before input validation.
Consequently, a request with invalid input (e.g., an unsafe `sessionName`) can trigger
the state transition to `half-open` and then fail validation, returning an error. This
consumes the single "probe" attempt allowed in the half-open state without actually
trying to spawn an agent. This violates the intended behavior of allowing exactly one
probe attempt to test system recovery, potentially delaying recovery from an open
circuit.

Did we get this right? 👍 / 👎 to inform future reviews.

success: true,
detail: `Killed tmux session and removed alias for orphaned dev-agent "${sessionName}"`,
};
actions.push(entry);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The orphan cleanup logic unconditionally reports success, even when the underlying tmux kill-session command fails. This can leave orphaned processes running without notifying the agent.
Severity: MEDIUM

Suggested Fix

Inspect the error within the catch block for the tmux kill-session command. Only ignore errors indicating the session was already gone. For all other errors (e.g., command timeout, permission issues), the recovery action should be marked as success: false to ensure the failure is correctly reported.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: pi/extensions/heartbeat.ts#L695

Potential issue: The orphan cleanup recovery action uses `execSync` to kill a tmux
session within a `try...catch` block that silently swallows all errors. This means if
the `tmux kill-session` command fails for any reason other than the session not existing
(e.g., tmux daemon is down, permissions error), the error is ignored. The recovery
action is then unconditionally reported as `success: true`. This prevents the agent from
being notified about the failed cleanup, leaving an orphaned session running on the
server while the system believes it has been successfully removed.

Did we get this right? 👍 / 👎 to inform future reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants