feat: add spawn circuit breaker, lifecycle logging, and heartbeat auto-recovery#198
feat: add spawn circuit breaker, lifecycle logging, and heartbeat auto-recovery#198vltbaudbot wants to merge 1 commit intomodem-dev:mainfrom
Conversation
…o-recovery Addresses remaining items from modem-dev#185 (control-agent resilience): ## Circuit breaker (agent-spawn.ts) After 3 consecutive spawn failures, the circuit opens and rejects new spawn attempts for 5 minutes (cooldown). This prevents resource waste when spawns are failing due to systemic issues (missing API keys, model unavailability, etc.). The circuit transitions: closed → open (after 3 failures) → half-open (after cooldown) → closed (on success) Failure tracking counts tmux spawn failures, readiness timeouts, and aborted readiness checks. Validation errors (bad name, missing model) don't affect the circuit. ## Worker lifecycle logging (agent-spawn.ts) All spawn events are logged to ~/.pi/agent/logs/worker-lifecycle.jsonl: - spawn_started: when a spawn attempt begins - spawn_success: readiness verified (includes ready_after_ms) - spawn_failed: tmux error, readiness timeout, or abort - circuit_rejected: spawn refused by open circuit New `spawn_status` tool exposes circuit breaker state and recent lifecycle events for observability. ## Heartbeat auto-recovery (heartbeat.ts) Before prompting the control-agent about failures, the heartbeat now attempts automatic recovery for two failure types: - Bridge down: kills existing bridge tmux session, clears port holders, runs startup-pi.sh, verifies bridge comes back - Orphaned dev-agents: kills tmux session, removes stale alias If recovery succeeds, no LLM tokens are consumed (same as healthy check). Only unrecoverable failures prompt the agent. Recovery actions are logged to ~/.pi/agent/logs/auto-recovery.jsonl for audit and debugging. ## Tests - Fixed test harness to handle multiple tool registrations - Added circuit breaker test (3 failures → open → rejected) - Added spawn_status tool registration test - All 128 tests pass (73 heartbeat + 6 agent-spawn + 49 memory) Refs modem-dev#185 Co-authored-by: Darcy Clarke <darcy@darcyclarke.me>
Greptile SummaryThis PR adds three resilience features: a spawn circuit breaker (3-strike, 5-minute cooldown) in Key findings:
Confidence Score: 3/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[agent_spawn called] --> B{Circuit open?}
B -- Yes --> C[Log circuit_rejected\nReturn circuit_open error]
B -- No --> D[Validate inputs]
D -- Invalid --> E[Return validation error\nNo circuit trip]
D -- Valid --> F[Log spawn_started\nRun tmux new-session]
F -- tmux code ≠ 0 --> G[recordFailure\nLog spawn_failed stage=spawn]
F -- tmux code = 0 --> H[waitForSessionReadiness]
H -- aborted --> I[recordFailure\nLog spawn_failed stage=aborted]
H -- timeout --> J[recordFailure\nLog spawn_failed stage=readiness]
H -- ready --> K[recordSuccess\nLog spawn_success]
G --> L{consecutiveFailures ≥ 3?}
I --> L
J --> L
L -- Yes --> M[state = open]
L -- No --> N[state unchanged]
subgraph Heartbeat Auto-Recovery
P[failures detected] --> Q{failure.name = bridge?}
Q -- Yes --> R[Kill bridge tmux\nKill port 7890\nRun startup-pi.sh]
R --> S{Startup script\nexists AND uuids?}
S -- No --> T[⚠️ No action logged\nBridge remains dead]
S -- Yes --> U[Verify bridge\nLog RecoveryAction]
Q -- No --> V{failure.name\nstartsWith orphan:?}
V -- Yes --> W[Kill tmux session\nRemove alias\nLog success]
end
U --> X{All recovered?}
W --> X
X -- Yes --> Y[No LLM tokens consumed]
X -- No --> Z[Inject heartbeat prompt\nwith recovery details]
Last reviewed commit: 760d1e3 |
| // Kill existing bridge tmux session | ||
| try { | ||
| execSync('tmux kill-session -t baudbot-gateway-bridge 2>/dev/null', { timeout: 5000 }); | ||
| } catch { | ||
| // May not exist — that's fine | ||
| } | ||
|
|
||
| // Kill anything holding port 7890 | ||
| try { | ||
| execSync('lsof -ti :7890 2>/dev/null | xargs kill -9 2>/dev/null', { timeout: 5000 }); | ||
| } catch { | ||
| // Nothing holding port — fine | ||
| } | ||
|
|
||
| // Restart via startup script | ||
| const startupScript = join(homedir(), ".pi", "agent", "skills", "control-agent", "startup-pi.sh"); | ||
| if (existsSync(startupScript)) { | ||
| // Get live session UUIDs from session-control dir | ||
| const sockFiles = readdirSync(SOCKET_DIR).filter((f) => f.endsWith(".sock")); | ||
| const uuids = sockFiles.map((f) => f.replace(".sock", "")).join(" "); | ||
| if (uuids) { | ||
| execSync(`bash "${startupScript}" ${uuids} 2>&1`, { | ||
| timeout: 30000, | ||
| encoding: "utf-8", | ||
| }); | ||
|
|
||
| // Verify bridge came back | ||
| await new Promise((resolve) => setTimeout(resolve, 3000)); | ||
| const verifyResult = await checkBridge(); | ||
|
|
||
| const entry: RecoveryAction = { | ||
| timestamp: new Date().toISOString(), | ||
| check: failure.name, | ||
| action: "bridge_restart", | ||
| success: verifyResult.ok, | ||
| detail: verifyResult.ok | ||
| ? "Bridge restarted and verified healthy" | ||
| : `Bridge restart attempted but still failing: ${verifyResult.detail}`, | ||
| }; | ||
| actions.push(entry); | ||
| logRecovery(entry); | ||
| } |
There was a problem hiding this comment.
Bridge killed before verifying restart is possible
The bridge session and port 7890 are killed before verifying that either the startup script exists or that there are live socket UUIDs to pass it. If startupScript doesn't exist, or sockFiles is empty (no .sock files), the code falls through both inner if guards without pushing any RecoveryAction entry and without logging anything. The bridge is now dead but recoveryActions contains no entry for "bridge", so:
recoveredCheckswill NOT contain"bridge".- The bridge failure remains in
remainingFailures— the agent is prompted. - But the agent has no idea the previous bridge tmux session was already killed and port 7890 was cleared — it's working against a ghost.
A minimal fix is to push a "partial failure" action in the missing branches:
// After killing bridge/port but before the startup script check:
if (!existsSync(startupScript)) {
const entry: RecoveryAction = {
timestamp: new Date().toISOString(),
check: failure.name,
action: "bridge_restart",
success: false,
detail: `Bridge session killed but startup script not found at ${startupScript} — cannot restart`,
};
actions.push(entry);
logRecovery(entry);
continue;
}
// …and similarly if uuids is empty:
if (!uuids) {
const entry: RecoveryAction = {
timestamp: new Date().toISOString(),
check: failure.name,
action: "bridge_restart",
success: false,
detail: "Bridge session killed but no live socket UUIDs found — cannot restart",
};
actions.push(entry);
logRecovery(entry);
continue;
}This ensures the agent prompt always includes the full picture of what auto-recovery attempted.
Prompt To Fix With AI
This is a comment left during a code review.
Path: pi/extensions/heartbeat.ts
Line: 614-655
Comment:
**Bridge killed before verifying restart is possible**
The bridge session and port 7890 are killed *before* verifying that either the startup script exists or that there are live socket UUIDs to pass it. If `startupScript` doesn't exist, or `sockFiles` is empty (no `.sock` files), the code falls through both inner `if` guards without pushing any `RecoveryAction` entry and without logging anything. The bridge is now dead but `recoveryActions` contains no entry for `"bridge"`, so:
1. `recoveredChecks` will NOT contain `"bridge"`.
2. The bridge failure remains in `remainingFailures` — the agent is prompted.
3. But the agent has no idea the previous bridge tmux session was already killed and port 7890 was cleared — it's working against a ghost.
A minimal fix is to push a "partial failure" action in the missing branches:
```typescript
// After killing bridge/port but before the startup script check:
if (!existsSync(startupScript)) {
const entry: RecoveryAction = {
timestamp: new Date().toISOString(),
check: failure.name,
action: "bridge_restart",
success: false,
detail: `Bridge session killed but startup script not found at ${startupScript} — cannot restart`,
};
actions.push(entry);
logRecovery(entry);
continue;
}
// …and similarly if uuids is empty:
if (!uuids) {
const entry: RecoveryAction = {
timestamp: new Date().toISOString(),
check: failure.name,
action: "bridge_restart",
success: false,
detail: "Bridge session killed but no live socket UUIDs found — cannot restart",
};
actions.push(entry);
logRecovery(entry);
continue;
}
```
This ensures the agent prompt always includes the full picture of what auto-recovery attempted.
How can I resolve this? If you propose a fix, please make it concise.| try { | ||
| // Kill the tmux session | ||
| try { | ||
| execSync(`tmux kill-session -t "${sessionName}" 2>/dev/null`, { timeout: 5000 }); |
There was a problem hiding this comment.
Shell injection via unsanitized session name from filesystem
sessionName is derived directly from filenames in SOCKET_DIR (e.g. a file named dev-agent-foo"; rm -rf ~; echo ".alias). While the directory is inside ~/.pi/, the shell interpolation via template literal in execSync would execute arbitrary injected commands if such a filename existed.
Prefer the array-argument form of spawnSync / execFileSync which bypasses the shell entirely:
import { execFileSync } from "node:child_process";
// …
execFileSync("tmux", ["kill-session", "-t", sessionName], { timeout: 5000 });Similarly on line 635 for the startup script invocation — the UUIDs should each be passed as separate shell-quoted arguments, not space-joined into a single string.
Prompt To Fix With AI
This is a comment left during a code review.
Path: pi/extensions/heartbeat.ts
Line: 676
Comment:
**Shell injection via unsanitized session name from filesystem**
`sessionName` is derived directly from filenames in `SOCKET_DIR` (e.g. a file named `dev-agent-foo"; rm -rf ~; echo ".alias`). While the directory is inside `~/.pi/`, the shell interpolation via template literal in `execSync` would execute arbitrary injected commands if such a filename existed.
Prefer the array-argument form of `spawnSync` / `execFileSync` which bypasses the shell entirely:
```typescript
import { execFileSync } from "node:child_process";
// …
execFileSync("tmux", ["kill-session", "-t", sessionName], { timeout: 5000 });
```
Similarly on line 635 for the startup script invocation — the UUIDs should each be passed as separate shell-quoted arguments, not space-joined into a single string.
How can I resolve this? If you propose a fix, please make it concise.| try { | ||
| const { execSync } = require("node:child_process"); | ||
| const tail = execSync(`tail -20 "${LIFECYCLE_LOG_PATH}" 2>/dev/null`, { encoding: "utf-8" }); | ||
| if (tail.trim()) { | ||
| const lines = tail.trim().split("\n"); |
There was a problem hiding this comment.
Dynamic require() in an ESM/TypeScript file — use readFileSync instead
This file uses static ES module import declarations at the top, but spawn_status falls back to require("node:child_process") and a shell tail command to read the last 20 lines of the JSONL log. Two problems:
- Dynamic
require()is a CommonJS pattern and can be surprising in an ESM context. - Shelling out to
tailfor a simple file-tail operation is non-portable (unavailable on Windows, behaves differently under PATH changes) and redundant —readFileSyncis already available as a static import in this file.
A portable alternative using the already-imported readFileSync:
import { appendFileSync, existsSync, mkdirSync, readFileSync, readlinkSync, statSync } from "node:fs";
// …
// Inside spawn_status execute():
let recentEvents = "";
try {
if (existsSync(LIFECYCLE_LOG_PATH)) {
const lines = readFileSync(LIFECYCLE_LOG_PATH, "utf-8")
.trimEnd()
.split("\n")
.slice(-20);
// … same parse/format logic
}
} catch {
recentEvents = " (no lifecycle log)";
}The same require("node:child_process") / require("node:fs") pattern also appears in heartbeat.ts (lines 595, 684) where top-level static imports should be used instead.
Prompt To Fix With AI
This is a comment left during a code review.
Path: pi/extensions/agent-spawn.ts
Line: 544-548
Comment:
**Dynamic `require()` in an ESM/TypeScript file — use `readFileSync` instead**
This file uses static ES module `import` declarations at the top, but `spawn_status` falls back to `require("node:child_process")` and a shell `tail` command to read the last 20 lines of the JSONL log. Two problems:
1. Dynamic `require()` is a CommonJS pattern and can be surprising in an ESM context.
2. Shelling out to `tail` for a simple file-tail operation is non-portable (unavailable on Windows, behaves differently under PATH changes) and redundant — `readFileSync` is already available as a static import in this file.
A portable alternative using the already-imported `readFileSync`:
```typescript
import { appendFileSync, existsSync, mkdirSync, readFileSync, readlinkSync, statSync } from "node:fs";
// …
// Inside spawn_status execute():
let recentEvents = "";
try {
if (existsSync(LIFECYCLE_LOG_PATH)) {
const lines = readFileSync(LIFECYCLE_LOG_PATH, "utf-8")
.trimEnd()
.split("\n")
.slice(-20);
// … same parse/format logic
}
} catch {
recentEvents = " (no lifecycle log)";
}
```
The same `require("node:child_process")` / `require("node:fs")` pattern also appears in `heartbeat.ts` (lines 595, 684) where top-level static imports should be used instead.
How can I resolve this? If you propose a fix, please make it concise.| // Circuit breaker check | ||
| if (isCircuitOpen(circuit)) { | ||
| const cooldownLeft = circuit.lastFailureAt | ||
| ? Math.max(0, CIRCUIT_COOLDOWN_MS - (Date.now() - circuit.lastFailureAt)) |
There was a problem hiding this comment.
Bug: The isCircuitOpen function mutates the circuit state to half-open before input validation, allowing invalid requests to waste the single recovery probe attempt.
Severity: MEDIUM
Suggested Fix
Move the circuit state transition to occur after all input validation passes and immediately before the spawn attempt. This ensures only valid requests consume the half-open probe slot.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: pi/extensions/agent-spawn.ts#L314-L317
Potential issue: The `isCircuitOpen` function mutates the circuit breaker's state to
`half-open` as a side effect. This check is performed before input validation.
Consequently, a request with invalid input (e.g., an unsafe `sessionName`) can trigger
the state transition to `half-open` and then fail validation, returning an error. This
consumes the single "probe" attempt allowed in the half-open state without actually
trying to spawn an agent. This violates the intended behavior of allowing exactly one
probe attempt to test system recovery, potentially delaying recovery from an open
circuit.
Did we get this right? 👍 / 👎 to inform future reviews.
| success: true, | ||
| detail: `Killed tmux session and removed alias for orphaned dev-agent "${sessionName}"`, | ||
| }; | ||
| actions.push(entry); |
There was a problem hiding this comment.
Bug: The orphan cleanup logic unconditionally reports success, even when the underlying tmux kill-session command fails. This can leave orphaned processes running without notifying the agent.
Severity: MEDIUM
Suggested Fix
Inspect the error within the catch block for the tmux kill-session command. Only ignore errors indicating the session was already gone. For all other errors (e.g., command timeout, permission issues), the recovery action should be marked as success: false to ensure the failure is correctly reported.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: pi/extensions/heartbeat.ts#L695
Potential issue: The orphan cleanup recovery action uses `execSync` to kill a tmux
session within a `try...catch` block that silently swallows all errors. This means if
the `tmux kill-session` command fails for any reason other than the session not existing
(e.g., tmux daemon is down, permissions error), the error is ignored. The recovery
action is then unconditionally reported as `success: true`. This prevents the agent from
being notified about the failed cleanup, leaving an orphaned session running on the
server while the system believes it has been successfully removed.
Did we get this right? 👍 / 👎 to inform future reviews.
Summary
Addresses remaining resilience items from #185 — the three highest-value improvements that don't require pi core changes.
1. Spawn Circuit Breaker (
agent-spawn.ts)After 3 consecutive spawn failures, the circuit opens and rejects new spawn attempts for 5 minutes. This prevents resource waste when spawns are failing systemically (missing API keys, model unavailability, etc.).
Failure types that trip the circuit:
Validation errors (bad name, missing model, etc.) do NOT trip the circuit.
2. Worker Lifecycle Logging (
agent-spawn.ts)All spawn events are appended to
~/.pi/agent/logs/worker-lifecycle.jsonl:spawn_startedspawn_successready_after_ms)spawn_failedcircuit_rejectedNew
spawn_statustool exposes circuit breaker state and recent lifecycle events for observability.3. Heartbeat Auto-Recovery (
heartbeat.ts)Before prompting the control-agent about failures, the heartbeat now attempts automatic recovery for two safe, idempotent failure types:
If recovery succeeds, no LLM tokens are consumed — same as a healthy check. Only unrecoverable failures prompt the agent.
Recovery actions are logged to
~/.pi/agent/logs/auto-recovery.jsonl.What this addresses from #185
agent_spawn)Testing
spawn_statustool registration testtsc --noEmit✅biome lint✅Files changed
pi/extensions/agent-spawn.tsspawn_statustoolpi/extensions/heartbeat.tspi/extensions/agent-spawn.test.mjsRefs #185