Skip to content
Closed
19 changes: 18 additions & 1 deletion docs/content/docs/(configuration)/config.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,12 @@ background_threshold = 0.80 # background summarization
aggressive_threshold = 0.85 # aggressive summarization
emergency_threshold = 0.95 # drop oldest 50%, no LLM

# Deterministic worker task contract timing.
[defaults.worker_contract]
ack_secs = 5 # seconds before first ack checkpoint
progress_secs = 45 # seconds between progress heartbeat nudges
tick_secs = 2 # scheduler tick interval for contract deadline checks

# Cortex (system observer) settings.
[defaults.cortex]
tick_interval_secs = 30
Expand Down Expand Up @@ -227,6 +233,7 @@ Most config values are hot-reloaded when their files change. Spacebot watches `c
| `max_concurrent_branches` | Yes | Next branch spawn checks new limit |
| Browser config | Yes | Next worker spawn uses new config |
| Warmup config | Yes | Next warmup pass uses new values |
| `[defaults.worker_contract]` (`ack_secs`, `progress_secs`, `tick_secs`) | Yes | Runtime contract deadlines and polling update without restart |
| Identity files (SOUL.md, etc.) | Yes | Next channel message renders new identity |
| Skills (SKILL.md files) | Yes | Next message / worker spawn sees new skills |
| Bindings | Yes | Next message routes using new bindings |
Expand Down Expand Up @@ -471,12 +478,22 @@ Map of model names to ordered fallback chains. Used when the primary model retur

Thresholds are fractions of `context_window`.

### `[defaults.worker_contract]`

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `ack_secs` | integer | 5 | Deadline to confirm a worker start was surfaced |
| `progress_secs` | integer | 45 | Deadline between meaningful worker progress updates |
| `tick_secs` | integer | 2 | Poll interval for worker contract deadline checks |

Setting `ack_secs`, `progress_secs`, or `tick_secs` to `0` is treated as unset and falls back to the resolved default for that scope.

### `[defaults.cortex]`

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `tick_interval_secs` | integer | 30 | How often the cortex checks system state |
| `worker_timeout_secs` | integer | 300 | Worker timeout before cancellation |
| `worker_timeout_secs` | integer | 300 | Inactivity timeout for worker progress events before forced cancellation |
| `branch_timeout_secs` | integer | 60 | Branch timeout before cancellation |
| `circuit_breaker_threshold` | integer | 3 | Consecutive failures before auto-disable |

Expand Down
2 changes: 2 additions & 0 deletions docs/content/docs/(deployment)/roadmap.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ The full message-in → LLM → response-out pipeline is wired end-to-end across
- **Tools** — 16 tools implement Rig's `Tool` trait with real logic (reply, branch, spawn_worker, route, cancel, skip, react, memory_save, memory_recall, set_status, shell, file, exec, browser, cron, web_search)
- **Workspace containment** — file tool validates paths stay within workspace boundary, shell/exec tools block instance directory traversal, sensitive file access, and secret env var leakage
- **Conversation persistence** — `ConversationLogger` with fire-and-forget SQLite writes, compaction archiving
- **Worker task contracts** — deterministic worker ack/progress/terminal deadlines with one-time SLA nudge and durable terminal convergence (`terminal_acked` / `terminal_failed`)
- **Worker event journal** — append-only `worker_events` persistence for started/status/tool/permission/question/completed lifecycle debugging
- **Cron** — scheduler with timers, active hours, circuit breaker (3 failures → disable), creates real channels. CronTool wired into channel tool factory.
- **Message routing** — full event loop with binding resolution, channel lifecycle, outbound routing
- **Settings store** — redb key-value with WorkerLogMode
Expand Down
54 changes: 51 additions & 3 deletions docs/content/docs/(features)/workers.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -62,13 +62,17 @@ Workers don't get memory tools, channel tools, or branch tools. They can't talk

```
Running ──→ Done (fire-and-forget completed)
Running ──→ Failed (error or cancellation)
Running ──→ Failed (error)
Running ──→ Cancelled (cancelled by channel/system)
Running ──→ timed_out (inactivity timeout elapsed)
Running ──→ WaitingForInput (interactive worker finished initial task)
WaitingForInput ──→ Running (follow-up message received via route)
WaitingForInput ──→ Failed (follow-up processing failed)
WaitingForInput ──→ Cancelled (cancelled by channel/system)
WaitingForInput ──→ timed_out (inactivity timeout elapsed)
```

`Done` and `Failed` are terminal. Illegal transitions are runtime errors.
`Done`, `Failed`, `Cancelled`, and `timed_out` are terminal. Illegal transitions are runtime errors.

## Context and History

Expand All @@ -95,7 +99,7 @@ Workers run in segments of 25 turns each. After each segment:

- If the agent returned a result: done
- If max turns hit: compact if needed, continue with "Continue where you left off"
- If cancelled: state = Failed
- If cancelled: state = Cancelled
- If context overflow: force compact, retry

This prevents runaway workers and handles long tasks that exceed a single agent loop.
Expand All @@ -111,10 +115,54 @@ Workers report progress via the `set_status` tool. The status string (max 256 ch

The channel LLM sees this and can decide whether to wait, ask for more info, or cancel.

Spacebot also forwards throttled worker checkpoints to the user-facing adapter:

- Start and completion updates are always surfaced.
- Mid-run checkpoints are deduped and rate-limited (default: at most one every 20s per worker, with urgent states bypassing the limit).
- Adapters that support message editing (for example Discord) update a single progress message in place to avoid channel spam.

## Concurrency

Workers run concurrently. The default limit is `max_concurrent_workers: 5` per channel (configurable per agent). Attempting to spawn beyond the limit returns an error to the LLM so it can wait or cancel an existing worker.

## Timeouts

Worker runs are bounded by `worker_timeout_secs` (default `300`) as an inactivity timeout. Any worker progress event (status updates, tool activity, permission/question prompts) resets the timer.

If no progress arrives within the timeout window, Spacebot marks the worker as `timed_out`, records a terminal result, and removes it from active worker state so the channel can continue delegating work.

## Deterministic Task Contracts

Each worker run now gets an internal task contract with three deadlines:

- **Acknowledge deadline** — confirms the worker start was surfaced to the user-facing adapter.
- **Progress deadline** — expects a meaningful heartbeat before the deadline.
- **Terminal deadline** — tracks terminal delivery lifecycle until receipt ack/failure.

If the acknowledge deadline is missed, Spacebot emits a synthesized "running" checkpoint. If the progress deadline is missed, it emits one synthesized "still working" nudge (one-time, no spam loop). Terminal receipt ack/failure then closes the contract as `terminal_acked` or `terminal_failed`.

## Terminal Delivery Reliability

Terminal worker notices (`done`, `failed`, `timed_out`, `cancelled`) are queued as durable delivery receipts before they are sent to the messaging adapter.

- Receipts are retried with bounded backoff on adapter delivery errors.
- Successful delivery marks the receipt as acknowledged.
- On process restart, in-flight (`sending`) receipts are re-queued so completion notices are not silently dropped.
- Old terminal receipts (`acked`, `failed`) are pruned periodically to keep storage bounded.

## Worker Event Journal

Worker lifecycle updates are also written to an append-only `worker_events` table:

- `started` with task + worker type
- `status` checkpoints
- `tool_started`
- `tool_completed`
- `permission` / `question`
- `completed` with terminal summary

This gives us durable debugging context even after in-memory status blocks are gone. The workers API and `worker_inspect` surface this timeline so long-running task behavior can be audited post-run.

## Model Routing

Workers default to `anthropic/claude-haiku-4.5-20250514`. Task-type overrides apply — for example, a `coding` task type routes to `anthropic/claude-sonnet-4-20250514`. Fallback chains are supported. All hot-reloadable.
Expand Down
28 changes: 28 additions & 0 deletions migrations/20260224000001_worker_delivery_receipts.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
-- Durable delivery receipts for terminal worker notifications.
--
-- Tracks whether a terminal worker completion notice has been delivered to the
-- user-facing channel, with bounded retry metadata for transient adapter
-- failures.

CREATE TABLE IF NOT EXISTS worker_delivery_receipts (
id TEXT PRIMARY KEY,
worker_id TEXT NOT NULL,
channel_id TEXT NOT NULL,
kind TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'pending',
terminal_state TEXT NOT NULL,
payload_text TEXT NOT NULL,
attempt_count INTEGER NOT NULL DEFAULT 0,
last_error TEXT,
next_attempt_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
acked_at TIMESTAMP,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
UNIQUE(worker_id, kind)
);

CREATE INDEX idx_worker_delivery_receipts_due
ON worker_delivery_receipts(status, next_attempt_at);

CREATE INDEX idx_worker_delivery_receipts_channel
ON worker_delivery_receipts(channel_id, created_at);
35 changes: 35 additions & 0 deletions migrations/20260224000002_worker_task_contracts.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
-- Deterministic worker task contracts.
--
-- Tracks acknowledgement/progress/terminal guarantees for worker executions so
-- long-running tasks always provide bounded feedback and reach terminal states.

CREATE TABLE IF NOT EXISTS worker_task_contracts (
id TEXT PRIMARY KEY,
agent_id TEXT NOT NULL,
channel_id TEXT NOT NULL,
worker_id TEXT NOT NULL UNIQUE,
task_summary TEXT NOT NULL,
state TEXT NOT NULL DEFAULT 'created',
ack_deadline_at TIMESTAMP NOT NULL,
progress_deadline_at TIMESTAMP NOT NULL,
terminal_deadline_at TIMESTAMP NOT NULL,
last_progress_at TIMESTAMP,
last_status_hash TEXT,
attempt_count INTEGER NOT NULL DEFAULT 0,
sla_nudge_sent INTEGER NOT NULL DEFAULT 0,
terminal_state TEXT,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_worker_task_contracts_channel_state
ON worker_task_contracts(channel_id, state);

CREATE INDEX idx_worker_task_contracts_ack_due
ON worker_task_contracts(state, ack_deadline_at);

CREATE INDEX idx_worker_task_contracts_progress_due
ON worker_task_contracts(state, progress_deadline_at);

CREATE INDEX idx_worker_task_contracts_terminal_due
ON worker_task_contracts(state, terminal_deadline_at);
24 changes: 24 additions & 0 deletions migrations/20260225000001_worker_events.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
-- Durable worker event journal for debugging and UX timeline recovery.
--
-- Captures lifecycle checkpoints (started/progress/tool activity/completed) as
-- append-only records tied to worker_runs.

CREATE TABLE IF NOT EXISTS worker_events (
id TEXT PRIMARY KEY,
worker_id TEXT NOT NULL,
channel_id TEXT,
agent_id TEXT,
event_type TEXT NOT NULL,
payload_json TEXT,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (worker_id) REFERENCES worker_runs(id) ON DELETE CASCADE
);

CREATE INDEX IF NOT EXISTS idx_worker_events_worker
ON worker_events(worker_id, created_at);

CREATE INDEX IF NOT EXISTS idx_worker_events_channel
ON worker_events(channel_id, created_at);

CREATE INDEX IF NOT EXISTS idx_worker_events_agent
ON worker_events(agent_id, created_at);
Loading
Loading