Skip to content

Bug: gaskd worker thread blocks indefinitely when Gemini CCB_DONE marker not detected #94

@bookandlover

Description

@bookandlover

Summary

The gaskd Gemini adapter's worker thread can block indefinitely when the CCB_DONE marker is not detected from Gemini's session log, causing all subsequent ask gemini requests to queue up and timeout with exit_code=2.

Root Cause

In askd/adapters/gemini.py, handle_task() polls the Gemini session log for the CCB_DONE marker after sending a message. When the log reader fails to detect the marker (even though Gemini has already replied), the worker thread stays in its polling loop until the request timeout expires.

Since worker_pool.py uses a single worker thread per session (BaseSessionWorker with a serial queue), a blocked task prevents ALL subsequent tasks from being processed:

Task 1 (code review) → sent to Gemini → Gemini replies → log reader misses CCB_DONE → worker blocks
Task 2 (ping test)   → enqueued → waiting for Task 1 → done_event.wait() times out → exit_code=2
Task 3 (ping test)   → enqueued → waiting → timeout → exit_code=2
...

Observed Behavior

# gaskd.log shows Task 1 started but never completed:
[INFO] start provider=gemini req_id=20260222-120939-577-71605-1 work_dir=/Users/peng/LLM/BitXiongServer
# No corresponding "done" entry

# All subsequent tasks never appear in gaskd.log at all
# ask returns exit_code=2 (result=None) for every new request

Meanwhile, Gemini's tmux pane clearly shows the response with CCB_DONE marker:

CCB_DONE: 20260222-120939-577-71605-1

Why the Log Reader Misses CCB_DONE

The GeminiLogReader reads from Gemini's JSON session file. Possible causes:

  1. Gemini CLI writes the response to a different/new session file than expected
  2. Race condition: log reader starts polling before Gemini writes the response
  3. Session file rotation: Gemini creates a new session file, invalidating the cached path
  4. File read timing: the session JSON may not be flushed to disk when the reader checks

Impact

  • Cascading failure: One missed CCB_DONE blocks ALL future Gemini requests
  • Silent failure: No error logged, no timeout warning — requests just silently fail with exit_code=2
  • No recovery: Only fix is to kill and restart askd daemon
  • Hard to diagnose: ccb-ping gemini still reports "OK" because it tests the askd TCP socket, not the worker thread

Suggested Fixes

1. Per-task timeout in worker thread (critical)

Add a per-task timeout in BaseSessionWorker.run() so a stuck task doesn't block the queue forever:

def run(self) -> None:
    while not self._stop_event.is_set():
        try:
            task = self._q.get(timeout=0.2)
        except queue.Empty:
            continue
        try:
            # Add per-task timeout using threading
            task.result = self._handle_task(task)
        except Exception as exc:
            task.result = self._handle_exception(exc, task)
        finally:
            task.done_event.set()

2. Fallback to tmux pane scraping

When the Gemini session log doesn't show CCB_DONE within a reasonable time (e.g., 30s), fall back to scraping the tmux pane output directly (which is more reliable).

3. Health monitoring

Add a "last completed task" timestamp. If no tasks complete for > N minutes while tasks are queued, log a warning and optionally restart the worker.

Environment

  • CCB version: v5.2.4 (ce20d5c)
  • Platform: macOS (Darwin 25.3.0)
  • Gemini CLI: gemini 0.x (exact version unknown)
  • Trigger: Large code review message (~8KB diff) sent to Gemini

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions