-
Notifications
You must be signed in to change notification settings - Fork 127
Description
Summary
The gaskd Gemini adapter's worker thread can block indefinitely when the CCB_DONE marker is not detected from Gemini's session log, causing all subsequent ask gemini requests to queue up and timeout with exit_code=2.
Root Cause
In askd/adapters/gemini.py, handle_task() polls the Gemini session log for the CCB_DONE marker after sending a message. When the log reader fails to detect the marker (even though Gemini has already replied), the worker thread stays in its polling loop until the request timeout expires.
Since worker_pool.py uses a single worker thread per session (BaseSessionWorker with a serial queue), a blocked task prevents ALL subsequent tasks from being processed:
Task 1 (code review) → sent to Gemini → Gemini replies → log reader misses CCB_DONE → worker blocks
Task 2 (ping test) → enqueued → waiting for Task 1 → done_event.wait() times out → exit_code=2
Task 3 (ping test) → enqueued → waiting → timeout → exit_code=2
...
Observed Behavior
# gaskd.log shows Task 1 started but never completed:
[INFO] start provider=gemini req_id=20260222-120939-577-71605-1 work_dir=/Users/peng/LLM/BitXiongServer
# No corresponding "done" entry
# All subsequent tasks never appear in gaskd.log at all
# ask returns exit_code=2 (result=None) for every new request
Meanwhile, Gemini's tmux pane clearly shows the response with CCB_DONE marker:
CCB_DONE: 20260222-120939-577-71605-1
Why the Log Reader Misses CCB_DONE
The GeminiLogReader reads from Gemini's JSON session file. Possible causes:
- Gemini CLI writes the response to a different/new session file than expected
- Race condition: log reader starts polling before Gemini writes the response
- Session file rotation: Gemini creates a new session file, invalidating the cached path
- File read timing: the session JSON may not be flushed to disk when the reader checks
Impact
- Cascading failure: One missed
CCB_DONEblocks ALL future Gemini requests - Silent failure: No error logged, no timeout warning — requests just silently fail with
exit_code=2 - No recovery: Only fix is to kill and restart askd daemon
- Hard to diagnose:
ccb-ping geministill reports "OK" because it tests the askd TCP socket, not the worker thread
Suggested Fixes
1. Per-task timeout in worker thread (critical)
Add a per-task timeout in BaseSessionWorker.run() so a stuck task doesn't block the queue forever:
def run(self) -> None:
while not self._stop_event.is_set():
try:
task = self._q.get(timeout=0.2)
except queue.Empty:
continue
try:
# Add per-task timeout using threading
task.result = self._handle_task(task)
except Exception as exc:
task.result = self._handle_exception(exc, task)
finally:
task.done_event.set()2. Fallback to tmux pane scraping
When the Gemini session log doesn't show CCB_DONE within a reasonable time (e.g., 30s), fall back to scraping the tmux pane output directly (which is more reliable).
3. Health monitoring
Add a "last completed task" timestamp. If no tasks complete for > N minutes while tasks are queued, log a warning and optionally restart the worker.
Environment
- CCB version: v5.2.4 (ce20d5c)
- Platform: macOS (Darwin 25.3.0)
- Gemini CLI: gemini 0.x (exact version unknown)
- Trigger: Large code review message (~8KB diff) sent to Gemini