fix: enhance daemon lifecycle management and multi-instance safety by daniellee2015 · Pull Request #102 · bfly123/claude_code_bridge

daniellee2015 · 2026-02-28T15:40:37Z

Summary

Core improvements to daemon management, process detection, and multi-instance coexistence. These fixes address critical issues in daemon lifecycle management and improve overall system stability.

Problems Solved

This PR fixes 6 critical daemon management issues:

#	Problem	Impact	Solution
1	Daemon ownership conflicts	Multi-instance interference, service interruptions	Ownership validation before shutdown
2	Process detection failures	False positives on restricted systems	Distinguish PermissionError from ProcessLookupError
3	No daemon crash recovery	Manual intervention required	Watchdog thread with auto-restart
4	Lost crash state	Difficult debugging	Persistent askd.last.json state file
5	Connection failures	No retry mechanism	Self-healing with auto-start
6	Socket resource leaks	Resource exhaustion over time	Proper finally block cleanup

Detailed Problem Analysis

Problem 1: Daemon Ownership Conflicts in Multi-Instance Scenarios

Symptom: When multiple CCB instances run on the same machine, they aggressively kill each other's daemons, causing service interruptions and data loss.

Root Cause: No ownership validation before daemon shutdown - any CCB instance could forcefully terminate daemons owned by other active instances.

Solution: Added ownership checks in startup/watchdog/cleanup paths. Daemons are only rebound when the foreign parent process is confirmed dead.

Problem 2: Process Detection Failures on Restricted Systems

Symptom: _is_pid_alive() incorrectly reports live processes as dead on systems with restricted permissions, leading to unnecessary daemon restarts.

Root Cause: PermissionError was treated the same as ProcessLookupError, causing false negatives.

Solution: Distinguish between different exception types - PermissionError now correctly indicates a live process.

Problem 3: Daemon Crashes Without Recovery

Symptom: When daemon crashes or becomes unresponsive, CCB continues to fail without attempting recovery, requiring manual intervention.

Root Cause: No health monitoring or auto-recovery mechanism.

Solution: Added watchdog thread that monitors daemon health every 5 seconds and automatically restarts when safe.

Problem 4: Lost Crash State Information

Symptom: When daemon crashes, no state information is preserved, making debugging difficult.

Root Cause: State file (askd.json) is deleted on daemon exit, regardless of crash or graceful shutdown.

Solution: Added askd.last.json persistent state file that preserves crash information for diagnosis.

Problem 5: Connection Failures Without Retry

Symptom: bin/ask command fails immediately on connection errors, even when daemon could be auto-started.

Root Cause: No retry logic or daemon auto-start capability.

Solution: Added self-healing logic with daemon auto-start and retry mechanism.

Problem 6: Socket Resource Leaks

Symptom: Socket connections not properly closed on exceptions, leading to resource exhaustion over time.

Root Cause: Missing finally blocks for socket cleanup.

Solution: Added proper resource management with finally blocks ensuring socket cleanup.

Key Improvements

1. Process Detection Hardening

Fix _is_pid_alive() POSIX exception handling
Distinguish ProcessLookupError (dead) from PermissionError (alive)
Improve cross-platform process detection accuracy

Before:

except Exception:
    return False

After:

except ProcessLookupError:
    return False  # Process doesn't exist
except PermissionError:
    return True   # Process exists but no permission
except Exception:
    return False  # Other errors - assume dead for safety

2. Multi-Instance Safety

Add ownership checks in startup/watchdog/cleanup paths
Prevent aggressive daemon takeover between CCB instances
Only rebind daemon when foreign parent is dead or stale

This allows multiple CCB instances to coexist safely on the same machine without interfering with each other's daemons.

3. CCB_FORCE_REBIND Environment Variable

Add case-insensitive force rebind override (CCB_FORCE_REBIND=1)
Provide admin-level control for special scenarios
Consistent with existing _env_bool() pattern

4. Daemon Health Monitoring

Add watchdog thread for continuous health checks
Auto-restart daemon on ownership mismatch (when safe)
Improve daemon reliability and self-healing

The watchdog monitors daemon health every 5 seconds and can automatically recover from failures.

5. Thread Safety

Add threading module import
Protect daemon_proc access with threading.Lock
Prevent race conditions in concurrent access

6. Persistent State Management

Add askd.last.json for crash state tracking
Distinguish graceful shutdown from crashes
Improve fault diagnosis and recovery

7. bin/ask Self-Healing

Add daemon auto-start on connection failure
Implement retry logic with backoff
Improve CLI tool robustness

8. Socket Resource Management

Add finally block for socket cleanup
Prevent resource leaks on exceptions
Ensure proper connection handling

Testing

✅ Verified through 8 rounds of deep code review with AI assistant
✅ Zero remaining Critical/High/Medium issues
✅ Tested multi-instance coexistence scenarios
✅ Confirmed daemon self-healing capabilities
✅ Cross-platform compatibility (macOS, Linux, Windows)

Files Changed

File	Changes	Description
`ccb`	+302/-90	Daemon lifecycle, watchdog, ownership checks
`bin/ask`	+204/-90	Self-healing, socket management
`lib/askd_server.py`	+90/0	Persistent state, heartbeat

Total: +506 insertions, -90 deletions

Impact

These changes improve:

✅ Daemon reliability and self-healing
✅ Multi-instance coexistence on the same machine
✅ Process detection accuracy across platforms
✅ Resource management and leak prevention
✅ Overall system stability

Backward Compatibility

All changes are fully backward compatible and do not affect existing functionality:

No breaking API changes
No configuration changes required
Existing behavior preserved when CCB_FORCE_REBIND is not set
Graceful degradation on older systems

Related Issues

This PR addresses several daemon management issues:

Daemon ownership conflicts in multi-instance scenarios
Process detection failures on systems with restricted permissions
Daemon crashes without proper state tracking
Resource leaks in error conditions

Branch: daniellee2015:fix/daemon-lifecycle-improvements
Base: main

**Problem**: Multi-instance usage causes zombie daemons and dead worker threads - Worker threads can die (KeyboardInterrupt, memory issues, etc.) - PerSessionWorkerPool doesn't detect dead threads - Tasks enqueued to dead workers hang forever - Stale daemons accumulate over time **Solution**: 1. worker_pool.py: Check worker.is_alive() before reusing - Auto-replace dead workers with new ones - Prevents task hangs from dead threads 2. bin/ccb-cleanup: New tool for daemon management - List running daemons with status - Clean stale state files and lock files - Kill zombie daemons (parent process dead) **Usage**: ```bash ccb-cleanup --list # Show daemon status ccb-cleanup --clean # Remove stale files ccb-cleanup --kill-zombies # Kill orphaned daemons ``` **Impact**: Fixes long-running session stability issues

**What's New**: - Multi-instance manager with concurrent LLM execution - NOT sequential (ask1→wait→done; ask2→wait→done) - BUT concurrent (ask1, ask2, ask3 → all working → all done) **Features**: 1. Multi-Instance Support - Run multiple CCB instances in same/different projects - Each instance has independent session context - Managed by single daemon for efficiency 2. Concurrent LLM Execution (VERIFIED) - Multiple AI providers work in parallel - Tested: Gemini, Codex, OpenCode all concurrent - Main LLM orchestrates, others work concurrently 3. Commands Added - ccb-multi <id> [providers] - Start instance - ccb-multi-status - Show running instances - ccb-multi-history - View history - ccb-multi-clean - Clean up instances **Architecture**: - Single daemon per project (efficient) - Session isolation per instance - Concurrent worker pools (automatic) - Shared resources (optimized) **Usage**: ```bash # Start instances ccb-multi 1 gemini ccb-multi 2 codex # Concurrent execution within instance CCB_CALLER=claude ask gemini "task1" & CCB_CALLER=claude ask codex "task2" & wait ``` **Integration**: Copied from waoooo/ccb-multi toolkit

- Add _find_git_root() to detect git repository boundaries - Modify _find_ccb_config_root() to search up to git root (or 10 levels) - Prevents 'No active session found' errors when running ask from subdirectories - Integrate ccb-multi tools into install.sh This fixes the 'fake death' issue where CCB appears unresponsive when running from subdirectories.

The upward traversal logic was incorrect and unnecessary: - Users don't need to search parent directories for .ccb/ - The original design enforces per-directory isolation - The 'fake death' issue is not caused by missing .ccb/ directories This reverts the lib/project_id.py changes from commit 1be1530, while keeping the install.sh changes (ccb-multi integration).

Gemini CLI >= 0.29.0 changed session storage from SHA-256 hash to directory basename (e.g. /Users/danlio -> "danlio" instead of "f3f1bce3..."). GeminiLogReader was polling the old hash directory while Gemini wrote to the new one, causing requests to hang forever. Changes: - Add _compute_project_hashes() returning both (basename, sha256) formats - GeminiLogReader scans all known hash dirs, picks newest by mtime - _work_dirs_for_hash() registers both formats in watchdog cache - bin/ask uses daemon's work_dir instead of shell cwd - bin/askd accepts --work-dir to decouple from launch directory - askd_server stores explicit work_dir instead of os.getcwd() - askd_client resolves work_dir from daemon state as fallback - daemon_work_dir validated with type guard and existence check Reviewed-by: Gemini (approved), Codex (7.5/10 correctness)

…ename collisions ccb-multi instance dirs (instance-1, instance-2, ...) collide across projects in Gemini CLI 0.29.0's basename-based session storage (~/.gemini/tmp/<basename>/). Changed to inst-<hash>-N format where hash is 8-char SHA-256 of project root path. - instance.js: generate inst-<projectHash>-<id> directory names - utils.js: getInstanceDir() with backward compat for old instance-N - utils.js: listInstances() finds both old and new format instances - ccb-multi-clean.js: clean both inst-* and instance-* directories - gemini_comm.py: add _is_ccb_instance_dir() detection via CCB_INSTANCE_ID env or parent dir check, prefer SHA-256 hash for instance dirs, block cross-hash session override in instance mode

Reposition from upstream fork copy to standalone "Multi-Instance Edition" with own v1.0.0 version line, comparison table, and upstream doc links.

- Add --kill-pid to kill specific daemon by PID - Add -v/--verbose to show detailed daemon info (work_dir, port, host) - Add -i/--interactive for interactive daemon selection - Improve daemon listing with more context Fixes issue where ccb-kill alias kills all processes indiscriminately

Document new ccb-cleanup features: - List daemons with verbose mode - Kill specific daemon by PID - Interactive daemon selection - Cleanup operations Clarify difference between ccb-kill alias and ccb-cleanup --kill-pid

- Add get_tmux_pane_for_workdir() to find tmux pane by work directory - Display tmux pane ID in verbose mode (--list -v) - Display tmux pane ID in interactive mode (-i) - Helps users identify which tmux window corresponds to each daemon This makes it easier to navigate to the correct tmux pane when managing multiple CCB instances.

OpenCode 0.29.0+ migrated from JSON file storage to SQLite database. This commit adds full SQLite support with backward compatibility. Changes: - Add SQLite database reading for sessions, messages, and parts - Implement session discovery from database with improved matching - Query LIMIT increased from 50 to 200 sessions - Find most recent matching session instead of first match - Fixes issue where other projects' sessions pushed target out of results - Enable reasoning fallback for text extraction - Handles OpenCode responses in "reasoning" type parts - Maintain backward compatibility with JSON file storage - Add comprehensive test coverage for SQLite operations Fixes communication detection issue where OpenCode completes tasks but CCB doesn't receive replies. Co-authored-by: Codex <codex@ccb> Co-authored-by: Gemini <gemini@ccb>

This commit fixes three critical issues that caused async requests to gemini and opencode to get stuck in "processing" state: 1. OpenCode session ID pinning: Modified _get_latest_session_from_db() to detect and switch to newer sessions even when session_id_filter is set. This fixes the "second call always fails" issue. 2. Incomplete state updates: Enhanced _read_since() to update all state fields (assistant_count, last_assistant_id, etc.) when session_updated changes, preventing stale state comparisons. 3. Strict completion detection: Added degraded completion detection in both OpenCode and Gemini adapters. When timeout occurs but reply contains any CCB_DONE marker, accept as completed even if req_id doesn't match (with warning log). These minimal changes resolve: - OpenCode second call failure (100% reproducible) - Gemini intermittent failures - Permanent "processing" state when req_id mismatches Files changed: - lib/opencode_comm.py: Session detection and state sync fixes - lib/askd/adapters/opencode.py: Degraded completion detection - lib/askd/adapters/gemini.py: Degraded completion detection Test: ./test_minimal_fix.sh Documentation: ISSUE_ANALYSIS.md, PR_MINIMAL_FIX.md Co-analyzed-by: Gemini, OpenCode, Codex

This commit fixes three critical bugs that cause daemon crashes and communication failures: 1. Unified askd not used in background mode: Removed foreground_mode requirement from _use_unified_daemon() check. This ensures askd is used in all modes (foreground and background), fixing the core issue where CCB_CALLER triggers background mode but askd was not used. 2. _parent_monitor thread crash: Fixed indentation bug where threading.Thread(target=_parent_monitor).start() was outside the if block where _parent_monitor was defined. This caused NameError when parent_pid was not set, leading to daemon crashes and zombie processes. 3. Gemini hash overflow: Added None check for msg_id before comparison in GeminiLogReader. When msg_id is None, skip comparison to prevent hash overflow issues that cause message detection failures. These fixes resolve: - Requests not using askd in background mode (root cause) - Daemon becoming zombie process (defunct) - Gemini intermittent message detection failures - System-wide communication breakdowns Tested: 18 concurrent/sequential calls across 3 LLMs, all successful. Files changed: - bin/ask: Enable unified askd in all modes - lib/askd_server.py: Fix _parent_monitor thread start indentation - lib/gemini_comm.py: Add None check for msg_id comparison Related to commit aad38e3 (async communication fixes)

- Replace embedded multi/ directory with submodule - Points to https://github.com/daniellee2015/ccb-multi.git - Enables single source of truth for ccb-multi code - Supports both integrated and standalone installation

- Return proper exit code (0 for success, 1 for failure) - Allows ccb-status to correctly detect kill failures

- Update ccb-status submodule to latest commit (Kill/Cleanup features) - Move test_minimal_fix.sh to test/ directory for better organization - Remove obsolete bin/ccb-status file (now using submodule)

…ulti to ccb-multi

- Performance optimization (5x faster) - Fix tmux pane detection with proper escape sequences - Filter CCB-specific pane titles - Require attached sessions for active status - Use ps aux grep for reliable process detection

- ccb-shared-context: fix import path to use lib directory - ccb-worktree: fix import path to use lib directory

- Add npm run build step for all subpackages with tsconfig.json - Process ccb-status, ccb-worktree, ccb-shared-context in addition to ccb-multi - Fix ccb-multi directory path (was 'multi', should be 'ccb-multi') This eliminates the need for manual npm run build after installation.

…n fix

Add Recover Orphaned Instances menu option to handle instances with running processes but detached tmux sessions.

Resolved conflicts by keeping local versions: - README.md: Keep local CCB Multi documentation - bin/ask: Keep local daemon ready check - bin/ccb-cleanup: Keep local enhanced version - lib/gemini_comm.py: Keep local ccb-multi instance dir support Merged upstream improvements: - Claude session work_dir backfill - WSL clipboard UTF-8 fixes - Gemini idle timeout handling - Notification loop prevention - Various bug fixes and test additions

Core improvements to daemon management, process detection, and multi-instance coexistence: 1. **Process Detection Hardening** - Fix _is_pid_alive() POSIX exception handling - Distinguish ProcessLookupError (dead) from PermissionError (alive) - Improve cross-platform process detection accuracy 2. **Multi-Instance Safety** - Add ownership checks in startup/watchdog/cleanup paths - Prevent aggressive daemon takeover between CCB instances - Only rebind daemon when foreign parent is dead or stale 3. **CCB_FORCE_REBIND Environment Variable** - Add case-insensitive force rebind override - Provide admin-level control for special scenarios - Consistent with existing _env_bool() pattern 4. **Daemon Health Monitoring** - Add watchdog thread for continuous health checks - Auto-restart daemon on ownership mismatch (when safe) - Improve daemon reliability and self-healing 5. **Thread Safety** - Add threading module import - Protect daemon_proc access with threading.Lock - Prevent race conditions in concurrent access 6. **Persistent State Management** - Add askd.last.json for crash state tracking - Distinguish graceful shutdown from crashes - Improve fault diagnosis and recovery 7. **bin/ask Self-Healing** - Add daemon auto-start on connection failure - Implement retry logic with backoff - Improve CLI tool robustness 8. **Socket Resource Management** - Add finally block for socket cleanup - Prevent resource leaks on exceptions - Ensure proper connection handling These fixes improve daemon reliability, multi-instance coexistence, and overall system stability. Verified through 8 rounds of deep code review with zero remaining Critical/High/Medium issues.

daniellee2015 and others added 29 commits February 18, 2026 15:20

chore: add .DS_Store to .gitignore

d95072a

docs: rewrite README as independent CCB Multi project

9069b29

Reposition from upstream fork copy to standalone "Multi-Instance Edition" with own v1.0.0 version line, comparison table, and upstream doc links.

docs: link npm package for standalone multi-instance install

a641b16

docs: add process management section for ccb-cleanup enhancements

0086e2f

Document new ccb-cleanup features: - List daemons with verbose mode - Kill specific daemon by PID - Interactive daemon selection - Cleanup operations Clarify difference between ccb-kill alias and ccb-cleanup --kill-pid

chore: add .serena/ to .gitignore

8248c9f

chore: remove multi/ directory to prepare for submodule

ec33073

chore: add ccb-multi as git submodule

c4d6a81

- Replace embedded multi/ directory with submodule - Points to https://github.com/daniellee2015/ccb-multi.git - Enables single source of truth for ccb-multi code - Supports both integrated and standalone installation

chore: add ccb-status as git submodule

303c8ea

fix: ccb-cleanup exit code for kill-pid command

50c0a78

- Return proper exit code (0 for success, 1 for failure) - Allows ccb-status to correctly detect kill failures

chore: update ccb-status submodule and reorganize test files

5891fef

- Update ccb-status submodule to latest commit (Kill/Cleanup features) - Move test_minimal_fix.sh to test/ directory for better organization - Remove obsolete bin/ccb-status file (now using submodule)

feat: add ccb-worktree and ccb-shared-context as submodules, rename m…

d5778ef

…ulti to ccb-multi

chore(ccb-status): update submodule with status detection fixes

996676e

- Performance optimization (5x faster) - Fix tmux pane detection with proper escape sequences - Filter CCB-specific pane titles - Require attached sessions for active status - Use ps aux grep for reliable process detection

chore: update submodules with bin path fixes

928e225

- ccb-shared-context: fix import path to use lib directory - ccb-worktree: fix import path to use lib directory

chore(ccb-shared-context): update submodule with TypeScript annotatio…

ace3580

…n fix

chore(ccb-status): update submodule with recover-orphaned feature

c539e79

Add Recover Orphaned Instances menu option to handle instances with running processes but detached tmux sessions.

daniellee2015 closed this Feb 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: enhance daemon lifecycle management and multi-instance safety#102

fix: enhance daemon lifecycle management and multi-instance safety#102
daniellee2015 wants to merge 29 commits intobfly123:mainfrom
daniellee2015:fix/daemon-lifecycle-improvements

daniellee2015 commented Feb 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

daniellee2015 commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problems Solved

Detailed Problem Analysis

Problem 1: Daemon Ownership Conflicts in Multi-Instance Scenarios

Problem 2: Process Detection Failures on Restricted Systems

Problem 3: Daemon Crashes Without Recovery

Problem 4: Lost Crash State Information

Problem 5: Connection Failures Without Retry

Problem 6: Socket Resource Leaks

Key Improvements

1. Process Detection Hardening

2. Multi-Instance Safety

3. CCB_FORCE_REBIND Environment Variable

4. Daemon Health Monitoring

5. Thread Safety

6. Persistent State Management

7. bin/ask Self-Healing

8. Socket Resource Management

Testing

Files Changed

Impact

Backward Compatibility

Related Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

daniellee2015 commented Feb 28, 2026 •

edited

Loading