Skip to content

fix: enhance daemon lifecycle management and multi-instance safety#103

Open
daniellee2015 wants to merge 2 commits intobfly123:mainfrom
daniellee2015:fix/daemon-lifecycle-clean
Open

fix: enhance daemon lifecycle management and multi-instance safety#103
daniellee2015 wants to merge 2 commits intobfly123:mainfrom
daniellee2015:fix/daemon-lifecycle-clean

Conversation

@daniellee2015
Copy link
Contributor

Summary

Core improvements to daemon management, process detection, multi-instance coexistence, and task lifecycle. These fixes address critical issues in daemon lifecycle management, communication reliability, and improve overall system stability.

This PR includes two major improvement areas:

  1. Daemon Lifecycle Management - Process detection, multi-instance safety, health monitoring
  2. Daemon Communication & Task Lifecycle - Completion tracking, error handling, adapter improvements

Problems Solved

This PR fixes 6 critical daemon management issues:

# Problem Impact Solution
1 Daemon ownership conflicts Multi-instance interference, service interruptions Ownership validation before shutdown
2 Process detection failures False positives on restricted systems Distinguish PermissionError from ProcessLookupError
3 No daemon crash recovery Manual intervention required Watchdog thread with auto-restart
4 Lost crash state Difficult debugging Persistent askd.last.json state file
5 Connection failures No retry mechanism Self-healing with auto-start
6 Socket resource leaks Resource exhaustion over time Proper finally block cleanup

Detailed Problem Analysis

Problem 1: Daemon Ownership Conflicts in Multi-Instance Scenarios

Symptom: When multiple CCB instances run on the same machine, they aggressively kill each other's daemons, causing service interruptions and data loss.

Root Cause: No ownership validation before daemon shutdown - any CCB instance could forcefully terminate daemons owned by other active instances.

Solution: Added ownership checks in startup/watchdog/cleanup paths. Daemons are only rebound when the foreign parent process is confirmed dead.

Problem 2: Process Detection Failures on Restricted Systems

Symptom: _is_pid_alive() incorrectly reports live processes as dead on systems with restricted permissions, leading to unnecessary daemon restarts.

Root Cause: PermissionError was treated the same as ProcessLookupError, causing false negatives.

Solution: Distinguish between different exception types - PermissionError now correctly indicates a live process.

Problem 3: Daemon Crashes Without Recovery

Symptom: When daemon crashes or becomes unresponsive, CCB continues to fail without attempting recovery, requiring manual intervention.

Root Cause: No health monitoring or auto-recovery mechanism.

Solution: Added watchdog thread that monitors daemon health every 5 seconds and automatically restarts when safe.

Problem 4: Lost Crash State Information

Symptom: When daemon crashes, no state information is preserved, making debugging difficult.

Root Cause: State file (askd.json) is deleted on daemon exit, regardless of crash or graceful shutdown.

Solution: Added askd.last.json persistent state file that preserves crash information for diagnosis.

Problem 5: Connection Failures Without Retry

Symptom: bin/ask command fails immediately on connection errors, even when daemon could be auto-started.

Root Cause: No retry logic or daemon auto-start capability.

Solution: Added self-healing logic with daemon auto-start and retry mechanism.

Problem 6: Socket Resource Leaks

Symptom: Socket connections not properly closed on exceptions, leading to resource exhaustion over time.

Root Cause: Missing finally blocks for socket cleanup.

Solution: Added proper resource management with finally blocks ensuring socket cleanup.

Key Improvements

1. Process Detection Hardening

  • Fix _is_pid_alive() POSIX exception handling
  • Distinguish ProcessLookupError (dead) from PermissionError (alive)
  • Improve cross-platform process detection accuracy

Before:

except Exception:
    return False

After:

except ProcessLookupError:
    return False  # Process doesn't exist
except PermissionError:
    return True   # Process exists but no permission
except Exception:
    return False  # Other errors - assume dead for safety

2. Multi-Instance Safety

  • Add ownership checks in startup/watchdog/cleanup paths
  • Prevent aggressive daemon takeover between CCB instances
  • Only rebind daemon when foreign parent is dead or stale

This allows multiple CCB instances to coexist safely on the same machine without interfering with each other's daemons.

3. CCB_FORCE_REBIND Environment Variable

  • Add case-insensitive force rebind override (CCB_FORCE_REBIND=1)
  • Provide admin-level control for special scenarios
  • Consistent with existing _env_bool() pattern

4. Daemon Health Monitoring

  • Add watchdog thread for continuous health checks
  • Auto-restart daemon on ownership mismatch (when safe)
  • Improve daemon reliability and self-healing

The watchdog monitors daemon health every 5 seconds and can automatically recover from failures.

5. Thread Safety

  • Add threading module import
  • Protect daemon_proc access with threading.Lock
  • Prevent race conditions in concurrent access

6. Persistent State Management

  • Add askd.last.json for crash state tracking
  • Distinguish graceful shutdown from crashes
  • Improve fault diagnosis and recovery

7. bin/ask Self-Healing

  • Add daemon auto-start on connection failure
  • Implement retry logic with backoff
  • Improve CLI tool robustness

8. Socket Resource Management

  • Add finally block for socket cleanup
  • Prevent resource leaks on exceptions
  • Ensure proper connection handling

9. Completion Hook Mechanism

  • Enhanced error handling in completion hooks
  • Better CCB_DONE marker tracking
  • Improved request ID handling

10. Worker Pool Task Lifecycle

  • Enhanced task lifecycle management
  • Improved degraded completion detection
  • Idle timeout and mismatched marker handling

11. Adapter Error Handling

  • Better error handling across all adapters (Claude, Codex, Gemini, Droid, OpenCode)
  • Improved retry logic
  • Enhanced communication reliability

12. Context Transfer & Memory

  • Improved context transfer reliability
  • Enhanced session management
  • Better memory transfer handling

Testing

  • ✅ Verified through 8 rounds of deep code review with AI assistant
  • ✅ Zero remaining Critical/High/Medium issues
  • ✅ Tested multi-instance coexistence scenarios
  • ✅ Confirmed daemon self-healing capabilities
  • ✅ Cross-platform compatibility (macOS, Linux, Windows)

Files Changed

Core Daemon Management (3 files)

File Changes Description
ccb +302/-90 Daemon lifecycle, watchdog, ownership checks
bin/ask +310/-196 Self-healing, socket management, retry logic
lib/askd_server.py +90/0 Persistent state, heartbeat thread

Daemon Communication & Task Lifecycle (15 files)

File Changes Description
lib/askd/adapters/base.py +2/0 Base adapter improvements
lib/askd/adapters/claude.py +10/-2 Claude adapter error handling
lib/askd/adapters/codex.py +81/-5 Codex adapter enhancements
lib/askd/adapters/droid.py +33/-7 Droid adapter improvements
lib/askd/adapters/gemini.py +44/-9 Gemini adapter enhancements
lib/askd/adapters/opencode.py +44/-9 OpenCode adapter improvements
lib/askd/daemon.py +10/0 Daemon core enhancements
lib/ccb_protocol.py +48/0 CCB_DONE marker tracking
lib/codex_comm.py +28/-1 Codex communication improvements
lib/completion_hook.py +13/-3 Completion hook error handling
lib/daskd_protocol.py +4/0 Droid protocol enhancements
lib/gaskd_protocol.py +4/0 Gemini protocol enhancements
lib/memory/transfer.py +15/-4 Memory transfer reliability
lib/terminal.py +22/-6 Terminal output formatting
lib/worker_pool.py +6/0 Task lifecycle management

Total: 18 files, +791 insertions, -275 deletions

Impact

These changes improve:

  • ✅ Daemon reliability and self-healing
  • ✅ Multi-instance coexistence on the same machine
  • ✅ Process detection accuracy across platforms
  • ✅ Resource management and leak prevention
  • ✅ Communication reliability between CCB and AI providers
  • ✅ Task lifecycle management and completion tracking
  • ✅ Error handling and retry mechanisms
  • ✅ Overall system stability

Backward Compatibility

All changes are fully backward compatible and do not affect existing functionality:

  • No breaking API changes
  • No configuration changes required
  • Existing behavior preserved when CCB_FORCE_REBIND is not set
  • Graceful degradation on older systems

Related Issues

This PR addresses several daemon management issues:

  • Daemon ownership conflicts in multi-instance scenarios
  • Process detection failures on systems with restricted permissions
  • Daemon crashes without proper state tracking
  • Resource leaks in error conditions

Branch: daniellee2015:fix/daemon-lifecycle-clean
Base: main
Commits: 2
Files: 18

Core improvements to daemon management, process detection, and multi-instance coexistence:

1. **Process Detection Hardening**
   - Fix _is_pid_alive() POSIX exception handling
   - Distinguish ProcessLookupError (dead) from PermissionError (alive)
   - Improve cross-platform process detection accuracy

2. **Multi-Instance Safety**
   - Add ownership checks in startup/watchdog/cleanup paths
   - Prevent aggressive daemon takeover between CCB instances
   - Only rebind daemon when foreign parent is dead or stale

3. **CCB_FORCE_REBIND Environment Variable**
   - Add case-insensitive force rebind override
   - Provide admin-level control for special scenarios
   - Consistent with existing _env_bool() pattern

4. **Daemon Health Monitoring**
   - Add watchdog thread for continuous health checks
   - Auto-restart daemon on ownership mismatch (when safe)
   - Improve daemon reliability and self-healing

5. **Thread Safety**
   - Add threading module import
   - Protect daemon_proc access with threading.Lock
   - Prevent race conditions in concurrent access

6. **Persistent State Management**
   - Add askd.last.json for crash state tracking
   - Distinguish graceful shutdown from crashes
   - Improve fault diagnosis and recovery

7. **bin/ask Self-Healing**
   - Add daemon auto-start on connection failure
   - Implement retry logic with backoff
   - Improve CLI tool robustness

8. **Socket Resource Management**
   - Add finally block for socket cleanup
   - Prevent resource leaks on exceptions
   - Ensure proper connection handling

These fixes improve daemon reliability, multi-instance coexistence, and overall system stability.

Verified through 8 rounds of deep code review with zero remaining Critical/High/Medium issues.
Improvements to daemon communication, context management, and task lifecycle:

- Enhanced completion hook mechanism with better error handling
- Improved context transfer and session management
- Better CCB_DONE marker tracking and request ID handling
- Enhanced worker pool task lifecycle management
- Improved degraded completion detection (idle timeout, mismatched markers)
- Better adapter error handling and retry logic
- Enhanced terminal output formatting
- Improved memory transfer reliability

These changes improve daemon communication reliability and task management.
@bfly123
Copy link
Owner

bfly123 commented Mar 2, 2026

两个pr都需要合并吗?

@daniellee2015
Copy link
Contributor Author

两个pr都需要合并吗?
可以先合daemon-lifecycle 这个 2a4839c 分支,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants