fix: enhance daemon lifecycle management and multi-instance safety#103
Open
daniellee2015 wants to merge 2 commits intobfly123:mainfrom
Open
fix: enhance daemon lifecycle management and multi-instance safety#103daniellee2015 wants to merge 2 commits intobfly123:mainfrom
daniellee2015 wants to merge 2 commits intobfly123:mainfrom
Conversation
Core improvements to daemon management, process detection, and multi-instance coexistence: 1. **Process Detection Hardening** - Fix _is_pid_alive() POSIX exception handling - Distinguish ProcessLookupError (dead) from PermissionError (alive) - Improve cross-platform process detection accuracy 2. **Multi-Instance Safety** - Add ownership checks in startup/watchdog/cleanup paths - Prevent aggressive daemon takeover between CCB instances - Only rebind daemon when foreign parent is dead or stale 3. **CCB_FORCE_REBIND Environment Variable** - Add case-insensitive force rebind override - Provide admin-level control for special scenarios - Consistent with existing _env_bool() pattern 4. **Daemon Health Monitoring** - Add watchdog thread for continuous health checks - Auto-restart daemon on ownership mismatch (when safe) - Improve daemon reliability and self-healing 5. **Thread Safety** - Add threading module import - Protect daemon_proc access with threading.Lock - Prevent race conditions in concurrent access 6. **Persistent State Management** - Add askd.last.json for crash state tracking - Distinguish graceful shutdown from crashes - Improve fault diagnosis and recovery 7. **bin/ask Self-Healing** - Add daemon auto-start on connection failure - Implement retry logic with backoff - Improve CLI tool robustness 8. **Socket Resource Management** - Add finally block for socket cleanup - Prevent resource leaks on exceptions - Ensure proper connection handling These fixes improve daemon reliability, multi-instance coexistence, and overall system stability. Verified through 8 rounds of deep code review with zero remaining Critical/High/Medium issues.
Improvements to daemon communication, context management, and task lifecycle: - Enhanced completion hook mechanism with better error handling - Improved context transfer and session management - Better CCB_DONE marker tracking and request ID handling - Enhanced worker pool task lifecycle management - Improved degraded completion detection (idle timeout, mismatched markers) - Better adapter error handling and retry logic - Enhanced terminal output formatting - Improved memory transfer reliability These changes improve daemon communication reliability and task management.
Owner
|
两个pr都需要合并吗? |
Contributor
Author
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Core improvements to daemon management, process detection, multi-instance coexistence, and task lifecycle. These fixes address critical issues in daemon lifecycle management, communication reliability, and improve overall system stability.
This PR includes two major improvement areas:
Problems Solved
This PR fixes 6 critical daemon management issues:
Detailed Problem Analysis
Problem 1: Daemon Ownership Conflicts in Multi-Instance Scenarios
Symptom: When multiple CCB instances run on the same machine, they aggressively kill each other's daemons, causing service interruptions and data loss.
Root Cause: No ownership validation before daemon shutdown - any CCB instance could forcefully terminate daemons owned by other active instances.
Solution: Added ownership checks in startup/watchdog/cleanup paths. Daemons are only rebound when the foreign parent process is confirmed dead.
Problem 2: Process Detection Failures on Restricted Systems
Symptom:
_is_pid_alive()incorrectly reports live processes as dead on systems with restricted permissions, leading to unnecessary daemon restarts.Root Cause:
PermissionErrorwas treated the same asProcessLookupError, causing false negatives.Solution: Distinguish between different exception types -
PermissionErrornow correctly indicates a live process.Problem 3: Daemon Crashes Without Recovery
Symptom: When daemon crashes or becomes unresponsive, CCB continues to fail without attempting recovery, requiring manual intervention.
Root Cause: No health monitoring or auto-recovery mechanism.
Solution: Added watchdog thread that monitors daemon health every 5 seconds and automatically restarts when safe.
Problem 4: Lost Crash State Information
Symptom: When daemon crashes, no state information is preserved, making debugging difficult.
Root Cause: State file (
askd.json) is deleted on daemon exit, regardless of crash or graceful shutdown.Solution: Added
askd.last.jsonpersistent state file that preserves crash information for diagnosis.Problem 5: Connection Failures Without Retry
Symptom:
bin/askcommand fails immediately on connection errors, even when daemon could be auto-started.Root Cause: No retry logic or daemon auto-start capability.
Solution: Added self-healing logic with daemon auto-start and retry mechanism.
Problem 6: Socket Resource Leaks
Symptom: Socket connections not properly closed on exceptions, leading to resource exhaustion over time.
Root Cause: Missing finally blocks for socket cleanup.
Solution: Added proper resource management with finally blocks ensuring socket cleanup.
Key Improvements
1. Process Detection Hardening
_is_pid_alive()POSIX exception handlingProcessLookupError(dead) fromPermissionError(alive)Before:
After:
2. Multi-Instance Safety
This allows multiple CCB instances to coexist safely on the same machine without interfering with each other's daemons.
3. CCB_FORCE_REBIND Environment Variable
CCB_FORCE_REBIND=1)_env_bool()pattern4. Daemon Health Monitoring
The watchdog monitors daemon health every 5 seconds and can automatically recover from failures.
5. Thread Safety
threadingmodule importdaemon_procaccess withthreading.Lock6. Persistent State Management
askd.last.jsonfor crash state tracking7. bin/ask Self-Healing
8. Socket Resource Management
9. Completion Hook Mechanism
10. Worker Pool Task Lifecycle
11. Adapter Error Handling
12. Context Transfer & Memory
Testing
Files Changed
Core Daemon Management (3 files)
ccbbin/asklib/askd_server.pyDaemon Communication & Task Lifecycle (15 files)
lib/askd/adapters/base.pylib/askd/adapters/claude.pylib/askd/adapters/codex.pylib/askd/adapters/droid.pylib/askd/adapters/gemini.pylib/askd/adapters/opencode.pylib/askd/daemon.pylib/ccb_protocol.pylib/codex_comm.pylib/completion_hook.pylib/daskd_protocol.pylib/gaskd_protocol.pylib/memory/transfer.pylib/terminal.pylib/worker_pool.pyTotal: 18 files, +791 insertions, -275 deletions
Impact
These changes improve:
Backward Compatibility
All changes are fully backward compatible and do not affect existing functionality:
CCB_FORCE_REBINDis not setRelated Issues
This PR addresses several daemon management issues:
Branch:
daniellee2015:fix/daemon-lifecycle-cleanBase:
mainCommits: 2
Files: 18