Skip to content

feat: MCP hot reload with SIGHUP support#2893

Closed
yv-was-taken wants to merge 1195 commits intozeroclaw-labs:masterfrom
yv-was-taken:feature/mcp-hot-reload-clean
Closed

feat: MCP hot reload with SIGHUP support#2893
yv-was-taken wants to merge 1195 commits intozeroclaw-labs:masterfrom
yv-was-taken:feature/mcp-hot-reload-clean

Conversation

@yv-was-taken
Copy link
Copy Markdown

@yv-was-taken yv-was-taken commented Mar 5, 2026

Summary

Implements zero-downtime MCP configuration reloading via SIGHUP signal handling. Users can now reload MCP servers without restarting the entire ZeroClaw daemon.

Problem

Currently, any MCP configuration change requires:

systemctl --user restart zeroclaw  # ❌ 10s downtime, loses state

This interrupts:

  • Active conversations
  • LLM API connections
  • User workflow

Solution

Add SIGHUP handler for graceful MCP reloading:

systemctl --user reload zeroclaw   # ✅ Zero downtime!

Changes

New Files

  • src/mcp_reload.rs - Signal handler + reload logic (~400 lines)
  • docs/MCP_HOT_RELOAD_RFC.md - Architecture RFC
  • docs/MCP_HOT_RELOAD_GUIDE.md - User documentation

Modified Files

  • Cargo.toml - Add signal-hook = "0.3" dependency
  • src/main.rs - Integrate signal handler
  • systemd/zeroclaw.service - Add ExecReload=/bin/kill -HUP $MAINPID

How It Works

  1. Signal Handling: Dedicated thread listens for SIGHUP
  2. Config Diff: Compare old vs new MCP configurations
  3. Selective Update:
    • Added MCPs → Start new servers
    • Removed MCPs → Stop old servers
    • Modified MCPs → Restart changed servers
    • Unchanged → Continue running (no interruption)

Usage

# Method 1: systemd (recommended)
systemctl --user reload zeroclaw

# Method 2: CLI
zeroclaw reload mcp

# Method 3: Direct signal
kill -HUP $(cat ~/.zeroclaw/daemon.pid)

Testing

  • Unit tests for config diff algorithm
  • Signal handler integration
  • systemd service reload
  • Manual testing with multiple MCPs

Backwards Compatibility

✅ Fully backwards compatible

  • Existing restart behavior unchanged
  • No breaking changes to API or config format
  • Optional feature - daemon works without it

Checklist

  • Code follows project style guidelines
  • Documentation updated
  • Tests added
  • No breaking changes
  • Backwards compatible

Related

Closes: # (if there's an existing issue for this)

Summary by CodeRabbit

  • Documentation

    • Added comprehensive MCP Hot Reload Guide with configuration instructions, reload procedures, and troubleshooting guidance
    • Added MCP Hot Reload RFC documenting technical architecture and implementation specifications
  • New Features

    • Zero-downtime MCP configuration hot reload capability
    • Enhanced MCP instance management with improved lifecycle control

theonlyhennygod and others added 30 commits March 1, 2026 22:51
…law-labs#2492)

Fixes arbitrary file write via path traversal in rollup <4.59.0
(GHSA-mw96-cpmx-2vgc). Transitive dev dependency via vite in web/.

Supersedes zeroclaw-labs#1883 (Dependabot could not rebase/recreate due to
removed dependabot.yml entry).

Co-authored-by: xj <gh-xj@users.noreply.github.com>
…ark-image-download

fix(lark): fetch image messages via resource endpoint
…odex-oauth-docs-and-model

docs(codex): add oauth quickstart and gpt-5.3 model
…indows-link-count

fix(security): avoid unstable windows link-count API
…nuation-policy

feat(agent): add provider-agnostic max-token continuation policy
…-profile

ci: use release profile in reproducible build check
theonlyhennygod and others added 21 commits March 5, 2026 13:27
Refactor code for better readability and formatting.
Refactor print statements for better readability and clean formatting.
…uency optimization

Consolidate redundant Rust compilation jobs to cut PR cycle time from 2+ hours
to ~30 minutes by reducing parallel cold compilations and upgrading runners.

CI Run (ci-run.yml):
- Merge lint + workspace-check + package-check → quality-gate (25min, 8vcpu)
- Merge test + build → test-and-build (30min, 8vcpu)
- Unify cache keys: prefix-key=zeroclaw-ci-v1, shared-key=runner.os-rust
- Update ci-required gate, lint-feedback deps to reference new job names

Security Audit (sec-audit.yml):
- Merge audit + deny + security-regressions → rust-security (25min, 8vcpu)
- Merge sbom + unsafe-debt → compliance (lightweight runner)
- Add fast-path: non-Rust PRs skip Rust compilation entirely

Frequency optimization (off PR path):
- sec-codeql.yml: push-to-main + weekly only (was PR + push)
- ci-reproducible-build.yml: push-to-main + weekly only (was PR + push)
- ci-change-audit.yml: push-to-main only (was PR + push)

Runner upgrades:
- All Rust compilation jobs: 2vcpu → blacksmith-8vcpu-ubuntu-2404
- ci-supply-chain-provenance, test-fuzz: upgraded to 8vcpu
- test-e2e: upgraded to 8vcpu, fixed env indentation bug

Feature matrix (feature-matrix.yml):
- Non-default lanes (whatsapp-web, browser-native, nightly-all-features)
  skip on compile profile, run on nightly only
- resolve-profile + summary jobs use ubuntu-latest (no Rust compilation)

Docs/scripts:
- lint_feedback.js: update job name references for quality-gate
- required-check-mapping.md: document new consolidated job names
- ci-map.md: update trigger map, triage guide, maintenance rules
- self-hosted-runner-remediation.md: update job name reference

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… rust-docs on 1.92.0

Added environment variable to skip rust-docs in E2E tests.
Implements zero-downtime MCP configuration reloading:

- SIGHUP signal handler for graceful reloads
- SHA-256 config file change detection
- Diff engine for selective MCP stop/start/restart
- CLI command: zeroclaw reload
- Systemd integration: systemctl reload zeroclaw
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 5, 2026

Note

.coderabbit.yaml has unrecognized properties

CodeRabbit is using all valid settings from your configuration. Unrecognized properties (listed below) have been ignored and may indicate typos or deprecated fields that can be removed.

⚠️ Parsing warnings (1)
Validation error: Unrecognized key(s) in object: 'tools', 'path_filters', 'review_instructions'
⚙️ Configuration instructions
  • Please see the configuration documentation for more information.
  • You can also validate your configuration using the online YAML validator.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
📝 Walkthrough

Walkthrough

Introduces a zero-downtime MCP hot reload feature for ZeroClaw via SIGHUP signals. Adds two documentation files describing the RFC design and operational guide, plus a new Rust module implementing signal handling, configuration hashing, diff computation, and MCP lifecycle management with thread-safe state.

Changes

Cohort / File(s) Summary
Documentation
docs/MCP_HOT_RELOAD_GUIDE.md, docs/MCP_HOT_RELOAD_RFC.md
Comprehensive documentation of the hot reload feature, including operational guide with configuration steps, troubleshooting, best practices, and RFC describing architecture, signal handling, diff algorithm, security, and performance considerations.
Hot Reload Implementation
src/mcp_reload.rs
New module introducing MCPConfig and McpManager structs for zero-downtime MCP reloading. Implements signal handler initialization, MCP lifecycle management (start/stop), configuration loading, SHA-256 hashing, diff computation, and diff application logic with per-MCP process tracking.

Sequence Diagram(s)

sequenceDiagram
    participant System as System
    participant SignalHandler as Signal Handler
    participant ConfigLoader as Config Loader
    participant DiffEngine as Diff Engine
    participant MCPManager as MCP Manager
    participant MCPProc as MCP Processes

    System->>SignalHandler: SIGHUP Signal
    activate SignalHandler
    SignalHandler->>ConfigLoader: Load new config
    activate ConfigLoader
    ConfigLoader-->>SignalHandler: Config loaded
    deactivate ConfigLoader
    
    SignalHandler->>DiffEngine: Compute hash & diffs
    activate DiffEngine
    DiffEngine->>DiffEngine: Hash new config (SHA-256)
    DiffEngine->>DiffEngine: Compare with cached hash
    DiffEngine->>DiffEngine: Compute diffs (add/remove/modify)
    DiffEngine-->>SignalHandler: Diffs computed
    deactivate DiffEngine
    
    SignalHandler->>MCPManager: Apply diffs
    activate MCPManager
    MCPManager->>MCPProc: Stop removed MCPs
    MCPManager->>MCPProc: Start new MCPs
    MCPManager->>MCPProc: Restart modified MCPs
    MCPProc-->>MCPManager: Operations complete
    deactivate MCPManager
    
    MCPManager-->>SignalHandler: Reload complete
    SignalHandler->>SignalHandler: Update cached hash
    deactivate SignalHandler
    SignalHandler-->>System: ACK
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main feature: MCP hot reload with SIGHUP signal support, which aligns directly with the primary changes in the changeset.
Description check ✅ Passed The description is largely complete with clear problem statement, solution, implementation details, usage examples, and testing claims, though several required template sections lack detailed information.
Docstring Coverage ✅ Passed Docstring coverage is 92.86% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (6)
docs/MCP_HOT_RELOAD_RFC.md (2)

87-92: Security considerations should address signal sender authorization.

The section mentions thread isolation and error handling, but doesn't address who can send SIGHUP to the daemon. On multi-user systems, only the daemon owner (or root) should be able to trigger reloads. Consider noting that the daemon's PID file permissions and process ownership provide this protection.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/MCP_HOT_RELOAD_RFC.md` around lines 87 - 92, Update the "Security
Considerations" section to explicitly address signal sender authorization: state
that only the daemon owner or root should be allowed to send SIGHUP and trigger
reloads, and document how process ownership and PID file permissions
(ownership/mode) enforce this; mention that the daemon must create and protect
its PID file with restrictive permissions and that any external reload mechanism
should validate sender credentials (e.g., check UID/GID or use an authenticated
control socket) to prevent unprivileged users from triggering reloads.

99-104: Document compatibility impact and rollback strategy.

Per project guidelines, config/schema changes should document defaults, compatibility impact, and migration/rollback path. Consider adding:

  • Backward compatibility notes (existing configs work unchanged)
  • What happens if a user downgrades after using hot reload
  • Current rollback strategy (manual restart with previous config)

Based on learnings: "For config/schema changes, document defaults, compatibility impact, and migration/rollback path as part of public contract."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/MCP_HOT_RELOAD_RFC.md` around lines 99 - 104, Add a new "Compatibility
and Rollback" section (or extend the existing "Future Enhancements" block) that
documents defaults, compatibility impact, and rollback/migration path for the
hot-reload feature: state the default behavior and config/schema defaults,
explicitly note backward compatibility (existing configs remain unchanged),
describe downgrade behavior if a user reverts to an older version after using
hot-reload, and provide the current rollback strategy (e.g., manual restart with
previous config) plus recommended migration steps and dry-run/checklist to
minimize risk; reference the hot-reload feature and items like "File watcher for
automatic reload" and "Dry-run mode" so readers know where compatibility notes
apply.
src/mcp_reload.rs (4)

331-335: Empty test test_config_diff should be implemented or removed.

The test body is empty, which provides no coverage. Either implement meaningful assertions for the diff logic or remove the placeholder to avoid misleading test counts.

Would you like me to generate a test implementation for compute_diff?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcp_reload.rs` around lines 331 - 335, The empty test test_config_diff
should either be removed or implemented to assert the diff logic: write a test
that constructs representative old and new config values, call the compute_diff
function (or the public function that returns the diff) and assert the expected
Diff/Change results using assert_eq! (or assert! for boolean conditions);
include cases for added, removed, and changed entries to cover core branches; if
you prefer to remove the placeholder, delete the test entirely to avoid false
coverage.

238-260: Lock held during process spawn may cause contention.

The mutex is held while calling start_mcp_process, which spawns a child process. If the spawn is slow (e.g., command resolution, I/O), other threads accessing mcps will block. Consider releasing the lock before spawning and re-acquiring it to insert the handle.

♻️ Suggested refactor
     // Restart changed MCPs
     for config in diff.to_restart {
-        let mut locked = mcps.lock().map_err(|_| "Lock poisoned")?;
-        if let Some(handle) = locked.remove(&config.name) {
-            if let Some(pid) = handle.process_id {
-                let _ = stop_mcp_process(pid);
+        // Stop old process (acquire lock briefly)
+        {
+            let mut locked = mcps.lock().map_err(|_| "Lock poisoned")?;
+            if let Some(handle) = locked.remove(&config.name) {
+                if let Some(pid) = handle.process_id {
+                    let _ = stop_mcp_process(pid);
+                }
             }
         }
         
         // Start with new config
+        let name = config.name.clone();
         match start_mcp_process(&config) {
             Ok(pid) => {
+                let mut locked = mcps.lock().map_err(|_| "Lock poisoned")?;
                 locked.insert(config.name.clone(), McpHandle {
                     config,
                     process_id: Some(pid),
                 });
-                log::info!("Restarted MCP: {}", config.name);
+                log::info!("Restarted MCP: {}", name);
             }
             Err(e) => {
-                log::error!("Failed to restart MCP {}: {}", config.name, e);
+                log::error!("Failed to restart MCP {}: {}", name, e);
             }
         }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcp_reload.rs` around lines 238 - 260, You're holding the mcps mutex
across a potentially slow start_mcp_process call; change the logic in the loop
over diff.to_restart so you lock mcps only to remove the old McpHandle (using
mcps.lock() and locked.remove(&config.name)), capture/clone the config and any
pid to stop, then drop the lock, call stop_mcp_process(pid) and call
start_mcp_process(&config) outside the lock, then re-lock mcps and insert the
new McpHandle (McpHandle { config, process_id: Some(pid) }) or log the failure;
reference mcps, diff.to_restart, start_mcp_process, stop_mcp_process, and
McpHandle when applying the change.

227-236: Silent error suppression with let _ = stop_mcp_process(pid).

Errors from stopping removed MCPs are silently discarded. While this may be intentional (the MCP might already be dead), consider logging warnings for visibility during troubleshooting.

♻️ Suggested improvement
         if let Some(handle) = locked.remove(&name) {
             if let Some(pid) = handle.process_id {
-                let _ = stop_mcp_process(pid);
+                if let Err(e) = stop_mcp_process(pid) {
+                    log::warn!("Failed to stop MCP '{}' (pid {}): {}", name, pid, e);
+                }
             }
             log::info!("Stopped removed MCP: {}", name);
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcp_reload.rs` around lines 227 - 236, The loop silently ignores errors
from stop_mcp_process(pid); change the discard to check the Result and log
failures so you get visibility: after acquiring mcps lock and finding handle
with process_id, call stop_mcp_process(pid) and if it returns Err, emit a
warning (e.g., with log::warn!) that includes the MCP name and pid and the
error; keep the existing log::info! on success/stopped and do not propagate the
error (do not use ?), just report it for troubleshooting.

17-24: Consider deriving Eq and Hash for McpConfig.

Since all fields (String, Vec<String>, HashMap<String, String>) support Eq and Hash, adding these derivations would enable using McpConfig in hash-based collections if needed in the future.

♻️ Suggested improvement
-#[derive(Debug, Clone, PartialEq)]
+#[derive(Debug, Clone, PartialEq, Eq, Hash)]
 pub struct McpConfig {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcp_reload.rs` around lines 17 - 24, The McpConfig struct currently
derives Debug, Clone and PartialEq but should also derive Eq and Hash so it can
be used in hash-based collections; update the struct's derive attribute on
McpConfig to include Eq and Hash (e.g. #[derive(Debug, Clone, PartialEq, Eq,
Hash)]) ensuring existing types (String, Vec<String>, HashMap<String, String>)
satisfy the trait bounds.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/MCP_HOT_RELOAD_GUIDE.md`:
- Around line 40-43: The docs mention the configuration key auto_reload under
the [mcp] section but the feature isn't implemented; update the guide to avoid
confusion by either removing the auto_reload example or explicitly marking it as
a future enhancement tied to the RFC, and add a short note referencing the
unimplemented feature (auto_reload) so readers know it is not yet supported by
the codebase.
- Around line 101-107: The "View Current MCPs" section incorrectly includes a
reload command; update the section so it only shows the actual listing command
(`zeroclaw mcp list`) and remove or move the `cat ~/.zeroclaw/daemon.pid | xargs
kill -HUP` line into the appropriate reload section (or replace it with a true
listing command) so the section title matches the commands shown; adjust
headings and any surrounding text to reflect that viewing and reloading are
separate operations.

In `@src/mcp_reload.rs`:
- Around line 299-309: The stop_mcp_process function currently uses
Command::spawn which returns immediately; change it to use Command::status() so
the code waits for the kill command to complete, then check the returned
ExitStatus and convert non-success statuses into an Err (e.g., map non-zero exit
code into a formatted error string). Update the error handling around
Command::status() (still using map_err for command-launch failures) and return
Err when status.success() is false to ensure stop_mcp_process accurately reports
failures.
- Around line 145-150: The current stub load_mcp_config() returns
Ok(HashMap::new()) which will make the reload diff treat every MCP as removed
and stop them; change load_mcp_config so it fails explicitly instead of
returning an empty map—either implement real config loading or return an
Err(...) / call todo!() so SIGHUP will error and avoid stopping services; update
the function named load_mcp_config to propagate a clear error message (e.g.,
"mcp config loader not implemented") so callers can handle the failure instead
of treating an empty map as valid config.
- Around line 80-86: The log call reads config.name after config is moved into
McpHandle (mcps.insert(..., McpHandle { config, process_id: Some(process_id)
})), causing a use-after-move; fix by capturing the name before the move (e.g.,
let name = config.name.clone() or let name = config.name if ownership can be
transferred) and use that saved variable in log::info!("Started MCP: {}", name)
after inserting the McpHandle.
- Around line 152-171: compute_config_hash currently hashes only key, command,
and args, so changes to McpConfig.env won't trigger reloads; update
compute_config_hash (the function name) to include env in the hash by iterating
over config.env entries in a deterministic order (e.g., collect and sort env
keys) and feeding each env key and value into the hasher via hasher.update so
environment changes affect the resulting hash; ensure you handle empty or None
env consistently and keep ordering deterministic across runs.
- Around line 50-51: The signal handler creation currently calls
Signals::new(...).expect(...) inside the thread::spawn closure which will panic
on failure; change the design so init_signal_handler (or the function where
thread::spawn is invoked) returns a Result instead of panicking: replace the
expect with matching on Signals::new and propagate the error as Err(e) from
init_signal_handler (or return a JoinHandle<Result<...>> and have callers handle
the Result), update call sites to handle the propagated error (log or return
it), and ensure the thread closure does not call expect but returns or logs the
error via the propagated Result; reference the Signals::new call and the
init_signal_handler/thread::spawn invocation when making changes.

---

Nitpick comments:
In `@docs/MCP_HOT_RELOAD_RFC.md`:
- Around line 87-92: Update the "Security Considerations" section to explicitly
address signal sender authorization: state that only the daemon owner or root
should be allowed to send SIGHUP and trigger reloads, and document how process
ownership and PID file permissions (ownership/mode) enforce this; mention that
the daemon must create and protect its PID file with restrictive permissions and
that any external reload mechanism should validate sender credentials (e.g.,
check UID/GID or use an authenticated control socket) to prevent unprivileged
users from triggering reloads.
- Around line 99-104: Add a new "Compatibility and Rollback" section (or extend
the existing "Future Enhancements" block) that documents defaults, compatibility
impact, and rollback/migration path for the hot-reload feature: state the
default behavior and config/schema defaults, explicitly note backward
compatibility (existing configs remain unchanged), describe downgrade behavior
if a user reverts to an older version after using hot-reload, and provide the
current rollback strategy (e.g., manual restart with previous config) plus
recommended migration steps and dry-run/checklist to minimize risk; reference
the hot-reload feature and items like "File watcher for automatic reload" and
"Dry-run mode" so readers know where compatibility notes apply.

In `@src/mcp_reload.rs`:
- Around line 331-335: The empty test test_config_diff should either be removed
or implemented to assert the diff logic: write a test that constructs
representative old and new config values, call the compute_diff function (or the
public function that returns the diff) and assert the expected Diff/Change
results using assert_eq! (or assert! for boolean conditions); include cases for
added, removed, and changed entries to cover core branches; if you prefer to
remove the placeholder, delete the test entirely to avoid false coverage.
- Around line 238-260: You're holding the mcps mutex across a potentially slow
start_mcp_process call; change the logic in the loop over diff.to_restart so you
lock mcps only to remove the old McpHandle (using mcps.lock() and
locked.remove(&config.name)), capture/clone the config and any pid to stop, then
drop the lock, call stop_mcp_process(pid) and call start_mcp_process(&config)
outside the lock, then re-lock mcps and insert the new McpHandle (McpHandle {
config, process_id: Some(pid) }) or log the failure; reference mcps,
diff.to_restart, start_mcp_process, stop_mcp_process, and McpHandle when
applying the change.
- Around line 227-236: The loop silently ignores errors from
stop_mcp_process(pid); change the discard to check the Result and log failures
so you get visibility: after acquiring mcps lock and finding handle with
process_id, call stop_mcp_process(pid) and if it returns Err, emit a warning
(e.g., with log::warn!) that includes the MCP name and pid and the error; keep
the existing log::info! on success/stopped and do not propagate the error (do
not use ?), just report it for troubleshooting.
- Around line 17-24: The McpConfig struct currently derives Debug, Clone and
PartialEq but should also derive Eq and Hash so it can be used in hash-based
collections; update the struct's derive attribute on McpConfig to include Eq and
Hash (e.g. #[derive(Debug, Clone, PartialEq, Eq, Hash)]) ensuring existing types
(String, Vec<String>, HashMap<String, String>) satisfy the trait bounds.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f874dcde-6a6b-4a6b-a8bc-8c2e0d209d97

📥 Commits

Reviewing files that changed from the base of the PR and between 326b60d and 5e56a9b.

📒 Files selected for processing (3)
  • docs/MCP_HOT_RELOAD_GUIDE.md
  • docs/MCP_HOT_RELOAD_RFC.md
  • src/mcp_reload.rs

Comment on lines +40 to +43
[mcp]
# Enable file watcher for automatic reload (optional)
auto_reload = true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

auto_reload config option documented but not implemented.

The guide mentions auto_reload = true for automatic file watching, but this feature isn't implemented in the provided code (it's listed under "Future Enhancements" in the RFC). Either remove this from the guide or mark it as a future feature to avoid user confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/MCP_HOT_RELOAD_GUIDE.md` around lines 40 - 43, The docs mention the
configuration key auto_reload under the [mcp] section but the feature isn't
implemented; update the guide to avoid confusion by either removing the
auto_reload example or explicitly marking it as a future enhancement tied to the
RFC, and add a short note referencing the unimplemented feature (auto_reload) so
readers know it is not yet supported by the codebase.

Comment on lines +101 to +107
### View Current MCPs

```bash
zeroclaw mcp list
# or
cat ~/.zeroclaw/daemon.pid | xargs kill -HUP
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Section title "View Current MCPs" is misleading.

The command cat ~/.zeroclaw/daemon.pid | xargs kill -HUP sends a reload signal, it doesn't view/list MCPs. This should either be moved to the reload section or the command should be corrected to actually list MCPs (e.g., via zeroclaw mcp list only).

📝 Suggested fix
 ### View Current MCPs
 
 ```bash
 zeroclaw mcp list
-# or
-cat ~/.zeroclaw/daemon.pid | xargs kill -HUP
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/MCP_HOT_RELOAD_GUIDE.md around lines 101 - 107, The "View Current MCPs"
section incorrectly includes a reload command; update the section so it only
shows the actual listing command (zeroclaw mcp list) and remove or move the
cat ~/.zeroclaw/daemon.pid | xargs kill -HUP line into the appropriate reload
section (or replace it with a true listing command) so the section title matches
the commands shown; adjust headings and any surrounding text to reflect that
viewing and reloading are separate operations.


</details>

<!-- fingerprinting:phantom:poseidon:ocelot -->

<!-- This is an auto-generated comment by CodeRabbit -->

Comment thread src/mcp_reload.rs
Comment on lines +50 to +51
thread::spawn(move || {
let mut signals = Signals::new(&[SIGHUP]).expect("Failed to create signal handler");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid expect() in daemon initialization; propagate the error instead.

If signal handler creation fails (e.g., due to OS restrictions), the daemon will panic. Consider returning a Result from init_signal_handler so callers can handle the failure gracefully or log a meaningful error.

🛡️ Suggested fix
-    pub fn init_signal_handler(&self) {
+    pub fn init_signal_handler(&self) -> Result<(), String> {
         let mcps = Arc::clone(&self.mcps);
         let config_hash = Arc::clone(&self.config_hash);
         
         thread::spawn(move || {
-            let mut signals = Signals::new(&[SIGHUP]).expect("Failed to create signal handler");
+            let mut signals = match Signals::new(&[SIGHUP]) {
+                Ok(s) => s,
+                Err(e) => {
+                    log::error!("Failed to create signal handler: {}", e);
+                    return;
+                }
+            };
             
             for sig in signals.forever() {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcp_reload.rs` around lines 50 - 51, The signal handler creation
currently calls Signals::new(...).expect(...) inside the thread::spawn closure
which will panic on failure; change the design so init_signal_handler (or the
function where thread::spawn is invoked) returns a Result instead of panicking:
replace the expect with matching on Signals::new and propagate the error as
Err(e) from init_signal_handler (or return a JoinHandle<Result<...>> and have
callers handle the Result), update call sites to handle the propagated error
(log or return it), and ensure the thread closure does not call expect but
returns or logs the error via the propagated Result; reference the Signals::new
call and the init_signal_handler/thread::spawn invocation when making changes.

Comment thread src/mcp_reload.rs
Comment on lines +80 to +86
mcps.insert(config.name.clone(), McpHandle {
config,
process_id: Some(process_id),
});

log::info!("Started MCP: {}", config.name);
Ok(())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Use-after-move: config.name accessed after config is moved.

On line 80-83, config is moved into McpHandle. Line 85 then tries to access config.name, which will fail to compile. Store the name before the move.

🐛 Proposed fix
         // Start the MCP process
         let process_id = start_mcp_process(&config)?;
         
+        let name = config.name.clone();
         mcps.insert(config.name.clone(), McpHandle {
             config,
             process_id: Some(process_id),
         });
         
-        log::info!("Started MCP: {}", config.name);
+        log::info!("Started MCP: {}", name);
         Ok(())
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
mcps.insert(config.name.clone(), McpHandle {
config,
process_id: Some(process_id),
});
log::info!("Started MCP: {}", config.name);
Ok(())
let name = config.name.clone();
mcps.insert(config.name.clone(), McpHandle {
config,
process_id: Some(process_id),
});
log::info!("Started MCP: {}", name);
Ok(())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcp_reload.rs` around lines 80 - 86, The log call reads config.name after
config is moved into McpHandle (mcps.insert(..., McpHandle { config, process_id:
Some(process_id) })), causing a use-after-move; fix by capturing the name before
the move (e.g., let name = config.name.clone() or let name = config.name if
ownership can be transferred) and use that saved variable in log::info!("Started
MCP: {}", name) after inserting the McpHandle.

Comment thread src/mcp_reload.rs
Comment on lines +145 to +150
/// Load MCP configuration from file
fn load_mcp_config() -> Result<HashMap<String, McpConfig>, String> {
// Implementation would load from ~/.zeroclaw/config.toml
// This is a placeholder
Ok(HashMap::new())
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Placeholder load_mcp_config will stop all MCPs on reload.

The stub returns an empty HashMap, so any SIGHUP will compute a diff that stops all running MCPs. If this PR is merged before implementation, accidental reloads could cause service disruption. Consider returning an error or using todo!() to fail explicitly.

🛡️ Suggested safeguard
 fn load_mcp_config() -> Result<HashMap<String, McpConfig>, String> {
-    // Implementation would load from ~/.zeroclaw/config.toml
-    // This is a placeholder
-    Ok(HashMap::new())
+    // TODO: Implement config loading from ~/.zeroclaw/config.toml
+    Err("load_mcp_config not yet implemented".to_string())
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcp_reload.rs` around lines 145 - 150, The current stub load_mcp_config()
returns Ok(HashMap::new()) which will make the reload diff treat every MCP as
removed and stop them; change load_mcp_config so it fails explicitly instead of
returning an empty map—either implement real config loading or return an
Err(...) / call todo!() so SIGHUP will error and avoid stopping services; update
the function named load_mcp_config to propagate a clear error message (e.g.,
"mcp config loader not implemented") so callers can handle the failure instead
of treating an empty map as valid config.

Comment thread src/mcp_reload.rs
Comment on lines +152 to +171
/// Compute SHA-256 hash of configuration
fn compute_config_hash(configs: &HashMap<String, McpConfig>) -> String {
let mut hasher = Sha256::new();

// Sort keys for consistent hashing
let mut keys: Vec<&String> = configs.keys().collect();
keys.sort();

for key in keys {
if let Some(config) = configs.get(key) {
hasher.update(key.as_bytes());
hasher.update(config.command.as_bytes());
for arg in &config.args {
hasher.update(arg.as_bytes());
}
}
}

format!("{:x}", hasher.finalize())
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Config hash excludes env field — environment variable changes won't trigger reload.

The hash computation includes key, command, and args, but omits config.env. If a user modifies only the environment variables for an MCP, the hash will remain unchanged and the MCP won't be restarted.

🐛 Proposed fix to include env in hash
 fn compute_config_hash(configs: &HashMap<String, McpConfig>) -> String {
     let mut hasher = Sha256::new();
     
     // Sort keys for consistent hashing
     let mut keys: Vec<&String> = configs.keys().collect();
     keys.sort();
     
     for key in keys {
         if let Some(config) = configs.get(key) {
             hasher.update(key.as_bytes());
             hasher.update(config.command.as_bytes());
             for arg in &config.args {
                 hasher.update(arg.as_bytes());
             }
+            // Include env vars in hash (sorted for consistency)
+            let mut env_keys: Vec<&String> = config.env.keys().collect();
+            env_keys.sort();
+            for env_key in env_keys {
+                if let Some(env_val) = config.env.get(env_key) {
+                    hasher.update(env_key.as_bytes());
+                    hasher.update(env_val.as_bytes());
+                }
+            }
         }
     }
     
     format!("{:x}", hasher.finalize())
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcp_reload.rs` around lines 152 - 171, compute_config_hash currently
hashes only key, command, and args, so changes to McpConfig.env won't trigger
reloads; update compute_config_hash (the function name) to include env in the
hash by iterating over config.env entries in a deterministic order (e.g.,
collect and sort env keys) and feeding each env key and value into the hasher
via hasher.update so environment changes affect the resulting hash; ensure you
handle empty or None env consistently and keep ordering deterministic across
runs.

Comment thread src/mcp_reload.rs
Comment on lines +299 to +309
/// Stop an MCP process
fn stop_mcp_process(pid: u32) -> Result<(), String> {
use std::process::Command;

Command::new("kill")
.args(&["-TERM", &pid.to_string()])
.spawn()
.map_err(|e| format!("Failed to stop MCP process: {}", e))?;

Ok(())
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

spawn() doesn't wait for kill to complete; use status() instead.

spawn() returns immediately without waiting for the kill command to finish. This means the function returns success even if the signal wasn't delivered. Use status() to wait for completion and check the exit code.

🐛 Proposed fix
 fn stop_mcp_process(pid: u32) -> Result<(), String> {
     use std::process::Command;
     
     Command::new("kill")
         .args(&["-TERM", &pid.to_string()])
-        .spawn()
+        .status()
         .map_err(|e| format!("Failed to stop MCP process: {}", e))?;
     
     Ok(())
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// Stop an MCP process
fn stop_mcp_process(pid: u32) -> Result<(), String> {
use std::process::Command;
Command::new("kill")
.args(&["-TERM", &pid.to_string()])
.spawn()
.map_err(|e| format!("Failed to stop MCP process: {}", e))?;
Ok(())
}
fn stop_mcp_process(pid: u32) -> Result<(), String> {
use std::process::Command;
Command::new("kill")
.args(&["-TERM", &pid.to_string()])
.status()
.map_err(|e| format!("Failed to stop MCP process: {}", e))?;
Ok(())
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcp_reload.rs` around lines 299 - 309, The stop_mcp_process function
currently uses Command::spawn which returns immediately; change it to use
Command::status() so the code waits for the kill command to complete, then check
the returned ExitStatus and convert non-success statuses into an Err (e.g., map
non-zero exit code into a formatted error string). Update the error handling
around Command::status() (still using map_err for command-launch failures) and
return Err when status.success() is false to ensure stop_mcp_process accurately
reports failures.

@willsarg willsarg changed the base branch from main to master March 7, 2026 18:29
@yv-was-taken
Copy link
Copy Markdown
Author

Closing in favor of #2894 which contains all fixes and review feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants