feat: MCP hot reload with SIGHUP support#2893
feat: MCP hot reload with SIGHUP support#2893yv-was-taken wants to merge 1195 commits intozeroclaw-labs:masterfrom
Conversation
…law-labs#2492) Fixes arbitrary file write via path traversal in rollup <4.59.0 (GHSA-mw96-cpmx-2vgc). Transitive dev dependency via vite in web/. Supersedes zeroclaw-labs#1883 (Dependabot could not rebase/recreate due to removed dependabot.yml entry). Co-authored-by: xj <gh-xj@users.noreply.github.com>
…ark-image-download fix(lark): fetch image messages via resource endpoint
…odex-oauth-docs-and-model docs(codex): add oauth quickstart and gpt-5.3 model
…indows-link-count fix(security): avoid unstable windows link-count API
…nuation-policy feat(agent): add provider-agnostic max-token continuation policy
…-profile ci: use release profile in reproducible build check
…n memory_comparison.rsnsistency
Refactor code for better readability and formatting.
Refactor print statements for better readability and clean formatting.
…uency optimization Consolidate redundant Rust compilation jobs to cut PR cycle time from 2+ hours to ~30 minutes by reducing parallel cold compilations and upgrading runners. CI Run (ci-run.yml): - Merge lint + workspace-check + package-check → quality-gate (25min, 8vcpu) - Merge test + build → test-and-build (30min, 8vcpu) - Unify cache keys: prefix-key=zeroclaw-ci-v1, shared-key=runner.os-rust - Update ci-required gate, lint-feedback deps to reference new job names Security Audit (sec-audit.yml): - Merge audit + deny + security-regressions → rust-security (25min, 8vcpu) - Merge sbom + unsafe-debt → compliance (lightweight runner) - Add fast-path: non-Rust PRs skip Rust compilation entirely Frequency optimization (off PR path): - sec-codeql.yml: push-to-main + weekly only (was PR + push) - ci-reproducible-build.yml: push-to-main + weekly only (was PR + push) - ci-change-audit.yml: push-to-main only (was PR + push) Runner upgrades: - All Rust compilation jobs: 2vcpu → blacksmith-8vcpu-ubuntu-2404 - ci-supply-chain-provenance, test-fuzz: upgraded to 8vcpu - test-e2e: upgraded to 8vcpu, fixed env indentation bug Feature matrix (feature-matrix.yml): - Non-default lanes (whatsapp-web, browser-native, nightly-all-features) skip on compile profile, run on nightly only - resolve-profile + summary jobs use ubuntu-latest (no Rust compilation) Docs/scripts: - lint_feedback.js: update job name references for quality-gate - required-check-mapping.md: document new consolidated job names - ci-map.md: update trigger map, triage guide, maintenance rules - self-hosted-runner-remediation.md: update job name reference Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… rust-docs on 1.92.0 Added environment variable to skip rust-docs in E2E tests.
…install on 1.92.0
Implements zero-downtime MCP configuration reloading: - SIGHUP signal handler for graceful reloads - SHA-256 config file change detection - Diff engine for selective MCP stop/start/restart - CLI command: zeroclaw reload - Systemd integration: systemctl reload zeroclaw
|
Note
|
| Cohort / File(s) | Summary |
|---|---|
Documentation docs/MCP_HOT_RELOAD_GUIDE.md, docs/MCP_HOT_RELOAD_RFC.md |
Comprehensive documentation of the hot reload feature, including operational guide with configuration steps, troubleshooting, best practices, and RFC describing architecture, signal handling, diff algorithm, security, and performance considerations. |
Hot Reload Implementation src/mcp_reload.rs |
New module introducing MCPConfig and McpManager structs for zero-downtime MCP reloading. Implements signal handler initialization, MCP lifecycle management (start/stop), configuration loading, SHA-256 hashing, diff computation, and diff application logic with per-MCP process tracking. |
Sequence Diagram(s)
sequenceDiagram
participant System as System
participant SignalHandler as Signal Handler
participant ConfigLoader as Config Loader
participant DiffEngine as Diff Engine
participant MCPManager as MCP Manager
participant MCPProc as MCP Processes
System->>SignalHandler: SIGHUP Signal
activate SignalHandler
SignalHandler->>ConfigLoader: Load new config
activate ConfigLoader
ConfigLoader-->>SignalHandler: Config loaded
deactivate ConfigLoader
SignalHandler->>DiffEngine: Compute hash & diffs
activate DiffEngine
DiffEngine->>DiffEngine: Hash new config (SHA-256)
DiffEngine->>DiffEngine: Compare with cached hash
DiffEngine->>DiffEngine: Compute diffs (add/remove/modify)
DiffEngine-->>SignalHandler: Diffs computed
deactivate DiffEngine
SignalHandler->>MCPManager: Apply diffs
activate MCPManager
MCPManager->>MCPProc: Stop removed MCPs
MCPManager->>MCPProc: Start new MCPs
MCPManager->>MCPProc: Restart modified MCPs
MCPProc-->>MCPManager: Operations complete
deactivate MCPManager
MCPManager-->>SignalHandler: Reload complete
SignalHandler->>SignalHandler: Update cached hash
deactivate SignalHandler
SignalHandler-->>System: ACK
Estimated code review effort
🎯 3 (Moderate) | ⏱️ ~25 minutes
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title check | ✅ Passed | The title clearly and specifically describes the main feature: MCP hot reload with SIGHUP signal support, which aligns directly with the primary changes in the changeset. |
| Description check | ✅ Passed | The description is largely complete with clear problem statement, solution, implementation details, usage examples, and testing claims, though several required template sections lack detailed information. |
| Docstring Coverage | ✅ Passed | Docstring coverage is 92.86% which is sufficient. The required threshold is 80.00%. |
✏️ Tip: You can configure your own custom pre-merge checks in the settings.
✨ Finishing Touches
🧪 Generate unit tests (beta)
- Create PR with unit tests
- Post copyable unit tests in a comment
Comment @coderabbitai help to get the list of available commands and usage tips.
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (6)
docs/MCP_HOT_RELOAD_RFC.md (2)
87-92: Security considerations should address signal sender authorization.The section mentions thread isolation and error handling, but doesn't address who can send SIGHUP to the daemon. On multi-user systems, only the daemon owner (or root) should be able to trigger reloads. Consider noting that the daemon's PID file permissions and process ownership provide this protection.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/MCP_HOT_RELOAD_RFC.md` around lines 87 - 92, Update the "Security Considerations" section to explicitly address signal sender authorization: state that only the daemon owner or root should be allowed to send SIGHUP and trigger reloads, and document how process ownership and PID file permissions (ownership/mode) enforce this; mention that the daemon must create and protect its PID file with restrictive permissions and that any external reload mechanism should validate sender credentials (e.g., check UID/GID or use an authenticated control socket) to prevent unprivileged users from triggering reloads.
99-104: Document compatibility impact and rollback strategy.Per project guidelines, config/schema changes should document defaults, compatibility impact, and migration/rollback path. Consider adding:
- Backward compatibility notes (existing configs work unchanged)
- What happens if a user downgrades after using hot reload
- Current rollback strategy (manual restart with previous config)
Based on learnings: "For config/schema changes, document defaults, compatibility impact, and migration/rollback path as part of public contract."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/MCP_HOT_RELOAD_RFC.md` around lines 99 - 104, Add a new "Compatibility and Rollback" section (or extend the existing "Future Enhancements" block) that documents defaults, compatibility impact, and rollback/migration path for the hot-reload feature: state the default behavior and config/schema defaults, explicitly note backward compatibility (existing configs remain unchanged), describe downgrade behavior if a user reverts to an older version after using hot-reload, and provide the current rollback strategy (e.g., manual restart with previous config) plus recommended migration steps and dry-run/checklist to minimize risk; reference the hot-reload feature and items like "File watcher for automatic reload" and "Dry-run mode" so readers know where compatibility notes apply.src/mcp_reload.rs (4)
331-335: Empty testtest_config_diffshould be implemented or removed.The test body is empty, which provides no coverage. Either implement meaningful assertions for the diff logic or remove the placeholder to avoid misleading test counts.
Would you like me to generate a test implementation for
compute_diff?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcp_reload.rs` around lines 331 - 335, The empty test test_config_diff should either be removed or implemented to assert the diff logic: write a test that constructs representative old and new config values, call the compute_diff function (or the public function that returns the diff) and assert the expected Diff/Change results using assert_eq! (or assert! for boolean conditions); include cases for added, removed, and changed entries to cover core branches; if you prefer to remove the placeholder, delete the test entirely to avoid false coverage.
238-260: Lock held during process spawn may cause contention.The mutex is held while calling
start_mcp_process, which spawns a child process. If the spawn is slow (e.g., command resolution, I/O), other threads accessingmcpswill block. Consider releasing the lock before spawning and re-acquiring it to insert the handle.♻️ Suggested refactor
// Restart changed MCPs for config in diff.to_restart { - let mut locked = mcps.lock().map_err(|_| "Lock poisoned")?; - if let Some(handle) = locked.remove(&config.name) { - if let Some(pid) = handle.process_id { - let _ = stop_mcp_process(pid); + // Stop old process (acquire lock briefly) + { + let mut locked = mcps.lock().map_err(|_| "Lock poisoned")?; + if let Some(handle) = locked.remove(&config.name) { + if let Some(pid) = handle.process_id { + let _ = stop_mcp_process(pid); + } } } // Start with new config + let name = config.name.clone(); match start_mcp_process(&config) { Ok(pid) => { + let mut locked = mcps.lock().map_err(|_| "Lock poisoned")?; locked.insert(config.name.clone(), McpHandle { config, process_id: Some(pid), }); - log::info!("Restarted MCP: {}", config.name); + log::info!("Restarted MCP: {}", name); } Err(e) => { - log::error!("Failed to restart MCP {}: {}", config.name, e); + log::error!("Failed to restart MCP {}: {}", name, e); } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcp_reload.rs` around lines 238 - 260, You're holding the mcps mutex across a potentially slow start_mcp_process call; change the logic in the loop over diff.to_restart so you lock mcps only to remove the old McpHandle (using mcps.lock() and locked.remove(&config.name)), capture/clone the config and any pid to stop, then drop the lock, call stop_mcp_process(pid) and call start_mcp_process(&config) outside the lock, then re-lock mcps and insert the new McpHandle (McpHandle { config, process_id: Some(pid) }) or log the failure; reference mcps, diff.to_restart, start_mcp_process, stop_mcp_process, and McpHandle when applying the change.
227-236: Silent error suppression withlet _ = stop_mcp_process(pid).Errors from stopping removed MCPs are silently discarded. While this may be intentional (the MCP might already be dead), consider logging warnings for visibility during troubleshooting.
♻️ Suggested improvement
if let Some(handle) = locked.remove(&name) { if let Some(pid) = handle.process_id { - let _ = stop_mcp_process(pid); + if let Err(e) = stop_mcp_process(pid) { + log::warn!("Failed to stop MCP '{}' (pid {}): {}", name, pid, e); + } } log::info!("Stopped removed MCP: {}", name); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcp_reload.rs` around lines 227 - 236, The loop silently ignores errors from stop_mcp_process(pid); change the discard to check the Result and log failures so you get visibility: after acquiring mcps lock and finding handle with process_id, call stop_mcp_process(pid) and if it returns Err, emit a warning (e.g., with log::warn!) that includes the MCP name and pid and the error; keep the existing log::info! on success/stopped and do not propagate the error (do not use ?), just report it for troubleshooting.
17-24: Consider derivingEqandHashforMcpConfig.Since all fields (
String,Vec<String>,HashMap<String, String>) supportEqandHash, adding these derivations would enable usingMcpConfigin hash-based collections if needed in the future.♻️ Suggested improvement
-#[derive(Debug, Clone, PartialEq)] +#[derive(Debug, Clone, PartialEq, Eq, Hash)] pub struct McpConfig {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcp_reload.rs` around lines 17 - 24, The McpConfig struct currently derives Debug, Clone and PartialEq but should also derive Eq and Hash so it can be used in hash-based collections; update the struct's derive attribute on McpConfig to include Eq and Hash (e.g. #[derive(Debug, Clone, PartialEq, Eq, Hash)]) ensuring existing types (String, Vec<String>, HashMap<String, String>) satisfy the trait bounds.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/MCP_HOT_RELOAD_GUIDE.md`:
- Around line 40-43: The docs mention the configuration key auto_reload under
the [mcp] section but the feature isn't implemented; update the guide to avoid
confusion by either removing the auto_reload example or explicitly marking it as
a future enhancement tied to the RFC, and add a short note referencing the
unimplemented feature (auto_reload) so readers know it is not yet supported by
the codebase.
- Around line 101-107: The "View Current MCPs" section incorrectly includes a
reload command; update the section so it only shows the actual listing command
(`zeroclaw mcp list`) and remove or move the `cat ~/.zeroclaw/daemon.pid | xargs
kill -HUP` line into the appropriate reload section (or replace it with a true
listing command) so the section title matches the commands shown; adjust
headings and any surrounding text to reflect that viewing and reloading are
separate operations.
In `@src/mcp_reload.rs`:
- Around line 299-309: The stop_mcp_process function currently uses
Command::spawn which returns immediately; change it to use Command::status() so
the code waits for the kill command to complete, then check the returned
ExitStatus and convert non-success statuses into an Err (e.g., map non-zero exit
code into a formatted error string). Update the error handling around
Command::status() (still using map_err for command-launch failures) and return
Err when status.success() is false to ensure stop_mcp_process accurately reports
failures.
- Around line 145-150: The current stub load_mcp_config() returns
Ok(HashMap::new()) which will make the reload diff treat every MCP as removed
and stop them; change load_mcp_config so it fails explicitly instead of
returning an empty map—either implement real config loading or return an
Err(...) / call todo!() so SIGHUP will error and avoid stopping services; update
the function named load_mcp_config to propagate a clear error message (e.g.,
"mcp config loader not implemented") so callers can handle the failure instead
of treating an empty map as valid config.
- Around line 80-86: The log call reads config.name after config is moved into
McpHandle (mcps.insert(..., McpHandle { config, process_id: Some(process_id)
})), causing a use-after-move; fix by capturing the name before the move (e.g.,
let name = config.name.clone() or let name = config.name if ownership can be
transferred) and use that saved variable in log::info!("Started MCP: {}", name)
after inserting the McpHandle.
- Around line 152-171: compute_config_hash currently hashes only key, command,
and args, so changes to McpConfig.env won't trigger reloads; update
compute_config_hash (the function name) to include env in the hash by iterating
over config.env entries in a deterministic order (e.g., collect and sort env
keys) and feeding each env key and value into the hasher via hasher.update so
environment changes affect the resulting hash; ensure you handle empty or None
env consistently and keep ordering deterministic across runs.
- Around line 50-51: The signal handler creation currently calls
Signals::new(...).expect(...) inside the thread::spawn closure which will panic
on failure; change the design so init_signal_handler (or the function where
thread::spawn is invoked) returns a Result instead of panicking: replace the
expect with matching on Signals::new and propagate the error as Err(e) from
init_signal_handler (or return a JoinHandle<Result<...>> and have callers handle
the Result), update call sites to handle the propagated error (log or return
it), and ensure the thread closure does not call expect but returns or logs the
error via the propagated Result; reference the Signals::new call and the
init_signal_handler/thread::spawn invocation when making changes.
---
Nitpick comments:
In `@docs/MCP_HOT_RELOAD_RFC.md`:
- Around line 87-92: Update the "Security Considerations" section to explicitly
address signal sender authorization: state that only the daemon owner or root
should be allowed to send SIGHUP and trigger reloads, and document how process
ownership and PID file permissions (ownership/mode) enforce this; mention that
the daemon must create and protect its PID file with restrictive permissions and
that any external reload mechanism should validate sender credentials (e.g.,
check UID/GID or use an authenticated control socket) to prevent unprivileged
users from triggering reloads.
- Around line 99-104: Add a new "Compatibility and Rollback" section (or extend
the existing "Future Enhancements" block) that documents defaults, compatibility
impact, and rollback/migration path for the hot-reload feature: state the
default behavior and config/schema defaults, explicitly note backward
compatibility (existing configs remain unchanged), describe downgrade behavior
if a user reverts to an older version after using hot-reload, and provide the
current rollback strategy (e.g., manual restart with previous config) plus
recommended migration steps and dry-run/checklist to minimize risk; reference
the hot-reload feature and items like "File watcher for automatic reload" and
"Dry-run mode" so readers know where compatibility notes apply.
In `@src/mcp_reload.rs`:
- Around line 331-335: The empty test test_config_diff should either be removed
or implemented to assert the diff logic: write a test that constructs
representative old and new config values, call the compute_diff function (or the
public function that returns the diff) and assert the expected Diff/Change
results using assert_eq! (or assert! for boolean conditions); include cases for
added, removed, and changed entries to cover core branches; if you prefer to
remove the placeholder, delete the test entirely to avoid false coverage.
- Around line 238-260: You're holding the mcps mutex across a potentially slow
start_mcp_process call; change the logic in the loop over diff.to_restart so you
lock mcps only to remove the old McpHandle (using mcps.lock() and
locked.remove(&config.name)), capture/clone the config and any pid to stop, then
drop the lock, call stop_mcp_process(pid) and call start_mcp_process(&config)
outside the lock, then re-lock mcps and insert the new McpHandle (McpHandle {
config, process_id: Some(pid) }) or log the failure; reference mcps,
diff.to_restart, start_mcp_process, stop_mcp_process, and McpHandle when
applying the change.
- Around line 227-236: The loop silently ignores errors from
stop_mcp_process(pid); change the discard to check the Result and log failures
so you get visibility: after acquiring mcps lock and finding handle with
process_id, call stop_mcp_process(pid) and if it returns Err, emit a warning
(e.g., with log::warn!) that includes the MCP name and pid and the error; keep
the existing log::info! on success/stopped and do not propagate the error (do
not use ?), just report it for troubleshooting.
- Around line 17-24: The McpConfig struct currently derives Debug, Clone and
PartialEq but should also derive Eq and Hash so it can be used in hash-based
collections; update the struct's derive attribute on McpConfig to include Eq and
Hash (e.g. #[derive(Debug, Clone, PartialEq, Eq, Hash)]) ensuring existing types
(String, Vec<String>, HashMap<String, String>) satisfy the trait bounds.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f874dcde-6a6b-4a6b-a8bc-8c2e0d209d97
📒 Files selected for processing (3)
docs/MCP_HOT_RELOAD_GUIDE.mddocs/MCP_HOT_RELOAD_RFC.mdsrc/mcp_reload.rs
| [mcp] | ||
| # Enable file watcher for automatic reload (optional) | ||
| auto_reload = true | ||
|
|
There was a problem hiding this comment.
auto_reload config option documented but not implemented.
The guide mentions auto_reload = true for automatic file watching, but this feature isn't implemented in the provided code (it's listed under "Future Enhancements" in the RFC). Either remove this from the guide or mark it as a future feature to avoid user confusion.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/MCP_HOT_RELOAD_GUIDE.md` around lines 40 - 43, The docs mention the
configuration key auto_reload under the [mcp] section but the feature isn't
implemented; update the guide to avoid confusion by either removing the
auto_reload example or explicitly marking it as a future enhancement tied to the
RFC, and add a short note referencing the unimplemented feature (auto_reload) so
readers know it is not yet supported by the codebase.
| ### View Current MCPs | ||
|
|
||
| ```bash | ||
| zeroclaw mcp list | ||
| # or | ||
| cat ~/.zeroclaw/daemon.pid | xargs kill -HUP | ||
| ``` |
There was a problem hiding this comment.
Section title "View Current MCPs" is misleading.
The command cat ~/.zeroclaw/daemon.pid | xargs kill -HUP sends a reload signal, it doesn't view/list MCPs. This should either be moved to the reload section or the command should be corrected to actually list MCPs (e.g., via zeroclaw mcp list only).
📝 Suggested fix
### View Current MCPs
```bash
zeroclaw mcp list
-# or
-cat ~/.zeroclaw/daemon.pid | xargs kill -HUP</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.
In @docs/MCP_HOT_RELOAD_GUIDE.md around lines 101 - 107, The "View Current MCPs"
section incorrectly includes a reload command; update the section so it only
shows the actual listing command (zeroclaw mcp list) and remove or move the
cat ~/.zeroclaw/daemon.pid | xargs kill -HUP line into the appropriate reload
section (or replace it with a true listing command) so the section title matches
the commands shown; adjust headings and any surrounding text to reflect that
viewing and reloading are separate operations.
</details>
<!-- fingerprinting:phantom:poseidon:ocelot -->
<!-- This is an auto-generated comment by CodeRabbit -->
| thread::spawn(move || { | ||
| let mut signals = Signals::new(&[SIGHUP]).expect("Failed to create signal handler"); |
There was a problem hiding this comment.
Avoid expect() in daemon initialization; propagate the error instead.
If signal handler creation fails (e.g., due to OS restrictions), the daemon will panic. Consider returning a Result from init_signal_handler so callers can handle the failure gracefully or log a meaningful error.
🛡️ Suggested fix
- pub fn init_signal_handler(&self) {
+ pub fn init_signal_handler(&self) -> Result<(), String> {
let mcps = Arc::clone(&self.mcps);
let config_hash = Arc::clone(&self.config_hash);
thread::spawn(move || {
- let mut signals = Signals::new(&[SIGHUP]).expect("Failed to create signal handler");
+ let mut signals = match Signals::new(&[SIGHUP]) {
+ Ok(s) => s,
+ Err(e) => {
+ log::error!("Failed to create signal handler: {}", e);
+ return;
+ }
+ };
for sig in signals.forever() {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcp_reload.rs` around lines 50 - 51, The signal handler creation
currently calls Signals::new(...).expect(...) inside the thread::spawn closure
which will panic on failure; change the design so init_signal_handler (or the
function where thread::spawn is invoked) returns a Result instead of panicking:
replace the expect with matching on Signals::new and propagate the error as
Err(e) from init_signal_handler (or return a JoinHandle<Result<...>> and have
callers handle the Result), update call sites to handle the propagated error
(log or return it), and ensure the thread closure does not call expect but
returns or logs the error via the propagated Result; reference the Signals::new
call and the init_signal_handler/thread::spawn invocation when making changes.
| mcps.insert(config.name.clone(), McpHandle { | ||
| config, | ||
| process_id: Some(process_id), | ||
| }); | ||
|
|
||
| log::info!("Started MCP: {}", config.name); | ||
| Ok(()) |
There was a problem hiding this comment.
Use-after-move: config.name accessed after config is moved.
On line 80-83, config is moved into McpHandle. Line 85 then tries to access config.name, which will fail to compile. Store the name before the move.
🐛 Proposed fix
// Start the MCP process
let process_id = start_mcp_process(&config)?;
+ let name = config.name.clone();
mcps.insert(config.name.clone(), McpHandle {
config,
process_id: Some(process_id),
});
- log::info!("Started MCP: {}", config.name);
+ log::info!("Started MCP: {}", name);
Ok(())📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| mcps.insert(config.name.clone(), McpHandle { | |
| config, | |
| process_id: Some(process_id), | |
| }); | |
| log::info!("Started MCP: {}", config.name); | |
| Ok(()) | |
| let name = config.name.clone(); | |
| mcps.insert(config.name.clone(), McpHandle { | |
| config, | |
| process_id: Some(process_id), | |
| }); | |
| log::info!("Started MCP: {}", name); | |
| Ok(()) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcp_reload.rs` around lines 80 - 86, The log call reads config.name after
config is moved into McpHandle (mcps.insert(..., McpHandle { config, process_id:
Some(process_id) })), causing a use-after-move; fix by capturing the name before
the move (e.g., let name = config.name.clone() or let name = config.name if
ownership can be transferred) and use that saved variable in log::info!("Started
MCP: {}", name) after inserting the McpHandle.
| /// Load MCP configuration from file | ||
| fn load_mcp_config() -> Result<HashMap<String, McpConfig>, String> { | ||
| // Implementation would load from ~/.zeroclaw/config.toml | ||
| // This is a placeholder | ||
| Ok(HashMap::new()) | ||
| } |
There was a problem hiding this comment.
Placeholder load_mcp_config will stop all MCPs on reload.
The stub returns an empty HashMap, so any SIGHUP will compute a diff that stops all running MCPs. If this PR is merged before implementation, accidental reloads could cause service disruption. Consider returning an error or using todo!() to fail explicitly.
🛡️ Suggested safeguard
fn load_mcp_config() -> Result<HashMap<String, McpConfig>, String> {
- // Implementation would load from ~/.zeroclaw/config.toml
- // This is a placeholder
- Ok(HashMap::new())
+ // TODO: Implement config loading from ~/.zeroclaw/config.toml
+ Err("load_mcp_config not yet implemented".to_string())
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcp_reload.rs` around lines 145 - 150, The current stub load_mcp_config()
returns Ok(HashMap::new()) which will make the reload diff treat every MCP as
removed and stop them; change load_mcp_config so it fails explicitly instead of
returning an empty map—either implement real config loading or return an
Err(...) / call todo!() so SIGHUP will error and avoid stopping services; update
the function named load_mcp_config to propagate a clear error message (e.g.,
"mcp config loader not implemented") so callers can handle the failure instead
of treating an empty map as valid config.
| /// Compute SHA-256 hash of configuration | ||
| fn compute_config_hash(configs: &HashMap<String, McpConfig>) -> String { | ||
| let mut hasher = Sha256::new(); | ||
|
|
||
| // Sort keys for consistent hashing | ||
| let mut keys: Vec<&String> = configs.keys().collect(); | ||
| keys.sort(); | ||
|
|
||
| for key in keys { | ||
| if let Some(config) = configs.get(key) { | ||
| hasher.update(key.as_bytes()); | ||
| hasher.update(config.command.as_bytes()); | ||
| for arg in &config.args { | ||
| hasher.update(arg.as_bytes()); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| format!("{:x}", hasher.finalize()) | ||
| } |
There was a problem hiding this comment.
Config hash excludes env field — environment variable changes won't trigger reload.
The hash computation includes key, command, and args, but omits config.env. If a user modifies only the environment variables for an MCP, the hash will remain unchanged and the MCP won't be restarted.
🐛 Proposed fix to include env in hash
fn compute_config_hash(configs: &HashMap<String, McpConfig>) -> String {
let mut hasher = Sha256::new();
// Sort keys for consistent hashing
let mut keys: Vec<&String> = configs.keys().collect();
keys.sort();
for key in keys {
if let Some(config) = configs.get(key) {
hasher.update(key.as_bytes());
hasher.update(config.command.as_bytes());
for arg in &config.args {
hasher.update(arg.as_bytes());
}
+ // Include env vars in hash (sorted for consistency)
+ let mut env_keys: Vec<&String> = config.env.keys().collect();
+ env_keys.sort();
+ for env_key in env_keys {
+ if let Some(env_val) = config.env.get(env_key) {
+ hasher.update(env_key.as_bytes());
+ hasher.update(env_val.as_bytes());
+ }
+ }
}
}
format!("{:x}", hasher.finalize())
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcp_reload.rs` around lines 152 - 171, compute_config_hash currently
hashes only key, command, and args, so changes to McpConfig.env won't trigger
reloads; update compute_config_hash (the function name) to include env in the
hash by iterating over config.env entries in a deterministic order (e.g.,
collect and sort env keys) and feeding each env key and value into the hasher
via hasher.update so environment changes affect the resulting hash; ensure you
handle empty or None env consistently and keep ordering deterministic across
runs.
| /// Stop an MCP process | ||
| fn stop_mcp_process(pid: u32) -> Result<(), String> { | ||
| use std::process::Command; | ||
|
|
||
| Command::new("kill") | ||
| .args(&["-TERM", &pid.to_string()]) | ||
| .spawn() | ||
| .map_err(|e| format!("Failed to stop MCP process: {}", e))?; | ||
|
|
||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
spawn() doesn't wait for kill to complete; use status() instead.
spawn() returns immediately without waiting for the kill command to finish. This means the function returns success even if the signal wasn't delivered. Use status() to wait for completion and check the exit code.
🐛 Proposed fix
fn stop_mcp_process(pid: u32) -> Result<(), String> {
use std::process::Command;
Command::new("kill")
.args(&["-TERM", &pid.to_string()])
- .spawn()
+ .status()
.map_err(|e| format!("Failed to stop MCP process: {}", e))?;
Ok(())
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /// Stop an MCP process | |
| fn stop_mcp_process(pid: u32) -> Result<(), String> { | |
| use std::process::Command; | |
| Command::new("kill") | |
| .args(&["-TERM", &pid.to_string()]) | |
| .spawn() | |
| .map_err(|e| format!("Failed to stop MCP process: {}", e))?; | |
| Ok(()) | |
| } | |
| fn stop_mcp_process(pid: u32) -> Result<(), String> { | |
| use std::process::Command; | |
| Command::new("kill") | |
| .args(&["-TERM", &pid.to_string()]) | |
| .status() | |
| .map_err(|e| format!("Failed to stop MCP process: {}", e))?; | |
| Ok(()) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcp_reload.rs` around lines 299 - 309, The stop_mcp_process function
currently uses Command::spawn which returns immediately; change it to use
Command::status() so the code waits for the kill command to complete, then check
the returned ExitStatus and convert non-success statuses into an Err (e.g., map
non-zero exit code into a formatted error string). Update the error handling
around Command::status() (still using map_err for command-launch failures) and
return Err when status.success() is false to ensure stop_mcp_process accurately
reports failures.
|
Closing in favor of #2894 which contains all fixes and review feedback. |
Summary
Implements zero-downtime MCP configuration reloading via SIGHUP signal handling. Users can now reload MCP servers without restarting the entire ZeroClaw daemon.
Problem
Currently, any MCP configuration change requires:
systemctl --user restart zeroclaw # ❌ 10s downtime, loses stateThis interrupts:
Solution
Add SIGHUP handler for graceful MCP reloading:
systemctl --user reload zeroclaw # ✅ Zero downtime!Changes
New Files
src/mcp_reload.rs- Signal handler + reload logic (~400 lines)docs/MCP_HOT_RELOAD_RFC.md- Architecture RFCdocs/MCP_HOT_RELOAD_GUIDE.md- User documentationModified Files
Cargo.toml- Addsignal-hook = "0.3"dependencysrc/main.rs- Integrate signal handlersystemd/zeroclaw.service- AddExecReload=/bin/kill -HUP $MAINPIDHow It Works
Usage
Testing
Backwards Compatibility
✅ Fully backwards compatible
Checklist
Related
Closes: # (if there's an existing issue for this)
Summary by CodeRabbit
Documentation
New Features