Skip to content

feat(ptc): programmatic tool calling -- executor, SDK, and E2E tests#625

Closed
zmanian wants to merge 16 commits intostagingfrom
feat/ptc-programmatic-tool-calling
Closed

feat(ptc): programmatic tool calling -- executor, SDK, and E2E tests#625
zmanian wants to merge 16 commits intostagingfrom
feat/ptc-programmatic-tool-calling

Conversation

@zmanian
Copy link
Copy Markdown
Collaborator

@zmanian zmanian commented Mar 6, 2026

Summary

Adds Programmatic Tool Calling (PTC) infrastructure from #407 (item 1), enabling tools to invoke other tools without LLM round-trips.

  • ToolExecutor: Standalone dispatch engine with timeout enforcement, nesting depth limit (max 5), and safety layer sanitization. Used by both the orchestrator HTTP RPC path and WASM tool_invoke host function.
  • Orchestrator endpoint: POST /worker/{job_id}/tools/call with auth, SSE event emission (JobToolUse + JobToolResult), and 503 when executor not configured.
  • WASM tool_invoke: Host function with alias resolution -- WASM tools call by alias (e.g. echo_alias), capabilities map aliases to real tool names.
  • Python SDK: Stdlib-only ironclaw_tools.py for container scripts. Reads connection details from env vars, provides call_tool() + convenience wrappers (shell, read_file, write_file, http_get).

Test plan

16 new tests across 4 suites:

  • 6 orchestrator HTTP RPC tests (cargo test orchestrator::api::tests --all-features)
    • Auth required (401), echo success, not found, no executor (503), timeout, SSE events
  • 3 executor integration tests (cargo test tools::executor::tests --all-features)
    • Safety sanitization (Bearer token redaction), invalid params error mapping, sequential call isolation
  • 4 Python SDK tests (cd sdk/python && python3 -m unittest test_ironclaw_tools)
    • Missing env vars, request format, HTTP error handling, convenience wrappers
  • 3 WASM E2E tests (cargo test tools::wasm::wrapper::tests --all-features -- --ignored)
    • Echo via alias, alias not granted, no capability (requires pre-built WASM binary)

Verification:

cargo test --all-features --lib           # 1825 passed, 0 failed
cargo clippy --all --all-features -- -D warnings  # zero warnings

Refs #407

Generated with Claude Code

@github-actions github-actions bot added scope: tool Tool infrastructure scope: tool/builtin Built-in tools scope: tool/wasm WASM tool sandbox scope: orchestrator Container orchestrator scope: worker Container worker scope: sandbox Docker sandbox size: XL 500+ changed lines risk: medium Business logic, config, or moderate-risk modules contributor: core 20+ merged PRs labels Mar 6, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's capabilities by introducing programmatic tool calling. This allows tools to directly interact with and invoke other tools, enabling more complex, multi-step workflows without requiring an LLM to mediate each step. The changes span the core Rust backend, including a new tool execution engine and API endpoint, and extend to a new Python SDK for easier integration within worker containers, ultimately streamlining tool orchestration and improving efficiency.

Highlights

  • Programmatic Tool Calling (PTC) Infrastructure: Introduced core infrastructure for tools to programmatically invoke other tools, bypassing LLM round-trips. This includes a new ToolExecutor for dispatch, timeout enforcement, nesting depth limits, and safety layer sanitization.
  • Orchestrator Endpoint for PTC: Added a new HTTP RPC endpoint (POST /worker/{job_id}/tools/call) to the orchestrator API, allowing external workers to trigger tool calls. This endpoint includes authentication, emits SSE events for tool usage and results, and handles cases where the executor is not configured.
  • WASM tool_invoke Host Function: Implemented a WASM host function (tool_invoke) that enables WASM-based tools to call other tools by alias. This function integrates with the ToolExecutor and enforces nesting depth limits.
  • Python SDK for Container Scripts: Developed a standard library-only Python SDK (ironclaw_tools.py) for use within container scripts. This SDK provides call_tool() and convenience wrappers (shell, read_file, write_file, http_get) to simplify programmatic tool interaction.
  • Nesting Depth and Safety Enforcement: Integrated nesting depth tracking into JobContext and the ToolExecutor to prevent infinite recursion in tool-invokes-tool chains. The ToolExecutor also applies safety layer sanitization to tool outputs.
Changelog
  • Dockerfile.worker
    • Added a step to copy the new IronClaw Python SDK into the worker image for programmatic tool calling.
  • sdk/python/ironclaw_tools.py
    • Added a new Python SDK file providing call_tool and convenience wrappers (shell, read_file, write_file, http_get) for programmatic tool invocation within container scripts.
  • sdk/python/test_ironclaw_tools.py
    • Added a new test file for the IronClaw Python SDK, covering environment variable handling, request formatting, HTTP error handling, and convenience wrapper functionality.
  • src/context/state.rs
    • Added tool_nesting_depth field to JobContext to track the current depth of programmatic tool calls.
    • Initialized tool_nesting_depth to 0 in the JobContext::new constructor.
  • src/db/libsql/jobs.rs
    • Initialized tool_nesting_depth to 0 when creating a JobContext from the database.
  • src/history/store.rs
    • Initialized tool_nesting_depth to 0 when creating a JobContext from the history store.
  • src/main.rs
    • Initialized a new ToolExecutor instance with the tool registry, safety layer, and a default timeout.
    • Wired the ToolExecutor into the ToolRegistry's shared slot for lazy resolution by WASM tools.
    • Added the ToolExecutor to the OrchestratorState for use by the new programmatic tool calling endpoint.
  • src/orchestrator/api.rs
    • Imported JobContext and ToolExecutor.
    • Added tool_executor field to OrchestratorState to hold the programmatic tool executor.
    • Registered a new POST /worker/{job_id}/tools/call route for programmatic tool invocation.
    • Implemented tool_call_handler to process programmatic tool call requests, build a minimal JobContext, delegate to the ToolExecutor, and emit SSE events.
    • Added new test cases for the programmatic tool calling endpoint, covering success, tool not found, no executor, SSE events, timeout, and authentication.
  • src/tools/builtin/mod.rs
    • Added ptc_script module to the built-in tools.
    • Exported PtcScriptTool for use in the tool registry.
  • src/tools/builtin/ptc_script.rs
    • Added a new PtcScriptTool for executing Python scripts that can programmatically call other IronClaw tools.
    • Implemented logic to build a Python program with a preamble/postamble, handle environment variables, execute the script, and capture/truncate output.
    • Defined PtcScriptTool's metadata, schema, and execution logic, including timeout and output truncation.
  • src/tools/executor.rs
    • Added a new ToolExecutor struct to encapsulate the logic for dispatching, executing, and sanitizing tool calls.
    • Defined PtcToolResult and PtcError enums for programmatic tool call outcomes.
    • Implemented execute method in ToolExecutor to handle tool lookup, timeout enforcement, nesting depth limits, and safety layer integration.
    • Added unit tests for ToolExecutor covering tool not found, successful execution, timeout, nesting depth, safety sanitization, invalid parameters, and sequential calls.
  • src/tools/mod.rs
    • Declared executor module.
    • Exported PtcError, PtcToolResult, and ToolExecutor from the executor module.
  • src/tools/registry.rs
    • Imported PtcScriptTool and ToolExecutor.
    • Added ptc_script to the list of protected tool names.
    • Added tool_executor_slot (an Arc<std::sync::RwLock<Option<Arc<ToolExecutor>>>>) to ToolRegistry for lazy resolution of the tool executor by WASM tools.
    • Implemented set_tool_executor to populate the shared executor slot and tool_executor_slot to retrieve it.
    • Registered the new PtcScriptTool as a built-in development tool.
    • Updated WasmToolWrapper creation to inject the shared tool_executor_slot for WASM tools.
  • src/tools/wasm/wrapper.rs
    • Introduced ToolResolver type and NestingGuard for managing tool invocation within WASM.
    • Added tool_resolver and tool_nesting_depth fields to StoreData to support programmatic tool calling from WASM.
    • Modified the tool_invoke host function to use the tool_resolver, enforce MAX_NESTING_DEPTH, and handle parameter parsing and error reporting.
    • Added tool_executor (direct reference for tests) and tool_executor_slot (for production lazy resolution) to WasmToolWrapper.
    • Implemented with_tool_executor and with_tool_executor_slot methods for configuring the WASM wrapper with an executor.
    • Updated the execute method of WasmToolWrapper to resolve the ToolExecutor and construct a ToolResolver closure that bridges synchronous WASM calls to asynchronous Rust tool execution.
    • Added new PTC integration tests for WASM tools, verifying tool_invoke functionality, alias resolution, and capability checks.
  • src/worker/api.rs
    • Defined ToolCallRequest and ToolCallResponse structs for programmatic tool calling.
    • Added nesting_depth to ToolCallRequest with a default value.
    • Implemented call_tool method in WorkerHttpClient to send programmatic tool call requests to the orchestrator.
  • tests/tool_schema_validation.rs
    • Added ptc_script to the list of expected core tools for schema validation.
  • tools-src/test-ptc/Cargo.toml
    • Added a new Cargo.toml file for the test-ptc-tool WASM project, defining its dependencies and build configuration.
  • tools-src/test-ptc/src/lib.rs
    • Added a new WASM tool (test-ptc-tool) that demonstrates programmatic tool calling by invoking the echo tool via the tool_invoke host function using an alias.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Programmatic Tool Calling (PTC), a significant new feature that adds a ToolExecutor, a new orchestrator endpoint, a WASM host function, and a Python SDK. While the architecture is generally well-designed, the current implementation contains critical security flaws. Most notably, a sandbox escape allows a compromised worker container to achieve Remote Code Execution (RCE) on the orchestrator host. Other issues include a bypassable nesting depth limit by malicious workers and the leakage of sensitive information via SSE events to the web UI. Additionally, consider improving the Python SDK's timeout handling for robustness.

Comment thread src/tools/executor.rs
Comment on lines +84 to +134
pub async fn execute(
&self,
tool_name: &str,
params: serde_json::Value,
ctx: &JobContext,
timeout_override: Option<Duration>,
) -> Result<PtcToolResult, PtcError> {
// Enforce global nesting depth limit
if ctx.tool_nesting_depth >= MAX_NESTING_DEPTH {
return Err(PtcError::NestingDepthExceeded {
max: MAX_NESTING_DEPTH,
});
}

let start = Instant::now();

// Look up the tool
let tool = self
.tools
.get(tool_name)
.await
.ok_or_else(|| PtcError::NotFound {
name: tool_name.to_string(),
})?;

// Determine timeout: caller override -> tool's own timeout -> default,
// capped at MAX_TIMEOUT_SECS.
let timeout = timeout_override
.unwrap_or_else(|| tool.execution_timeout())
.min(Duration::from_secs(MAX_TIMEOUT_SECS));

// Execute with timeout
let tool_result = tokio::time::timeout(timeout, tool.execute(params, ctx))
.await
.map_err(|_| PtcError::Timeout {
name: tool_name.to_string(),
timeout,
})?
.map_err(|e| match e {
crate::tools::ToolError::InvalidParameters(reason) => PtcError::InvalidParameters {
name: tool_name.to_string(),
reason,
},
crate::tools::ToolError::RateLimited(_) => PtcError::RateLimited {
name: tool_name.to_string(),
},
other => PtcError::ExecutionFailed {
name: tool_name.to_string(),
reason: other.to_string(),
},
})?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

CRITICAL: Sandbox Escape / Remote Code Execution (RCE) on Orchestrator Host.

The ToolExecutor executes tools directly on the host where the orchestrator is running. It does not check the domain() of the tool being executed. If a tool with ToolDomain::Container (like ptc_script or shell) is registered in the orchestrator's tool registry, a compromised worker container can use the POST /worker/{job_id}/tools/call endpoint to execute arbitrary code on the orchestrator host.

In a typical deployment, the orchestrator and agent share a registry that includes development tools. This allows any worker to escape its sandbox and compromise the host system.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d8ff994. Added a tool.domain() check in ToolExecutor::execute() that rejects ToolDomain::Container tools with PtcError::DomainBlocked. Container-domain tools (shell, file ops) can no longer be invoked through the PTC endpoint on the orchestrator host. Regression test: test_container_domain_blocked.

Comment thread src/orchestrator/api.rs Outdated
format!("Programmatic tool call from job {}", job_id),
);
// Propagate nesting depth so the executor enforces the global limit
ctx.tool_nesting_depth = req.nesting_depth;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

MEDIUM: Nesting Depth Limit Bypass.

The nesting_depth is provided by the untrusted worker and is trusted by the orchestrator without validation or increment. A malicious worker can always send nesting_depth: 0 in every request to bypass the MAX_NESTING_DEPTH limit (currently 5). This can lead to infinite recursion and Denial of Service (DoS) on the orchestrator due to resource exhaustion (e.g., spawning infinite processes or runtimes).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d8ff994. The orchestrator now floors client-provided nesting_depth at 1 (req.nesting_depth.max(1)) since PTC calls from workers are inherently at depth >= 1. A malicious worker sending nesting_depth: 0 will have it corrected to 1.

Comment thread src/orchestrator/api.rs
Comment on lines +472 to +481
if let Some(ref tx) = state.job_event_tx {
let _ = tx.send((
job_id,
SseEvent::JobToolUse {
job_id: job_id.to_string(),
tool_name: req.tool_name.clone(),
input: req.parameters.clone(),
},
));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

MEDIUM: Sensitive Information Leak via SSE.

The tool_call_handler broadcasts JobToolUse events containing the raw, unsanitized parameters provided by the worker. If a tool is called with sensitive data (e.g., API keys, passwords, or PII), this information will be leaked to any user observing the SSE stream in the web UI. While tool outputs are sanitized, tool inputs are currently broadcast in the clear.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d8ff994. Tool parameters are now redacted before broadcasting via SSE. The JobToolUse event sends {"_note": "parameters redacted for security"} instead of raw parameters.

Comment thread sdk/python/ironclaw_tools.py Outdated
Comment on lines +53 to +88
def call_tool(name, params=None, timeout_secs=None):
"""Call a tool on the orchestrator by name.

Args:
name: Tool name (e.g., "echo", "shell", "read_file").
params: Dictionary of parameters to pass to the tool.
timeout_secs: Optional timeout in seconds (max 300).

Returns:
Tool output as a string.

Raises:
RuntimeError: If the tool call fails.
"""
url = f"{_base_url()}/tools/call"
body = {
"tool_name": name,
"parameters": params or {},
}
if timeout_secs is not None:
body["timeout_secs"] = min(int(timeout_secs), 300)

data = json.dumps(body).encode("utf-8")
req = urllib.request.Request(
url,
data=data,
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer {_token()}",
},
method="POST",
)

try:
with urllib.request.urlopen(req, timeout=(timeout_secs if timeout_secs is not None else 60) + 5) as resp:
result = json.loads(resp.read().decode("utf-8"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current timeout logic can lead to premature client-side timeouts. When timeout_secs is not provided, the client-side HTTP request timeout defaults to 65 seconds. However, the server-side tool execution may have a longer default timeout (e.g., ptc_script defaults to 120s). This mismatch will cause the client to time out while the server is still correctly processing the tool.

To fix this, I suggest making the timeout explicit by providing a default value in call_tool and consistently using it for both the server request and the client-side timeout calculation. This ensures consistent behavior and prevents unexpected timeouts.

def call_tool(name, params=None, timeout_secs=60):
    """Call a tool on the orchestrator by name.

    Args:
        name: Tool name (e.g., "echo", "shell", "read_file").
        params: Dictionary of parameters to pass to the tool.
        timeout_secs: Timeout in seconds (default 60, max 300).

    Returns:
        Tool output as a string.

    Raises:
        RuntimeError: If the tool call fails.
    """
    url = f"{_base_url()}/tools/call"

    server_timeout = min(int(timeout_secs), 300)
    body = {
        "tool_name": name,
        "parameters": params or {},
        "timeout_secs": server_timeout,
    }

    data = json.dumps(body).encode("utf-8")
    req = urllib.request.Request(
        url,
        data=data,
        headers={
            "Content-Type": "application/json",
            "Authorization": f"Bearer {_token()}",
        },
        method="POST",
    )

    try:
        # The client-side timeout for the HTTP request should be slightly longer
        # than the server-side tool execution timeout to account for network latency.
        client_timeout = server_timeout + 5
        with urllib.request.urlopen(req, timeout=client_timeout) as resp:
            result = json.loads(resp.read().decode("utf-8"))

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d8ff994. call_tool now defaults to timeout_secs=60, always sends timeout_secs to the server, and uses server_timeout + 5 for the client-side HTTP timeout to account for network latency.

@zmanian
Copy link
Copy Markdown
Collaborator Author

zmanian commented Mar 7, 2026

Proposal: Evaluate PTC Effectiveness Using Trajectory Benchmarks Before Merge

Before merging this PR, I'd like to use it as a proof-of-concept for the trajectory benchmark system (#467) -- demonstrating that we can objectively evaluate whether a new agent capability actually helps.

The Question

Does PTC reduce LLM round-trips and cost for multi-step tool chains, or does the overhead of generating Python scripts negate the savings?

Methodology

Use the existing RecordingLlm infrastructure (IRONCLAW_RECORD_TRACE=1) to capture real agent trajectories for the same prompts on two branches:

Branch Setup
main (baseline) Agent has normal tools only -- LLM orchestrates each step
feat/ptc-programmatic-tool-calling Agent also has ptc_script -- LLM can batch operations

Candidate Tasks

These are tasks with sequential tool chains where PTC should theoretically reduce round-trips:

Task Without PTC (expected LLM calls) With PTC (hypothesis)
Write file + read it back 3 (write, read, summarize) 2 (ptc_script chains both, summarize)
Search memory + read doc + answer 4 2
Read 3 files + summarize 4 2
Shell command + write output to file 3 2
Write file + patch it + verify 4 2

Metrics to Compare (per task)

All already captured by InstrumentedLlm and TraceMetrics:

  • LLM calls -- the primary metric; fewer = PTC is working
  • Total tokens (input + output) -- fewer round-trips should mean less context re-sending
  • Estimated cost -- dollar impact
  • Tool call count -- may increase (ptc_script is itself a tool call) but that's fine if LLM calls drop
  • Wall time -- tool-to-tool calls skip LLM latency

Recording Protocol

  1. Write a #[ignore] recording harness test that sends a fixed prompt through the real agent with RecordingLlm and saves the trace JSON
  2. Record each prompt 3-5x per branch (LLM responses are non-deterministic; need directional confidence)
  3. Use temperature=0 where the provider supports it to reduce variance
  4. Traces are JSON files checked into the repo -- anyone can inspect and verify

Comparison Report

Use the existing compare_runs() infrastructure from the trajectory system to produce a table:

Task                     | LLM calls (base) | LLM calls (PTC) | Delta | Tokens (base) | Tokens (PTC) | Delta
write + read back        |        3         |        2         | -33%  |      360      |      280     | -22%
search + read + answer   |        4         |        2         | -50%  |      680      |      420     | -38%
...

Why This Matters Beyond PTC

This establishes a repeatable methodology for evaluating any future agent capability change:

  1. Reproducible -- recorded traces are JSON files in the repo
  2. Cheap to re-run -- once recorded, replay tests are free and deterministic
  3. Generalizable -- same approach works for new skills, prompt changes, model upgrades
  4. Honest -- if PTC doesn't help, the numbers show that; that's valuable too

Statistical Caveat

Single recordings aren't statistically rigorous. Plan is to record each prompt 3-5x per branch and report mean/stddev for key metrics. This gives directional evidence, not a p-value -- but it's far more than "trust me, it helps."

Next Steps

  1. Build the recording test harness (works on both branches)
  2. Record 3-5 baseline traces on main
  3. Record 3-5 PTC traces on this branch
  4. Build comparison script (parse JSON traces, produce table)
  5. Post results here

cc @ilblackdragon -- this ties directly to #467 and the nearai/benchmarks work. The trajectory infrastructure can answer "does feature X make the agent better?" questions with data.

Copy link
Copy Markdown
Collaborator Author

@zmanian zmanian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: feat(ptc): programmatic tool calling -- executor, SDK, and E2E tests

Reviewed the full diff across all 8 commits, with focus on the security fixes from the previous Gemini review (d8ff994) and the new code.

Previous review findings -- status

All four items from the Gemini review have been addressed:

  1. Container domain blocking (CRITICAL) -- Fixed. ToolExecutor::execute() now checks tool.domain() == ToolDomain::Container and returns PtcError::DomainBlocked. Regression test added (test_container_domain_blocked).

  2. Nesting depth bypass (MEDIUM) -- Fixed. tool_call_handler floors nesting_depth at 1 via .max(1).

  3. SSE parameter leak (MEDIUM) -- Fixed. SSE JobToolUse events now emit {"_note": "parameters redacted for security"} instead of raw params.

  4. Python SDK timeout mismatch (MEDIUM) -- Fixed. call_tool now defaults to timeout_secs=60, always sends it to the server, and uses server_timeout + 5 for client HTTP timeout.

New findings

BUG: UTF-8 byte-index slicing in truncate_output (ptc_script.rs:123)

&output[..MAX_OUTPUT_SIZE]

This will panic on multi-byte UTF-8 characters if MAX_OUTPUT_SIZE falls in the middle of a character boundary. Python scripts can easily produce non-ASCII output (CJK, emoji, etc.). Per CLAUDE.md: "Never use byte-index slicing (&s[..n]) on user-supplied or external strings -- it panics on multi-byte characters." Fix: walk backwards with is_char_boundary() or use char_indices().

MINOR: NestingGuard could underflow on misuse

The NestingGuard::drop does *self.depth -= 1 which would underflow/panic if depth were 0. Currently safe because it's always preceded by += 1, but saturating_sub(1) would be more defensive and has zero cost.

MINOR: tool_nesting_depth not incremented for orchestrator-side execution

When the orchestrator's tool_call_handler calls executor.execute(), the tool_nesting_depth in the JobContext is set to req.nesting_depth.max(1), but the executor doesn't increment it before executing the tool. This means if the tool itself triggers a PTC call back to the orchestrator (recursive tool call), the nesting counter stays at 1 rather than incrementing. The WASM path handles this correctly via NestingGuard, but the HTTP RPC path relies solely on the client sending an honest nesting depth. Since the domain blocking prevents Container-domain tools from running on the orchestrator (and those are the ones that could recurse via the Python SDK), this is low-risk in practice, but worth noting for future extensibility.

OBSERVATION: ptc_script correctly declares ToolDomain::Container

This is important -- ptc_script is Container domain, so ToolExecutor will block it from running via the orchestrator PTC endpoint. This means it can only run inside worker containers (where the Python SDK env vars are set), which is the intended design. Good.

OBSERVATION: extra_env forwarding in ptc_script

The tool forwards ctx.extra_env to the Python subprocess. This is the correct mechanism for worker-injected credentials, but note that any key-value pair in extra_env will be visible to the Python script. This is acceptable since ptc_script already requires ApprovalRequirement::Always, but worth documenting.

Code quality

  • Error handling is clean -- thiserror for PtcError, proper error mapping throughout
  • No .unwrap() in production code (the ones in executor.rs are in test functions or .unwrap_or() safe variants)
  • Good test coverage: 16 tests across 4 modules covering happy paths, errors, security boundaries
  • std::sync::RwLock usage for the executor slot is correctly motivated (WASM spawn_blocking context)
  • Python SDK is stdlib-only as claimed, clean and minimal
  • WASM test fixture (tools-src/test-ptc) is well-structured

Verdict

The UTF-8 byte-slicing bug in truncate_output should be fixed before merge -- it's a crash on non-ASCII output, which is a real scenario for a tool that runs arbitrary Python scripts. The rest is solid.

@ilblackdragon
Copy link
Copy Markdown
Member

Next Steps

  1. Build the recording test harness (works on both branches)
  2. Record 3-5 baseline traces on main
  3. Record 3-5 PTC traces on this branch
  4. Build comparison script (parse JSON traces, produce table)
  5. Post results here

These steps make sense.
We should def consider a set of e2e tasks that we can run benchmarks with real models as well in nearai/benchmarks and compare on amount of tokens required. Ideally should be an easy tool that can do comparison main vs branch.

@zmanian zmanian force-pushed the feat/ptc-programmatic-tool-calling branch from 38a974d to 68b1f22 Compare March 12, 2026 21:52
@zmanian zmanian closed this Mar 12, 2026
@zmanian zmanian reopened this Mar 12, 2026
@zmanian zmanian force-pushed the feat/ptc-programmatic-tool-calling branch from f1d7951 to 467904e Compare March 12, 2026 22:42
@zmanian zmanian enabled auto-merge (squash) March 12, 2026 23:14
@zmanian zmanian force-pushed the feat/ptc-programmatic-tool-calling branch 2 times, most recently from dbe11cb to 291d34a Compare March 13, 2026 00:58
Copy link
Copy Markdown
Member

@ilblackdragon ilblackdragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: feat(ptc): programmatic tool calling -- executor, SDK, and E2E tests

This is a well-architected PR. The ToolExecutor abstraction, lazy slot pattern for deferred wiring, container domain blocking, SSE redaction, and the Python SDK (stdlib-only, zero deps) are all thoughtfully designed. The follow-up security commits (54e37590, bde17202) address the critical server-side nesting-depth trust issue and SSE parameter leakage -- good work closing those.

That said, there are a few issues remaining, one of which is a blocker.


1. BLOCKING: UTF-8 panic in truncate_output (src/tools/builtin/ptc_script.rs, line ~123)

&output[..MAX_OUTPUT_SIZE]

Indexing a &str at a byte offset that falls inside a multi-byte UTF-8 codepoint panics at runtime. Python scripts routinely produce non-ASCII output (emoji, CJK, accented characters, even repr() of bytes). When MAX_OUTPUT_SIZE (65536) lands mid-character, the process crashes.

Fix options (pick one):

  • Walk backward with output.floor_char_boundary(MAX_OUTPUT_SIZE) (nightly, or hand-roll with is_char_boundary)
  • Use output.char_indices().take_while(|(i, _)| *i < MAX_OUTPUT_SIZE).last() to find the safe cut point
  • String::from_utf8_lossy(&output.as_bytes()[..MAX_OUTPUT_SIZE]) (replaces partial char with U+FFFD -- acceptable for truncation)

The existing test test_truncate_output only uses ASCII "x" characters and does not catch this. Please add a test with multi-byte characters (e.g., a string of "\u{1F600}" repeated past the limit).


2. Should fix: NestingGuard::drop uses raw subtraction (src/tools/wasm/wrapper.rs, line 49)

impl Drop for NestingGuard<'_> {
    fn drop(&mut self) {
        *self.depth -= 1;
    }
}

If *self.depth is somehow 0 when the guard drops (e.g., a logic error sets it to 0 before the guard runs, or a future refactor introduces a double-drop path), this panics in debug builds and wraps to u32::MAX in release -- silently disabling the nesting limit.

Use *self.depth = self.depth.saturating_sub(1); -- it compiles to the same machine code on the happy path and eliminates the edge case. The orchestrator side already uses saturating_add(1) (line 483 of api.rs), so the defensive style is consistent.


3. Should fix: set_tool_executor silently swallows lock poisoning (src/tools/registry.rs, line ~147)

pub fn set_tool_executor(&self, executor: Arc<ToolExecutor>) {
    if let Ok(mut guard) = self.tool_executor_slot.write() {
        *guard = Some(executor);
    }
}

A poisoned RwLock means a previous holder panicked. If this silently fails, every subsequent WASM PTC call gets a cryptic "no tool executor configured" error with zero diagnostic trail. At minimum, add:

} else {
    tracing::error!("tool_executor_slot RwLock is poisoned; PTC will be unavailable");
}

This gives operators a single log line pointing at the root cause instead of a stream of mysterious PTC failures.


4. Should fix: stray // ci fix comment (src/tools/builtin/memory.rs, line 639)

The file ends with a bare // ci fix after the closing brace of the test module. This looks like a leftover from a CI troubleshooting session. Please remove it.


5. Architecture notes (positive)

  • ToolExecutor + lazy slot: Clean separation of concerns. WASM tools registered before the executor exists still get PTC access at execution time -- avoids circular dependency.
  • Container domain blocking: ToolDomain::Container check in executor.rs prevents sandbox-escape via PTC. The regression test (test_container_domain_blocked) is good.
  • SSE redaction (54e37590): Closing the parameter-leak path on worker-reported events is important -- glad this was caught and fixed.
  • Python SDK: stdlib-only, clear docstrings, structured output via ptc_output(). The _env() helper with descriptive error messages is user-friendly.

6. Nice to have (non-blocking)

  • Add a test for truncate_output with multi-byte characters (4-byte emoji strings that cross the boundary).
  • Add a test that creates a NestingGuard with depth = 0 and drops it, verifying it does not panic (once saturating_sub is in place).
  • Consider whether MAX_OUTPUT_SIZE should be configurable (env var or tool param) -- 64KB is reasonable as a default but some scripts may legitimately produce more.

Summary: One blocker (UTF-8 panic), two defensive-coding fixes (NestingGuard, lock poisoning), and one cleanup (stray comment). The overall design is solid.

@zmanian
Copy link
Copy Markdown
Collaborator Author

zmanian commented Mar 16, 2026

Thanks for the thorough review @ilblackdragon. Addressing each point:

  1. UTF-8 panic in truncate_output — Agreed, this is the blocker. Will fix by walking back from MAX_OUTPUT_SIZE with is_char_boundary() to find a safe split point, and add a test covering multi-byte characters (e.g. emoji, CJK).

  2. NestingGuard::drop underflow — Will switch to saturating_sub(1).

  3. set_tool_executor lock poisoning — Will add error logging instead of silently swallowing it.

  4. Stray // ci fix comment — Will remove.

The inline findings from @gemini (container domain blocking, nesting depth bypass, SSE leak, timeout mismatch) have already been addressed in previous commits.

Will push the fixes shortly.

@zmanian
Copy link
Copy Markdown
Collaborator Author

zmanian commented Mar 16, 2026

Addressing review feedback from @ilblackdragon

All 4 items from your review have been addressed in commit 8ffc721:

1. BLOCKING: UTF-8 panic in truncate_output (ptc_script.rs)

Fixed. truncate_output now walks backward with is_char_boundary() to find a safe cut point:

let mut i = MAX_OUTPUT_SIZE;
while i > 0 && !output.is_char_boundary(i) {
    i -= 1;
}

Added test_truncate_output_multibyte_boundary that builds a string of 4-byte emoji characters crossing the boundary -- verifies no panic and that the kept portion contains only complete characters.

2. NestingGuard::drop uses raw subtraction (wrapper.rs)

Fixed. Changed to *self.depth = self.depth.saturating_sub(1); -- consistent with the saturating_add(1) used on the orchestrator side.

3. set_tool_executor silently swallows lock poisoning (registry.rs)

Fixed. Added an else branch with tracing::error!("tool_executor_slot RwLock is poisoned; PTC will be unavailable") so operators get a clear diagnostic instead of mysterious PTC failures.

4. Stray // ci fix comment (memory.rs)

Removed.


All changes verified: cargo fmt, cargo clippy --all --benches --tests --examples --all-features (zero warnings), and all PTC-related tests pass (ptc_script, executor, registry).

@zmanian zmanian force-pushed the feat/ptc-programmatic-tool-calling branch 2 times, most recently from f65fcf9 to 77a50b2 Compare March 17, 2026 03:06
@zmanian
Copy link
Copy Markdown
Collaborator Author

zmanian commented Mar 20, 2026

All review feedback has been addressed in the latest commits:

  • UTF-8 truncation panic: Fixed with is_char_boundary() backward scan + multibyte test
  • NestingGuard::drop: Changed to saturating_sub(1)
  • Lock poisoning in set_tool_executor: Now logs a diagnostic error
  • Stray // ci fix comment: Removed

Ready for re-review when you get a chance. Thanks!

zmanian and others added 16 commits March 21, 2026 08:11
Add ToolExecutor for standalone tool dispatch used by both the
orchestrator HTTP RPC endpoint and the WASM tool_invoke host function.
Includes Python SDK for container scripts, WASM test fixture, and
comprehensive E2E test coverage across all PTC paths.

Implementation:
- ToolExecutor with timeout, nesting depth limit, safety sanitization
- Orchestrator POST /worker/{job_id}/tools/call endpoint with SSE events
- WASM tool_invoke host function with alias resolution
- Python SDK (stdlib-only) with call_tool + convenience wrappers

Tests (16 new):
- 6 orchestrator HTTP RPC tests (auth, echo, not-found, timeout, SSE, no-executor)
- 3 executor integration tests (sanitization, invalid params, sequential)
- 4 Python SDK tests (env vars, request format, HTTP error, wrappers)
- 3 WASM E2E tests (echo via alias, alias not granted, no capability)

Refs #407

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…r SDK

- Fix WASM tool_invoke production wiring: change tool_executor to a
  shared Arc<std::sync::RwLock> slot with lazy resolution so WASM tools
  registered during build_all() can access the executor set afterward
- Add nesting_depth field to ToolCallRequest and propagate it through
  the orchestrator's tool_call_handler into JobContext
- Add ptc_script built-in tool: runs Python scripts with ironclaw_tools
  SDK pre-imported, env-scrubbed subprocess, structured output support
- Copy Python SDK into Docker worker image at dist-packages path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Python SDK: remove 60s minimum timeout enforcement, respect
  requested timeout with 5s network buffer
- Rust executor: cap timeout at MAX_TIMEOUT_SECS instead of
  falling back to default when exceeded
- WASM wrapper: use RAII guard for nesting depth to prevent
  leak on panic

Addresses Gemini review feedback on PR #408.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix Python SDK client timeout to use actual timeout_secs + 5s buffer
  instead of enforcing 60s minimum
- Cap tool execution timeout at MAX_TIMEOUT_SECS instead of falling
  back to default when exceeded
- Use RAII guard for tool_nesting_depth to ensure decrement on panic

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alphabetize PromptQueue before PtcScriptTool to pass cargo fmt check.

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… redaction

Three security fixes from code review:

1. CRITICAL: Block Container-domain tools from executing on the
   orchestrator host. The ToolExecutor now checks tool.domain() and
   rejects Container tools with PtcError::DomainBlocked, preventing
   sandbox escape / RCE.

2. MEDIUM: Floor client-provided nesting_depth at 1 instead of trusting
   the worker's value. A malicious worker can no longer send
   nesting_depth=0 to bypass MAX_NESTING_DEPTH.

3. MEDIUM: Redact tool parameters in SSE JobToolUse events to prevent
   leaking sensitive data (API keys, passwords) to web UI observers.

4. Python SDK: Always send timeout_secs to server and use server_timeout+5
   for client-side HTTP timeout to prevent premature client timeouts.

Regression test: test_container_domain_blocked verifies Container-domain
tools are rejected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The import was dropped during rebase onto staging, causing compilation
failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Staging moved schema hint logic to display-time in schema() method,
so the construction-time call from PTC is no longer needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Nesting depth: change from .max(1) to .saturating_add(1) so the
  orchestrator always increments the depth server-side rather than
  trusting the client-supplied value. This prevents a malicious worker
  from bypassing the nesting limit by always sending 0.

- SSE redaction: redact raw input parameters from worker-reported
  tool_use events in job_event_handler before broadcasting via SSE.
  Previously only the PTC path redacted; worker-reported events leaked
  raw parameters (potentially containing API keys, passwords, PII)
  to the web UI.

- Domain check and Python SDK timeout were already addressed in the
  current branch.

- Add regression tests for both fixes.

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… lock poison logging, cleanup

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…for_schema

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove redundant #![cfg(test)] inner attribute from codex_test_helpers.rs
(already gated by #[cfg(test)] in mod.rs), fixing the clippy
duplicated_attributes warning. Also apply cargo fmt import reordering
in tools/mod.rs.

https://claude.ai/code/session_017ckCCurNiBL8uzE4dJg59K
@zmanian zmanian force-pushed the feat/ptc-programmatic-tool-calling branch from f689ec0 to 71d5c49 Compare March 21, 2026 09:11
@github-actions github-actions bot added the scope: llm LLM integration label Mar 21, 2026
Copy link
Copy Markdown
Collaborator Author

zmanian commented Mar 21, 2026

All four review items from @ilblackdragon have been addressed (in commit 786df99):

  1. UTF-8 panic in truncate_output (BLOCKER): Fixed — uses is_char_boundary() to find a safe cut point. Multibyte emoji test added (test_truncate_output_multibyte).
  2. NestingGuard::drop: Now uses self.depth.saturating_sub(1).
  3. set_tool_executor lock poisoning: Added tracing::error! logging in the else branch.
  4. Stray // ci fix comment: Removed.

Ready for re-review.

@zmanian
Copy link
Copy Markdown
Collaborator Author

zmanian commented Mar 21, 2026

Thanks for the thorough review @ilblackdragon. All four items addressed in db1e83c:

1. UTF-8 panic in truncate_output (BLOCKING) -- Fixed. The byte-index slice &output[..MAX_OUTPUT_SIZE] now walks backward with is_char_boundary() to find a safe cut point before slicing. Added two regression tests: one with 4-byte emoji (\u{1F600}) and one with 2-byte accented characters (\u{00E9}), both producing strings that cross the MAX_OUTPUT_SIZE boundary mid-character.

2. NestingGuard::drop underflow -- Changed *self.depth -= 1 to *self.depth = self.depth.saturating_sub(1). Added two tests: one verifying depth=0 does not panic on drop, one verifying normal decrement (3 -> 2).

3. set_tool_executor silent lock poisoning -- Added tracing::error! logging in both the write path (set_tool_executor in registry.rs) and the read path (slot.read() in wrapper.rs line ~756) where the lock is actually consumed. Both now log an explicit error message pointing at RwLock poisoning as the root cause.

4. Stray // ci fix comment -- This was already removed in a prior commit (not present on the branch).

@zmanian
Copy link
Copy Markdown
Collaborator Author

zmanian commented Mar 23, 2026

Closing in favor of #1557 (v2 architecture), which introduces a unified execution model with CodeAct that supersedes the PTC approach. The v2 engine's Capability primitive and Monty-based tool composition provide a more general solution for tool-to-tool dispatch.

@zmanian zmanian closed this Mar 23, 2026
auto-merge was automatically disabled March 23, 2026 03:48

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: medium Business logic, config, or moderate-risk modules scope: llm LLM integration scope: orchestrator Container orchestrator scope: sandbox Docker sandbox scope: tool/builtin Built-in tools scope: tool/wasm WASM tool sandbox scope: tool Tool infrastructure scope: worker Container worker size: XL 500+ changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants