Skip to content

feat(beta): add ToolPolicyMiddleware for tool access control#2533

Open
amabito wants to merge 8 commits intoag2ai:mainfrom
amabito:feat/tool-policy-middleware
Open

feat(beta): add ToolPolicyMiddleware for tool access control#2533
amabito wants to merge 8 commits intoag2ai:mainfrom
amabito:feat/tool-policy-middleware

Conversation

@amabito
Copy link
Copy Markdown
Contributor

@amabito amabito commented Mar 31, 2026

Why are these changes needed?

Pulls the tool allow/block list logic out of GovernanceMiddleware (#2501,
closed) into its own middleware. Sits on on_tool_execution, checks the
tool name against blocked and allowed lists, returns ToolErrorEvent if denied.

blocked_tools overrides allowed_tools. allowed_tools=None means no
restriction; allowed_tools=[] means deny all. Config is frozen after
construction.

Related issue number

Replaces the tool policy portion of #2501 (closed).
Related: #2531 (BudgetMiddleware), #2532 (CircuitBreakerMiddleware).

Checks

@Lancetnik
Copy link
Copy Markdown
Member

@amabito hi! Thank you for the middlewares contribution, but I don't see a sense of the current one

  1. If you don't wont to let an Agent execute a tool, you can just do not pass it to context. Provide models with information about tool and forbid its execution looks strange to me
  2. Global counters _total_tool_calls and _total_blocked also could slowly lead to memory leaks because we have no options to drop them. Also, I don't see a reason to have such counters

@amabito
Copy link
Copy Markdown
Contributor Author

amabito commented Apr 9, 2026

Hey @Lancetnik, thanks for the pushback -- both points are worth addressing directly.

"Just don't pass the tool to context"

Static exclusion at registration time works when the policy is fixed for the session, and for that case you're right that a middleware adds nothing. Where it stops working:

  1. Audit of denied attempts. If a tool was advertised to the model earlier in the session (cached context, multi-turn history, a prior planning step) and policy changes mid-session, the model may still try to call it. Static exclusion at registration doesn't intercept that call or leave any record. The middleware catches the attempt, returns a structured ToolErrorEvent via _make_tool_error (tool_policy.py L37-L47), and surfaces it on the event stream where observability hooks can pick it up. "The model tried to call delete_all and was blocked" is a signal you want to see, not silence.

  2. Dynamic policy updates. Policies driven by external state -- rate limiting, user role, per-tenant rules, kill-switches triggered by an ops dashboard -- can't be expressed by omitting the tool at construction time, because the agent is already built. The middleware holds _policy as a swappable reference, so a supervisor can update it without rebuilding the agent. I haven't exposed a setter yet, but the shape is there.

  3. Deny-by-default sandboxes. An allowlist-based agent (allowed_tools=["search", "calc"]) is a cleaner primitive than "register every tool except these", especially when the tool set is assembled dynamically from plugins or MCP servers. The check in _ToolPolicy.check (L59-L68) enforces deny-beats-allow precedence, which is the usual policy semantics.

If even the audit case feels out of scope for AG2's middleware layer and you'd rather see this as a dedicated capability or event hook rather than a middleware, I'm open to that -- say the word and I'll repackage it. But I don't think "don't pass the tool" covers these cases.

Counters and the leak concern

The counters at L90-L91 are instance attributes on ToolPolicyMiddleware, not module-level globals, so they don't persist across unrelated factories. What they do is grow unbounded over the lifetime of a long-lived factory with no reset path, which is a real API design smell even if it's not strictly a leak.

I'd rather just drop the counters. They were a convenience for testing and a half-hearted observability hook, and neither justifies stateful middleware. Concrete change:

  • Remove _total_tool_calls, _total_blocked, _lock, and the two properties (L90-L104).
  • Add an optional on_blocked: Callable[[ToolCallEvent, str], None] | None parameter. Users who want metrics wire it to their own Prometheus counter / log / whatever. The middleware stays stateless.
  • Tests that check counter values get rewritten against a mock callback.

Landing it in the next revision alongside the constructor simplification from #2531 / #2532.

amabito added a commit to amabito/ag2 that referenced this pull request Apr 9, 2026
Address review feedback from @Lancetnik on ag2ai#2533:

- Flatten constructor API: pass blocked_tools/allowed_tools/on_blocked
  directly instead of requiring callers to build a ToolPolicyConfig
- Drop stateful counters (_total_tool_calls, _total_blocked, _lock)
  and their accessor properties -- the middleware is now stateless
- Add optional on_blocked callback so callers can wire their own
  observability (metrics, audit log) without the library retaining
  any per-factory state

The ToolPolicyConfig dataclass is kept as an internal struct for
immutability (tuple freezing in __post_init__).
@Lancetnik
Copy link
Copy Markdown
Member

  • Dynamic policy updates. Policies driven by external state -- rate limiting, user role, per-tenant rules, kill-switches triggered by an ops dashboard -- can't be expressed by omitting the tool at construction time, because the agent is already built. The middleware holds _policy as a swappable reference, so a supervisor can update it without rebuilding the agent. I haven't exposed a setter yet, but the shape is there.

Fair, it seems the commonest case for such feature - but I don't see a public API covers it. Can we introduce a some?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer end-to-end test here using public API (or at least do not use non-public objects). Such tests are not so implementation-tied and not so fragile. Also, they cover real usecases and could be used as a some kind of documentation

Please, try to implement each test using TestClient / TracingClient. Reference: https://github.com/ag2ai/ag2/blob/main/test/beta/middleware/test_tool_execution.py

amabito added a commit to amabito/ag2 that referenced this pull request Apr 11, 2026
Address review feedback from @Lancetnik on ag2ai#2533:

- Flatten constructor API: pass blocked_tools/allowed_tools/on_blocked
  directly instead of requiring callers to build a ToolPolicyConfig
- Drop stateful counters (_total_tool_calls, _total_blocked, _lock)
  and their accessor properties -- the middleware is now stateless
- Add optional on_blocked callback so callers can wire their own
  observability (metrics, audit log) without the library retaining
  any per-factory state

The ToolPolicyConfig dataclass is kept as an internal struct for
immutability (tuple freezing in __post_init__).
amabito added a commit to amabito/ag2 that referenced this pull request Apr 11, 2026
Address @Lancetnik follow-up on ag2ai#2533: the dynamic-policy justification
needs a public API to be useful. Adds:

- ToolPolicyMiddleware.update_policy(config: ToolPolicyConfig)
  Atomically replaces the active policy. Subsequent tool calls see the
  new config; per-invocation instances already inside on_tool_execution
  keep their captured snapshot, so in-flight executions are never torn
  across a config change.

- ToolPolicyMiddleware.config property
  Read-only access to the active ToolPolicyConfig.

Intended hooks: rate-limit kill switches, per-tenant rule updates from
an ops dashboard, role changes during a long-running session -- all
without rebuilding the agent.

Thread-safety: _policy is assigned last so readers in __call__ never
observe a state where _config is new but _policy is stale (Python
attribute assignment is atomic for single references).

Tests: 19/19 pass, covering update-then-permit, in-flight-snapshot
isolation, restriction clearing, and the config property.
@amabito amabito force-pushed the feat/tool-policy-middleware branch from 3253b29 to 5984625 Compare April 11, 2026 08:58
@amabito
Copy link
Copy Markdown
Contributor Author

amabito commented Apr 11, 2026

@Lancetnik Thanks -- fair point, "it's in the shape" is not a public API. Added one in 5984625.

ToolPolicyMiddleware.update_policy(config)

mw = ToolPolicyMiddleware(allowed_tools=["search", "calc"])
agent = MyAgent(middleware=[mw])

# Later, from an ops callback, tenant-context hook, rate-limit watchdog, etc.:
mw.update_policy(ToolPolicyConfig(
    blocked_tools=["delete_all", "shell_exec"],
    allowed_tools=["search", "calc"],
))

The swap is atomic with a specific contract:

  • Subsequent calls to __call__ (i.e. new events arriving at the agent) pick up the new policy.
  • Per-invocation _ToolPolicyInstance objects that are already inside on_tool_execution keep the policy snapshot they captured at construction time. In-flight tool executions are never torn across a config change -- you won't see a call half-evaluated against the old rules and half against the new.
  • _policy is assigned last (after _config) so a concurrent reader in __call__ never observes a state where _config is new but _policy is stale. Single-reference attribute assignment is atomic in CPython, so no lock is needed for the happy path.

Also added mw.config as a read-only property for supervisors that want to log or diff the active configuration.

Tests cover:

  1. test_update_policy_takes_effect_on_next_call -- deny shell_exec, swap policy, next call permits it.
  2. test_update_policy_in_flight_instance_uses_snapshot -- construct instance, swap to "no restrictions", confirm the in-flight instance still enforces the old allowlist.
  3. test_update_policy_clears_restrictions -- passing a fresh ToolPolicyConfig() drops all rules.
  4. test_config_property_returns_active_config.

Branch is also rebased onto current main -- the CONFLICTING state is resolved.

Other changes carried from the earlier reply (already in the branch via 1d96b4a): _total_tool_calls / _total_blocked / _lock and their properties are gone; observation goes through the on_blocked callback.

19/19 tests pass, pre-commit clean.

@amabito
Copy link
Copy Markdown
Contributor Author

amabito commented Apr 11, 2026

Sorry for the rapid-fire -- ran the middleware through another review pass and wanted to send the consolidated state.

Since the last update:

  • update_policy() contract. Docstring now states clearly that self._policy is the sole authority for enforcement in __call__; self._config is updated for the .config property. Under CPython the two STORE_ATTR are not a single transaction, so a .config reader may briefly see the new config while __call__ uses the previous policy for one final instantiation. test_concurrent_update_policy_consistent_decisions exercises 20 parallel on_tool_execution calls while a background writer alternates policies and confirms every result is a valid decision against some snapshot.
  • Callback isolation. on_blocked is now wrapped in contextlib.suppress(Exception), so a failing audit callback does not prevent the ToolErrorEvent from being returned.
  • Export. ToolPolicyMiddleware and ToolPolicyConfig are re-exported from both autogen.beta.middleware and autogen.beta.middleware.builtin. Regression test test_tool_policy_importable_from_middleware_top_level covers the top-level path. Note: denials currently produce ToolErrorEvent(PermissionError(reason)), not a dedicated ToolDeniedError -- happy to add a named exception in a follow-up if that matters for callers.
  • Edge-case tests added. test_empty_tool_name_denied_by_allowlist and test_empty_allowlist_blocks_all_via_public_interface cover corner cases through the public middleware interface.

25 tests, all passing. Pre-commit clean.

amabito added 5 commits April 14, 2026 14:55
Single-responsibility middleware that blocks disallowed tool calls
based on configurable allow/block lists. Extracted from
GovernanceMiddleware to align with beta v2 composable middleware design.

- ToolPolicyConfig: blocked_tools, allowed_tools
- blocked_tools overrides allowed_tools
- allowed_tools=None means no restriction, [] means deny all
- Immutable config (lists frozen to tuples)
Address review feedback from @Lancetnik on ag2ai#2533:

- Flatten constructor API: pass blocked_tools/allowed_tools/on_blocked
  directly instead of requiring callers to build a ToolPolicyConfig
- Drop stateful counters (_total_tool_calls, _total_blocked, _lock)
  and their accessor properties -- the middleware is now stateless
- Add optional on_blocked callback so callers can wire their own
  observability (metrics, audit log) without the library retaining
  any per-factory state

The ToolPolicyConfig dataclass is kept as an internal struct for
immutability (tuple freezing in __post_init__).
Address @Lancetnik follow-up on ag2ai#2533: the dynamic-policy justification
needs a public API to be useful. Adds:

- ToolPolicyMiddleware.update_policy(config: ToolPolicyConfig)
  Atomically replaces the active policy. Subsequent tool calls see the
  new config; per-invocation instances already inside on_tool_execution
  keep their captured snapshot, so in-flight executions are never torn
  across a config change.

- ToolPolicyMiddleware.config property
  Read-only access to the active ToolPolicyConfig.

Intended hooks: rate-limit kill switches, per-tenant rule updates from
an ops dashboard, role changes during a long-running session -- all
without rebuilding the agent.

Thread-safety: _policy is assigned last so readers in __call__ never
observe a state where _config is new but _policy is stale (Python
attribute assignment is atomic for single references).

Tests: 19/19 pass, covering update-then-permit, in-flight-snapshot
isolation, restriction clearing, and the config property.
…eware at top level

TP-1 (update_policy atomicity):
The previous docstring said "Assign _policy last so readers in __call__
never observe a state where _config is new but _policy is stale", but
the assignment order was _config then _policy -- which is exactly right.
The comment was misleading.  Clarify: _config is assigned first so the
config property never lags behind the active policy; under CPython's GIL
each assignment is atomic at the bytecode level, so __call__ readers
cannot observe a mismatched pair.

TP-2 (export gap):
autogen.beta.middleware top-level __init__ did not re-export
ToolPolicyMiddleware or ToolPolicyConfig.  Add imports and __all__
entries so callers can reach them from the canonical import path.
Add test_tool_policy_importable_from_middleware_top_level regression
test.
…ersarial tests

- Wrap on_blocked callback in contextlib.suppress(Exception) so a
  failing audit observer does not prevent the ToolErrorEvent from
  being returned to the caller.
- Add test_on_blocked_callback_exception_does_not_prevent_denial.
- Add test_concurrent_update_policy_consistent_decisions: 20 parallel
  coroutines interleaved with a background policy-writer; asserts
  every result is a valid ToolErrorEvent or ToolResultEvent.
- Add test_empty_tool_name_denied_by_allowlist.
- Add test_empty_allowlist_blocks_all_via_public_interface covering
  ToolPolicyMiddleware(allowed_tools=[]) through the public interface.
@amabito amabito force-pushed the feat/tool-policy-middleware branch from 5043ecc to 2cd9dd9 Compare April 14, 2026 05:55
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines Coverage Δ
autogen/beta/middleware/__init__.py 100.00% <ø> (ø)
autogen/beta/middleware/builtin/__init__.py 93.75% <100.00%> (+0.41%) ⬆️
autogen/beta/middleware/builtin/tool_policy.py 100.00% <100.00%> (ø)

... and 48 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants