Skip to content

fix(beta): persist server-side builtin tool calls in history#2695

Merged
Lancetnik merged 18 commits intoag2ai:mainfrom
vvlrff:fix/anthropic-builtin-tool-history
Apr 27, 2026
Merged

fix(beta): persist server-side builtin tool calls in history#2695
Lancetnik merged 18 commits intoag2ai:mainfrom
vvlrff:fix/anthropic-builtin-tool-history

Conversation

@vvlrff
Copy link
Copy Markdown
Collaborator

@vvlrff vvlrff commented Apr 16, 2026

Summary

Provider-executed builtin tools (web search, web fetch, code execution, image generation, etc.) used to disappear from agent history. They were dispatched and executed server-side by the provider — but the client silently dropped the resulting blocks (ServerToolUseBlock, executable_code, code_execution_result, ResponseFunctionWebSearch, ImageGenerationCall, …). On a chained reply.ask(), the model had no record of its own prior server-side tool activity and would either hallucinate, repeat the call, or reject the follow-up because the assistant message it received back didn't match what it had emitted.

This PR introduces a uniform mechanism for capturing those server-side tool calls and results as typed events, persisting them through the agent's stream/history, and round-tripping them back into the provider's API on subsequent turns. It covers Anthropic, Google Gemini, and OpenAI Responses providers, and cleans up several related issues uncovered along the way.

Problem

  1. The client loops in all three providers only handled "regular" content blocks (text, thinking, function/tool calls). Server-side blocks were dropped on the floor:
    • AnthropicServerToolUseBlock, WebSearchToolResultBlock, WebFetchToolResultBlock, CodeExecutionToolResultBlock, BashCodeExecutionToolResultBlock, TextEditorCodeExecutionToolResultBlock.
    • GeminiPart.executable_code, Part.code_execution_result, candidate.grounding_metadata (google search / url context).
    • OpenAI ResponsesResponseFunctionWebSearch, ResponseCodeInterpreterToolCall, ImageGenerationCall, ResponseReasoningItem.
  2. convert_messages() in each provider's mapper had no branches for these events, so even if they had been emitted they wouldn't be serialised back into the API on the next turn.
  3. Anthropic's pause_turn continuation loop appended intermediate responses to a local messages list without sending them through the stream → history never saw them.
  4. Once BuiltinToolCallEvent (a ToolCallEvent subclass) started being emitted, the executor's _tool_not_found fallback fired for tools that actually ran on the server — producing spurious ToolNotFoundEvent entries in history.
  5. ShellTool on Anthropic was silently broken — Anthropic's bash tool is client-side, but ShellTool.register() was a no-op, so the executor raised ToolNotFoundError and the model hallucinated output.

Solution

Each provider gets a small, isolated events.py module that defines provider-specific subclasses of BuiltinToolCallEvent / BuiltinToolResultEvent. The events:

  • Carry the provider's native object (block, part, item) as a non-repr, non-default Field, so it can be re-serialised verbatim on the next turn (no lossy reconstruction).
  • Expose factory classmethods (from_block, from_item, from_executable_code, from_grounding, from_code_execution_result) that map provider-side names/IDs/inputs into the framework-level BuiltinToolCallEvent / BuiltinToolResultEvent shape.
  • Are emitted by the client through context.send() so they end up in the persistent stream / conversation history just like every other event.
  • Are recognised by each provider's convert_messages() and converted back into the provider-specific message format on subsequent turns.

The result: server-side tool activity becomes a first-class part of the conversation history and survives across reply.ask() chains, multi-turn flows, and persistent stream backends.

Changes by Provider

Anthropic

New: autogen/beta/config/anthropic/events.py

  • AnthropicServerToolCallEvent(BuiltinToolCallEvent) — wraps ServerToolUseBlock. from_block() maps web_search / web_fetch / code_execution / bash_code_execution / text_editor_code_execution block names to canonical AG2 tool names (WEB_SEARCH_TOOL_NAME, WEB_FETCH_TOOL_NAME, CODE_EXECUTION_TOOL_NAME).
  • AnthropicServerToolResultEvent(BuiltinToolResultEvent) — wraps the union of *ToolResultBlock types via AnthropicServerToolResultBlockType. from_block() dispatches by isinstance to the canonical tool name.

Modified: autogen/beta/config/anthropic/anthropic_client.py

  • Imports ServerToolUseBlock and the result block types from anthropic.types.
  • New helper _emit_builtin_tool_events() — converts server-side blocks into typed events.
  • _process_response() and _process_stream() now handle server_tool_use and *_tool_result content blocks for both streaming and non-streaming paths.
  • The non-streaming pause_turn continuation loop now calls _emit_builtin_tool_events() on every intermediate response so the events show up in the stream (the streaming loop already routed through _process_stream()).

Modified: autogen/beta/config/anthropic/mappers.py

  • convert_messages() now recognises AnthropicServerToolCallEvent / AnthropicServerToolResultEvent and re-serialises them via block.model_dump(exclude_none=True, mode="json"). Both attach to the same assistant message, matching Anthropic's expected wire format.
  • ShellToolSchema now raises UnsupportedToolError("shell", "anthropic") — Anthropic's bash is a client-side tool. Users should switch to LocalShellTool, which works with any provider via subprocess.

Google Gemini

New: autogen/beta/config/gemini/events.py

  • GeminiServerToolCallEvent(BuiltinToolCallEvent) — wraps types.Part and/or types.GroundingMetadata. Two factories:
    • from_executable_code(part) — converts Part.executable_code (code interpreter calls) into a tool call with {"code": ..., "language": ...} arguments.
    • from_grounding(gm, name=...) — converts GroundingMetadata (Google Search / URL Context calls) into a tool call with {"queries": [...]} arguments. Generates a synthetic UUID since Gemini doesn't return an id.
  • GeminiServerToolResultEvent(BuiltinToolResultEvent) — same wrapping, with from_code_execution_result() and from_grounding() factories.

Modified: autogen/beta/config/gemini/gemini_client.py

  • _process_response() and _process_stream() now walk every candidate's parts and emit:
    • A GeminiServerToolCallEvent for executable_code parts, immediately followed by a matching GeminiServerToolResultEvent when a code_execution_result part appears (linked via pending_code_call_id).
    • A grounding call/result pair when the candidate has grounding_metadata (deferred to the end of the stream so the final, fully-populated metadata is used).

Modified: autogen/beta/config/gemini/mappers.py

  • convert_messages() now recognises GeminiServerToolCallEvent / GeminiServerToolResultEvent. When the wrapped event carries a native types.Part, it's appended back to the previous model-role Content so the conversation history matches Gemini's expected shape.
  • New grounding_tool_name(gm) helper — chooses web_search vs web_fetch based on whether web_search_queries is set.

OpenAI Responses

Modified: autogen/beta/config/openai/openai_responses_client.py

  • Removed ~135 lines of inline server-tool handling logic. The client now delegates to the typed events introduced in events.py (OpenAIServerToolCallEvent, OpenAIServerToolResultEvent, OpenAIReasoningEvent) via their from_item() factories — both for the non-streaming _process_response() and the streaming _process_stream().
  • ImageGenerationCall.result is still decoded into BinaryResult files alongside the typed event, preserving backwards compatibility for image outputs.

Modified: autogen/beta/config/openai/events.py

  • OpenAIServerToolCallEvent.from_item() now handles ResponseFunctionWebSearch, ResponseCodeInterpreterToolCall, and ImageGenerationCall uniformly.
  • OpenAIServerToolResultEvent.from_item() mirrors the dispatch.
  • OpenAIReasoningEvent carries the original ResponseReasoningItem so it can be replayed verbatim.

Modified: autogen/beta/config/openai/mappers.py

  • events_to_responses_input() recognises OpenAIReasoningEvent and OpenAIServerToolCallEvent and serialises them via message.item.model_dump(exclude_none=True, mode="json") — the Responses API accepts the same dict back as input.

Builtin Tools

All builtin tools that proxy to a server-executed provider tool now register a no-op sub_scope listener in their register() method, scoped to BuiltinToolCallEvent.name == <TOOL_NAME>:

  • code_execution.py, image_generation.py, mcp_server.py, memory.py, skills.py, web_fetch.py, shell.py

This consumes the synthesised builtin-tool-call event so the executor's _tool_not_found fallback no longer fires for tools that ran on the server. The function is module-level (no nested closures inside hot paths) per project conventions.

code_execution.py also adds a CodeExecutionVersions TypeAlias and registers the new code_execution_20260120 Anthropic version alongside the existing code_execution_20250825.

shell.py docstrings are rewritten to reflect that only OpenAI Responses API executes shell server-side; Anthropic users are pointed to LocalShellTool.

@github-actions github-actions Bot added the beta label Apr 16, 2026
Comment thread autogen/beta/tools/executor.py Outdated
Comment thread autogen/beta/config/anthropic/anthropic_client.py Outdated
Comment thread autogen/beta/config/anthropic/anthropic_client.py Outdated
Comment thread autogen/beta/config/anthropic/anthropic_client.py Outdated
Comment thread autogen/beta/config/anthropic/anthropic_client.py Outdated
Comment thread autogen/beta/config/anthropic/anthropic_client.py Outdated
Comment thread autogen/beta/config/anthropic/anthropic_client.py Outdated
Comment thread test/beta/config/anthropic/tools/test_unsupported.py Outdated
@Lancetnik Lancetnik self-assigned this Apr 21, 2026
vvlrff added 4 commits April 26, 2026 15:13
…egrations

- Added `from_block` and `from_grounding` methods to `AnthropicServerToolCallEvent` and `AnthropicServerToolResultEvent` for improved event creation from tool use blocks.
- Introduced `GeminiServerToolCallEvent` and `GeminiServerToolResultEvent` classes to handle tool calls and results in the Gemini integration, including methods for creating events from executable code and grounding metadata.
- Updated `gemini_client.py` to process responses and streams, emitting appropriate tool call and result events.
- Enhanced `mappers.py` to support new event types and ensure proper conversion of messages.
- Removed unused imports and cleaned up event handling in OpenAI integration, streamlining the response processing logic.
- Added comprehensive tests for new event handling in both Anthropic and Gemini configurations, ensuring correct behavior for tool calls and results.
@vvlrff vvlrff marked this pull request as ready for review April 26, 2026 16:33
@vvlrff vvlrff changed the title fix(beta): persist Anthropic server-side builtin tool calls in history fix(beta): persist server-side builtin tool calls in history Apr 26, 2026
@Lancetnik Lancetnik enabled auto-merge April 27, 2026 20:40
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 27, 2026

Codecov Report

❌ Patch coverage is 72.57384% with 65 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
autogen/beta/config/anthropic/anthropic_client.py 20.83% 16 Missing and 3 partials ⚠️
...ogen/beta/config/openai/openai_responses_client.py 40.00% 13 Missing and 2 partials ⚠️
autogen/beta/config/anthropic/events.py 75.00% 5 Missing and 4 partials ⚠️
autogen/beta/config/openai/events.py 88.57% 2 Missing and 2 partials ⚠️
autogen/beta/tools/builtin/image_generation.py 25.00% 3 Missing ⚠️
autogen/beta/tools/builtin/mcp_server.py 25.00% 3 Missing ⚠️
autogen/beta/tools/builtin/memory.py 25.00% 3 Missing ⚠️
autogen/beta/tools/builtin/shell.py 25.00% 3 Missing ⚠️
autogen/beta/tools/builtin/web_fetch.py 25.00% 3 Missing ⚠️
autogen/beta/config/gemini/gemini_client.py 93.54% 0 Missing and 2 partials ⚠️
... and 1 more
Files with missing lines Coverage Δ
autogen/beta/config/anthropic/mappers.py 86.99% <100.00%> (+1.28%) ⬆️
autogen/beta/config/gemini/events.py 100.00% <100.00%> (ø)
autogen/beta/config/gemini/mappers.py 83.66% <100.00%> (+2.09%) ⬆️
autogen/beta/config/openai/mappers.py 82.81% <100.00%> (+0.95%) ⬆️
autogen/beta/tools/builtin/code_execution.py 88.46% <100.00%> (+0.46%) ⬆️
autogen/beta/tools/builtin/skills.py 96.42% <75.00%> (-3.58%) ⬇️
autogen/beta/config/gemini/gemini_client.py 72.99% <93.54%> (+15.98%) ⬆️
autogen/beta/tools/builtin/image_generation.py 93.18% <25.00%> (-4.38%) ⬇️
autogen/beta/tools/builtin/mcp_server.py 92.85% <25.00%> (-4.58%) ⬇️
autogen/beta/tools/builtin/memory.py 88.00% <25.00%> (-7.46%) ⬇️
... and 6 more

... and 52 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Lancetnik Lancetnik added this pull request to the merge queue Apr 27, 2026
Merged via the queue into ag2ai:main with commit 52187d8 Apr 27, 2026
31 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants