feat: add web_fetch built-in tool by henrypark133 · Pull Request #435 · nearai/ironclaw

henrypark133 · 2026-03-01T02:37:42Z

Summary

Adds a new web_fetch built-in tool distinct from the generic http tool
GET-only, auto-approved (ApprovalRequirement::Never), purpose-built for reading web pages and documentation
Returns structured output: {url, final_url, status, title, content, word_count}
HTML → Markdown conversion via the existing Readability pipeline (html-to-markdown feature)
Chrome-like User-Agent and Accept: text/markdown, text/html header so sites that block default reqwest strings return real content
Follows up to 3 redirects — each Location URL is SSRF-validated through validate_url() before the next request is sent, so SSRF protection applies to every hop identically to the original request
final_url in the output reflects the canonical URL after any redirects (equals url when none occurred)
Reuses existing security infrastructure: validate_url() (HTTPS-only, no private IPs, DNS rebinding defence), LeakDetector, 5 MB response cap, inbound sanitization
web_fetch added to PROTECTED_TOOL_NAMES to prevent shadowing by WASM/MCP tools

Test plan

Was able to fetch websites that were previously failing with HTTP GET due to bot-detection on the default reqwest User-Agent. The Chrome UA + Accept header fixes real-world compatibility.

🤖 Generated with Claude Code

- New web_fetch Rust built-in tool (GET-only, auto-approved, structured output: url/title/content/word_count) with HTML to Markdown via Readability - Full SSRF protection: HTTPS-only, no private IPs, DNS rebinding defence, outbound/inbound leak scanning, 5 MB cap, no redirect following - Rate limited: 30 req/min, 500/hr (same as http tool) - Protected tool name; registered in register_builtin_tools() - validate_url made pub(crate) so web_fetch can reuse it from http.rs - New skills/web-fetch/SKILL.md for agent guidance on web browsing - Fixes unicode panic in extract_title: use to_ascii_lowercase not to_lowercase to preserve byte offsets when indexing original string Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The web_fetch tool's schema description already tells the LLM when and how to use it. A SKILL.md would only add redundant prompt context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The LLM had no way to distinguish a 404 error page from a 200 success. Including status in the structured output (alongside url/title/content/ word_count) lets the agent report failures correctly and matches the behaviour of the http tool which always returns status. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Set a Chrome-like User-Agent so sites that block the default reqwest string return real content instead of bot-rejection pages. - Add Accept: text/markdown, text/html header (mirrors OpenClaw). - Follow up to 3 redirects manually instead of blocking all 3xx. Every Location URL is run through validate_url() before the next request is sent, so SSRF protection applies to every hop identically to how it applies to the original URL. - Resolve relative Location values against the current URL before SSRF-validating them. - Log each followed hop at DEBUG level. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When redirects are followed, the original `url` field no longer reflects where the content actually came from. Add `final_url` so the LLM can cite the canonical source correctly. Equals `url` when no redirects occurred. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gemini-code-assist · 2026-03-01T02:37:58Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a specialized web_fetch tool designed to enhance the agent's ability to safely and effectively retrieve and process web content. It provides a more robust and user-friendly alternative to the general http tool for common web browsing tasks, focusing on security, content transformation, and compatibility.

Highlights

New web_fetch Tool: Added a new web_fetch built-in tool, distinct from the generic http tool, specifically designed for reading web pages and documentation.
Enhanced Web Content Retrieval: Implemented web_fetch as GET-only, auto-approved, returning structured output including url, final_url, status, title, content, and word_count.
HTML to Markdown Conversion: Integrated HTML to Markdown conversion via the existing Readability pipeline to provide clean, readable content.
Improved Compatibility and Security: Configured a Chrome-like User-Agent and Accept header for better site compatibility, and implemented manual redirect following (up to 3 redirects) with SSRF validation for each hop. The tool reuses existing security infrastructure like HTTPS-only, no private IPs, DNS rebinding defense, LeakDetector, and a 5 MB response cap.
Tool Protection: Added web_fetch to PROTECTED_TOOL_NAMES to prevent shadowing by WASM/MCP tools.

Changelog

src/tools/builtin/http.rs
- Changed the visibility of the validate_url function to pub(crate) to allow its use by other modules within the builtin crate.
src/tools/builtin/mod.rs
- Added the web_fetch module.
- Exported WebFetchTool for external use.
src/tools/builtin/web_fetch.rs
- Added a new module implementing the WebFetchTool.
- Defined the tool's functionality, including URL validation, manual redirect handling with SSRF checks, HTTP request configuration (User-Agent, Accept header), response body streaming with size limits, HTML title extraction, and HTML-to-Markdown conversion.
- Included unit tests for title extraction and tool properties.
src/tools/registry.rs
- Imported WebFetchTool.
- Added "web_fetch" to the PROTECTED_TOOL_NAMES constant.
- Registered a new instance of WebFetchTool in the ToolRegistry.

Activity

No human activity (comments, reviews, etc.) has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new web_fetch built-in tool, designed specifically for fetching and parsing web page content. The implementation is robust, with strong security considerations like manual, validated redirects to prevent SSRF attacks. My review focuses on a performance improvement and a code simplification to enhance maintainability. Overall, this is a great addition.

gemini-code-assist · 2026-03-01T02:43:07Z

+pub struct WebFetchTool {
+    client: Client,
+}
+
+impl WebFetchTool {
+    /// Create a new `WebFetchTool` with a Chrome-like UA and no auto-redirects.
+    ///
+    /// Redirects are followed manually (up to [`MAX_REDIRECTS`] hops) so that
+    /// each `Location` URL is SSRF-validated before the next request is sent.
+    pub fn new() -> Self {
+        let client = Client::builder()
+            .timeout(Duration::from_secs(30))
+            .redirect(reqwest::redirect::Policy::none())
+            .user_agent(USER_AGENT)
+            .build()
+            .expect("Failed to create HTTP client for web_fetch");
+
+        Self { client }
+    }


For performance, the LeakDetector should be initialized once when WebFetchTool is created, rather than on every execute call. LeakDetector::new() is an expensive operation as it compiles multiple regular expressions.

Store an instance of LeakDetector in the WebFetchTool struct and initialize it in new(). You will also need to update its usage in the execute method.

pub struct WebFetchTool { client: Client, leak_detector: LeakDetector, } impl WebFetchTool { /// Create a new `WebFetchTool` with a Chrome-like UA and no auto-redirects. /// /// Redirects are followed manually (up to [`MAX_REDIRECTS`] hops) so that /// each `Location` URL is SSRF-validated before the next request is sent. pub fn new() -> Self { let client = Client::builder() .timeout(Duration::from_secs(30)) .redirect(reqwest::redirect::Policy::none()) .user_agent(USER_AGENT) .build() .expect("Failed to create HTTP client for web_fetch"); Self { client, leak_detector: LeakDetector::new() } }

Fixed in b34d859 — LeakDetector is now stored in the WebFetchTool struct and initialized once in new().

gemini-code-assist · 2026-03-01T02:43:07Z

+        let detector = LeakDetector::new();
+        detector
+            .scan_http_request(current_url.as_str(), &[], None)
+            .map_err(|e| ToolError::NotAuthorized(e.to_string()))?;


Following the change to store LeakDetector in the struct, remove the local instantiation here and use self.leak_detector. Note that you will need to update the other usage of detector in this function as well (at line 195).

self.leak_detector .scan_http_request(current_url.as_str(), &[], None) .map_err(|e| ToolError::NotAuthorized(e.to_string()))?;

Fixed in b34d859 — removed the local detector binding and switched both usages (initial scan + redirect re-validation loop) to self.leak_detector.

gemini-code-assist · 2026-03-01T02:43:07Z

+        #[cfg(feature = "html-to-markdown")]
+        let (content, title) = if is_html {
+            let title = extract_title(&raw_text);
+            match convert_html_to_markdown(&raw_text, current_url.as_str()) {
+                Ok(md) => (md, title),
+                Err(e) => {
+                    tracing::warn!(
+                        url = %current_url,
+                        error = %e,
+                        "HTML-to-markdown conversion failed, returning raw text"
+                    );
+                    (raw_text.clone(), title)
+                }
+            }
+        } else {
+            (raw_text.clone(), None)
+        };
+
+        #[cfg(not(feature = "html-to-markdown"))]
+        let (content, title) = if is_html {
+            (raw_text.clone(), extract_title(&raw_text))
+        } else {
+            (raw_text.clone(), None)
+        };


The logic for handling HTML content can be simplified to reduce code duplication between the cfg blocks. You can handle the is_html check once, and then use cfg attributes internally to determine the content.

let (content, title) = if is_html { let title = extract_title(&raw_text); #[cfg(feature = "html-to-markdown")] let content = match convert_html_to_markdown(&raw_text, current_url.as_str()) { Ok(md) => md, Err(e) => { tracing::warn!( url = %current_url, error = %e, "HTML-to-markdown conversion failed, returning raw text" ); raw_text.clone() } }; #[cfg(not(feature = "html-to-markdown"))] let content = raw_text.clone(); (content, title) } else { (raw_text.clone(), None) };

Fixed in b34d859 — collapsed the two duplicate #[cfg] blocks into a single if is_html with internal cfg attributes, as suggested.

- Store LeakDetector in WebFetchTool struct (init once in new(), not per execute() call) - Use self.leak_detector for both outbound scan and redirect re-validation - Simplify HTML/cfg blocks to reduce duplication (gemini-code-assist suggestion) - Fix pub use ordering in mod.rs (cargo fmt) - Add web_fetch to core_registration_covers_expected_tools snapshot test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

nickpismenkov

LGTM

* feat: add web_fetch built-in tool and web-fetch skill - New web_fetch Rust built-in tool (GET-only, auto-approved, structured output: url/title/content/word_count) with HTML to Markdown via Readability - Full SSRF protection: HTTPS-only, no private IPs, DNS rebinding defence, outbound/inbound leak scanning, 5 MB cap, no redirect following - Rate limited: 30 req/min, 500/hr (same as http tool) - Protected tool name; registered in register_builtin_tools() - validate_url made pub(crate) so web_fetch can reuse it from http.rs - New skills/web-fetch/SKILL.md for agent guidance on web browsing - Fixes unicode panic in extract_title: use to_ascii_lowercase not to_lowercase to preserve byte offsets when indexing original string Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: remove web-fetch skill (tool description is self-sufficient) The web_fetch tool's schema description already tells the LLM when and how to use it. A SKILL.md would only add redundant prompt context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: include HTTP status in web_fetch output The LLM had no way to distinguish a 404 error page from a 200 success. Including status in the structured output (alongside url/title/content/ word_count) lets the agent report failures correctly and matches the behaviour of the http tool which always returns status. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(web_fetch): add Chrome UA and safe redirect following - Set a Chrome-like User-Agent so sites that block the default reqwest string return real content instead of bot-rejection pages. - Add Accept: text/markdown, text/html header (mirrors OpenClaw). - Follow up to 3 redirects manually instead of blocking all 3xx. Every Location URL is run through validate_url() before the next request is sent, so SSRF protection applies to every hop identically to how it applies to the original URL. - Resolve relative Location values against the current URL before SSRF-validating them. - Log each followed hop at DEBUG level. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(web_fetch): expose final_url after redirect following When redirects are followed, the original `url` field no longer reflects where the content actually came from. Add `final_url` so the LLM can cite the canonical source correctly. Equals `url` when no redirects occurred. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(web_fetch): address review comments and fix CI failures - Store LeakDetector in WebFetchTool struct (init once in new(), not per execute() call) - Use self.leak_detector for both outbound scan and redirect re-validation - Simplify HTML/cfg blocks to reduce duplication (gemini-code-assist suggestion) - Fix pub use ordering in mod.rs (cargo fmt) - Add web_fetch to core_registration_covers_expected_tools snapshot test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

henrypark133 and others added 5 commits February 28, 2026 18:16

chore: remove web-fetch skill (tool description is self-sufficient)

2327199

The web_fetch tool's schema description already tells the LLM when and how to use it. A SKILL.md would only add redundant prompt context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions Bot added scope: tool Tool infrastructure scope: tool/builtin Built-in tools size: L 200-499 changed lines risk: medium Business logic, config, or moderate-risk modules contributor: experienced 6-19 merged PRs labels Mar 1, 2026

gemini-code-assist Bot reviewed Mar 1, 2026

View reviewed changes

nickpismenkov approved these changes Mar 1, 2026

View reviewed changes

henrypark133 merged commit 8751a5a into main Mar 1, 2026
13 checks passed

henrypark133 deleted the worktree-web-fetch branch March 1, 2026 03:58

github-actions Bot mentioned this pull request Mar 1, 2026

chore: release v0.13.0 #385

Merged

This was referenced Mar 3, 2026

🦞 OpenClaw 生态日报 2026-03-03 duanyytop/agents-radar#46

Closed

🦞 OpenClaw 生态日报 2026-03-03 rollysys/agents-radar#28

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add web_fetch built-in tool#435

feat: add web_fetch built-in tool#435
henrypark133 merged 6 commits intomainfrom
worktree-web-fetch

henrypark133 commented Mar 1, 2026

Uh oh!

gemini-code-assist Bot commented Mar 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 1, 2026

Uh oh!

henrypark133 Mar 1, 2026

Uh oh!

gemini-code-assist Bot Mar 1, 2026

Uh oh!

henrypark133 Mar 1, 2026

Uh oh!

gemini-code-assist Bot Mar 1, 2026

Uh oh!

henrypark133 Mar 1, 2026

Uh oh!

nickpismenkov left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

henrypark133 commented Mar 1, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot commented Mar 1, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

henrypark133 Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

henrypark133 Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

henrypark133 Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

nickpismenkov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants