Skip to content

feat: add web_fetch built-in tool#435

Merged
henrypark133 merged 6 commits intomainfrom
worktree-web-fetch
Mar 1, 2026
Merged

feat: add web_fetch built-in tool#435
henrypark133 merged 6 commits intomainfrom
worktree-web-fetch

Conversation

@henrypark133
Copy link
Copy Markdown
Collaborator

Summary

  • Adds a new web_fetch built-in tool distinct from the generic http tool
  • GET-only, auto-approved (ApprovalRequirement::Never), purpose-built for reading web pages and documentation
  • Returns structured output: {url, final_url, status, title, content, word_count}
  • HTML → Markdown conversion via the existing Readability pipeline (html-to-markdown feature)
  • Chrome-like User-Agent and Accept: text/markdown, text/html header so sites that block default reqwest strings return real content
  • Follows up to 3 redirects — each Location URL is SSRF-validated through validate_url() before the next request is sent, so SSRF protection applies to every hop identically to the original request
  • final_url in the output reflects the canonical URL after any redirects (equals url when none occurred)
  • Reuses existing security infrastructure: validate_url() (HTTPS-only, no private IPs, DNS rebinding defence), LeakDetector, 5 MB response cap, inbound sanitization
  • web_fetch added to PROTECTED_TOOL_NAMES to prevent shadowing by WASM/MCP tools

Test plan

Was able to fetch websites that were previously failing with HTTP GET due to bot-detection on the default reqwest User-Agent. The Chrome UA + Accept header fixes real-world compatibility.

🤖 Generated with Claude Code

henrypark133 and others added 5 commits February 28, 2026 18:16
- New web_fetch Rust built-in tool (GET-only, auto-approved, structured
  output: url/title/content/word_count) with HTML to Markdown via Readability
- Full SSRF protection: HTTPS-only, no private IPs, DNS rebinding defence,
  outbound/inbound leak scanning, 5 MB cap, no redirect following
- Rate limited: 30 req/min, 500/hr (same as http tool)
- Protected tool name; registered in register_builtin_tools()
- validate_url made pub(crate) so web_fetch can reuse it from http.rs
- New skills/web-fetch/SKILL.md for agent guidance on web browsing
- Fixes unicode panic in extract_title: use to_ascii_lowercase not
  to_lowercase to preserve byte offsets when indexing original string

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The web_fetch tool's schema description already tells the LLM when and
how to use it. A SKILL.md would only add redundant prompt context.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The LLM had no way to distinguish a 404 error page from a 200 success.
Including status in the structured output (alongside url/title/content/
word_count) lets the agent report failures correctly and matches the
behaviour of the http tool which always returns status.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Set a Chrome-like User-Agent so sites that block the default reqwest
  string return real content instead of bot-rejection pages.
- Add Accept: text/markdown, text/html header (mirrors OpenClaw).
- Follow up to 3 redirects manually instead of blocking all 3xx.
  Every Location URL is run through validate_url() before the next
  request is sent, so SSRF protection applies to every hop identically
  to how it applies to the original URL.
- Resolve relative Location values against the current URL before
  SSRF-validating them.
- Log each followed hop at DEBUG level.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When redirects are followed, the original `url` field no longer
reflects where the content actually came from. Add `final_url` so
the LLM can cite the canonical source correctly. Equals `url` when
no redirects occurred.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added scope: tool Tool infrastructure scope: tool/builtin Built-in tools size: L 200-499 changed lines risk: medium Business logic, config, or moderate-risk modules contributor: experienced 6-19 merged PRs labels Mar 1, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a specialized web_fetch tool designed to enhance the agent's ability to safely and effectively retrieve and process web content. It provides a more robust and user-friendly alternative to the general http tool for common web browsing tasks, focusing on security, content transformation, and compatibility.

Highlights

  • New web_fetch Tool: Added a new web_fetch built-in tool, distinct from the generic http tool, specifically designed for reading web pages and documentation.
  • Enhanced Web Content Retrieval: Implemented web_fetch as GET-only, auto-approved, returning structured output including url, final_url, status, title, content, and word_count.
  • HTML to Markdown Conversion: Integrated HTML to Markdown conversion via the existing Readability pipeline to provide clean, readable content.
  • Improved Compatibility and Security: Configured a Chrome-like User-Agent and Accept header for better site compatibility, and implemented manual redirect following (up to 3 redirects) with SSRF validation for each hop. The tool reuses existing security infrastructure like HTTPS-only, no private IPs, DNS rebinding defense, LeakDetector, and a 5 MB response cap.
  • Tool Protection: Added web_fetch to PROTECTED_TOOL_NAMES to prevent shadowing by WASM/MCP tools.
Changelog
  • src/tools/builtin/http.rs
    • Changed the visibility of the validate_url function to pub(crate) to allow its use by other modules within the builtin crate.
  • src/tools/builtin/mod.rs
    • Added the web_fetch module.
    • Exported WebFetchTool for external use.
  • src/tools/builtin/web_fetch.rs
    • Added a new module implementing the WebFetchTool.
    • Defined the tool's functionality, including URL validation, manual redirect handling with SSRF checks, HTTP request configuration (User-Agent, Accept header), response body streaming with size limits, HTML title extraction, and HTML-to-Markdown conversion.
    • Included unit tests for title extraction and tool properties.
  • src/tools/registry.rs
    • Imported WebFetchTool.
    • Added "web_fetch" to the PROTECTED_TOOL_NAMES constant.
    • Registered a new instance of WebFetchTool in the ToolRegistry.
Activity
  • No human activity (comments, reviews, etc.) has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new web_fetch built-in tool, designed specifically for fetching and parsing web page content. The implementation is robust, with strong security considerations like manual, validated redirects to prevent SSRF attacks. My review focuses on a performance improvement and a code simplification to enhance maintainability. Overall, this is a great addition.

Comment on lines +62 to +80
pub struct WebFetchTool {
client: Client,
}

impl WebFetchTool {
/// Create a new `WebFetchTool` with a Chrome-like UA and no auto-redirects.
///
/// Redirects are followed manually (up to [`MAX_REDIRECTS`] hops) so that
/// each `Location` URL is SSRF-validated before the next request is sent.
pub fn new() -> Self {
let client = Client::builder()
.timeout(Duration::from_secs(30))
.redirect(reqwest::redirect::Policy::none())
.user_agent(USER_AGENT)
.build()
.expect("Failed to create HTTP client for web_fetch");

Self { client }
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For performance, the LeakDetector should be initialized once when WebFetchTool is created, rather than on every execute call. LeakDetector::new() is an expensive operation as it compiles multiple regular expressions.

Store an instance of LeakDetector in the WebFetchTool struct and initialize it in new(). You will also need to update its usage in the execute method.

pub struct WebFetchTool {
    client: Client,
    leak_detector: LeakDetector,
}

impl WebFetchTool {
    /// Create a new `WebFetchTool` with a Chrome-like UA and no auto-redirects.
    ///
    /// Redirects are followed manually (up to [`MAX_REDIRECTS`] hops) so that
    /// each `Location` URL is SSRF-validated before the next request is sent.
    pub fn new() -> Self {
        let client = Client::builder()
            .timeout(Duration::from_secs(30))
            .redirect(reqwest::redirect::Policy::none())
            .user_agent(USER_AGENT)
            .build()
            .expect("Failed to create HTTP client for web_fetch");

        Self { client, leak_detector: LeakDetector::new() }
    }

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b34d859LeakDetector is now stored in the WebFetchTool struct and initialized once in new().

Comment thread src/tools/builtin/web_fetch.rs Outdated
Comment on lines +131 to +134
let detector = LeakDetector::new();
detector
.scan_http_request(current_url.as_str(), &[], None)
.map_err(|e| ToolError::NotAuthorized(e.to_string()))?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Following the change to store LeakDetector in the struct, remove the local instantiation here and use self.leak_detector. Note that you will need to update the other usage of detector in this function as well (at line 195).

        self.leak_detector
            .scan_http_request(current_url.as_str(), &[], None)
            .map_err(|e| ToolError::NotAuthorized(e.to_string()))?;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b34d859 — removed the local detector binding and switched both usages (initial scan + redirect re-validation loop) to self.leak_detector.

Comment thread src/tools/builtin/web_fetch.rs Outdated
Comment on lines +255 to +278
#[cfg(feature = "html-to-markdown")]
let (content, title) = if is_html {
let title = extract_title(&raw_text);
match convert_html_to_markdown(&raw_text, current_url.as_str()) {
Ok(md) => (md, title),
Err(e) => {
tracing::warn!(
url = %current_url,
error = %e,
"HTML-to-markdown conversion failed, returning raw text"
);
(raw_text.clone(), title)
}
}
} else {
(raw_text.clone(), None)
};

#[cfg(not(feature = "html-to-markdown"))]
let (content, title) = if is_html {
(raw_text.clone(), extract_title(&raw_text))
} else {
(raw_text.clone(), None)
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for handling HTML content can be simplified to reduce code duplication between the cfg blocks. You can handle the is_html check once, and then use cfg attributes internally to determine the content.

        let (content, title) = if is_html {
            let title = extract_title(&raw_text);

            #[cfg(feature = "html-to-markdown")]
            let content = match convert_html_to_markdown(&raw_text, current_url.as_str()) {
                Ok(md) => md,
                Err(e) => {
                    tracing::warn!(
                        url = %current_url,
                        error = %e,
                        "HTML-to-markdown conversion failed, returning raw text"
                    );
                    raw_text.clone()
                }
            };

            #[cfg(not(feature = "html-to-markdown"))]
            let content = raw_text.clone();

            (content, title)
        } else {
            (raw_text.clone(), None)
        };

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b34d859 — collapsed the two duplicate #[cfg] blocks into a single if is_html with internal cfg attributes, as suggested.

- Store LeakDetector in WebFetchTool struct (init once in new(), not per execute() call)
- Use self.leak_detector for both outbound scan and redirect re-validation
- Simplify HTML/cfg blocks to reduce duplication (gemini-code-assist suggestion)
- Fix pub use ordering in mod.rs (cargo fmt)
- Add web_fetch to core_registration_covers_expected_tools snapshot test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@nickpismenkov nickpismenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@henrypark133 henrypark133 merged commit 8751a5a into main Mar 1, 2026
13 checks passed
@henrypark133 henrypark133 deleted the worktree-web-fetch branch March 1, 2026 03:58
@github-actions github-actions Bot mentioned this pull request Mar 1, 2026
zmanian pushed a commit to zmanian/ironclaw that referenced this pull request Mar 1, 2026
* feat: add web_fetch built-in tool and web-fetch skill

- New web_fetch Rust built-in tool (GET-only, auto-approved, structured
  output: url/title/content/word_count) with HTML to Markdown via Readability
- Full SSRF protection: HTTPS-only, no private IPs, DNS rebinding defence,
  outbound/inbound leak scanning, 5 MB cap, no redirect following
- Rate limited: 30 req/min, 500/hr (same as http tool)
- Protected tool name; registered in register_builtin_tools()
- validate_url made pub(crate) so web_fetch can reuse it from http.rs
- New skills/web-fetch/SKILL.md for agent guidance on web browsing
- Fixes unicode panic in extract_title: use to_ascii_lowercase not
  to_lowercase to preserve byte offsets when indexing original string

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: remove web-fetch skill (tool description is self-sufficient)

The web_fetch tool's schema description already tells the LLM when and
how to use it. A SKILL.md would only add redundant prompt context.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: include HTTP status in web_fetch output

The LLM had no way to distinguish a 404 error page from a 200 success.
Including status in the structured output (alongside url/title/content/
word_count) lets the agent report failures correctly and matches the
behaviour of the http tool which always returns status.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(web_fetch): add Chrome UA and safe redirect following

- Set a Chrome-like User-Agent so sites that block the default reqwest
  string return real content instead of bot-rejection pages.
- Add Accept: text/markdown, text/html header (mirrors OpenClaw).
- Follow up to 3 redirects manually instead of blocking all 3xx.
  Every Location URL is run through validate_url() before the next
  request is sent, so SSRF protection applies to every hop identically
  to how it applies to the original URL.
- Resolve relative Location values against the current URL before
  SSRF-validating them.
- Log each followed hop at DEBUG level.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(web_fetch): expose final_url after redirect following

When redirects are followed, the original `url` field no longer
reflects where the content actually came from. Add `final_url` so
the LLM can cite the canonical source correctly. Equals `url` when
no redirects occurred.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(web_fetch): address review comments and fix CI failures

- Store LeakDetector in WebFetchTool struct (init once in new(), not per execute() call)
- Use self.leak_detector for both outbound scan and redirect re-validation
- Simplify HTML/cfg blocks to reduce duplication (gemini-code-assist suggestion)
- Fix pub use ordering in mod.rs (cargo fmt)
- Add web_fetch to core_registration_covers_expected_tools snapshot test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
bkutasi pushed a commit to bkutasi/ironclaw that referenced this pull request Mar 28, 2026
* feat: add web_fetch built-in tool and web-fetch skill

- New web_fetch Rust built-in tool (GET-only, auto-approved, structured
  output: url/title/content/word_count) with HTML to Markdown via Readability
- Full SSRF protection: HTTPS-only, no private IPs, DNS rebinding defence,
  outbound/inbound leak scanning, 5 MB cap, no redirect following
- Rate limited: 30 req/min, 500/hr (same as http tool)
- Protected tool name; registered in register_builtin_tools()
- validate_url made pub(crate) so web_fetch can reuse it from http.rs
- New skills/web-fetch/SKILL.md for agent guidance on web browsing
- Fixes unicode panic in extract_title: use to_ascii_lowercase not
  to_lowercase to preserve byte offsets when indexing original string

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: remove web-fetch skill (tool description is self-sufficient)

The web_fetch tool's schema description already tells the LLM when and
how to use it. A SKILL.md would only add redundant prompt context.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: include HTTP status in web_fetch output

The LLM had no way to distinguish a 404 error page from a 200 success.
Including status in the structured output (alongside url/title/content/
word_count) lets the agent report failures correctly and matches the
behaviour of the http tool which always returns status.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(web_fetch): add Chrome UA and safe redirect following

- Set a Chrome-like User-Agent so sites that block the default reqwest
  string return real content instead of bot-rejection pages.
- Add Accept: text/markdown, text/html header (mirrors OpenClaw).
- Follow up to 3 redirects manually instead of blocking all 3xx.
  Every Location URL is run through validate_url() before the next
  request is sent, so SSRF protection applies to every hop identically
  to how it applies to the original URL.
- Resolve relative Location values against the current URL before
  SSRF-validating them.
- Log each followed hop at DEBUG level.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(web_fetch): expose final_url after redirect following

When redirects are followed, the original `url` field no longer
reflects where the content actually came from. Add `final_url` so
the LLM can cite the canonical source correctly. Equals `url` when
no redirects occurred.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(web_fetch): address review comments and fix CI failures

- Store LeakDetector in WebFetchTool struct (init once in new(), not per execute() call)
- Use self.leak_detector for both outbound scan and redirect re-validation
- Simplify HTML/cfg blocks to reduce duplication (gemini-code-assist suggestion)
- Fix pub use ordering in mod.rs (cargo fmt)
- Add web_fetch to core_registration_covers_expected_tools snapshot test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: experienced 6-19 merged PRs risk: medium Business logic, config, or moderate-risk modules scope: tool/builtin Built-in tools scope: tool Tool infrastructure size: L 200-499 changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants