Conversation
- New web_fetch Rust built-in tool (GET-only, auto-approved, structured output: url/title/content/word_count) with HTML to Markdown via Readability - Full SSRF protection: HTTPS-only, no private IPs, DNS rebinding defence, outbound/inbound leak scanning, 5 MB cap, no redirect following - Rate limited: 30 req/min, 500/hr (same as http tool) - Protected tool name; registered in register_builtin_tools() - validate_url made pub(crate) so web_fetch can reuse it from http.rs - New skills/web-fetch/SKILL.md for agent guidance on web browsing - Fixes unicode panic in extract_title: use to_ascii_lowercase not to_lowercase to preserve byte offsets when indexing original string Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The web_fetch tool's schema description already tells the LLM when and how to use it. A SKILL.md would only add redundant prompt context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The LLM had no way to distinguish a 404 error page from a 200 success. Including status in the structured output (alongside url/title/content/ word_count) lets the agent report failures correctly and matches the behaviour of the http tool which always returns status. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Set a Chrome-like User-Agent so sites that block the default reqwest string return real content instead of bot-rejection pages. - Add Accept: text/markdown, text/html header (mirrors OpenClaw). - Follow up to 3 redirects manually instead of blocking all 3xx. Every Location URL is run through validate_url() before the next request is sent, so SSRF protection applies to every hop identically to how it applies to the original URL. - Resolve relative Location values against the current URL before SSRF-validating them. - Log each followed hop at DEBUG level. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When redirects are followed, the original `url` field no longer reflects where the content actually came from. Add `final_url` so the LLM can cite the canonical source correctly. Equals `url` when no redirects occurred. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a specialized Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new web_fetch built-in tool, designed specifically for fetching and parsing web page content. The implementation is robust, with strong security considerations like manual, validated redirects to prevent SSRF attacks. My review focuses on a performance improvement and a code simplification to enhance maintainability. Overall, this is a great addition.
| pub struct WebFetchTool { | ||
| client: Client, | ||
| } | ||
|
|
||
| impl WebFetchTool { | ||
| /// Create a new `WebFetchTool` with a Chrome-like UA and no auto-redirects. | ||
| /// | ||
| /// Redirects are followed manually (up to [`MAX_REDIRECTS`] hops) so that | ||
| /// each `Location` URL is SSRF-validated before the next request is sent. | ||
| pub fn new() -> Self { | ||
| let client = Client::builder() | ||
| .timeout(Duration::from_secs(30)) | ||
| .redirect(reqwest::redirect::Policy::none()) | ||
| .user_agent(USER_AGENT) | ||
| .build() | ||
| .expect("Failed to create HTTP client for web_fetch"); | ||
|
|
||
| Self { client } | ||
| } |
There was a problem hiding this comment.
For performance, the LeakDetector should be initialized once when WebFetchTool is created, rather than on every execute call. LeakDetector::new() is an expensive operation as it compiles multiple regular expressions.
Store an instance of LeakDetector in the WebFetchTool struct and initialize it in new(). You will also need to update its usage in the execute method.
pub struct WebFetchTool {
client: Client,
leak_detector: LeakDetector,
}
impl WebFetchTool {
/// Create a new `WebFetchTool` with a Chrome-like UA and no auto-redirects.
///
/// Redirects are followed manually (up to [`MAX_REDIRECTS`] hops) so that
/// each `Location` URL is SSRF-validated before the next request is sent.
pub fn new() -> Self {
let client = Client::builder()
.timeout(Duration::from_secs(30))
.redirect(reqwest::redirect::Policy::none())
.user_agent(USER_AGENT)
.build()
.expect("Failed to create HTTP client for web_fetch");
Self { client, leak_detector: LeakDetector::new() }
}There was a problem hiding this comment.
Fixed in b34d859 — LeakDetector is now stored in the WebFetchTool struct and initialized once in new().
| let detector = LeakDetector::new(); | ||
| detector | ||
| .scan_http_request(current_url.as_str(), &[], None) | ||
| .map_err(|e| ToolError::NotAuthorized(e.to_string()))?; |
There was a problem hiding this comment.
Following the change to store LeakDetector in the struct, remove the local instantiation here and use self.leak_detector. Note that you will need to update the other usage of detector in this function as well (at line 195).
self.leak_detector
.scan_http_request(current_url.as_str(), &[], None)
.map_err(|e| ToolError::NotAuthorized(e.to_string()))?;There was a problem hiding this comment.
Fixed in b34d859 — removed the local detector binding and switched both usages (initial scan + redirect re-validation loop) to self.leak_detector.
| #[cfg(feature = "html-to-markdown")] | ||
| let (content, title) = if is_html { | ||
| let title = extract_title(&raw_text); | ||
| match convert_html_to_markdown(&raw_text, current_url.as_str()) { | ||
| Ok(md) => (md, title), | ||
| Err(e) => { | ||
| tracing::warn!( | ||
| url = %current_url, | ||
| error = %e, | ||
| "HTML-to-markdown conversion failed, returning raw text" | ||
| ); | ||
| (raw_text.clone(), title) | ||
| } | ||
| } | ||
| } else { | ||
| (raw_text.clone(), None) | ||
| }; | ||
|
|
||
| #[cfg(not(feature = "html-to-markdown"))] | ||
| let (content, title) = if is_html { | ||
| (raw_text.clone(), extract_title(&raw_text)) | ||
| } else { | ||
| (raw_text.clone(), None) | ||
| }; |
There was a problem hiding this comment.
The logic for handling HTML content can be simplified to reduce code duplication between the cfg blocks. You can handle the is_html check once, and then use cfg attributes internally to determine the content.
let (content, title) = if is_html {
let title = extract_title(&raw_text);
#[cfg(feature = "html-to-markdown")]
let content = match convert_html_to_markdown(&raw_text, current_url.as_str()) {
Ok(md) => md,
Err(e) => {
tracing::warn!(
url = %current_url,
error = %e,
"HTML-to-markdown conversion failed, returning raw text"
);
raw_text.clone()
}
};
#[cfg(not(feature = "html-to-markdown"))]
let content = raw_text.clone();
(content, title)
} else {
(raw_text.clone(), None)
};There was a problem hiding this comment.
Fixed in b34d859 — collapsed the two duplicate #[cfg] blocks into a single if is_html with internal cfg attributes, as suggested.
- Store LeakDetector in WebFetchTool struct (init once in new(), not per execute() call) - Use self.leak_detector for both outbound scan and redirect re-validation - Simplify HTML/cfg blocks to reduce duplication (gemini-code-assist suggestion) - Fix pub use ordering in mod.rs (cargo fmt) - Add web_fetch to core_registration_covers_expected_tools snapshot test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: add web_fetch built-in tool and web-fetch skill - New web_fetch Rust built-in tool (GET-only, auto-approved, structured output: url/title/content/word_count) with HTML to Markdown via Readability - Full SSRF protection: HTTPS-only, no private IPs, DNS rebinding defence, outbound/inbound leak scanning, 5 MB cap, no redirect following - Rate limited: 30 req/min, 500/hr (same as http tool) - Protected tool name; registered in register_builtin_tools() - validate_url made pub(crate) so web_fetch can reuse it from http.rs - New skills/web-fetch/SKILL.md for agent guidance on web browsing - Fixes unicode panic in extract_title: use to_ascii_lowercase not to_lowercase to preserve byte offsets when indexing original string Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: remove web-fetch skill (tool description is self-sufficient) The web_fetch tool's schema description already tells the LLM when and how to use it. A SKILL.md would only add redundant prompt context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: include HTTP status in web_fetch output The LLM had no way to distinguish a 404 error page from a 200 success. Including status in the structured output (alongside url/title/content/ word_count) lets the agent report failures correctly and matches the behaviour of the http tool which always returns status. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(web_fetch): add Chrome UA and safe redirect following - Set a Chrome-like User-Agent so sites that block the default reqwest string return real content instead of bot-rejection pages. - Add Accept: text/markdown, text/html header (mirrors OpenClaw). - Follow up to 3 redirects manually instead of blocking all 3xx. Every Location URL is run through validate_url() before the next request is sent, so SSRF protection applies to every hop identically to how it applies to the original URL. - Resolve relative Location values against the current URL before SSRF-validating them. - Log each followed hop at DEBUG level. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(web_fetch): expose final_url after redirect following When redirects are followed, the original `url` field no longer reflects where the content actually came from. Add `final_url` so the LLM can cite the canonical source correctly. Equals `url` when no redirects occurred. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(web_fetch): address review comments and fix CI failures - Store LeakDetector in WebFetchTool struct (init once in new(), not per execute() call) - Use self.leak_detector for both outbound scan and redirect re-validation - Simplify HTML/cfg blocks to reduce duplication (gemini-code-assist suggestion) - Fix pub use ordering in mod.rs (cargo fmt) - Add web_fetch to core_registration_covers_expected_tools snapshot test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: add web_fetch built-in tool and web-fetch skill - New web_fetch Rust built-in tool (GET-only, auto-approved, structured output: url/title/content/word_count) with HTML to Markdown via Readability - Full SSRF protection: HTTPS-only, no private IPs, DNS rebinding defence, outbound/inbound leak scanning, 5 MB cap, no redirect following - Rate limited: 30 req/min, 500/hr (same as http tool) - Protected tool name; registered in register_builtin_tools() - validate_url made pub(crate) so web_fetch can reuse it from http.rs - New skills/web-fetch/SKILL.md for agent guidance on web browsing - Fixes unicode panic in extract_title: use to_ascii_lowercase not to_lowercase to preserve byte offsets when indexing original string Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: remove web-fetch skill (tool description is self-sufficient) The web_fetch tool's schema description already tells the LLM when and how to use it. A SKILL.md would only add redundant prompt context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: include HTTP status in web_fetch output The LLM had no way to distinguish a 404 error page from a 200 success. Including status in the structured output (alongside url/title/content/ word_count) lets the agent report failures correctly and matches the behaviour of the http tool which always returns status. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(web_fetch): add Chrome UA and safe redirect following - Set a Chrome-like User-Agent so sites that block the default reqwest string return real content instead of bot-rejection pages. - Add Accept: text/markdown, text/html header (mirrors OpenClaw). - Follow up to 3 redirects manually instead of blocking all 3xx. Every Location URL is run through validate_url() before the next request is sent, so SSRF protection applies to every hop identically to how it applies to the original URL. - Resolve relative Location values against the current URL before SSRF-validating them. - Log each followed hop at DEBUG level. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(web_fetch): expose final_url after redirect following When redirects are followed, the original `url` field no longer reflects where the content actually came from. Add `final_url` so the LLM can cite the canonical source correctly. Equals `url` when no redirects occurred. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(web_fetch): address review comments and fix CI failures - Store LeakDetector in WebFetchTool struct (init once in new(), not per execute() call) - Use self.leak_detector for both outbound scan and redirect re-validation - Simplify HTML/cfg blocks to reduce duplication (gemini-code-assist suggestion) - Fix pub use ordering in mod.rs (cargo fmt) - Add web_fetch to core_registration_covers_expected_tools snapshot test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
web_fetchbuilt-in tool distinct from the generichttptoolApprovalRequirement::Never), purpose-built for reading web pages and documentation{url, final_url, status, title, content, word_count}html-to-markdownfeature)Accept: text/markdown, text/htmlheader so sites that block defaultreqweststrings return real contentLocationURL is SSRF-validated throughvalidate_url()before the next request is sent, so SSRF protection applies to every hop identically to the original requestfinal_urlin the output reflects the canonical URL after any redirects (equalsurlwhen none occurred)validate_url()(HTTPS-only, no private IPs, DNS rebinding defence),LeakDetector, 5 MB response cap, inbound sanitizationweb_fetchadded toPROTECTED_TOOL_NAMESto prevent shadowing by WASM/MCP toolsTest plan
Was able to fetch websites that were previously failing with HTTP GET due to bot-detection on the default
reqwestUser-Agent. The Chrome UA + Accept header fixes real-world compatibility.🤖 Generated with Claude Code