Conversation
- Implement fetchWebContent function to scrape HTTP(S) link content - Support automatic identification and parsing of Markdown files and regular web pages - Integrate Cheerio library for HTML content extraction and cleaning - Add content length limit and truncation functionality - Support proxy configuration and HTTPS proxy - Add web page title and metadata extraction functionality - Implement content fallback mechanism for SPA pages - Register the fetchWebContent tool in tool settings - Update README document to add new feature descriptions - Add support for unit tests and integration tests - Configure test scripts for content fetching validation
- Added normalizeEngineName function to handle different client representations of engine names - Implemented standardization for engine names such as Bing, DuckDuckGo, linux.do, etc. - Integrated engine name conversion logic into Zod validation mode - Add dedicated engine normalization test files and test cases - Add new test command test:engine-normalization in package.json
|
And adding general web page content scraping functionality |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d1a702eff4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const validateWebUrl = (url: string): boolean => { | ||
| try { | ||
| const urlObj = new URL(url); | ||
| return urlObj.protocol === 'http:' || urlObj.protocol === 'https:'; |
There was a problem hiding this comment.
Block private-network targets in web URL validation
fetchWebContent is exposed to MCP clients as a “public HTTP(S)” fetch tool, but this validator only checks the scheme and therefore accepts hosts like 127.0.0.1, localhost, RFC1918 ranges, or cloud metadata endpoints (for example 169.254.169.254). In deployments where the server can reach internal services, this becomes an SSRF primitive that lets callers read internal-only resources through the tool.
Useful? React with 👍 / 👎.
| requestOptions.httpsAgent = proxyAgent; | ||
| } | ||
|
|
||
| const response = await axios.get(parsedUrl.toString(), requestOptions); |
There was a problem hiding this comment.
Enforce response size limits before fetching page bodies
This request fetches arbitrary URLs with responseType: 'text' but does not set any download/body limits, so Axios will buffer the full response in memory before the later maxChars truncation is applied. A large file or intentionally oversized response can exhaust memory or stall the process even when callers request a small maxChars value.
Useful? React with 👍 / 👎.
close #19