Improve WebFetchTool text extraction with HTML entity decoding and structure preservation#647
Improve WebFetchTool text extraction with HTML entity decoding and structure preservation#647liangzhang-keepmoving wants to merge 3 commits intosipeed:mainfrom
Conversation
…ructure preservation
nikolasdehor
left a comment
There was a problem hiding this comment.
This is a subset of the web.go changes from PR #649 — same author, same code. Reviewing independently:
1. Use html.UnescapeString instead of a hand-rolled entity map
Go standard library has html.UnescapeString() in the html package which handles all named entities, numeric entities (&#NNN;), and hex entities (&#xHHHH;). The manual map only covers 9 of the hundreds of HTML entities. For example, —, …, «, ​ (zero-width space) are all missed.
Replace the entire decodeHTMLEntities method with:
import "html"
result = html.UnescapeString(result)2. Performance: 30+ regexps compiled per call
removeHTMLTags compiles a new regexp for each of the 20 block tags (opening + closing = 40 regexp compilations), plus the final catch-all. These should be compiled once as package-level variables. On a typical web page, extractText is called once per fetch, but regexp compilation is expensive relative to the replacement itself.
3. Comments in Chinese
The codebase uses English comments throughout. Please translate the Chinese comments.
4. decodeHTMLEntities runs before tag removal
HTML entities inside attribute values (e.g., <a href="foo&bar">) are decoded first, then the tag-removal regex runs. This ordering is fine in practice since the regex just strips the whole tag, but it is worth noting that decoding entities before stripping tags is the correct order (content entities need to be decoded, attribute entities get stripped anyway).
5. No tests
There are no tests for the new extraction behavior. At minimum, test that extractText correctly handles: block tag structure preservation, entity decoding, nested tags, and edge cases like self-closing tags.
Please use html.UnescapeString (item 1) and compile regexps once (item 2).
|
Thank you for your detailed feedback. I've addressed all your suggestions as follows: 1. ✅ Use html.UnescapeString instead of custom entity map
2. ✅ Compile regexps once as package-level variables
3. ✅ Translated Chinese comments to English
4. ✅ Adjusted processing order (decode after tag removal)
5. ✅ Added comprehensive tests
Additional Improvement
|
|
The results of the test are listed below: === RUN TestWebFetchTool_extractText |
|
Hello maintainers, I have completed all the requested changes for this PR: 1) Used html.UnescapeString instead of custom entity map, 2) Compiled regexes as package-level variables for better performance, 3) Translated all comments to English, 4) Adjusted processing order for better entity handling, and 5) Added comprehensive tests. All tests are passing successfully. Could you please review and merge this PR when you have time? Thank you! |
|
Your Name seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
|
@liangzhang-keepmoving Hi! This PR has had no activity for over 2 weeks, so I'm closing it for now to keep things tidy. If it's still relevant, feel free to reopen it anytime and we'll pick it back up. |
Summary
This PR improves the
WebFetchTooltext extraction capabilities by:&,<,>,", etc.Changes Made
pkg/tools/web.go:extractText()method to use new helper functionsdecodeHTMLEntities()method for HTML entity decodingremoveHTMLTags()method for smarter tag removalBenefits
This change enhances PicoClaw's ability to fetch and process web content, making it more reliable for tasks that require up-to-date information from the internet.