Skip to content

Improve WebFetchTool text extraction with HTML entity decoding and structure preservation#647

Closed
liangzhang-keepmoving wants to merge 3 commits intosipeed:mainfrom
liangzhang-keepmoving:main
Closed

Improve WebFetchTool text extraction with HTML entity decoding and structure preservation#647
liangzhang-keepmoving wants to merge 3 commits intosipeed:mainfrom
liangzhang-keepmoving:main

Conversation

@liangzhang-keepmoving
Copy link
Copy Markdown

Summary

This PR improves the WebFetchTool text extraction capabilities by:

  1. Adding HTML entity decoding - Now properly handles entities like &, <, >, ", etc.
  2. Preserving content structure - Adds newlines for block-level elements to maintain readability
  3. Refactoring extraction logic - Split into smaller, more manageable methods
  4. Improving text cleaning - Better handling of whitespace and empty lines

Changes Made

  • Modified pkg/tools/web.go:
    • Updated extractText() method to use new helper functions
    • Added decodeHTMLEntities() method for HTML entity decoding
    • Added removeHTMLTags() method for smarter tag removal

Benefits

  • More accurate text extraction from HTML pages
  • Better readability of extracted content
  • Support for properly displaying HTML entities
  • More maintainable code structure

This change enhances PicoClaw's ability to fetch and process web content, making it more reliable for tasks that require up-to-date information from the internet.

Copy link
Copy Markdown

@nikolasdehor nikolasdehor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a subset of the web.go changes from PR #649 — same author, same code. Reviewing independently:

1. Use html.UnescapeString instead of a hand-rolled entity map

Go standard library has html.UnescapeString() in the html package which handles all named entities, numeric entities (&#NNN;), and hex entities (&#xHHHH;). The manual map only covers 9 of the hundreds of HTML entities. For example, —, …, «, ​ (zero-width space) are all missed.

Replace the entire decodeHTMLEntities method with:

import "html"
result = html.UnescapeString(result)

2. Performance: 30+ regexps compiled per call

removeHTMLTags compiles a new regexp for each of the 20 block tags (opening + closing = 40 regexp compilations), plus the final catch-all. These should be compiled once as package-level variables. On a typical web page, extractText is called once per fetch, but regexp compilation is expensive relative to the replacement itself.

3. Comments in Chinese

The codebase uses English comments throughout. Please translate the Chinese comments.

4. decodeHTMLEntities runs before tag removal

HTML entities inside attribute values (e.g., <a href="foo&amp;bar">) are decoded first, then the tag-removal regex runs. This ordering is fine in practice since the regex just strips the whole tag, but it is worth noting that decoding entities before stripping tags is the correct order (content entities need to be decoded, attribute entities get stripped anyway).

5. No tests

There are no tests for the new extraction behavior. At minimum, test that extractText correctly handles: block tag structure preservation, entity decoding, nested tags, and edge cases like self-closing tags.

Please use html.UnescapeString (item 1) and compile regexps once (item 2).

@liangzhang-keepmoving
Copy link
Copy Markdown
Author

Thank you for your detailed feedback. I've addressed all your suggestions as follows:

1. ✅ Use html.UnescapeString instead of custom entity map

  • Change : Replaced the custom decodeHTMLEntities method with html.UnescapeString()
  • Code Reference :
    • Added html package import: pkg/tools/web.go:8
    • Updated extractText function: pkg/tools/web.go:635
    • Removed custom decodeHTMLEntities method entirely
  • Benefit : Now supports all standard HTML entities, including — , … , « , etc.

2. ✅ Compile regexps once as package-level variables

  • Change : Moved all regex compilations to package-level variables with init() function
  • Code Reference :
    • Added package-level regex variables: pkg/tools/web.go:21-48
    • Updated extractText function to use precompiled regexes: pkg/tools/web.go:624-640
    • Updated removeHTMLTags function: pkg/tools/web.go:656-661
  • Benefit : Reduced regex compilation overhead from ~40 compilations per call to 1-time compilation

3. ✅ Translated Chinese comments to English

  • Change : Updated all comments to English for consistency
  • Code Reference :
    • Updated comments in extractText function: pkg/tools/web.go:622-653
    • Updated comments in removeHTMLTags function: pkg/tools/web.go:657-664
  • Benefit : Maintains consistency with the codebase's English comment style

4. ✅ Adjusted processing order (decode after tag removal)

  • Change : Modified the processing sequence to remove tags first, then decode entities
  • Code Reference :
    • Updated extractText function processing order: pkg/tools/web.go:632-635
  • Benefit : Ensures content entities are decoded while attribute entities are properly stripped

5. ✅ Added comprehensive tests

  • Change : Added extensive test cases for the new functionality
  • Code Reference :
    • Added test cases in web_test.go : pkg/tools/web_test.go:306-361
  • Tests Added :
    • HTML entity decoding test
    • Nested tags test
    • Self-closing tags test
    • Complex HTML structure test
    • Edge cases test
  • Result : All tests pass successfully

Additional Improvement

  • Added self-closing tag handling : Added special handling for
    tags to preserve text flow
  • Code Reference : pkg/tools/web.go:627-629
    All changes have been implemented and tested, with all tests passing successfully. The code now follows best practices and addresses all your concerns.

@liangzhang-keepmoving
Copy link
Copy Markdown
Author

The results of the test are listed below:

=== RUN TestWebFetchTool_extractText
=== RUN TestWebFetchTool_extractText/preserves_newlines_between_block_elements
=== RUN TestWebFetchTool_extractText/removes_script_and_style_tags
=== RUN TestWebFetchTool_extractText/collapses_excessive_blank_lines
=== RUN TestWebFetchTool_extractText/collapses_horizontal_whitespace
=== RUN TestWebFetchTool_extractText/empty_input
=== RUN TestWebFetchTool_extractText/HTML_entity_decoding
=== RUN TestWebFetchTool_extractText/nested_tags
=== RUN TestWebFetchTool_extractText/self-closing_tags
=== RUN TestWebFetchTool_extractText/complex_HTML_with_multiple_block_elements
=== RUN TestWebFetchTool_extractText/HTML_entities_in_attributes
--- PASS: TestWebFetchTool_extractText (0.00s)
--- PASS: TestWebFetchTool_extractText/preserves_newlines_between_block_elements (0.00s)
--- PASS: TestWebFetchTool_extractText/removes_script_and_style_tags (0.00s)
--- PASS: TestWebFetchTool_extractText/collapses_excessive_blank_lines (0.00s)
--- PASS: TestWebFetchTool_extractText/collapses_horizontal_whitespace (0.00s)
--- PASS: TestWebFetchTool_extractText/empty_input (0.00s)
--- PASS: TestWebFetchTool_extractText/HTML_entity_decoding (0.00s)
--- PASS: TestWebFetchTool_extractText/nested_tags (0.00s)
--- PASS: TestWebFetchTool_extractText/self-closing_tags (0.00s)
--- PASS: TestWebFetchTool_extractText/complex_HTML_with_multiple_block_elements (0.00s)
--- PASS: TestWebFetchTool_extractText/HTML_entities_in_attributes (0.00s)

@liangzhang-keepmoving
Copy link
Copy Markdown
Author

Hello maintainers, I have completed all the requested changes for this PR: 1) Used html.UnescapeString instead of custom entity map, 2) Compiled regexes as package-level variables for better performance, 3) Translated all comments to English, 4) Adjusted processing order for better entity handling, and 5) Added comprehensive tests. All tests are passing successfully. Could you please review and merge this PR when you have time? Thank you!

@sipeed-bot sipeed-bot bot added type: enhancement New feature or request domain: tool labels Mar 3, 2026
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 5, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ liangzhang-keepmoving
❌ Your Name


Your Name seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@sipeed-bot
Copy link
Copy Markdown

sipeed-bot bot commented Mar 26, 2026

@liangzhang-keepmoving Hi! This PR has had no activity for over 2 weeks, so I'm closing it for now to keep things tidy. If it's still relevant, feel free to reopen it anytime and we'll pick it back up.

@sipeed-bot sipeed-bot bot closed this Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: tool type: enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants