Improve WebFetchTool text extraction with HTML entity decoding and structure preservation by liangzhang-keepmoving · Pull Request #647 · sipeed/picoclaw

liangzhang-keepmoving · 2026-02-22T17:56:29Z

Summary

This PR improves the WebFetchTool text extraction capabilities by:

Adding HTML entity decoding - Now properly handles entities like &, <, >, ", etc.
Preserving content structure - Adds newlines for block-level elements to maintain readability
Refactoring extraction logic - Split into smaller, more manageable methods
Improving text cleaning - Better handling of whitespace and empty lines

Changes Made

Modified pkg/tools/web.go:
- Updated extractText() method to use new helper functions
- Added decodeHTMLEntities() method for HTML entity decoding
- Added removeHTMLTags() method for smarter tag removal

Benefits

More accurate text extraction from HTML pages
Better readability of extracted content
Support for properly displaying HTML entities
More maintainable code structure

This change enhances PicoClaw's ability to fetch and process web content, making it more reliable for tasks that require up-to-date information from the internet.

…ructure preservation

nikolasdehor

This is a subset of the web.go changes from PR #649 — same author, same code. Reviewing independently:

1. Use html.UnescapeString instead of a hand-rolled entity map

Go standard library has html.UnescapeString() in the html package which handles all named entities, numeric entities (&#NNN;), and hex entities (&#xHHHH;). The manual map only covers 9 of the hundreds of HTML entities. For example, —, …, «,  (zero-width space) are all missed.

Replace the entire decodeHTMLEntities method with:

import "html"
result = html.UnescapeString(result)

2. Performance: 30+ regexps compiled per call

removeHTMLTags compiles a new regexp for each of the 20 block tags (opening + closing = 40 regexp compilations), plus the final catch-all. These should be compiled once as package-level variables. On a typical web page, extractText is called once per fetch, but regexp compilation is expensive relative to the replacement itself.

3. Comments in Chinese

The codebase uses English comments throughout. Please translate the Chinese comments.

4. decodeHTMLEntities runs before tag removal

HTML entities inside attribute values (e.g., <a href="foo&bar">) are decoded first, then the tag-removal regex runs. This ordering is fine in practice since the regex just strips the whole tag, but it is worth noting that decoding entities before stripping tags is the correct order (content entities need to be decoded, attribute entities get stripped anyway).

5. No tests

There are no tests for the new extraction behavior. At minimum, test that extractText correctly handles: block tag structure preservation, entity decoding, nested tags, and edge cases like self-closing tags.

Please use html.UnescapeString (item 1) and compile regexps once (item 2).

liangzhang-keepmoving · 2026-02-23T08:52:48Z

Thank you for your detailed feedback. I've addressed all your suggestions as follows:

1. ✅ Use html.UnescapeString instead of custom entity map

Change : Replaced the custom decodeHTMLEntities method with html.UnescapeString()
Code Reference :
- Added html package import: pkg/tools/web.go:8
- Updated extractText function: pkg/tools/web.go:635
- Removed custom decodeHTMLEntities method entirely
Benefit : Now supports all standard HTML entities, including — , … , « , etc.

2. ✅ Compile regexps once as package-level variables

Change : Moved all regex compilations to package-level variables with init() function
Code Reference :
- Added package-level regex variables: pkg/tools/web.go:21-48
- Updated extractText function to use precompiled regexes: pkg/tools/web.go:624-640
- Updated removeHTMLTags function: pkg/tools/web.go:656-661
Benefit : Reduced regex compilation overhead from ~40 compilations per call to 1-time compilation

3. ✅ Translated Chinese comments to English

Change : Updated all comments to English for consistency
Code Reference :
- Updated comments in extractText function: pkg/tools/web.go:622-653
- Updated comments in removeHTMLTags function: pkg/tools/web.go:657-664
Benefit : Maintains consistency with the codebase's English comment style

4. ✅ Adjusted processing order (decode after tag removal)

Change : Modified the processing sequence to remove tags first, then decode entities
Code Reference :
- Updated extractText function processing order: pkg/tools/web.go:632-635
Benefit : Ensures content entities are decoded while attribute entities are properly stripped

5. ✅ Added comprehensive tests

Change : Added extensive test cases for the new functionality
Code Reference :
- Added test cases in web_test.go : pkg/tools/web_test.go:306-361
Tests Added :
- HTML entity decoding test
- Nested tags test
- Self-closing tags test
- Complex HTML structure test
- Edge cases test
Result : All tests pass successfully

Additional Improvement

Added self-closing tag handling : Added special handling for
tags to preserve text flow
Code Reference : pkg/tools/web.go:627-629
All changes have been implemented and tested, with all tests passing successfully. The code now follows best practices and addresses all your concerns.

liangzhang-keepmoving · 2026-02-23T08:56:20Z

The results of the test are listed below:

=== RUN TestWebFetchTool_extractText
=== RUN TestWebFetchTool_extractText/preserves_newlines_between_block_elements
=== RUN TestWebFetchTool_extractText/removes_script_and_style_tags
=== RUN TestWebFetchTool_extractText/collapses_excessive_blank_lines
=== RUN TestWebFetchTool_extractText/collapses_horizontal_whitespace
=== RUN TestWebFetchTool_extractText/empty_input
=== RUN TestWebFetchTool_extractText/HTML_entity_decoding
=== RUN TestWebFetchTool_extractText/nested_tags
=== RUN TestWebFetchTool_extractText/self-closing_tags
=== RUN TestWebFetchTool_extractText/complex_HTML_with_multiple_block_elements
=== RUN TestWebFetchTool_extractText/HTML_entities_in_attributes
--- PASS: TestWebFetchTool_extractText (0.00s)
--- PASS: TestWebFetchTool_extractText/preserves_newlines_between_block_elements (0.00s)
--- PASS: TestWebFetchTool_extractText/removes_script_and_style_tags (0.00s)
--- PASS: TestWebFetchTool_extractText/collapses_excessive_blank_lines (0.00s)
--- PASS: TestWebFetchTool_extractText/collapses_horizontal_whitespace (0.00s)
--- PASS: TestWebFetchTool_extractText/empty_input (0.00s)
--- PASS: TestWebFetchTool_extractText/HTML_entity_decoding (0.00s)
--- PASS: TestWebFetchTool_extractText/nested_tags (0.00s)
--- PASS: TestWebFetchTool_extractText/self-closing_tags (0.00s)
--- PASS: TestWebFetchTool_extractText/complex_HTML_with_multiple_block_elements (0.00s)
--- PASS: TestWebFetchTool_extractText/HTML_entities_in_attributes (0.00s)

liangzhang-keepmoving · 2026-03-02T11:15:00Z

Hello maintainers, I have completed all the requested changes for this PR: 1) Used html.UnescapeString instead of custom entity map, 2) Compiled regexes as package-level variables for better performance, 3) Translated all comments to English, 4) Adjusted processing order for better entity handling, and 5) Added comprehensive tests. All tests are passing successfully. Could you please review and merge this PR when you have time? Thank you!

CLAassistant · 2026-03-05T15:07:48Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ liangzhang-keepmoving
❌ Your Name

Your Name seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

sipeed-bot · 2026-03-26T08:50:17Z

@liangzhang-keepmoving Hi! This PR has had no activity for over 2 weeks, so I'm closing it for now to keep things tidy. If it's still relevant, feel free to reopen it anytime and we'll pick it back up.

Improve WebFetchTool text extraction with HTML entity decoding and st…

99ba905

…ructure preservation

nikolasdehor suggested changes Feb 23, 2026

View reviewed changes

Address reviewer feedback: Improve WebFetchTool text extraction

e1687b7

liangzhang-keepmoving requested a review from nikolasdehor February 23, 2026 11:15

sipeed-bot bot added type: enhancement New feature or request domain: tool labels Mar 3, 2026

Merge branch 'main' into main

75f1537

sipeed-bot bot closed this Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve WebFetchTool text extraction with HTML entity decoding and structure preservation#647

Improve WebFetchTool text extraction with HTML entity decoding and structure preservation#647
liangzhang-keepmoving wants to merge 3 commits intosipeed:mainfrom
liangzhang-keepmoving:main

liangzhang-keepmoving commented Feb 22, 2026

Uh oh!

nikolasdehor left a comment

Uh oh!

liangzhang-keepmoving commented Feb 23, 2026

Uh oh!

liangzhang-keepmoving commented Feb 23, 2026

Uh oh!

liangzhang-keepmoving commented Mar 2, 2026

Uh oh!

CLAassistant commented Mar 5, 2026 •

edited

Loading

Uh oh!

sipeed-bot bot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

liangzhang-keepmoving commented Feb 22, 2026

Summary

Changes Made

Benefits

Uh oh!

nikolasdehor left a comment

Choose a reason for hiding this comment

Uh oh!

liangzhang-keepmoving commented Feb 23, 2026

1. ✅ Use html.UnescapeString instead of custom entity map

2. ✅ Compile regexps once as package-level variables

3. ✅ Translated Chinese comments to English

4. ✅ Adjusted processing order (decode after tag removal)

5. ✅ Added comprehensive tests

Additional Improvement

Uh oh!

liangzhang-keepmoving commented Feb 23, 2026

Uh oh!

liangzhang-keepmoving commented Mar 2, 2026

Uh oh!

CLAassistant commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sipeed-bot bot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Mar 5, 2026 •

edited

Loading