fix(llm): prevent UTF-8 panic in line_bounds() (fixes #1669)#1679
fix(llm): prevent UTF-8 panic in line_bounds() (fixes #1669)#1679ilblackdragon merged 2 commits intonearai:stagingfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a crucial fix to enhance the robustness of text processing, specifically addressing a potential panic in the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors thread sorting logic in chat_threads_handler to use sort_by_key with std::cmp::Reverse for improved idiomatic Rust. More significantly, it enhances the line_bounds function to be UTF-8 safe by correctly handling positions within multi-byte characters and clamping out-of-bounds inputs, preventing potential panics. Comprehensive unit tests were added to validate these changes. The review suggests strengthening two of these new tests, test_line_bounds_mid_multibyte_char and test_line_bounds_emoji_before_newline, by adding explicit assertions for the expected return values to confirm correct boundary calculation, rather than just verifying non-panicking behavior.
| let _ = line_bounds(text, 1); // would panic before fix | ||
| let _ = line_bounds(text, 2); | ||
| let _ = line_bounds(text, 3); |
There was a problem hiding this comment.
This test is great for ensuring the function doesn't panic. To make it more robust, consider also asserting the expected return value. This would verify that line_bounds not only avoids panicking but also correctly snaps to the valid character boundary as intended.
| let _ = line_bounds(text, 1); // would panic before fix | |
| let _ = line_bounds(text, 2); | |
| let _ = line_bounds(text, 3); | |
| assert_eq!(line_bounds(text, 1), (0, 4)); | |
| assert_eq!(line_bounds(text, 2), (0, 4)); | |
| assert_eq!(line_bounds(text, 3), (0, 4)); |
References
- When handling UTF-8 strings, especially when dealing with truncation or boundary calculations, it's crucial to use character-aware methods and ensure that operations snap to valid character boundaries to prevent panics and ensure correctness.
| let text = "Result: 🔥\n<tool_call>"; | ||
| let newline_pos = text.find('\n').unwrap(); | ||
| // saturating_sub(1) lands inside '🔥' (byte 11 → 10, but char ends at 12) | ||
| let _ = line_bounds(text, newline_pos.saturating_sub(1)); |
There was a problem hiding this comment.
Similar to the test_line_bounds_mid_multibyte_char test, this test would be stronger if it asserted the expected output. This ensures the function is not just avoiding a panic but is also correctly calculating the line bounds when the position is inside a multi-byte character.
| let _ = line_bounds(text, newline_pos.saturating_sub(1)); | |
| assert_eq!(line_bounds(text, newline_pos.saturating_sub(1)), (0, 12)); |
References
- When handling UTF-8 strings, especially when dealing with truncation or boundary calculations, it's crucial to use character-aware methods and ensure that operations snap to valid character boundaries to prevent panics and ensure correctness.
serrrfirat
left a comment
There was a problem hiding this comment.
Clean, focused fix. The backward walk to is_char_boundary is the standard Rust idiom for this class of bug, and the 5 regression tests cover the key scenarios well (mid-char positions, emoji before newline triggering the actual saturating_sub(1) path, pos beyond length, baseline ASCII).
The clippy sort_by → sort_by_key(Reverse(...)) change is correct (DateTime<Utc> implements Ord + Copy).
Re: gemini-bot comments — both are false positives. The tests at lines 2336 and 2349 already assert exact expected return values (0, 4) and (0, 12). The bot's "suggestions" are identical to the existing code.
One cosmetic nit (non-blocking): the comment on line 1388 says "at most 3 bytes for UTF-8" — this refers to the walk-back distance (up to 3 steps for a 4-byte char), not the max char width. Could be slightly clearer, but it's technically correct.
LGTM ✅
serrrfirat
left a comment
There was a problem hiding this comment.
Clean, focused fix. The backward walk to is_char_boundary is the standard Rust idiom for this class of bug, and the 5 regression tests cover the key scenarios well (mid-char positions, emoji before newline triggering the actual saturating_sub(1) path, pos beyond length, baseline ASCII).
The clippy sort_by to sort_by_key(Reverse(...)) change is correct (DateTime<Utc> implements Ord + Copy).
Re: gemini-bot comments — both are false positives. The tests at lines 2336 and 2349 already assert exact expected return values (0, 4) and (0, 12). The bot's "suggestions" are identical to the existing code.
One cosmetic nit (non-blocking): the comment on line 1388 says "at most 3 bytes for UTF-8" — this refers to the walk-back distance (up to 3 steps for a 4-byte char), not the max char width. Could be slightly clearer, but it's technically correct.
LGTM
zmanian
left a comment
There was a problem hiding this comment.
Review: fix(llm): prevent UTF-8 panic in line_bounds()
Correct fix, well-tested. The is_char_boundary() walk-back pattern follows project conventions.
Findings
Low (nit): The inline walk-back loop duplicates crate::util::floor_char_boundary(). Could simplify to:
let safe = crate::util::floor_char_boundary(text, pos);Not blocking -- the inline version is correct and the comment is helpful.
Medium (out of scope): src/llm/nearai_chat.rs:454 has the same unsafe pattern:
&response_text[..response_text.len().min(300)]This operates on HTTP response text from an external API. Should use floor_char_boundary. Worth a follow-up issue.
Verified
- Core fix correctly prevents UTF-8 panic via
is_char_boundary()walk-back - 5 regression tests covering ASCII, boundary at len, mid-multibyte (positions 1/2/3 of 4-byte emoji), emoji-before-newline (exact crash scenario from #1669), and pos-beyond-len
- Clippy
sort_by->sort_by_key(Reverse(...))fixes are correct
Approve. Ship it.
`line_bounds()` used `text[..pos]` slicing which panics when `pos` lands inside a multi-byte UTF-8 character. This happens when `end.saturating_sub(1)` in `is_recoverable_tool_call_segment()` steps back into a multi-byte char like emoji. Fix: clamp `pos` to `text.len()` and walk backward to the nearest char boundary before slicing. Add 5 regression tests covering mid-char positions, emoji boundaries, and out-of-bounds pos. Also fix pre-existing clippy `unnecessary_sort_by` warnings in web gateway handlers. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>
Address Gemini review: strengthen regression tests to verify correct return values (not just absence of panic) when pos lands mid-char. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>
1d888be to
bb6be89
Compare
|
Hi @zmanian, I've rebased this branch onto the latest staging. Could you please approve the workflow run so CI can execute? Thanks! |
…earai#1679) * fix(llm): prevent UTF-8 panic in line_bounds() (fixes nearai#1669) `line_bounds()` used `text[..pos]` slicing which panics when `pos` lands inside a multi-byte UTF-8 character. This happens when `end.saturating_sub(1)` in `is_recoverable_tool_call_segment()` steps back into a multi-byte char like emoji. Fix: clamp `pos` to `text.len()` and walk backward to the nearest char boundary before slicing. Add 5 regression tests covering mid-char positions, emoji boundaries, and out-of-bounds pos. Also fix pre-existing clippy `unnecessary_sort_by` warnings in web gateway handlers. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering> * test: assert expected values in line_bounds UTF-8 tests Address Gemini review: strengthen regression tests to verify correct return values (not just absence of panic) when pos lands mid-char. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering> --------- Co-authored-by: willamhou <willamhou@ceresman.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Happy <yesreply@happy.engineering>
…earai#1679) * fix(llm): prevent UTF-8 panic in line_bounds() (fixes nearai#1669) `line_bounds()` used `text[..pos]` slicing which panics when `pos` lands inside a multi-byte UTF-8 character. This happens when `end.saturating_sub(1)` in `is_recoverable_tool_call_segment()` steps back into a multi-byte char like emoji. Fix: clamp `pos` to `text.len()` and walk backward to the nearest char boundary before slicing. Add 5 regression tests covering mid-char positions, emoji boundaries, and out-of-bounds pos. Also fix pre-existing clippy `unnecessary_sort_by` warnings in web gateway handlers. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering> * test: assert expected values in line_bounds UTF-8 tests Address Gemini review: strengthen regression tests to verify correct return values (not just absence of panic) when pos lands mid-char. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering> --------- Co-authored-by: willamhou <willamhou@ceresman.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Happy <yesreply@happy.engineering>
Summary
line_bounds()(src/llm/reasoning.rs) whenposlands inside a multi-byte UTF-8 characteris_recoverable_tool_call_segment()callsline_bounds(text, end.saturating_sub(1))which can step into the middle of a multi-byte char (e.g., emoji🔥is 4 bytes — subtracting 1 from the byte after it lands at byte 3 of the char)postotext.len()and walk backward to the nearestis_char_boundary()before slicingAlso fixes pre-existing clippy
unnecessary_sort_bywarnings in web gateway handlers (zero-warning policy).Test plan
cargo test -- tests::test_line_bounds— 5 new tests passcargo clippy --all --benches --tests --examples --all-features— zero warningscargo fmt --check— cleanCloses #1669
🤖 Generated with Claude Code