Skip to content

fix(llm): prevent UTF-8 panic in line_bounds() (fixes #1669)#1679

Merged
ilblackdragon merged 2 commits intonearai:stagingfrom
willamhou:fix/utf8-panic-1669
Mar 27, 2026
Merged

fix(llm): prevent UTF-8 panic in line_bounds() (fixes #1669)#1679
ilblackdragon merged 2 commits intonearai:stagingfrom
willamhou:fix/utf8-panic-1669

Conversation

@willamhou
Copy link
Copy Markdown
Contributor

Summary

  • Fix potential panic in line_bounds() (src/llm/reasoning.rs) when pos lands inside a multi-byte UTF-8 character
  • is_recoverable_tool_call_segment() calls line_bounds(text, end.saturating_sub(1)) which can step into the middle of a multi-byte char (e.g., emoji 🔥 is 4 bytes — subtracting 1 from the byte after it lands at byte 3 of the char)
  • Fix: clamp pos to text.len() and walk backward to the nearest is_char_boundary() before slicing
  • 5 regression tests covering: mid-char positions, emoji before newline, at-length, and out-of-bounds

Also fixes pre-existing clippy unnecessary_sort_by warnings in web gateway handlers (zero-warning policy).

Test plan

  • cargo test -- tests::test_line_bounds — 5 new tests pass
  • cargo clippy --all --benches --tests --examples --all-features — zero warnings
  • cargo fmt --check — clean

Closes #1669

🤖 Generated with Claude Code

@github-actions github-actions Bot added scope: channel/web Web gateway channel scope: llm LLM integration size: M 50-199 changed lines risk: medium Business logic, config, or moderate-risk modules contributor: new First-time contributor labels Mar 26, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial fix to enhance the robustness of text processing, specifically addressing a potential panic in the line_bounds() function when dealing with multi-byte UTF-8 characters. By ensuring that string slicing always occurs at valid character boundaries, the change prevents crashes in scenarios involving complex text, such as emojis. The update also includes minor code quality improvements by resolving Clippy warnings, contributing to a cleaner and more stable codebase.

Highlights

  • UTF-8 Safety Fix: Prevented a potential panic in the line_bounds() function within src/llm/reasoning.rs when the input position falls inside a multi-byte UTF-8 character.
  • Robust Position Handling: Implemented logic to clamp the input position to text.len() and walk backward to the nearest character boundary, ensuring safe string slicing.
  • Comprehensive Testing: Added five new regression tests specifically for line_bounds() to cover various edge cases, including mid-character positions, emojis before newlines, and out-of-bounds positions.
  • Code Quality Improvement: Addressed and resolved pre-existing Clippy unnecessary_sort_by warnings in web gateway handlers (src/channels/web/handlers/chat.rs and src/channels/web/server.rs).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors thread sorting logic in chat_threads_handler to use sort_by_key with std::cmp::Reverse for improved idiomatic Rust. More significantly, it enhances the line_bounds function to be UTF-8 safe by correctly handling positions within multi-byte characters and clamping out-of-bounds inputs, preventing potential panics. Comprehensive unit tests were added to validate these changes. The review suggests strengthening two of these new tests, test_line_bounds_mid_multibyte_char and test_line_bounds_emoji_before_newline, by adding explicit assertions for the expected return values to confirm correct boundary calculation, rather than just verifying non-panicking behavior.

Comment thread src/llm/reasoning.rs Outdated
Comment on lines +2334 to +2336
let _ = line_bounds(text, 1); // would panic before fix
let _ = line_bounds(text, 2);
let _ = line_bounds(text, 3);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test is great for ensuring the function doesn't panic. To make it more robust, consider also asserting the expected return value. This would verify that line_bounds not only avoids panicking but also correctly snaps to the valid character boundary as intended.

Suggested change
let _ = line_bounds(text, 1); // would panic before fix
let _ = line_bounds(text, 2);
let _ = line_bounds(text, 3);
assert_eq!(line_bounds(text, 1), (0, 4));
assert_eq!(line_bounds(text, 2), (0, 4));
assert_eq!(line_bounds(text, 3), (0, 4));
References
  1. When handling UTF-8 strings, especially when dealing with truncation or boundary calculations, it's crucial to use character-aware methods and ensure that operations snap to valid character boundaries to prevent panics and ensure correctness.

Comment thread src/llm/reasoning.rs Outdated
let text = "Result: 🔥\n<tool_call>";
let newline_pos = text.find('\n').unwrap();
// saturating_sub(1) lands inside '🔥' (byte 11 → 10, but char ends at 12)
let _ = line_bounds(text, newline_pos.saturating_sub(1));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the test_line_bounds_mid_multibyte_char test, this test would be stronger if it asserted the expected output. This ensures the function is not just avoiding a panic but is also correctly calculating the line bounds when the position is inside a multi-byte character.

Suggested change
let _ = line_bounds(text, newline_pos.saturating_sub(1));
assert_eq!(line_bounds(text, newline_pos.saturating_sub(1)), (0, 12));
References
  1. When handling UTF-8 strings, especially when dealing with truncation or boundary calculations, it's crucial to use character-aware methods and ensure that operations snap to valid character boundaries to prevent panics and ensure correctness.

Copy link
Copy Markdown
Collaborator

@serrrfirat serrrfirat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, focused fix. The backward walk to is_char_boundary is the standard Rust idiom for this class of bug, and the 5 regression tests cover the key scenarios well (mid-char positions, emoji before newline triggering the actual saturating_sub(1) path, pos beyond length, baseline ASCII).

The clippy sort_bysort_by_key(Reverse(...)) change is correct (DateTime<Utc> implements Ord + Copy).

Re: gemini-bot comments — both are false positives. The tests at lines 2336 and 2349 already assert exact expected return values (0, 4) and (0, 12). The bot's "suggestions" are identical to the existing code.

One cosmetic nit (non-blocking): the comment on line 1388 says "at most 3 bytes for UTF-8" — this refers to the walk-back distance (up to 3 steps for a 4-byte char), not the max char width. Could be slightly clearer, but it's technically correct.

LGTM ✅

Copy link
Copy Markdown
Collaborator

@serrrfirat serrrfirat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, focused fix. The backward walk to is_char_boundary is the standard Rust idiom for this class of bug, and the 5 regression tests cover the key scenarios well (mid-char positions, emoji before newline triggering the actual saturating_sub(1) path, pos beyond length, baseline ASCII).

The clippy sort_by to sort_by_key(Reverse(...)) change is correct (DateTime<Utc> implements Ord + Copy).

Re: gemini-bot comments — both are false positives. The tests at lines 2336 and 2349 already assert exact expected return values (0, 4) and (0, 12). The bot's "suggestions" are identical to the existing code.

One cosmetic nit (non-blocking): the comment on line 1388 says "at most 3 bytes for UTF-8" — this refers to the walk-back distance (up to 3 steps for a 4-byte char), not the max char width. Could be slightly clearer, but it's technically correct.

LGTM

Copy link
Copy Markdown
Collaborator

@zmanian zmanian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: fix(llm): prevent UTF-8 panic in line_bounds()

Correct fix, well-tested. The is_char_boundary() walk-back pattern follows project conventions.

Findings

Low (nit): The inline walk-back loop duplicates crate::util::floor_char_boundary(). Could simplify to:

let safe = crate::util::floor_char_boundary(text, pos);

Not blocking -- the inline version is correct and the comment is helpful.

Medium (out of scope): src/llm/nearai_chat.rs:454 has the same unsafe pattern:

&response_text[..response_text.len().min(300)]

This operates on HTTP response text from an external API. Should use floor_char_boundary. Worth a follow-up issue.

Verified

  • Core fix correctly prevents UTF-8 panic via is_char_boundary() walk-back
  • 5 regression tests covering ASCII, boundary at len, mid-multibyte (positions 1/2/3 of 4-byte emoji), emoji-before-newline (exact crash scenario from #1669), and pos-beyond-len
  • Clippy sort_by -> sort_by_key(Reverse(...)) fixes are correct

Approve. Ship it.

willamhou and others added 2 commits March 27, 2026 14:17
`line_bounds()` used `text[..pos]` slicing which panics when `pos`
lands inside a multi-byte UTF-8 character. This happens when
`end.saturating_sub(1)` in `is_recoverable_tool_call_segment()` steps
back into a multi-byte char like emoji.

Fix: clamp `pos` to `text.len()` and walk backward to the nearest
char boundary before slicing. Add 5 regression tests covering
mid-char positions, emoji boundaries, and out-of-bounds pos.

Also fix pre-existing clippy `unnecessary_sort_by` warnings in
web gateway handlers.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Address Gemini review: strengthen regression tests to verify correct
return values (not just absence of panic) when pos lands mid-char.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
@willamhou willamhou force-pushed the fix/utf8-panic-1669 branch from 1d888be to bb6be89 Compare March 27, 2026 06:17
@willamhou
Copy link
Copy Markdown
Contributor Author

Hi @zmanian, I've rebased this branch onto the latest staging. Could you please approve the workflow run so CI can execute? Thanks!

@ilblackdragon ilblackdragon merged commit 7234700 into nearai:staging Mar 27, 2026
14 checks passed
DougAnderson444 pushed a commit to DougAnderson444/ironclaw that referenced this pull request Mar 29, 2026
…earai#1679)

* fix(llm): prevent UTF-8 panic in line_bounds() (fixes nearai#1669)

`line_bounds()` used `text[..pos]` slicing which panics when `pos`
lands inside a multi-byte UTF-8 character. This happens when
`end.saturating_sub(1)` in `is_recoverable_tool_call_segment()` steps
back into a multi-byte char like emoji.

Fix: clamp `pos` to `text.len()` and walk backward to the nearest
char boundary before slicing. Add 5 regression tests covering
mid-char positions, emoji boundaries, and out-of-bounds pos.

Also fix pre-existing clippy `unnecessary_sort_by` warnings in
web gateway handlers.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>

* test: assert expected values in line_bounds UTF-8 tests

Address Gemini review: strengthen regression tests to verify correct
return values (not just absence of panic) when pos lands mid-char.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>

---------

Co-authored-by: willamhou <willamhou@ceresman.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Happy <yesreply@happy.engineering>
drchirag1991 pushed a commit to drchirag1991/ironclaw that referenced this pull request Apr 8, 2026
…earai#1679)

* fix(llm): prevent UTF-8 panic in line_bounds() (fixes nearai#1669)

`line_bounds()` used `text[..pos]` slicing which panics when `pos`
lands inside a multi-byte UTF-8 character. This happens when
`end.saturating_sub(1)` in `is_recoverable_tool_call_segment()` steps
back into a multi-byte char like emoji.

Fix: clamp `pos` to `text.len()` and walk backward to the nearest
char boundary before slicing. Add 5 regression tests covering
mid-char positions, emoji boundaries, and out-of-bounds pos.

Also fix pre-existing clippy `unnecessary_sort_by` warnings in
web gateway handlers.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>

* test: assert expected values in line_bounds UTF-8 tests

Address Gemini review: strengthen regression tests to verify correct
return values (not just absence of panic) when pos lands mid-char.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>

---------

Co-authored-by: willamhou <willamhou@ceresman.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Happy <yesreply@happy.engineering>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: new First-time contributor risk: medium Business logic, config, or moderate-risk modules scope: channel/web Web gateway channel scope: llm LLM integration size: M 50-199 changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CRITICAL] Potential UTF-8 panic in is_recoverable_tool_call_segment() at line 1405 in src/

4 participants