[BUG]: use byte offset in full-text reader rather than token position #4531

codetheweb · 2025-05-12T20:56:04Z

Description of changes

The full-text reader had a bug where it used the positions of tokens to determine whether the tokens appeared sequentially, rather than using the byte offset of tokens. This was problematic because the index stores byte offsets. Since Unicode characters can be many bytes wide, we cannot guarantee that the stored byte offsets are strictly consecutive.

Test plan

How are these changes tested?

Added a test that fails on main.

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

n/a

github-actions · 2025-05-12T20:56:12Z

propel-code-bot · 2025-05-12T21:08:08Z

Fix for Full-Text Search with Unicode Characters

This PR fixes a bug in the full-text reader where token positions were used instead of byte offsets when determining if tokens appeared sequentially. Since Unicode characters can be variable-width (multiple bytes), using token positions led to incorrect search results when texts contained multi-byte characters. The fix uses byte offsets to properly handle Unicode text.

Key Changes:
• Changed position adjustment logic to use byte offsets instead of token positions
• Simplified the token position adjustment algorithm
• Added a test case for documents containing multi-byte characters

Affected Areas:
• Full-text search implementation in rust/index/src/fulltext/types.rs

This summary was automatically generated by @propel-code-bot

…chroma-core#4531) ## Description of changes The full-text reader had a bug where it used the positions of tokens to determine whether the tokens appeared sequentially, rather than using the byte offset of tokens. This was problematic because the index stores byte offsets. Since Unicode characters can be many bytes wide, we cannot guarantee that the stored byte offsets are strictly consecutive. ## Test plan _How are these changes tested?_ Added a test that fails on main. ## Documentation Changes _Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_ n/a

…#4531) ## Description of changes The full-text reader had a bug where it used the positions of tokens to determine whether the tokens appeared sequentially, rather than using the byte offset of tokens. This was problematic because the index stores byte offsets. Since Unicode characters can be many bytes wide, we cannot guarantee that the stored byte offsets are strictly consecutive. ## Test plan _How are these changes tested?_ Added a test that fails on main. ## Documentation Changes _Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_ n/a

…chroma-core#4531) ## Description of changes The full-text reader had a bug where it used the positions of tokens to determine whether the tokens appeared sequentially, rather than using the byte offset of tokens. This was problematic because the index stores byte offsets. Since Unicode characters can be many bytes wide, we cannot guarantee that the stored byte offsets are strictly consecutive. ## Test plan _How are these changes tested?_ Added a test that fails on main. ## Documentation Changes _Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_ n/a

[BUG]: use byte offset in full-text reader rather than token position

8891e68

codetheweb force-pushed the bug-fts-reader-byte-offset-vs-token-position branch from 69dca61 to 8891e68 Compare May 12, 2025 20:56

codetheweb marked this pull request as ready for review May 12, 2025 21:07

codetheweb requested review from HammadB and rescrv May 12, 2025 21:07

rescrv approved these changes May 12, 2025

View reviewed changes

codetheweb enabled auto-merge (squash) May 12, 2025 21:37

codetheweb merged commit 796610e into main May 12, 2025
70 checks passed

codetheweb deleted the bug-fts-reader-byte-offset-vs-token-position branch May 12, 2025 21:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]: use byte offset in full-text reader rather than token position #4531

[BUG]: use byte offset in full-text reader rather than token position #4531

Uh oh!

codetheweb commented May 12, 2025 •

edited

Loading

Uh oh!

github-actions bot commented May 12, 2025

Uh oh!

propel-code-bot bot commented May 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[BUG]: use byte offset in full-text reader rather than token position #4531

[BUG]: use byte offset in full-text reader rather than token position #4531

Uh oh!

Conversation

codetheweb commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Test plan

Documentation Changes

Uh oh!

github-actions bot commented May 12, 2025

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Uh oh!

propel-code-bot bot commented May 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codetheweb commented May 12, 2025 •

edited

Loading