-
Notifications
You must be signed in to change notification settings - Fork 2k
[BUG]: use byte offset in full-text reader rather than token position #4531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
69dca61 to
8891e68
Compare
|
Fix for Full-Text Search with Unicode Characters This PR fixes a bug in the full-text reader where token positions were used instead of byte offsets when determining if tokens appeared sequentially. Since Unicode characters can be variable-width (multiple bytes), using token positions led to incorrect search results when texts contained multi-byte characters. The fix uses byte offsets to properly handle Unicode text. Key Changes: Affected Areas: This summary was automatically generated by @propel-code-bot |
…chroma-core#4531) ## Description of changes The full-text reader had a bug where it used the positions of tokens to determine whether the tokens appeared sequentially, rather than using the byte offset of tokens. This was problematic because the index stores byte offsets. Since Unicode characters can be many bytes wide, we cannot guarantee that the stored byte offsets are strictly consecutive. ## Test plan _How are these changes tested?_ Added a test that fails on main. ## Documentation Changes _Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_ n/a
…#4531) ## Description of changes The full-text reader had a bug where it used the positions of tokens to determine whether the tokens appeared sequentially, rather than using the byte offset of tokens. This was problematic because the index stores byte offsets. Since Unicode characters can be many bytes wide, we cannot guarantee that the stored byte offsets are strictly consecutive. ## Test plan _How are these changes tested?_ Added a test that fails on main. ## Documentation Changes _Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_ n/a
…chroma-core#4531) ## Description of changes The full-text reader had a bug where it used the positions of tokens to determine whether the tokens appeared sequentially, rather than using the byte offset of tokens. This was problematic because the index stores byte offsets. Since Unicode characters can be many bytes wide, we cannot guarantee that the stored byte offsets are strictly consecutive. ## Test plan _How are these changes tested?_ Added a test that fails on main. ## Documentation Changes _Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_ n/a
Description of changes
The full-text reader had a bug where it used the positions of tokens to determine whether the tokens appeared sequentially, rather than using the byte offset of tokens. This was problematic because the index stores byte offsets. Since Unicode characters can be many bytes wide, we cannot guarantee that the stored byte offsets are strictly consecutive.
Test plan
How are these changes tested?
Added a test that fails on main.
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?
n/a