Skip to content

Conversation

@codetheweb
Copy link
Contributor

@codetheweb codetheweb commented Apr 29, 2025

Description of changes

I saw a test_collection_sqlite proptest failure in CI and was able to reproduce it locally by updating rust/frontend/tests/test_collection.proptest-regressions.

SQLite has special handling for two characters in a LIKE query: % and _. We were previously stripping out % from user input and passing _ through. However, because _ acts as a wildcard, this can lead to unexpected results.

After discussing with @Sicheng-Pan we decided to escape both characters, so users can provide % and _ in their $contains query and it will work as expected (a character literal).

Test plan

How are these changes tested?

Added proptest case was previously failing.

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

n/a

@github-actions
Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@codetheweb codetheweb requested a review from Sicheng-Pan April 29, 2025 22:58
@codetheweb codetheweb marked this pull request as ready for review April 29, 2025 22:58
Copy link
Collaborator

@HammadB HammadB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah. The python code explicitly handled this IIRC

@codetheweb
Copy link
Contributor Author

Ah yeah. The python code explicitly handled this IIRC

I don't think it did unless PyPika had something built in.

search_term = f"%{v}%"

@codetheweb codetheweb enabled auto-merge (squash) April 29, 2025 23:14
@HammadB
Copy link
Collaborator

HammadB commented Apr 30, 2025

Ah yeah. The python code explicitly handled this IIRC

I don't think it did unless PyPika had something built in.

search_term = f"%{v}%"

https://github.com/chroma-core/chroma/pull/965/files#diff-403dd784c71c55196b6483bc3977c92f24a327a20b3d73429c1ba9a99ef9ef43R87

Pretty sure this is making an API behavior change? It is currently covered by tests

This could be a breaking change AFAICT and should be thought about + communicated. I think its worth thinking about if we want parity with distributed also.

@propel-code-bot
Copy link
Contributor

Proper Escaping of Special Characters in SQLite $contains Filters

This PR fixes string filtering logic for SQLite collections by correctly escaping % and _ characters in $contains (LIKE) filters. The previous behavior stripped % and left _ unescaped, which could unintentionally match more rows than desired; now both are explicitly escaped, so user queries will interpret these characters as literals.

Key Changes:
• Updated LIKE expression in DocumentOperator::Contains and NotContains to escape % and _ according to SQLite rules, using \ as the escape character.
• Adjusted the test reference matcher to compare the raw pattern, matching the updated SQL logic.
• Added a new failing proptest regression case to ensure coverage of the correct escaping behavior.
• Slight lockfile dependency updates not directly related to this PR's SQLite logic.

Affected Areas:
• rust/segment/src/sqlite_metadata.rs
• rust/segment/src/test.rs
• rust/frontend/tests/test_collection.proptest-regressions
• Cargo.lock

This summary was automatically generated by @propel-code-bot

@codetheweb codetheweb disabled auto-merge June 5, 2025 20:54
@codetheweb
Copy link
Contributor Author

This could be a breaking change AFAICT and should be thought about + communicated. I think its worth thinking about if we want parity with distributed also.

Yes, it's technically a breaking change.

We currently do not have parity between local and distributed in terms of handling _. After this PR, we will have parity.

@codetheweb codetheweb enabled auto-merge (squash) June 5, 2025 21:03
@codetheweb codetheweb merged commit c755e9f into main Jun 5, 2025
71 checks passed
codetheweb added a commit that referenced this pull request Jun 6, 2025
…4775)

## Description of changes

#4402 caused
`test_filtering.py` to start flaking because the test assertions around
`_` handling were not updated.

## Test plan

_How are these changes tested?_

- [x] Tests pass locally with `pytest` for python, `yarn test` for js,
`cargo test` for rust
Inventrohyder pushed a commit to Inventrohyder/chroma that referenced this pull request Aug 5, 2025
…QLite `$contains` filter (chroma-core#4402)

## Description of changes

I saw a `test_collection_sqlite` proptest failure in CI and was able to
reproduce it locally by updating
`rust/frontend/tests/test_collection.proptest-regressions`.

SQLite has special handling for two characters in a `LIKE` query: `%`
and `_`. We were previously stripping out `%` from user input and
passing `_` through. However, because `_` acts as a wildcard, this can
lead to unexpected results.

After discussing with @Sicheng-Pan we decided to escape both characters,
so users can provide `%` and `_` in their `$contains` query and it will
work as expected (a character literal).

## Test plan

_How are these changes tested?_

Added proptest case was previously failing.

## Documentation Changes

_Are all docstrings for user-facing APIs updated if required? Do we need
to make documentation changes in the [docs
section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_

n/a
Inventrohyder pushed a commit to Inventrohyder/chroma that referenced this pull request Aug 5, 2025
…hroma-core#4775)

## Description of changes

chroma-core#4402 caused
`test_filtering.py` to start flaking because the test assertions around
`_` handling were not updated.

## Test plan

_How are these changes tested?_

- [x] Tests pass locally with `pytest` for python, `yarn test` for js,
`cargo test` for rust
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants