Skip to content

feat: Unify tokenizer API and add support for pdb.icu and pdb.edge_ngram#53

Merged
isaacvando merged 23 commits intomainfrom
fix-schema-compat
Apr 20, 2026
Merged

feat: Unify tokenizer API and add support for pdb.icu and pdb.edge_ngram#53
isaacvando merged 23 commits intomainfrom
fix-schema-compat

Conversation

@isaacvando
Copy link
Copy Markdown
Collaborator

@isaacvando isaacvando commented Apr 16, 2026

Ticket(s) Closed

I realized that we had a set of tokenizer functions in the indexing module that was used only for indexing, and then a completely different way to specify tokenizers for queries. This is a pretty bad user experience since they are really the same thing in each case. The function based approach is nicer because it makes it clearer what's required and allowed and makes it more direct to prevent SQL injection. I moved the previously indexing-specific approach to tokenizers out of the indexing module into a new tokenizer module and updated the query functions to use those tokenizer functions instead of their old idiosyncratic way.

The examples in the docs that use tokenizers will need to be updated when this is released.

What

  • Unify tokenizer use across indexing and querying
  • Add edge_ngram tokenizer
  • Fix bug with tokenizer compilation that improperly handled multiple named arguments
  • Remove redundant unicode tokenizer use in examples
  • Bump PDB version to 0.23
  • Convert schema compat checks to use json5
  • Update the API coverage script to grep the source instead of parsing it to make it less brittle

Why

How

Tests

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 17, 2026

Codecov Report

❌ Patch coverage is 95.87629% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.54%. Comparing base (abb42f3) to head (c06be59).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
paradedb/sqlalchemy/tokenizer.py 94.44% 2 Missing and 2 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #53      +/-   ##
==========================================
+ Coverage   86.43%   89.54%   +3.10%     
==========================================
  Files          16       17       +1     
  Lines        1342     1262      -80     
  Branches      288      262      -26     
==========================================
- Hits         1160     1130      -30     
+ Misses         99       75      -24     
+ Partials       83       57      -26     
Flag Coverage Δ
pg15 89.54% <95.87%> (+3.10%) ⬆️
pg16 89.54% <95.87%> (+3.10%) ⬆️
pg17 89.54% <95.87%> (+3.10%) ⬆️
pg18 89.54% <95.87%> (+3.10%) ⬆️
py3.10 89.54% <95.87%> (+3.10%) ⬆️
py3.11 89.54% <95.87%> (+3.10%) ⬆️
py3.12 89.54% <95.87%> (+3.10%) ⬆️
py3.13 89.54% <95.87%> (+3.10%) ⬆️
py3.14 89.54% <95.87%> (+3.10%) ⬆️
sqlalchemy-paradedb 89.54% <95.87%> (+3.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
paradedb/__init__.py 100.00% <100.00%> (ø)
paradedb/sqlalchemy/__init__.py 100.00% <100.00%> (ø)
paradedb/sqlalchemy/_pdb_cast.py 88.88% <100.00%> (+1.01%) ⬆️
paradedb/sqlalchemy/expr.py 100.00% <ø> (+10.00%) ⬆️
paradedb/sqlalchemy/indexing.py 85.81% <100.00%> (+7.51%) ⬆️
paradedb/sqlalchemy/search.py 92.99% <100.00%> (-0.14%) ⬇️
paradedb/sqlalchemy/tokenizer.py 94.44% <94.44%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@isaacvando isaacvando changed the title chore: Fix schema compat feat: Unify tokenizer API and add support for pdb.icu and pdb.edge_ngram Apr 20, 2026
@isaacvando
Copy link
Copy Markdown
Collaborator Author

@philippemnoel I've made a lot of changes since you last reviewed this. Mind taking another look? It should be good to go now.

Copy link
Copy Markdown
Member

@philippemnoel philippemnoel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@isaacvando isaacvando merged commit 4dbc836 into main Apr 20, 2026
17 checks passed
@isaacvando isaacvando deleted the fix-schema-compat branch April 20, 2026 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Schema compat failure for ParadeDB v0.23.0

3 participants