Skip to content

feat: use token-based sizing for embedding chunking#749

Merged
lfnovo merged 4 commits into
lfnovo:mainfrom
unendless314:feat-token-based-chunking-cjk
Apr 19, 2026
Merged

feat: use token-based sizing for embedding chunking#749
lfnovo merged 4 commits into
lfnovo:mainfrom
unendless314:feat-token-based-chunking-cjk

Conversation

@unendless314
Copy link
Copy Markdown
Contributor

Description

This PR replaces character-based sizing with token-based sizing in the embedding chunking pipeline.

Specifically, it updates chunking and embedding threshold checks to use the repository's existing token_count() contract instead of len(...), which improves chunk sizing consistency for CJK and mixed-language content.

As part of this behavior change, the default OPEN_NOTEBOOK_CHUNK_SIZE is also reduced from 1200 to 512, since chunk size semantics now refer to tokens rather than characters and 512 is a safer cross-provider baseline.

Related Issue

Fixes #542

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

How Has This Been Tested?

  • Tested locally with Docker
  • Tested locally with development setup
  • Added new unit tests
  • Existing tests pass (uv run pytest)
  • Manual testing performed (describe below)

Test Details:

  • Updated unit tests for token-based chunk sizing in tests/test_chunking.py
  • Updated embedding tests in tests/test_embedding.py to validate token-based long-text routing
  • Ran:
    • UV_CACHE_DIR=/tmp/uv-cache uv run pytest tests/test_chunking.py
    • UV_CACHE_DIR=/tmp/uv-cache uv run pytest tests/test_embedding.py
  • Result: 50 passed

Design Alignment

Which design principles does this PR support? (See DESIGN_PRINCIPLES.md)

  • Privacy First
  • Simplicity Over Features
  • API-First Architecture
  • Multi-Provider Flexibility
  • Extensibility Through Standards
  • Async-First for Performance

Explanation:
This change keeps the implementation simple by reusing the repository's existing token_count() contract instead of introducing a second tokenizer path. It also better supports multi-provider embedding usage by using token-based chunk sizing and a more conservative
default of 512 tokens.

Checklist

Code Quality

  • My code follows PEP 8 style guidelines (Python)
  • My code follows TypeScript best practices (Frontend)
  • I have added type hints to my code (Python)
  • I have added JSDoc comments where appropriate (TypeScript)
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings or errors

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran linting: make ruff or ruff check . --fix
  • I ran type checking: make lint or uv run python -m mypy .

Documentation

  • I have updated the relevant documentation in /docs (if applicable)
  • I have added/updated docstrings for new/modified functions
  • I have updated the API documentation (if API changes were made)
  • I have added comments to complex logic

Database Changes

  • I have created migration scripts for any database schema changes (in /migrations)
  • Migration includes both up and down scripts
  • Migration has been tested locally

Breaking Changes

  • This PR includes breaking changes
  • I have documented the migration path for users
  • I have updated MIGRATION.md (if applicable)

Screenshots (if applicable)

N/A

Additional Context

This PR intentionally keeps the existing token_count() fallback behavior unchanged. The goal here is to remove character-based chunk sizing from the embedding pipeline, not to redesign tokenizer fallback behavior in the same change set.

Pre-Submission Verification

Before submitting, please verify:

  • I have read CONTRIBUTING.md
  • I have read DESIGN_PRINCIPLES.md
  • This PR addresses an approved issue that was assigned to me
  • I have not included unrelated changes in this PR
  • My PR title follows conventional commits format (e.g., "feat: add user authentication")

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files

Confidence score: 4/5

  • This PR looks safe to merge with minimal risk overall, but there is a moderate performance concern (severity 5/10, high confidence) rather than a functional correctness break.
  • In open_notebook/utils/embedding.py, debug-metric tokenization runs unconditionally, which can add avoidable CPU/latency overhead on every embedding batch call.
  • Pay close attention to open_notebook/utils/embedding.py - gate or defer debug tokenization so normal embedding paths do not pay the extra cost.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="open_notebook/utils/embedding.py">

<violation number="1" location="open_notebook/utils/embedding.py:124">
P2: Expensive tokenization for debug metrics is executed unconditionally, adding avoidable overhead on every embedding batch call.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread open_notebook/utils/embedding.py Outdated
unendless314 and others added 4 commits April 19, 2026 13:37
The previous 512-token default matched exactly the context window of
BERT-family embedders like mxbai-embed-large, leaving no margin for:
- tokenizer mismatch between our o200k_base measurement and the
  embedder's own WordPiece tokenizer
- occasional splitter overshoot (RecursiveCharacterTextSplitter can
  emit chunks slightly above chunk_size when separators are sparse)
- special tokens ([CLS], [SEP]) that consume context-window budget

400 tokens keeps ~20% headroom below 512 while still being a large
improvement over the old character-based default for most content.
Users with larger-context embedders can raise OPEN_NOTEBOOK_CHUNK_SIZE
via env var. Also adds a CHANGELOG entry for the full PR behavior
change.
Target release is 1.8.5 — moving the Changed section out of Unreleased.
@lfnovo lfnovo force-pushed the feat-token-based-chunking-cjk branch from dd94b8b to 05ce03d Compare April 19, 2026 16:39
@lfnovo lfnovo merged commit 6aabacf into lfnovo:main Apr 19, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Improve Chunking Strategy Using Token-Based Measurement

2 participants