feat: use token-based sizing for embedding chunking by unendless314 · Pull Request #749 · lfnovo/open-notebook

unendless314 · 2026-04-13T01:01:31Z

Description

This PR replaces character-based sizing with token-based sizing in the embedding chunking pipeline.

Specifically, it updates chunking and embedding threshold checks to use the repository's existing token_count() contract instead of len(...), which improves chunk sizing consistency for CJK and mixed-language content.

As part of this behavior change, the default OPEN_NOTEBOOK_CHUNK_SIZE is also reduced from 1200 to 512, since chunk size semantics now refer to tokens rather than characters and 512 is a safer cross-provider baseline.

Related Issue

Fixes #542

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Code refactoring (no functional changes)
Performance improvement
Test coverage improvement

How Has This Been Tested?

Tested locally with Docker
Tested locally with development setup
Added new unit tests
Existing tests pass (uv run pytest)
Manual testing performed (describe below)

Test Details:

Updated unit tests for token-based chunk sizing in tests/test_chunking.py
Updated embedding tests in tests/test_embedding.py to validate token-based long-text routing
Ran:
- UV_CACHE_DIR=/tmp/uv-cache uv run pytest tests/test_chunking.py
- UV_CACHE_DIR=/tmp/uv-cache uv run pytest tests/test_embedding.py
Result: 50 passed

Design Alignment

Which design principles does this PR support? (See DESIGN_PRINCIPLES.md)

Explanation:
This change keeps the implementation simple by reusing the repository's existing token_count() contract instead of introducing a second tokenizer path. It also better supports multi-provider embedding usage by using token-based chunk sizing and a more conservative
default of 512 tokens.

Checklist

Code Quality

My code follows PEP 8 style guidelines (Python)
My code follows TypeScript best practices (Frontend)
I have added type hints to my code (Python)
I have added JSDoc comments where appropriate (TypeScript)
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
My changes generate no new warnings or errors

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran linting: make ruff or ruff check . --fix
I ran type checking: make lint or uv run python -m mypy .

Documentation

I have updated the relevant documentation in /docs (if applicable)
I have added/updated docstrings for new/modified functions
I have updated the API documentation (if API changes were made)
I have added comments to complex logic

Database Changes

I have created migration scripts for any database schema changes (in /migrations)
Migration includes both up and down scripts
Migration has been tested locally

Breaking Changes

This PR includes breaking changes
I have documented the migration path for users
I have updated MIGRATION.md (if applicable)

Screenshots (if applicable)

N/A

Additional Context

This PR intentionally keeps the existing token_count() fallback behavior unchanged. The goal here is to remove character-based chunk sizing from the embedding pipeline, not to redesign tokenizer fallback behavior in the same change set.

Pre-Submission Verification

Before submitting, please verify:

I have read CONTRIBUTING.md
I have read DESIGN_PRINCIPLES.md
This PR addresses an approved issue that was assigned to me
I have not included unrelated changes in this PR
My PR title follows conventional commits format (e.g., "feat: add user authentication")

cubic-dev-ai

1 issue found across 5 files

Confidence score: 4/5

This PR looks safe to merge with minimal risk overall, but there is a moderate performance concern (severity 5/10, high confidence) rather than a functional correctness break.
In open_notebook/utils/embedding.py, debug-metric tokenization runs unconditionally, which can add avoidable CPU/latency overhead on every embedding batch call.
Pay close attention to open_notebook/utils/embedding.py - gate or defer debug tokenization so normal embedding paths do not pay the extra cost.

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="open_notebook/utils/embedding.py">

<violation number="1" location="open_notebook/utils/embedding.py:124">
P2: Expensive tokenization for debug metrics is executed unconditionally, adding avoidable overhead on every embedding batch call.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

The previous 512-token default matched exactly the context window of BERT-family embedders like mxbai-embed-large, leaving no margin for: - tokenizer mismatch between our o200k_base measurement and the embedder's own WordPiece tokenizer - occasional splitter overshoot (RecursiveCharacterTextSplitter can emit chunks slightly above chunk_size when separators are sparse) - special tokens ([CLS], [SEP]) that consume context-window budget 400 tokens keeps ~20% headroom below 512 while still being a large improvement over the old character-based default for most content. Users with larger-context embedders can raise OPEN_NOTEBOOK_CHUNK_SIZE via env var. Also adds a CHANGELOG entry for the full PR behavior change.

Target release is 1.8.5 — moving the Changed section out of Unreleased.

cubic-dev-ai Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread open_notebook/utils/embedding.py Outdated

unendless314 and others added 4 commits April 19, 2026 13:37

feat: make chunk sizing token-based with 512-token default

80b69f8

fix: defer embedding debug token metrics

cc97594

chore: move chunking changelog entry under 1.8.5

05ce03d

Target release is 1.8.5 — moving the Changed section out of Unreleased.

lfnovo force-pushed the feat-token-based-chunking-cjk branch from dd94b8b to 05ce03d Compare April 19, 2026 16:39

lfnovo merged commit 6aabacf into lfnovo:main Apr 19, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use token-based sizing for embedding chunking#749

feat: use token-based sizing for embedding chunking#749
lfnovo merged 4 commits into
lfnovo:mainfrom
unendless314:feat-token-based-chunking-cjk

unendless314 commented Apr 13, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

unendless314 commented Apr 13, 2026

Description

Related Issue

Type of Change

How Has This Been Tested?

Design Alignment

Checklist

Code Quality

Testing

Documentation

Database Changes

Breaking Changes

Screenshots (if applicable)

Additional Context

Pre-Submission Verification

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants