Skip to content

fix(chunker): correctly determine chunk midpoint when empty chunks are present#1800

Merged
collindutter merged 1 commit intomainfrom
fix/chunker-empty-subchunks
Mar 4, 2025
Merged

fix(chunker): correctly determine chunk midpoint when empty chunks are present#1800
collindutter merged 1 commit intomainfrom
fix/chunker-empty-subchunks

Conversation

@collindutter
Copy link
Copy Markdown
Member

Describe your changes

Problem:

["foo", '', "bar", 'baz'] is token counted as 'foobarbaz' rather than 'foo bar baz' when getting the midpoint index:

subchunk_tokens_count = self.tokenizer.count_tokens("".join(subchunks[: index + 1]))

This leads to an incorrect midpoint index which results in an incorrect chunk split. In certain cases this can lead to hitting recursive max depth.

Solution:

Join the chunks on the separator that we originally split them on:

subchunks = chunk.strip().split(separator.value)

subchunk_tokens_count = self.tokenizer.count_tokens(separator.value.join(subchunks[: index + 1]))

This correctly calculates the midpoint index which results in a correct chunk split.

Other changes in the PR are updates to the tests because chunk boundaries have changed slightly.

Issue ticket number and link

Closes #1796

@collindutter collindutter added this to the 1.5 milestone Mar 4, 2025
@collindutter collindutter requested a review from a team March 4, 2025 20:09
@collindutter collindutter self-assigned this Mar 4, 2025
@collindutter collindutter enabled auto-merge March 4, 2025 20:10
@collindutter collindutter force-pushed the fix/chunker-empty-subchunks branch from 4b7bb05 to 8f11146 Compare March 4, 2025 21:51
…e present

Previously ["foo", '', "bar", 'baz'] would be token counted as
'foobarbaz' rather than 'foo  bar baz' when getting the midpoint index
@collindutter collindutter force-pushed the fix/chunker-empty-subchunks branch from 8f11146 to fd7d8fb Compare March 4, 2025 22:23
@collindutter collindutter added this pull request to the merge queue Mar 4, 2025
Merged via the queue into main with commit 8ec2a8a Mar 4, 2025
15 checks passed
@collindutter collindutter deleted the fix/chunker-empty-subchunks branch March 4, 2025 22:32
collindutter added a commit that referenced this pull request Mar 4, 2025
…e present (#1800)

Previously ["foo", '', "bar", 'baz'] would be token counted as
'foobarbaz' rather than 'foo  bar baz' when getting the midpoint index
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

maximum recursion depth exceeded using MarkdownChunker.chunk with certain inputs

2 participants