fix(chunker): correctly determine chunk midpoint when empty chunks are present#1800
Merged
collindutter merged 1 commit intomainfrom Mar 4, 2025
Merged
fix(chunker): correctly determine chunk midpoint when empty chunks are present#1800collindutter merged 1 commit intomainfrom
collindutter merged 1 commit intomainfrom
Conversation
cjkindel
approved these changes
Mar 4, 2025
4b7bb05 to
8f11146
Compare
…e present Previously ["foo", '', "bar", 'baz'] would be token counted as 'foobarbaz' rather than 'foo bar baz' when getting the midpoint index
8f11146 to
fd7d8fb
Compare
collindutter
added a commit
that referenced
this pull request
Mar 4, 2025
…e present (#1800) Previously ["foo", '', "bar", 'baz'] would be token counted as 'foobarbaz' rather than 'foo bar baz' when getting the midpoint index
This was referenced Mar 4, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Describe your changes
Problem:
["foo", '', "bar", 'baz']is token counted as'foobarbaz'rather than'foo bar baz'when getting the midpoint index:griptape/griptape/chunkers/base_chunker.py
Line 106 in 41ad7f5
This leads to an incorrect midpoint index which results in an incorrect chunk split. In certain cases this can lead to hitting recursive max depth.
Solution:
Join the chunks on the separator that we originally split them on:
griptape/griptape/chunkers/base_chunker.py
Line 56 in 41ad7f5
griptape/griptape/chunkers/base_chunker.py
Line 106 in 4b7bb05
This correctly calculates the midpoint index which results in a correct chunk split.
Other changes in the PR are updates to the tests because chunk boundaries have changed slightly.
Issue ticket number and link
Closes #1796