Skip to content

Fix mixed samples for schema extraction.#1004

Merged
mdwelsh merged 3 commits into
mainfrom
matt/fix-schema-samples
Nov 7, 2024
Merged

Fix mixed samples for schema extraction.#1004
mdwelsh merged 3 commits into
mainfrom
matt/fix-schema-samples

Conversation

@mdwelsh
Copy link
Copy Markdown
Contributor

@mdwelsh mdwelsh commented Nov 6, 2024

This PR fixes issues where the samples retrieved from OpenSearch are mixed in terms of their types. It addresses two cases:

  • In cases where there is a mix of singleton and list types, we promote to a list.
  • In cases where there is a mix of singleton types (e.g., bools and ints), we promote to string.

elif sample_type == float and t == int:
# No need to change.
pass
elif sample_type == list and t != list:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to work for the query side of it? When we retrieve a non-example document, it doesn't actually matter if a value is an int or a float. Python will silently promote as necessary. However, when we retrieve a value and it's a string and we need a list, I would expect there to be code somewhere else that forces the promotion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. This schema description is only used to populate the LLM prompt used to guide the query planner. There is nothing (currently) in Luna that coerces types as they flow through operators. That would be a bigger change than this.

# Need to upgrade our sample type to match the new list type.
sample_type = t
samples = {str([x]) for x in samples}
elif sample_type != t:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're going to want to have an:

elif sample_type == str and t != str:
   # silently promote to string otherwise mixed type samples will generate lots of warnings
   sample_value = str(sample_value)

In order to avoid tons of log messages on this type of mismatch. I tolerated it on the other one on the expectation you needed to see all the errors to figure out what fix should be done. It's not clear from having tried using this made clear it didn't work.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you mean just to special case when sample_type is already str?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I want to not get lots of log messages since we're auto handling this case.
Maybe you add a

if not reported[key]:
  logger.warning("Auto coercing <type> to str")
  reported[key] = True

to make sure that we log in the case where the str example was found first would be better (and then set reported[key] = True in the other branch)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, done.

Comment thread lib/sycamore/sycamore/tests/unit/query/test_schema.py
@mdwelsh mdwelsh requested a review from eric-anderson November 6, 2024 23:22
@mdwelsh mdwelsh enabled auto-merge (squash) November 7, 2024 00:12
@mdwelsh mdwelsh merged commit 53f5f29 into main Nov 7, 2024
@HenryL27 HenryL27 deleted the matt/fix-schema-samples branch August 30, 2025 00:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants