Fix mixed samples for schema extraction.#1004
Conversation
| elif sample_type == float and t == int: | ||
| # No need to change. | ||
| pass | ||
| elif sample_type == list and t != list: |
There was a problem hiding this comment.
Is this going to work for the query side of it? When we retrieve a non-example document, it doesn't actually matter if a value is an int or a float. Python will silently promote as necessary. However, when we retrieve a value and it's a string and we need a list, I would expect there to be code somewhere else that forces the promotion.
There was a problem hiding this comment.
Right. This schema description is only used to populate the LLM prompt used to guide the query planner. There is nothing (currently) in Luna that coerces types as they flow through operators. That would be a bigger change than this.
| # Need to upgrade our sample type to match the new list type. | ||
| sample_type = t | ||
| samples = {str([x]) for x in samples} | ||
| elif sample_type != t: |
There was a problem hiding this comment.
I think you're going to want to have an:
elif sample_type == str and t != str:
# silently promote to string otherwise mixed type samples will generate lots of warnings
sample_value = str(sample_value)
In order to avoid tons of log messages on this type of mismatch. I tolerated it on the other one on the expectation you needed to see all the errors to figure out what fix should be done. It's not clear from having tried using this made clear it didn't work.
There was a problem hiding this comment.
Oh, you mean just to special case when sample_type is already str?
There was a problem hiding this comment.
Yes, I want to not get lots of log messages since we're auto handling this case.
Maybe you add a
if not reported[key]:
logger.warning("Auto coercing <type> to str")
reported[key] = True
to make sure that we log in the case where the str example was found first would be better (and then set reported[key] = True in the other branch)
This PR fixes issues where the samples retrieved from OpenSearch are mixed in terms of their types. It addresses two cases: