Configure cudf-polars options through environment variables#19369
Configure cudf-polars options through environment variables#19369rapids-bot[bot] merged 7 commits intorapidsai:branch-25.08from
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
| ``` | ||
|
|
||
| You can configure the default value to use through environment variables | ||
| with the prefix `CUDF_POLARS__STREAMING__{option_name}`. For example, |
There was a problem hiding this comment.
Probably more consistent to make this prefix CUDF_POLARS__EXECUTOR__ rather than __STREAMING__.
Then the rule is that the environment variable matches the "fully qualified" path in python, relative to config_options. config_options.executor.max_rows_per_partition would be set with CUDF_POLARS__EXECUTOR__MAX_ROWS_PER_PARTITION.
This updates our configuration handling to enable setting the default value through environment variables for ~all of our configuration options. Follow-up to rapidsai#19316. Closes rapidsai#19330
99d74dc to
e9068f5
Compare
|
dd4f456 handles the rename from |
jameslamb
left a comment
There was a problem hiding this comment.
Giving you a ci-codeowners approval (just required because you touched scripts in ci/). Deferring to other cudf maintainers on the substance of these changes.
|
A note on the implementation: I wanted to avoid repeating the name of the field in both the field name and the environment variable name (e.g. "chunked" appearing twice in import dataclasses
import os
import typing
T = typing.TypeVar("T")
class EnvironmentOption[T]:
def __init__(
self, converter: typing.Callable[[str], T], default: T, env_prefix: str
) -> None:
self._converter = converter
self._default = default
self._env_prefix = env_prefix
def __set_name__(self, owner: typing.Any, name: str) -> None:
self._name = name
def __get__(self, obj: typing.Any, type: typing.Any) -> T:
key = f"{self._env_prefix}__{self._name.upper()}"
env_value = os.environ.get(key)
if env_value is None:
return self._default
else:
return self._converter(env_value)
@dataclasses.dataclass(frozen=True)
class Config:
a: int = EnvironmentOption(int, 1, "CUDF_POLARS") # type: ignore[assignment]
config = Config()
print("env", config.a)
config = Config(a=2)
print("arg", config.a)We need the The bigger problem to me was that the defaults from the environment are bound at import time when the class is created, rather than when the instance is created. This made testing it slightly harder. If people want, I can try finishing the descriptor-based implementation and see if anything else breaks. Otherwise, we just live with duplicating the field name in the code. |
| user_executor = os.environ.get(f"{env_prefix}__EXECUTOR", "streaming") | ||
| user_executor_options = engine.config.get("executor_options", {}) | ||
| user_parquet_options = engine.config.get("parquet_options", {}) | ||
| # This is set in polars, and so can't be overridden by the environment |
There was a problem hiding this comment.
Just to emphasize, we can't do
CUDF_POLARS__RAISE_ON_FAIL=1 python ...and have the default be looked up from the environment. That's set to False in polars. If we want, we can change that to None and pass it through here.
rjzamora
left a comment
There was a problem hiding this comment.
I have some smallish questions, but the change looks good. Thanks tom!
| T = TypeVar("T") | ||
|
|
||
|
|
||
| def _make_default_factory( |
There was a problem hiding this comment.
This is probably a big "Nit", but is the typing a bit "off" for sink_to_directory?
For sink_to_directory, the default type is None (rather than bool).
| ) | ||
| sink_to_directory: bool | None = dataclasses.field( | ||
| default_factory=_make_default_factory( | ||
| f"{_env_prefix}__SINK_TO_DIRECTORY", _bool_converter, None |
There was a problem hiding this comment.
See my earlier note/question about whether the None default needs to be reflected in the _make_default_factory type-hinting.
There was a problem hiding this comment.
I think that things are correct, but let me check:
- We let the default
sink_to_directory=None, which means "True ifscheduler=distributed, False forscheduler="synchronous". No change frombranch-25.08here - If you configure
sink_to_directorythrough an environment variable, it must resolve to a boolean. In other words,CUDF_POLARS__EXECUTOR__SINK_TO_DIRETORY=0is allowed (and will be checked later to ensure it's compatible with the scheduler), butCUDF_POLARS__EXECUTOR__SINK_TO_DIRETORY=noneis not allowed. I think this is OK, since AFAICT there's no reason to set that (it's already the default).
So assuming we're OK with not allowing CUDF_POLARS__EXECUTOR__SINK_TO_DIRETORY=none then we should be good. If we do want to allow that, we just need another converter for bool_or_none
There was a problem hiding this comment.
Just to clarify, the functionality definitely seems correct to me.
I'm just noting that the Callable[[], T] return type of _make_default_factory may not be entirely "correct" in this case, because T is bool, but the callable may return None if no environment variable is set?
Feel free to ignore me if I'm just confused :)
There was a problem hiding this comment.
Ah, I see. Somehow mypy infers the type of T here to be bool | None. Adding:
reveal_type(_make_default_factory("foo", _bool_converter, default=None))shows
python/cudf_polars/cudf_polars/utils/config.py:564: note: Revealed type is "def () -> Union[builtins.bool, None]"
That's actually want in this case, but I don't see how it's type safe. I guess that bool | None is the only type that would work here... You've gotten me confused too :)
|
/merge |
4cad4c7
into
rapidsai:branch-25.08
|
Just to note, we might have considered using something like pydantic-settings, but I think we perhaps didn't need that much generality? |
|
It wasn't explicitly discussed, though I considered it. At the I lean against using something like pydantic-settings since IMO the cost of the dependency wouldn't quite make it worth it. But if we want to add more features ( |
Description
This updates our configuration handling to enable setting the default value through environment variables for ~all of our configuration options, rather than just specific ones like
target_partition_size_default.Follow-up to #19316.
Closes #19330