Skip to content

Configure cudf-polars options through environment variables#19369

Merged
rapids-bot[bot] merged 7 commits intorapidsai:branch-25.08from
TomAugspurger:tom/env-config
Jul 16, 2025
Merged

Configure cudf-polars options through environment variables#19369
rapids-bot[bot] merged 7 commits intorapidsai:branch-25.08from
TomAugspurger:tom/env-config

Conversation

@TomAugspurger
Copy link
Copy Markdown
Contributor

Description

This updates our configuration handling to enable setting the default value through environment variables for ~all of our configuration options, rather than just specific ones like target_partition_size_default.

Follow-up to #19316.

Closes #19330

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Jul 14, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@TomAugspurger TomAugspurger added the non-breaking Non-breaking change label Jul 14, 2025
@TomAugspurger TomAugspurger added the improvement Improvement / enhancement to an existing function label Jul 14, 2025
@github-actions github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Jul 14, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python Jul 14, 2025
```

You can configure the default value to use through environment variables
with the prefix `CUDF_POLARS__STREAMING__{option_name}`. For example,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably more consistent to make this prefix CUDF_POLARS__EXECUTOR__ rather than __STREAMING__.

Then the rule is that the environment variable matches the "fully qualified" path in python, relative to config_options. config_options.executor.max_rows_per_partition would be set with CUDF_POLARS__EXECUTOR__MAX_ROWS_PER_PARTITION.

This updates our configuration handling to enable setting the default
value through environment variables for ~all of our configuration
options.

Follow-up to rapidsai#19316.

Closes rapidsai#19330
@TomAugspurger
Copy link
Copy Markdown
Contributor Author

dd4f456 handles the rename from CUDF_POLARS__STREAMING to CUDF_POLARS__EXECUTOR, if we want to go with that. That was never released, so we don't need to worry about compatibility shims.

@TomAugspurger TomAugspurger marked this pull request as ready for review July 14, 2025 21:20
@TomAugspurger TomAugspurger requested review from a team as code owners July 14, 2025 21:20
Copy link
Copy Markdown
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving you a ci-codeowners approval (just required because you touched scripts in ci/). Deferring to other cudf maintainers on the substance of these changes.

@TomAugspurger
Copy link
Copy Markdown
Contributor Author

A note on the implementation: I wanted to avoid repeating the name of the field in both the field name and the environment variable name (e.g. "chunked" appearing twice in chunked: bool = dataclasses.field(default_factory=_make_default_factory(f"{_env_prefix}__CHUNKED", _bool_converter, default=True))). I had hoped to use a descriptor field, which lets you know what the field's name is (so we can look it up from the environment). This kind of works:

import dataclasses
import os
import typing

T = typing.TypeVar("T")


class EnvironmentOption[T]:
    def __init__(
        self, converter: typing.Callable[[str], T], default: T, env_prefix: str
    ) -> None:
        self._converter = converter
        self._default = default
        self._env_prefix = env_prefix

    def __set_name__(self, owner: typing.Any, name: str) -> None:
        self._name = name

    def __get__(self, obj: typing.Any, type: typing.Any) -> T:
        key = f"{self._env_prefix}__{self._name.upper()}"
        env_value = os.environ.get(key)
        if env_value is None:
            return self._default
        else:
            return self._converter(env_value)


@dataclasses.dataclass(frozen=True)
class Config:
    a: int = EnvironmentOption(int, 1, "CUDF_POLARS")  # type: ignore[assignment]


config = Config()
print("env", config.a)

config = Config(a=2)
print("arg", config.a)

We need the type: ignore[assignment] there since the actual type the user sees is the result of EnvironmentOption.__get__, rather than EnvironmentOption. I think that's unavoidable, but fine.

The bigger problem to me was that the defaults from the environment are bound at import time when the class is created, rather than when the instance is created. This made testing it slightly harder. If people want, I can try finishing the descriptor-based implementation and see if anything else breaks. Otherwise, we just live with duplicating the field name in the code.

user_executor = os.environ.get(f"{env_prefix}__EXECUTOR", "streaming")
user_executor_options = engine.config.get("executor_options", {})
user_parquet_options = engine.config.get("parquet_options", {})
# This is set in polars, and so can't be overridden by the environment
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to emphasize, we can't do

CUDF_POLARS__RAISE_ON_FAIL=1 python ...

and have the default be looked up from the environment. That's set to False in polars. If we want, we can change that to None and pass it through here.

Copy link
Copy Markdown
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some smallish questions, but the change looks good. Thanks tom!

T = TypeVar("T")


def _make_default_factory(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a big "Nit", but is the typing a bit "off" for sink_to_directory?
For sink_to_directory, the default type is None (rather than bool).

)
sink_to_directory: bool | None = dataclasses.field(
default_factory=_make_default_factory(
f"{_env_prefix}__SINK_TO_DIRECTORY", _bool_converter, None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my earlier note/question about whether the None default needs to be reflected in the _make_default_factory type-hinting.

Copy link
Copy Markdown
Contributor Author

@TomAugspurger TomAugspurger Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that things are correct, but let me check:

  • We let the default sink_to_directory=None, which means "True if scheduler=distributed, False for scheduler="synchronous". No change from branch-25.08 here
  • If you configure sink_to_directory through an environment variable, it must resolve to a boolean. In other words, CUDF_POLARS__EXECUTOR__SINK_TO_DIRETORY=0 is allowed (and will be checked later to ensure it's compatible with the scheduler), but CUDF_POLARS__EXECUTOR__SINK_TO_DIRETORY=none is not allowed. I think this is OK, since AFAICT there's no reason to set that (it's already the default).

So assuming we're OK with not allowing CUDF_POLARS__EXECUTOR__SINK_TO_DIRETORY=none then we should be good. If we do want to allow that, we just need another converter for bool_or_none

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, the functionality definitely seems correct to me.

I'm just noting that the Callable[[], T] return type of _make_default_factory may not be entirely "correct" in this case, because T is bool, but the callable may return None if no environment variable is set?

Feel free to ignore me if I'm just confused :)

Copy link
Copy Markdown
Contributor Author

@TomAugspurger TomAugspurger Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Somehow mypy infers the type of T here to be bool | None. Adding:

reveal_type(_make_default_factory("foo", _bool_converter, default=None))

shows

python/cudf_polars/cudf_polars/utils/config.py:564: note: Revealed type is "def () -> Union[builtins.bool, None]"

That's actually want in this case, but I don't see how it's type safe. I guess that bool | None is the only type that would work here... You've gotten me confused too :)

@TomAugspurger
Copy link
Copy Markdown
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 4cad4c7 into rapidsai:branch-25.08 Jul 16, 2025
172 of 174 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF Python Jul 16, 2025
@wence-
Copy link
Copy Markdown
Contributor

wence- commented Jul 24, 2025

Just to note, we might have considered using something like pydantic-settings, but I think we perhaps didn't need that much generality?

@TomAugspurger
Copy link
Copy Markdown
Contributor Author

It wasn't explicitly discussed, though I considered it. At the I lean against using something like pydantic-settings since IMO the cost of the dependency wouldn't quite make it worth it. But if we want to add more features (.env file parsing, ...) then we could reevaluate. The design here should be consistent with the environment variables that users set.

@TomAugspurger TomAugspurger deleted the tom/env-config branch July 24, 2025 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cudf-polars Issues specific to cudf-polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

[FEA] Allow setting all cudf-polars configuration options with environment variables

5 participants