-
Notifications
You must be signed in to change notification settings - Fork 1k
[FEA] Allow setting all cudf-polars configuration options with environment variables #19330
Description
Is your feature request related to a problem? Please describe.
cudf-polars includes many configuration options (https://github.com/rapidsai/cudf/blob/4d2d0ae4a41165568148d31e2f19bf19129c879f/python/cudf_polars/cudf_polars/utils/config.py). When using the polars API, these are provided to the pl.GPUEngine() which is then passed along to .collect(), .sink_*(), etc.
Describe the solution you'd like
It would be helpful to configure those options when you don't have the ability to change the source code, or when using cudf-polars via a wrapper like narwhals. A way to do this is through environment variables.
#19316 is doing this for StreamingExecutor.target_partition_size. We could consider that to any configuration option. The environment variable would be CUDF_POLARS__<option_name>, where option_name is the "fully qualified" name of the option relative to ConfigOptions (i.e. for executors, it includes the name of the executor (streaming, in_memory).
Describe alternatives you've considered
We could have a global configuration object, or some Python API for managing the defaults used:
import cudf_polars.utils.config
cudf_polars.utils.config.streaming__target_partition_size = ...
which users could set at the start of their program. This would require the ability to modify the source code.
Additional context
As for the actual implementation, I'd like to see a couple things:
- Automatically derive the environment variable name from the type declaration; If we have to do this manually, we'd risk typos causing issues.
- Automatically derive the "converter" (from the string environment variable to the in-memory type) from the type declaration.