[Data] Add credential provider abstraction for Databricks UC datasource#60457
Conversation
Introduce a credential provider pattern for Databricks authentication, enabling custom credential sources while maintaining backward compatibility. Changes: - Add DatabricksCredentialProvider base class and StaticCredentialProvider - Add credential_provider parameter to DatabricksUCDatasource and read_databricks_tables() - Add retry on 401 with credential invalidation - Move Databricks tests to dedicated test file The credential provider abstraction allows users to implement custom credential sources that support token refresh and other authentication patterns beyond static tokens. Backward compatibility: Existing code using the token parameter or DATABRICKS_TOKEN environment variable continues to work unchanged. Signed-off-by: ankur <ankur@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new, extensible Databricks credential provider system for Ray Data. It defines an abstract DatabricksCredentialProvider base class and provides concrete implementations for static tokens (StaticCredentialProvider) and environment variables (EnvironmentCredentialProvider), the latter including logic to detect the Databricks host from the runtime environment. The DatabricksUCDatasource was refactored to utilize this new provider, dynamically fetching authentication tokens for each request and implementing a single-retry mechanism for 401 Unauthorized responses during both statement polling and chunk resolution. The read_databricks_tables API was updated to accept an optional credential_provider parameter, replacing its previous direct environment variable parsing. Corresponding unit tests were added for the new credential providers and their integration, including tests for the 401 retry logic. Review comments highlighted areas for improvement, such as making the EnvironmentCredentialProvider's host resolution error message more precise, evaluating the portability of the _detect_databricks_host method's reliance on IPython, and enhancing the 401 retry logic in both the polling and chunk resolution loops with a maximum retry limit or backoff strategy to prevent potential infinite loops. Additionally, a comment noted the importance of ensuring the DatabricksCredentialProvider instance is serializable if it holds state for remote workers.
python/ray/data/_internal/datasource/databricks_uc_datasource.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/datasource/databricks_uc_datasource.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 005cdf5265
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: ankur <ankur@anyscale.com>
324b3e0 to
5bcf76b
Compare
Signed-off-by: ankur <ankur@anyscale.com>
python/ray/data/_internal/datasource/databricks_uc_datasource.py
Outdated
Show resolved
Hide resolved
goutamvenkat-anyscale
left a comment
There was a problem hiding this comment.
left a few comments
| token_env_var: str = "DATABRICKS_TOKEN", | ||
| host_env_var: str = "DATABRICKS_HOST", |
There was a problem hiding this comment.
Lets make these consts
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Why are we bypassing all exceptions?
There was a problem hiding this comment.
Logged the exception here. If the data runtime is not available and host is also not set, then it will be handled by the current flow.
| headers = _build_headers(credential_provider_for_tasks) | ||
| resolve_response = requests.get( | ||
| resolve_external_link_url, headers=headers | ||
| ) |
There was a problem hiding this comment.
Make this into a reusable function
| _STATEMENT_EXEC_POLL_TIME_S = 1 | ||
|
|
||
|
|
||
| def _build_headers(credential_provider: DatabricksCredentialProvider) -> dict: |
There was a problem hiding this comment.
Nit: dict[str, str]
| ) | ||
|
|
||
|
|
||
| def test_databricks_uc_datasource(): |
There was a problem hiding this comment.
Why is this being nuked?
There was a problem hiding this comment.
This was just moved to other file. I have updated the tests as well as per current guidelines as well as your comments.
| .reset_index(drop=True) | ||
| ) | ||
|
|
||
| pd.testing.assert_frame_equal(result, expected_result_df) |
There was a problem hiding this comment.
use rows_same
| .sort_values("c1") | ||
| .reset_index(drop=True) | ||
| ) | ||
| pd.testing.assert_frame_equal(result, expected_result_df) |
There was a problem hiding this comment.
Use rows_same
- Add module-level constants DEFAULT_TOKEN_ENV_VAR and DEFAULT_HOST_ENV_VAR
for environment variable names
- Add logging to _detect_databricks_host() instead of silently swallowing
exceptions
- Extract 401 retry logic into reusable _request_with_401_retry() function
to reduce code duplication
Signed-off-by: ankur <ankur@anyscale.com>
b110106 to
8acd939
Compare
python/ray/data/tests/datasource/test_databricks_uc_datasource.py
Outdated
Show resolved
Hide resolved
python/ray/data/tests/datasource/test_databricks_uc_datasource.py
Outdated
Show resolved
Hide resolved
f56d718 to
3ae45c6
Compare
Signed-off-by: ankur <ankur@anyscale.com>
3ae45c6 to
4ba7c89
Compare
python/ray/data/tests/datasource/test_databricks_uc_datasource.py
Outdated
Show resolved
Hide resolved
python/ray/data/tests/datasource/test_databricks_uc_datasource.py
Outdated
Show resolved
Hide resolved
Signed-off-by: ankur <ankur@anyscale.com>
python/ray/data/tests/datasource/test_databricks_uc_datasource.py
Outdated
Show resolved
Hide resolved
Signed-off-by: ankur <ankur@anyscale.com>
python/ray/data/tests/datasource/test_databricks_credentials.py
Outdated
Show resolved
Hide resolved
python/ray/data/tests/datasource/test_databricks_credentials.py
Outdated
Show resolved
Hide resolved
python/ray/data/tests/datasource/test_databricks_credentials.py
Outdated
Show resolved
Hide resolved
Signed-off-by: ankur <ankur@anyscale.com>
python/ray/data/tests/datasource/test_databricks_uc_datasource.py
Outdated
Show resolved
Hide resolved
Signed-off-by: ankur <ankur@anyscale.com>
17e4adb to
afbf7a5
Compare
|
/gemini summary |
Summary of ChangesThis pull request significantly enhances the Databricks Unity Catalog datasource by introducing a robust and extensible credential provider abstraction. This change improves authentication flexibility, allowing users to define custom credential sources and enabling automatic token refresh mechanisms. It also adds resilience to transient authentication failures by implementing a 401 retry mechanism with credential invalidation, ensuring smoother data operations. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
|
Signed-off-by: ankur <ankur@anyscale.com>
Signed-off-by: ankur <ankur@anyscale.com>
…ce (ray-project#60457) Introduce a credential provider pattern for Databricks authentication, enabling custom credential sources while maintaining backward compatibility. Changes: - Add DatabricksCredentialProvider base class with StaticCredentialProvider and EnvironmentCredentialProvider implementations - Add credential_provider parameter to DatabricksUCDatasource and read_databricks_tables() - Add UnityCatalogConnector class for reading Unity Catalog tables directly (supports Delta/Parquet formats with AWS, Azure, and GCP credential handoff) - Add retry on 401 with credential invalidation via shared request_with_401_retry() helper - Centralize common code (build_headers, request_with_401_retry) in databricks_credentials.py - Move Databricks tests to dedicated test files with shared test utilities The credential provider abstraction allows users to implement custom credential sources that support token refresh and other authentication patterns beyond static tokens. Backward compatibility: Existing code using the DATABRICKS_TOKEN and DATABRICKS_HOST environment variables continues to work unchanged. --------- Signed-off-by: ankur <ankur@anyscale.com>
…ce (ray-project#60457) Introduce a credential provider pattern for Databricks authentication, enabling custom credential sources while maintaining backward compatibility. Changes: - Add DatabricksCredentialProvider base class with StaticCredentialProvider and EnvironmentCredentialProvider implementations - Add credential_provider parameter to DatabricksUCDatasource and read_databricks_tables() - Add UnityCatalogConnector class for reading Unity Catalog tables directly (supports Delta/Parquet formats with AWS, Azure, and GCP credential handoff) - Add retry on 401 with credential invalidation via shared request_with_401_retry() helper - Centralize common code (build_headers, request_with_401_retry) in databricks_credentials.py - Move Databricks tests to dedicated test files with shared test utilities The credential provider abstraction allows users to implement custom credential sources that support token refresh and other authentication patterns beyond static tokens. Backward compatibility: Existing code using the DATABRICKS_TOKEN and DATABRICKS_HOST environment variables continues to work unchanged. --------- Signed-off-by: ankur <ankur@anyscale.com>
…ce (ray-project#60457) Introduce a credential provider pattern for Databricks authentication, enabling custom credential sources while maintaining backward compatibility. Changes: - Add DatabricksCredentialProvider base class with StaticCredentialProvider and EnvironmentCredentialProvider implementations - Add credential_provider parameter to DatabricksUCDatasource and read_databricks_tables() - Add UnityCatalogConnector class for reading Unity Catalog tables directly (supports Delta/Parquet formats with AWS, Azure, and GCP credential handoff) - Add retry on 401 with credential invalidation via shared request_with_401_retry() helper - Centralize common code (build_headers, request_with_401_retry) in databricks_credentials.py - Move Databricks tests to dedicated test files with shared test utilities The credential provider abstraction allows users to implement custom credential sources that support token refresh and other authentication patterns beyond static tokens. Backward compatibility: Existing code using the DATABRICKS_TOKEN and DATABRICKS_HOST environment variables continues to work unchanged. --------- Signed-off-by: ankur <ankur@anyscale.com> Signed-off-by: 400Ping <jiekaichang@apache.org>
…ce (ray-project#60457) Introduce a credential provider pattern for Databricks authentication, enabling custom credential sources while maintaining backward compatibility. Changes: - Add DatabricksCredentialProvider base class with StaticCredentialProvider and EnvironmentCredentialProvider implementations - Add credential_provider parameter to DatabricksUCDatasource and read_databricks_tables() - Add UnityCatalogConnector class for reading Unity Catalog tables directly (supports Delta/Parquet formats with AWS, Azure, and GCP credential handoff) - Add retry on 401 with credential invalidation via shared request_with_401_retry() helper - Centralize common code (build_headers, request_with_401_retry) in databricks_credentials.py - Move Databricks tests to dedicated test files with shared test utilities The credential provider abstraction allows users to implement custom credential sources that support token refresh and other authentication patterns beyond static tokens. Backward compatibility: Existing code using the DATABRICKS_TOKEN and DATABRICKS_HOST environment variables continues to work unchanged. --------- Signed-off-by: ankur <ankur@anyscale.com> Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
…ce (#60457) Introduce a credential provider pattern for Databricks authentication, enabling custom credential sources while maintaining backward compatibility. Changes: - Add DatabricksCredentialProvider base class with StaticCredentialProvider and EnvironmentCredentialProvider implementations - Add credential_provider parameter to DatabricksUCDatasource and read_databricks_tables() - Add UnityCatalogConnector class for reading Unity Catalog tables directly (supports Delta/Parquet formats with AWS, Azure, and GCP credential handoff) - Add retry on 401 with credential invalidation via shared request_with_401_retry() helper - Centralize common code (build_headers, request_with_401_retry) in databricks_credentials.py - Move Databricks tests to dedicated test files with shared test utilities The credential provider abstraction allows users to implement custom credential sources that support token refresh and other authentication patterns beyond static tokens. Backward compatibility: Existing code using the DATABRICKS_TOKEN and DATABRICKS_HOST environment variables continues to work unchanged. --------- Signed-off-by: ankur <ankur@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…ce (#60457) Introduce a credential provider pattern for Databricks authentication, enabling custom credential sources while maintaining backward compatibility. Changes: - Add DatabricksCredentialProvider base class with StaticCredentialProvider and EnvironmentCredentialProvider implementations - Add credential_provider parameter to DatabricksUCDatasource and read_databricks_tables() - Add UnityCatalogConnector class for reading Unity Catalog tables directly (supports Delta/Parquet formats with AWS, Azure, and GCP credential handoff) - Add retry on 401 with credential invalidation via shared request_with_401_retry() helper - Centralize common code (build_headers, request_with_401_retry) in databricks_credentials.py - Move Databricks tests to dedicated test files with shared test utilities The credential provider abstraction allows users to implement custom credential sources that support token refresh and other authentication patterns beyond static tokens. Backward compatibility: Existing code using the DATABRICKS_TOKEN and DATABRICKS_HOST environment variables continues to work unchanged. --------- Signed-off-by: ankur <ankur@anyscale.com>
…ce (ray-project#60457) Introduce a credential provider pattern for Databricks authentication, enabling custom credential sources while maintaining backward compatibility. Changes: - Add DatabricksCredentialProvider base class with StaticCredentialProvider and EnvironmentCredentialProvider implementations - Add credential_provider parameter to DatabricksUCDatasource and read_databricks_tables() - Add UnityCatalogConnector class for reading Unity Catalog tables directly (supports Delta/Parquet formats with AWS, Azure, and GCP credential handoff) - Add retry on 401 with credential invalidation via shared request_with_401_retry() helper - Centralize common code (build_headers, request_with_401_retry) in databricks_credentials.py - Move Databricks tests to dedicated test files with shared test utilities The credential provider abstraction allows users to implement custom credential sources that support token refresh and other authentication patterns beyond static tokens. Backward compatibility: Existing code using the DATABRICKS_TOKEN and DATABRICKS_HOST environment variables continues to work unchanged. --------- Signed-off-by: ankur <ankur@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>
…ce (ray-project#60457) Introduce a credential provider pattern for Databricks authentication, enabling custom credential sources while maintaining backward compatibility. Changes: - Add DatabricksCredentialProvider base class with StaticCredentialProvider and EnvironmentCredentialProvider implementations - Add credential_provider parameter to DatabricksUCDatasource and read_databricks_tables() - Add UnityCatalogConnector class for reading Unity Catalog tables directly (supports Delta/Parquet formats with AWS, Azure, and GCP credential handoff) - Add retry on 401 with credential invalidation via shared request_with_401_retry() helper - Centralize common code (build_headers, request_with_401_retry) in databricks_credentials.py - Move Databricks tests to dedicated test files with shared test utilities The credential provider abstraction allows users to implement custom credential sources that support token refresh and other authentication patterns beyond static tokens. Backward compatibility: Existing code using the DATABRICKS_TOKEN and DATABRICKS_HOST environment variables continues to work unchanged. --------- Signed-off-by: ankur <ankur@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…ce (ray-project#60457) Introduce a credential provider pattern for Databricks authentication, enabling custom credential sources while maintaining backward compatibility. Changes: - Add DatabricksCredentialProvider base class with StaticCredentialProvider and EnvironmentCredentialProvider implementations - Add credential_provider parameter to DatabricksUCDatasource and read_databricks_tables() - Add UnityCatalogConnector class for reading Unity Catalog tables directly (supports Delta/Parquet formats with AWS, Azure, and GCP credential handoff) - Add retry on 401 with credential invalidation via shared request_with_401_retry() helper - Centralize common code (build_headers, request_with_401_retry) in databricks_credentials.py - Move Databricks tests to dedicated test files with shared test utilities The credential provider abstraction allows users to implement custom credential sources that support token refresh and other authentication patterns beyond static tokens. Backward compatibility: Existing code using the DATABRICKS_TOKEN and DATABRICKS_HOST environment variables continues to work unchanged. --------- Signed-off-by: ankur <ankur@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Introduce a credential provider pattern for Databricks authentication, enabling custom credential sources while maintaining backward compatibility.
Changes:
The credential provider abstraction allows users to implement custom credential sources that support token refresh and other authentication patterns beyond static tokens.
Backward compatibility: Existing code using the DATABRICKS_TOKEN and DATABRICKS_HOST environment variables continues to work unchanged.