datalake: add DescribeCatalog admin endpoint#29677
Conversation
70c3173 to
0028a86
Compare
Replace iobuf_istream/stringstream with iobuf::linearize_to_string for extracting HTTP error response body. Change err_msg type from std::string to ss::sstring. Remove unused is_transport_error(). No functional changes intended.
The failure struct used two bools (can_be_retried, aborted) to classify retry outcomes, where aborted took precedence over can_be_retried. This made the state space larger than necessary (4 combinations for 3 states) and left the specific cause of retriable failures opaque. Replace with an error_kind enum that captures why the failure happened: unrecoverable, aborted, retriable_http_status, network_error, or timeout. Rename the struct from failure to request_error. Slim down retries_exhausted to store only the error_kinds and the last error instead of a full http_call_error per attempt.
The http_call_error status variant carried only the HTTP status code, while the truncated response body lived in a separate err_msg field on request_error. This split the error context across two places, making http_call_error incomplete for downstream propagation. Introduce http_status_error to pair the HTTP status with the response body, and use it as the status variant in http_call_error. This makes http_call_error self-contained and lets us drop the redundant err_msg field from request_error.
Extract the HTTP request logic for the GET /v1/config endpoint out of maybe_configure into a standalone public get_config method. This allows callers to fetch catalog configuration independently, and also fixes the probe endpoint which was incorrectly recorded as create_namespace.
Verify that when get_config receives a non-retriable HTTP error, both the status code and the response body are propagated through the http_status_error variant to the caller.
Instead of unconditionally mapping retries_exhausted to timedout, propagate the last HTTP error when available.
0028a86 to
e93e25c
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds an admin endpoint to probe Iceberg catalog connectivity and credential validity, enabling operators and tests to verify catalog configuration without waiting for a full translation cycle. The implementation also includes substantial refactoring of the REST client error handling to better track and propagate error information.
Changes:
- Added
DescribeCatalogRPC to the datalake admin service that checks catalog connectivity by calling the catalog'sget_configendpoint - Refactored REST client error handling to use structured
error_kindenum instead of boolean flags, with better tracking of retry reasons and last error - Moved
catalog_errcenum to a separatecatalog_errors.hheader to support richer error reporting withcatalog_describe_errorstruct
Reviewed changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| proto/redpanda/core/admin/internal/datalake/v1/datalake.proto | Adds DescribeCatalog RPC definition with empty request/response messages |
| src/v/iceberg/catalog_errors.h | New header defining catalog_errc enum and catalog_describe_error struct for rich error reporting |
| src/v/iceberg/catalog.h | Adds describe_catalog() virtual method to catalog interface |
| src/v/iceberg/rest_catalog.{h,cc} | Implements describe_catalog() by calling get_config on the REST client |
| src/v/iceberg/filesystem_catalog.{h,cc} | Implements describe_catalog() as no-op (always succeeds) |
| src/v/iceberg/rest_client/error.h | Refactors error handling with error_kind enum and http_status_error struct |
| src/v/iceberg/rest_client/retry_policy.{h,cc} | Updates retry logic to use error_kind instead of boolean flags |
| src/v/iceberg/rest_client/catalog_client.{h,cc} | Adds public get_config method and refactors error tracking in perform_request |
| src/v/datalake/coordinator/frontend.{h,cc} | Adds describe_catalog method that delegates to coordinator_manager |
| src/v/datalake/coordinator/coordinator_manager.{h,cc} | Implements describe_catalog by calling catalog's describe_catalog method |
| src/v/redpanda/admin/services/datalake/datalake.{h,cc} | Implements RPC handler that calls coordinator frontend's describe_catalog |
| tests/rptest/clients/admin/proto/* | Generated Python protobuf bindings for the new RPC |
| tests/rptest/tests/datalake/rest_catalog_connection_test.py | Adds test that verifies describe_catalog succeeds and fails appropriately with firewall block |
Move the error enum out of the catalog class into a standalone catalog_errc enum in catalog_errors.h so it can be used without pulling in the full catalog interface. Add to_string_view and fmt::formatter support.
e93e25c to
a3b0138
Compare
Add a DescribeCatalog RPC to the datalake admin service that probes the iceberg catalog for connectivity and credential validity. Introduces catalog_describe_error as a rich error type carrying both an error code and a human-readable message, and adds describe_catalog() as a virtual method on the catalog base class. The rest_catalog implementation calls get_config to verify the catalog is reachable. The response is currently empty but may include additional catalog information such as configuration details or capabilities in the future.
a3b0138 to
c9114be
Compare
andrwng
left a comment
There was a problem hiding this comment.
Seems like the beginnings of a nice Iceberg REST catalog client proxy 😄
|
|
||
| /// Rich error for catalog describe, carrying a human-readable message | ||
| /// suitable for surfacing to the user via the admin API. | ||
| struct catalog_describe_error { |
There was a problem hiding this comment.
nit: maybe utils/detailed_error is useful for your needs here? Been using that in some of the newer cloud topics metastore code
There was a problem hiding this comment.
Just tried that. Doesn't bring any benefit for this code as we don't do much with the error/add additional annotations to it. Will not commit it.
| auto h = co_await lock_.get_units(); | ||
| auto res = co_await client_->get_config(rtc); | ||
| if (res.has_value()) { | ||
| co_return outcome::success(); |
There was a problem hiding this comment.
Given the name is "describe" it's a little surprising this doesn't return something (like the catalog config or something, or some general key-value map). I don't mind it, but wondering if you gave some thought to a non-void return type
There was a problem hiding this comment.
Yeah, long term it will probably return stuff. Wanted to get something already useful without investing too much.
We probably want to return catalog reachability error in response rather than as error message but again don't want to overthink it now.
| ss::future<proto::admin::describe_catalog_response> | ||
| datalake_service_impl::describe_catalog( | ||
| serde::pb::rpc::context, proto::admin::describe_catalog_request) { | ||
| if (!_coordinator_fe->local_is_initialized()) { |
There was a problem hiding this comment.
nit: I typically think of the frontend as the "RPC frontend" or "leader router". If we're only trying to test local connectivity, maybe we can plumb the coordinator manager into the datalake service impl and call it directly. That'd also cut down on some work in the future if we want to expand the kinds of requests we want to try
There was a problem hiding this comment.
Had the same thoughts but I'm not sure single catalog idea will stick so this is a bit of future proofing while at the same time avoiding a larger refactor.
Add an admin RPC that sends a config request to the iceberg catalog, allowing operators and tests to verify catalog connectivity and credentials without waiting for a full translation cycle.
Backports Required
Release Notes