Skip to content

datalake: add DescribeCatalog admin endpoint#29677

Merged
nvartolomei merged 8 commits into
redpanda-data:devfrom
nvartolomei:nv/iceberg-catalog-check-config
Feb 26, 2026
Merged

datalake: add DescribeCatalog admin endpoint#29677
nvartolomei merged 8 commits into
redpanda-data:devfrom
nvartolomei:nv/iceberg-catalog-check-config

Conversation

@nvartolomei
Copy link
Copy Markdown
Contributor

Add an admin RPC that sends a config request to the iceberg catalog, allowing operators and tests to verify catalog connectivity and credentials without waiting for a full translation cycle.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

Replace iobuf_istream/stringstream with iobuf::linearize_to_string
for extracting HTTP error response body. Change err_msg type from
std::string to ss::sstring. Remove unused is_transport_error().

No functional changes intended.
The failure struct used two bools (can_be_retried, aborted) to classify
retry outcomes, where aborted took precedence over can_be_retried. This
made the state space larger than necessary (4 combinations for 3 states)
and left the specific cause of retriable failures opaque.

Replace with an error_kind enum that captures why the failure happened:
unrecoverable, aborted, retriable_http_status, network_error, or
timeout. Rename the struct from failure to request_error.

Slim down retries_exhausted to store only the error_kinds and the last
error instead of a full http_call_error per attempt.
The http_call_error status variant carried only the HTTP status code,
while the truncated response body lived in a separate err_msg field on
request_error. This split the error context across two places, making
http_call_error incomplete for downstream propagation.

Introduce http_status_error to pair the HTTP status with the response
body, and use it as the status variant in http_call_error. This makes
http_call_error self-contained and lets us drop the redundant err_msg
field from request_error.
Extract the HTTP request logic for the GET /v1/config endpoint out of
maybe_configure into a standalone public get_config method. This allows
callers to fetch catalog configuration independently, and also fixes
the probe endpoint which was incorrectly recorded as create_namespace.
Verify that when get_config receives a non-retriable HTTP error, both
the status code and the response body are propagated through the
http_status_error variant to the caller.
Instead of unconditionally mapping retries_exhausted to timedout,
propagate the last HTTP error when available.
@nvartolomei nvartolomei force-pushed the nv/iceberg-catalog-check-config branch from 0028a86 to e93e25c Compare February 25, 2026 13:48
@nvartolomei nvartolomei marked this pull request as ready for review February 25, 2026 13:49
Copilot AI review requested due to automatic review settings February 25, 2026 13:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an admin endpoint to probe Iceberg catalog connectivity and credential validity, enabling operators and tests to verify catalog configuration without waiting for a full translation cycle. The implementation also includes substantial refactoring of the REST client error handling to better track and propagate error information.

Changes:

  • Added DescribeCatalog RPC to the datalake admin service that checks catalog connectivity by calling the catalog's get_config endpoint
  • Refactored REST client error handling to use structured error_kind enum instead of boolean flags, with better tracking of retry reasons and last error
  • Moved catalog_errc enum to a separate catalog_errors.h header to support richer error reporting with catalog_describe_error struct

Reviewed changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
proto/redpanda/core/admin/internal/datalake/v1/datalake.proto Adds DescribeCatalog RPC definition with empty request/response messages
src/v/iceberg/catalog_errors.h New header defining catalog_errc enum and catalog_describe_error struct for rich error reporting
src/v/iceberg/catalog.h Adds describe_catalog() virtual method to catalog interface
src/v/iceberg/rest_catalog.{h,cc} Implements describe_catalog() by calling get_config on the REST client
src/v/iceberg/filesystem_catalog.{h,cc} Implements describe_catalog() as no-op (always succeeds)
src/v/iceberg/rest_client/error.h Refactors error handling with error_kind enum and http_status_error struct
src/v/iceberg/rest_client/retry_policy.{h,cc} Updates retry logic to use error_kind instead of boolean flags
src/v/iceberg/rest_client/catalog_client.{h,cc} Adds public get_config method and refactors error tracking in perform_request
src/v/datalake/coordinator/frontend.{h,cc} Adds describe_catalog method that delegates to coordinator_manager
src/v/datalake/coordinator/coordinator_manager.{h,cc} Implements describe_catalog by calling catalog's describe_catalog method
src/v/redpanda/admin/services/datalake/datalake.{h,cc} Implements RPC handler that calls coordinator frontend's describe_catalog
tests/rptest/clients/admin/proto/* Generated Python protobuf bindings for the new RPC
tests/rptest/tests/datalake/rest_catalog_connection_test.py Adds test that verifies describe_catalog succeeds and fails appropriately with firewall block

Comment thread src/v/iceberg/catalog_errors.h
Comment thread proto/redpanda/core/admin/internal/datalake/v1/datalake.proto
@nvartolomei nvartolomei changed the title datalake: add GetCatalogConfig admin endpoint datalake: add DescribeCatalog admin endpoint Feb 25, 2026
Move the error enum out of the catalog class into a standalone
catalog_errc enum in catalog_errors.h so it can be used without
pulling in the full catalog interface. Add to_string_view and
fmt::formatter support.
@nvartolomei nvartolomei force-pushed the nv/iceberg-catalog-check-config branch from e93e25c to a3b0138 Compare February 25, 2026 14:46
Add a DescribeCatalog RPC to the datalake admin service that probes
the iceberg catalog for connectivity and credential validity.

Introduces catalog_describe_error as a rich error type carrying both
an error code and a human-readable message, and adds describe_catalog()
as a virtual method on the catalog base class.

The rest_catalog implementation calls get_config to verify the catalog
is reachable.

The response is currently empty but may include additional catalog
information such as configuration details or capabilities in the
future.
@nvartolomei nvartolomei force-pushed the nv/iceberg-catalog-check-config branch from a3b0138 to c9114be Compare February 25, 2026 14:57
@nvartolomei nvartolomei requested a review from andrwng February 25, 2026 17:23
Copy link
Copy Markdown
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the beginnings of a nice Iceberg REST catalog client proxy 😄


/// Rich error for catalog describe, carrying a human-readable message
/// suitable for surfacing to the user via the admin API.
struct catalog_describe_error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe utils/detailed_error is useful for your needs here? Been using that in some of the newer cloud topics metastore code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tried that. Doesn't bring any benefit for this code as we don't do much with the error/add additional annotations to it. Will not commit it.

auto h = co_await lock_.get_units();
auto res = co_await client_->get_config(rtc);
if (res.has_value()) {
co_return outcome::success();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the name is "describe" it's a little surprising this doesn't return something (like the catalog config or something, or some general key-value map). I don't mind it, but wondering if you gave some thought to a non-void return type

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, long term it will probably return stuff. Wanted to get something already useful without investing too much.

We probably want to return catalog reachability error in response rather than as error message but again don't want to overthink it now.

ss::future<proto::admin::describe_catalog_response>
datalake_service_impl::describe_catalog(
serde::pb::rpc::context, proto::admin::describe_catalog_request) {
if (!_coordinator_fe->local_is_initialized()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I typically think of the frontend as the "RPC frontend" or "leader router". If we're only trying to test local connectivity, maybe we can plumb the coordinator manager into the datalake service impl and call it directly. That'd also cut down on some work in the future if we want to expand the kinds of requests we want to try

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had the same thoughts but I'm not sure single catalog idea will stick so this is a bit of future proofing while at the same time avoiding a larger refactor.

@nvartolomei nvartolomei merged commit 8bbb3a9 into redpanda-data:dev Feb 26, 2026
31 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants