Fix audit log client on upgrade by michael-redpanda · Pull Request #28852 · redpanda-data/redpanda

michael-redpanda · 2025-12-04T20:56:15Z

Fixes: CORE-14903

This PR does the following:

Unsets the include authorization options in metadata/describe_cluster requests for non SL clients (reduce prevlance of authz failures in logs)
Reduces log severity of authz checks for that flag from INFO to DEBUG
Permits users of kafka::client::cluster to provide an error mitigation function to be called if there is an error refreshing metadata - reduces prevlance of sasl authentication failed error messages

Backports Required

Release Notes

Bug Fixes

Variety of fixes to reduce noise in log concerning internal kafka client authentication and authorization

When building a list of authorized operations for a metadata response, describe cluster response or a describe groups response, make the authz checks quiet so they do not clutter the log at INFO level. Signed-off-by: Michael Boquard <michael@redpanda.com>

Copilot

Pull request overview

This PR fixes issues related to internal Kafka client authentication and authorization logging by reducing log noise and handling errors more gracefully during metadata operations. The changes specifically address problems encountered when the audit log client upgrades or performs metadata refreshes.

Key changes:

Adds error mitigation support for metadata refresh failures in the Kafka client
Disables authorized operations requests for internal (non-SL) clients to reduce authorization failures
Reduces log severity of authorization checks from INFO to DEBUG

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/rptest/tests/audit_log_test.py	Enables kafka/client trace logging and reduces audit log partitions from 3 to 1 for testing
src/v/security/audit/client.cc	Adds metadata error mitigation callback to kafka_client_impl constructor
src/v/kafka/server/handlers/details/security.h	Changes authorized_operations to use quiet authorization checks
src/v/kafka/client/configuration.h	Adds include_authorized_operations flag to control whether clients request authorized operations
src/v/kafka/client/configuration.cc	Sets include_authorized_operations to false for internal clients
src/v/kafka/client/cluster.h	Adds external_metadata_mitigate callback support for handling metadata update failures
src/v/kafka/client/cluster.cc	Implements metadata error mitigation and conditionally includes authorized operations based on config
src/v/kafka/client/client.h	Adds metadata_mitigater parameter to client constructors
src/v/kafka/client/client.cc	Passes metadata_mitigater through to cluster constructor

Copilot · 2025-12-04T20:56:55Z

+          [this](std::exception_ptr eptr) { return mitigate_error(eptr); },
          [this](std::exception_ptr eptr) { return mitigate_error(eptr); }) {}


The same lambda [this](std::exception_ptr eptr) { return mitigate_error(eptr); } is duplicated on consecutive lines. This suggests the constructor now takes two mitigation callbacks where previously it took one. Consider extracting this to a named variable or using a helper function to avoid duplication and make the intent clearer.

Suggested change

[this](std::exception_ptr eptr) { return mitigate_error(eptr); },

[this](std::exception_ptr eptr) { return mitigate_error(eptr); }) {}

error_mitigator,

error_mitigator)

{

auto error_mitigator = [this](std::exception_ptr eptr) { return mitigate_error(eptr); };

}

Copilot · 2025-12-04T20:56:56Z

+    metadata_update mu;
+    if (describe_cluster_version) {
+        auto describe_cluster_resp = co_await dispatch_describe_cluster_request(
+          broker);


[nitpick] The indentation change here appears inconsistent with the original formatting. The opening of the function call should align with the assignment or follow project style conventions.

Suggested change

broker);

broker);

michael-redpanda · 2025-12-04T20:58:10Z

Force push:

FIxed python lint

michael-redpanda · 2025-12-05T13:13:36Z

Force push:

Fixed issue causing test failures

pgellert · 2025-12-05T13:27:29Z

  , _external_mitigate(std::move(mitigater))
-  , _cluster(std::make_unique<cluster>(cfg.connection_cfg))
+  , _cluster(
+      std::make_unique<cluster>(
+        cfg.connection_cfg, std::move(metadata_mitigater)))


Could we reuse the same mitigater instead of introducing a separate one?

noncopyable_function

We could wrap that in a ss::shared_ptr or switch to a std::function

pgellert · 2025-12-05T13:29:46Z


    vlog(_logger.debug, "Starting cluster metadata dispatch and update");

+    auto f = ss::now();


I think now that this is not a coroutine, the gate holder on L146 no longer works.

Could we leave this as a coroutine instead of using continuation style? It's so much easier to reason about coroutines than continuation style. Maybe ss::coroutine::as_future is what we need here to make the code clear with coroutines + exceptions.

oh right good call thanks

pgellert · 2025-12-05T13:36:17Z

      : audit_client(sink, controller)
      , _client(
          config::to_yaml(client_config, config::redact_secrets::no),
+          [this](std::exception_ptr eptr) { return mitigate_error(eptr); },


I think the SR client would have the same issue as well, right? I think it would be simpler to just reuse the existing mitigator function because I can't see why we wouldn't want to mitigate metadata update errors if we want to mitigate all other errors.

+1 for reusing the existing mitigator instead of passing another one.

pgellert · 2025-12-05T13:41:24Z

        security = AuditLogTestSecurityConfig.default_credentials()
        audit_config = AuditLogConfig(
            enabled=True,
-            num_partitions=3,


I think it would be worth extracting the num_partitions + num_brokers into constants with a note about the reasoning why we need a lower num_partitions than num_brokers for the completeness of this test.

Is the reasoning that the audit log client prefers to send data to the partitions that are local to the node, and so this auth issue only surfaces if there are no local partitions to produce to?

I'm also wondering if it would be worth disabling the "prefer partitions led by the local node" optimization in debug builds to get more test coverage over the scenario when the data needs to go to a different node.

Is the reasoning that the audit log client prefers to send data to the partitions that are local to the node, and so this auth issue only surfaces if there are no local partitions to produce to?

Partially - there really wasn't an issue in the first place - the logs were mostly scary and hard to reason about whether there was an issue or not. Reducing the partition count will help ensure that our mitigation efforts work and I like the idea of disabling the "preference" on debug builds.

vbotbuildovich · 2025-12-05T13:54:15Z

CI test results

test results on build#77406

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
MasterTestSuite	test_remote_partition_read_cached_index		unit	https://buildkite.com/redpanda/redpanda/builds/77406#019aeea5-3d98-43f0-a92c-dd78c2c6eddf	FAIL	0/1

test results on build#77539

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
AuditLogTestAdminApi	test_audit_log_functioning	{"audit_transport_mode": "kclient"}	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de90-4e75-9d9a-57864aa3f3e1	FLAKY	7/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestAdminApi&test_method=test_audit_log_functioning
AuditLogTestAdminApi	test_audit_log_functioning	{"audit_transport_mode": "kclient"}	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc9-0639-4ee1-b78e-c24e1f2ebe10	FLAKY	8/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestAdminApi&test_method=test_audit_log_functioning
AuditLogTestAdminAuthApi	test_excluded_principal	{"audit_transport_mode": "kclient"}	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de88-4753-be01-dc9232441327	FLAKY	7/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestAdminAuthApi&test_method=test_excluded_principal
AuditLogTestKafkaApi	test_audit_topic_protections	null	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de85-4c2d-a85e-120e2fde38b7	FLAKY	8/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestKafkaApi&test_method=test_audit_topic_protections
AuditLogTestKafkaApi	test_consume	{"audit_transport_mode": "kclient"}	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de87-49d2-bb54-ee13288f4317	FLAKY	8/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestKafkaApi&test_method=test_consume
AuditLogTestKafkaAuthnApi	test_authn_failure_messages	{"audit_transport_mode": "kclient"}	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de85-4c2d-a85e-120e2fde38b7	FLAKY	8/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestKafkaAuthnApi&test_method=test_authn_failure_messages
AuditLogTestKafkaAuthnApi	test_excluded_principal	{"audit_transport_mode": "kclient"}	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de8c-4e7e-b1f4-546411a2ac87	FLAKY	7/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestKafkaAuthnApi&test_method=test_excluded_principal
AuditLogTestKafkaAuthnApi	test_no_audit_user_authn	null	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de8e-44f9-b40f-5eed0227eba1	FLAKY	6/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestKafkaAuthnApi&test_method=test_no_audit_user_authn
AuditLogTestReproducer	test_sanctioning_mode	{"audit_transport_mode": "kclient"}	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de92-4b71-ab70-9b6bdf82d86d	FLAKY	9/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestReproducer&test_method=test_sanctioning_mode
AuditLogTestSchemaRegistry	test_sr_audit	{"audit_transport_mode": "kclient"}	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de88-4753-be01-dc9232441327	FLAKY	3/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistry&test_method=test_sr_audit
AuditLogTestSchemaRegistry	test_sr_audit_bad_authn	{"audit_transport_mode": "kclient"}	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de8c-4e7e-b1f4-546411a2ac87	FLAKY	8/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistry&test_method=test_sr_audit_bad_authn
AuditLogTestSchemaRegistryACLs	test_sr_audit_authz	{"audit_transport_mode": "kclient", "endpoint_name": "DELETE_CONFIG_SUBJECT"}	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de85-4c2d-a85e-120e2fde38b7	FLAKY	8/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistryACLs&test_method=test_sr_audit_authz
AuditLogTestSchemaRegistryACLs	test_sr_audit_authz	{"audit_transport_mode": "kclient", "endpoint_name": "DELETE_SUBJECT"}	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de8c-4e7e-b1f4-546411a2ac87	FLAKY	6/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistryACLs&test_method=test_sr_audit_authz
AuditLogTestSchemaRegistryACLs	test_sr_audit_authz	{"audit_transport_mode": "kclient", "endpoint_name": "DELETE_SUBJECT_VERSION"}	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de8e-44f9-b40f-5eed0227eba1	FLAKY	8/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistryACLs&test_method=test_sr_audit_authz
AuditLogTestSchemaRegistryACLs	test_sr_audit_authz	{"audit_transport_mode": "kclient", "endpoint_name": "GET_CONFIG_SUBJECT"}	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc9-0631-4d56-8436-c6de6b854568	FLAKY	18/21	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistryACLs&test_method=test_sr_audit_authz
AuditLogTestSchemaRegistryACLs	test_sr_audit_authz	{"audit_transport_mode": "kclient", "endpoint_name": "GET_MODE"}	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc9-0633-4012-919a-2f2aadd4dcdb	FLAKY	19/21	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistryACLs&test_method=test_sr_audit_authz
AuditLogTestSchemaRegistryACLs	test_sr_audit_authz	{"audit_transport_mode": "kclient", "endpoint_name": "GET_SUBJECT_VERSIONS_VERSION_SCHEMA"}	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de8e-44f9-b40f-5eed0227eba1	FLAKY	5/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistryACLs&test_method=test_sr_audit_authz
AuditLogTestSchemaRegistryACLs	test_sr_audit_authz	{"audit_transport_mode": "kclient", "endpoint_name": "POST_SUBJECT"}	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de92-4b71-ab70-9b6bdf82d86d	FLAKY	7/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistryACLs&test_method=test_sr_audit_authz
AuditLogTestSchemaRegistryACLs	test_sr_audit_authz	{"audit_transport_mode": "kclient", "endpoint_name": "PUT_MODE"}	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de8e-44f9-b40f-5eed0227eba1	FLAKY	5/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistryACLs&test_method=test_sr_audit_authz
AuditLogTestSchemaRegistryACLs	test_sr_audit_authz_get_subjects	{"audit_transport_mode": "kclient"}	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de88-4753-be01-dc9232441327	FLAKY	9/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistryACLs&test_method=test_sr_audit_authz_get_subjects
AuditLogTestsAppLifecycle	test_recovery_mode	{"audit_transport_mode": "kclient"}	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc9-0636-4321-bb47-e9f4dced3730	FLAKY	9/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestsAppLifecycle&test_method=test_recovery_mode
ShadowLinkingReplicationTests	test_topic_delete	{"source_cluster_spec": {"cluster_type": "redpanda"}}	integration	https://buildkite.com/redpanda/redpanda/builds/77539#019affc9-0638-4ae0-8deb-9fb21374c5dd	FLAKY	14/21	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0826, p0=0.0045, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_topic_delete

vbotbuildovich · 2025-12-05T15:08:57Z

Retry command for Build#77406

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"DELETE_SUBJECT_VERSION"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"GET_SUBJECT_VERSIONS_VERSION_SCHEMA"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_public@{"audit_transport_mode":"kclient","endpoint_name":"SCHEMA_REGISTRY_STATUS_READY"}
tests/rptest/tests/audit_log_test.py::AuditLogTestKafkaApi.test_audit_topic_protections
tests/rptest/tests/audit_log_test.py::AuditLogTestSmallBuffers.test_bypass@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestReproducer.test_sanctioning_mode@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"POST_SUBJECT"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"PUT_MODE_SUBJECT"}
tests/rptest/tests/audit_log_test.py::AuditLogTestOauth.test_admin_oauth@{"audit_transport_mode":"kclient"}
tests/rptest/tests/partition_reassignments_test.py::PartitionReassignmentsTest.test_reassignments_cancel
tests/rptest/tests/audit_log_test.py::AuditLogTestAdminAuthApi.test_excluded_principal@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"GET_MODE"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz_get_subjects@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestKafkaApi.test_management@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistry.test_sr_audit_bad_authn@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestOauth.test_kafka_oauth@{"audit_transport_mode":"kclient","authz_match":"acl"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"GET_MODE_SUBJECT"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"GET_SUBJECT_VERSIONS_VERSION_REFERENCED_BY"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"PUT_CONFIG_SUBJECT"}
tests/rptest/tests/audit_log_test.py::AuditLogTestAdminApi.test_config_rejected
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"GET_SCHEMAS_IDS_ID_VERSIONS"}
tests/rptest/tests/audit_log_test.py::AuditLogTestOauth.test_kafka_oauth@{"audit_transport_mode":"kclient","authz_match":"rbac"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"DELETE_MODE_SUBJECT"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"DELETE_SUBJECT"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSanctionMode.test_sanctioning_mode@{"audit_transport_mode":"kclient"}

IoannisRP

A couple of question on the mitigate on the update_metadata path...

Is it there just for the inform call to happen earlier? i.e. we don't have to wait for a a dispatch call to end up calling inform
dispatch calls happen with a retry. The first fails, informs and then it "succeeds".
There is no retry around update_metadata, is there? Is the logic that the first fails and the next (either timer or mitigator call) update_metadata succeeds?

IoannisRP · 2025-12-08T15:33:54Z

+        .include_cluster_authorized_operations
+        = _config.include_authorized_operations ? true : false,
+        .include_topic_authorized_operations
+        = _config.include_authorized_operations ? true : false}},


nit:

Suggested change

.include_cluster_authorized_operations

= _config.include_authorized_operations ? true : false,

.include_topic_authorized_operations

= _config.include_authorized_operations ? true : false}},

.include_cluster_authorized_operations

= _config.include_authorized_operations(),

.include_topic_authorized_operations

= _config.include_authorized_operations()}},

same below

IoannisRP · 2025-12-08T15:58:54Z

+                      .handle_exception([this](std::exception_ptr ex) {
+                          vlog(
+                            _logger.warn,
+                            "External metadata mitigation failed - {}",


if the external mitigator is set, we wouldn't reach the warning bellow. Good to have the context of what failed in the warning, as well.

Suggested change

"External metadata mitigation failed - {}",

"Failed to dispatch metadata request: External metadata mitigation failed - {}",

IoannisRP · 2025-12-08T16:29:50Z

      : audit_client(sink, controller)
      , _client(
          config::to_yaml(client_config, config::redact_secrets::no),
+          [this](std::exception_ptr eptr) { return mitigate_error(eptr); },


+1 for reusing the existing mitigator instead of passing another one.

michael-redpanda · 2025-12-08T18:53:02Z

@IoannisRP

Is it there just for the inform call to happen earlier? i.e. we don't have to wait for a a dispatch call to end up calling inform

The issue is that brokers may shut down and come back and in that time they'll lose the ephemeral credentials they previous held, so we need to react to these situations with infrom

dispatch calls happen with a retry. The first fails, informs and then it "succeeds".
There is no retry around update_metadata, is there? Is the logic that the first fails and the next (either timer or mitigator call) update_metadata succeeds?

Previously, the behavior would have been that the metadata request would have failed and thrown an exception and continued the retry.. The changes within kafka::client::cluster now 'swallow' that exception. The alternative would be to throw from the kafka::clinet::cluster but that change in behavior led to some strange behavior with Shadow Linking.

IoannisRP · 2025-12-08T19:33:37Z

The issue is that brokers may shut down and come back and in that time they'll lose the ephemeral credentials they previous held, so we need to react to these situations with infrom

I have seen the same behavior in SR, as well. It is normally resolved through the next dispatch-fail-inform-retry cycle. If it can help to resolve this more efficiently, we should go with the existing external mitigator (as discussed here).

Previously, the behavior would have been that the metadata request would have failed and thrown an exception and continued the retry

which path are we discussing?

In the client, the client::dispatch, client::update_metadata and most other calls come with mitigation+retry from the client side. So after the first failure, inform would be called, and the retry would happen.

The paths that I see that don't do mitigation (so no inform) in the cluster, is the periodic update_metadata from update_timer_callback and the update_metadata that happens as part of the mitigation. However, the later should be happening after the external_mitigate_error, so after the inform has already happened.

Is it the periodic metadata update that is the target of this fix? i.e. to not have logs full of sasl errors, even when not in the presence of other calls?

Not all uses of the kafka client need metadata or describe_cluster requests to include the bitmask of authorized operations. If the authenticated principal used by the client doesn't have the appropriate level of authorization to get that information this clutters up the broker logs with failed authorization checks. This new option permits users of k/client to opt out of getting that information. Signed-off-by: Michael Boquard <michael@redpanda.com>

All uses of the kafka client outside of Shadow Linking do not require the authorized operations flag to be set. This change ensures that any use disables that flag. Signed-off-by: Michael Boquard <michael@redpanda.com>

michael-redpanda · 2025-12-08T20:15:16Z

Is it the periodic metadata update that is the target of this fix? i.e. to not have logs full of sasl errors, even when not in the presence of other calls?

that would be the desire

michael-redpanda · 2025-12-08T20:15:54Z

Force push:

Addressed pr comments

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Copilot · 2025-12-08T20:20:34Z

+cluster::cluster(
+  connection_configuration config,
+  std::optional<external_metadata_mitigate> mitigater)


Corrected spelling of 'mitigater' to 'mitigator'.

Copilot · 2025-12-08T20:20:34Z

+    // The mitigater can be used to perform actions when metadata update fails.
+    explicit cluster(
+      connection_configuration config,
+      std::optional<external_metadata_mitigate> mitigater = std::nullopt);
+
+    // The mitigater can be used to perform actions when metadata update fails.


Corrected spelling of 'mitigater' to 'mitigator'.

Suggested change

// The mitigater can be used to perform actions when metadata update fails.

explicit cluster(

connection_configuration config,

std::optional<external_metadata_mitigate> mitigater = std::nullopt);

// The mitigater can be used to perform actions when metadata update fails.

// The mitigator can be used to perform actions when metadata update fails.

explicit cluster(

connection_configuration config,

std::optional<external_metadata_mitigate> mitigater = std::nullopt);

// The mitigator can be used to perform actions when metadata update fails.

Copilot · 2025-12-08T20:20:35Z

+    // The mitigater can be used to perform actions when metadata update fails.
+    explicit cluster(
+      connection_configuration config,
+      std::optional<external_metadata_mitigate> mitigater = std::nullopt);
+
+    // The mitigater can be used to perform actions when metadata update fails.


Corrected spelling of 'mitigater' to 'mitigator'.

Suggested change

// The mitigater can be used to perform actions when metadata update fails.

explicit cluster(

connection_configuration config,

std::optional<external_metadata_mitigate> mitigater = std::nullopt);

// The mitigater can be used to perform actions when metadata update fails.

// The mitigator can be used to perform actions when metadata update fails.

explicit cluster(

connection_configuration config,

std::optional<external_metadata_mitigate> mitigater = std::nullopt);

// The mitigator can be used to perform actions when metadata update fails.

Copilot · 2025-12-08T20:20:35Z

  : client(
      client_configuration::from_config_store(configuration(cfg)),
      std::move(mitigater)) {}


Corrected spelling of 'mitigater' to 'mitigator'.

vbotbuildovich · 2025-12-08T21:57:18Z

Retry command for Build#77539

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/audit_log_test.py::AuditLogTestKafkaApi.test_consume@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestReproducer.test_sanctioning_mode@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"POST_SUBJECT"}
tests/rptest/tests/audit_log_test.py::AuditLogTestAdminApi.test_audit_log_functioning@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestKafkaApi.test_audit_topic_protections
tests/rptest/tests/audit_log_test.py::AuditLogTestKafkaAuthnApi.test_authn_failure_messages@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"DELETE_CONFIG_SUBJECT"}
tests/rptest/tests/audit_log_test.py::AuditLogTestAdminAuthApi.test_excluded_principal@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistry.test_sr_audit@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz_get_subjects@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestKafkaAuthnApi.test_no_audit_user_authn
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"DELETE_SUBJECT_VERSION"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"GET_SUBJECT_VERSIONS_VERSION_SCHEMA"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"PUT_MODE"}
tests/rptest/tests/audit_log_test.py::AuditLogTestKafkaAuthnApi.test_excluded_principal@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistry.test_sr_audit_bad_authn@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"DELETE_SUBJECT"}
tests/rptest/tests/audit_log_test.py::AuditLogTestsAppLifecycle.test_recovery_mode@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"GET_MODE"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"GET_CONFIG_SUBJECT"}
tests/rptest/tests/cluster_linking_e2e_test.py::ShadowLinkingReplicationTests.test_topic_delete@{"source_cluster_spec":{"cluster_type":"redpanda"}}

IoannisRP

looking good 👍

IoannisRP · 2025-12-09T13:07:47Z

+    auto dispatch_f = co_await ss::coroutine::as_future(
+      do_dispatch_and_apply_metadata_updates(std::move(topics_request_list)));
+
+    if (dispatch_f.failed()) {
+        auto err = dispatch_f.get_exception();
+        if (_external_metadata_mitigate) {
+            auto external_f = co_await ss::coroutine::as_future(
+              _external_metadata_mitigate.value()(err));
+            if (external_f.failed()) {
+                auto external_err = external_f.get_exception();
+                vlog(
+                  _logger.warn,
+                  "Failed to dispatch metadata request: External metadata "
+                  "mitigation failed - {}",
+                  external_err);
+            }
        } else {
-            // If describe cluster isn't supported, then dispatch a metadata
-            // request with no topics at API version 10 in order to get the
-            // cluster level authorized operations
-            auto metadata_resp = co_await dispatch_metadata_request(
-              broker, {}, api_version(10));
-            mu = to_metadata_update(
-              std::move(metadata_resp), topic_metadata_included_t::no);
+            vlog(_logger.warn, "Failed to dispatch metadata request - {}", err);
        }
+    }
+}


it's a shame we can't reuse some of the "client/utils" here. The closest would be retry_with_mitigation, but it only does mitigation on retry, and if you add a retry, it would retry on every failure (not just sasl).

Signed-off-by: Michael Boquard <michael@redpanda.com>

michael-redpanda · 2025-12-10T15:43:34Z

Force push:

Removed mitigator as this was causing test flakiness
Reduced log severity of broker error in metadata refresh to 'debug' to reduce occurence of benign but alarming error messages

vbotbuildovich · 2025-12-10T18:00:09Z

/backport v25.3.x

michael-redpanda requested review from BenPope, IoannisRP and pgellert December 4, 2025 20:56

michael-redpanda self-assigned this Dec 4, 2025

Copilot AI review requested due to automatic review settings December 4, 2025 20:56

michael-redpanda requested a review from a team as a code owner December 4, 2025 20:56

github-actions Bot added the area/redpanda label Dec 4, 2025

Copilot AI reviewed Dec 4, 2025

View reviewed changes

michael-redpanda force-pushed the fix-audit-log-client-on-upgrade branch from f033e8d to e23848d Compare December 4, 2025 20:58

michael-redpanda force-pushed the fix-audit-log-client-on-upgrade branch from e23848d to 98f6061 Compare December 5, 2025 13:12

pgellert reviewed Dec 5, 2025

View reviewed changes

IoannisRP reviewed Dec 8, 2025

View reviewed changes

michael-redpanda added 2 commits December 8, 2025 15:13

k/client: Non SL clients do not include authorized ops

8a929cc

All uses of the kafka client outside of Shadow Linking do not require the authorized operations flag to be set. This change ensures that any use disables that flag. Signed-off-by: Michael Boquard <michael@redpanda.com>

michael-redpanda force-pushed the fix-audit-log-client-on-upgrade branch from 98f6061 to 4984b58 Compare December 8, 2025 20:14

michael-redpanda requested review from IoannisRP, Copilot and pgellert December 8, 2025 20:19

Copilot AI reviewed Dec 8, 2025

View reviewed changes

IoannisRP reviewed Dec 9, 2025

View reviewed changes

michael-redpanda added this to the v25.3.2 milestone Dec 9, 2025

k/client: Reduce log severity on metadata error

d7f6816

Signed-off-by: Michael Boquard <michael@redpanda.com>

michael-redpanda force-pushed the fix-audit-log-client-on-upgrade branch from 4984b58 to d7f6816 Compare December 10, 2025 15:42

michael-redpanda requested a review from IoannisRP December 10, 2025 15:42

pgellert approved these changes Dec 10, 2025

View reviewed changes

michael-redpanda enabled auto-merge December 10, 2025 16:42

IoannisRP approved these changes Dec 10, 2025

View reviewed changes

michael-redpanda merged commit 296e00e into redpanda-data:dev Dec 10, 2025
19 checks passed

vbotbuildovich mentioned this pull request Dec 10, 2025

[v25.3.x] Fix audit log client on upgrade #28927

Merged

		[this](std::exception_ptr eptr) { return mitigate_error(eptr); },
		[this](std::exception_ptr eptr) { return mitigate_error(eptr); }) {}

-          [this](std::exception_ptr eptr) { return mitigate_error(eptr); },
-          [this](std::exception_ptr eptr) { return mitigate_error(eptr); }) {}
+          error_mitigator,
+          error_mitigator)
+    {
+        auto error_mitigator = [this](std::exception_ptr eptr) { return mitigate_error(eptr); };
+    }


		vlog(_logger.debug, "Starting cluster metadata dispatch and update");

		auto f = ss::now();

	"External metadata mitigation failed - {}",
	"Failed to dispatch metadata request: External metadata mitigation failed - {}",

Conversation

michael-redpanda commented Dec 4, 2025

Backports Required

Release Notes

Bug Fixes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

michael-redpanda commented Dec 4, 2025

Uh oh!

michael-redpanda commented Dec 5, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vbotbuildovich commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI test results

Uh oh!

vbotbuildovich commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Retry command for Build#77406

Uh oh!

IoannisRP left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michael-redpanda commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IoannisRP commented Dec 8, 2025

Uh oh!

michael-redpanda commented Dec 8, 2025

Uh oh!

michael-redpanda commented Dec 8, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

vbotbuildovich commented Dec 5, 2025 •

edited

Loading

vbotbuildovich commented Dec 5, 2025 •

edited

Loading

michael-redpanda commented Dec 8, 2025 •

edited

Loading

vbotbuildovich commented Dec 8, 2025 •

edited

Loading