Skip to content

Fix audit log client on upgrade#28852

Merged
michael-redpanda merged 4 commits into
redpanda-data:devfrom
michael-redpanda:fix-audit-log-client-on-upgrade
Dec 10, 2025
Merged

Fix audit log client on upgrade#28852
michael-redpanda merged 4 commits into
redpanda-data:devfrom
michael-redpanda:fix-audit-log-client-on-upgrade

Conversation

@michael-redpanda
Copy link
Copy Markdown
Contributor

Fixes: CORE-14903

This PR does the following:

  • Unsets the include authorization options in metadata/describe_cluster requests for non SL clients (reduce prevlance of authz failures in logs)
  • Reduces log severity of authz checks for that flag from INFO to DEBUG
  • Permits users of kafka::client::cluster to provide an error mitigation function to be called if there is an error refreshing metadata - reduces prevlance of sasl authentication failed error messages

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

Bug Fixes

  • Variety of fixes to reduce noise in log concerning internal kafka client authentication and authorization

When building a list of authorized operations for a metadata response,
describe cluster response or a describe groups response, make the authz
checks quiet so they do not clutter the log at INFO level.

Signed-off-by: Michael Boquard <michael@redpanda.com>
@michael-redpanda michael-redpanda self-assigned this Dec 4, 2025
Copilot AI review requested due to automatic review settings December 4, 2025 20:56
@michael-redpanda michael-redpanda requested a review from a team as a code owner December 4, 2025 20:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes issues related to internal Kafka client authentication and authorization logging by reducing log noise and handling errors more gracefully during metadata operations. The changes specifically address problems encountered when the audit log client upgrades or performs metadata refreshes.

Key changes:

  • Adds error mitigation support for metadata refresh failures in the Kafka client
  • Disables authorized operations requests for internal (non-SL) clients to reduce authorization failures
  • Reduces log severity of authorization checks from INFO to DEBUG

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/rptest/tests/audit_log_test.py Enables kafka/client trace logging and reduces audit log partitions from 3 to 1 for testing
src/v/security/audit/client.cc Adds metadata error mitigation callback to kafka_client_impl constructor
src/v/kafka/server/handlers/details/security.h Changes authorized_operations to use quiet authorization checks
src/v/kafka/client/configuration.h Adds include_authorized_operations flag to control whether clients request authorized operations
src/v/kafka/client/configuration.cc Sets include_authorized_operations to false for internal clients
src/v/kafka/client/cluster.h Adds external_metadata_mitigate callback support for handling metadata update failures
src/v/kafka/client/cluster.cc Implements metadata error mitigation and conditionally includes authorized operations based on config
src/v/kafka/client/client.h Adds metadata_mitigater parameter to client constructors
src/v/kafka/client/client.cc Passes metadata_mitigater through to cluster constructor

Comment thread src/v/security/audit/client.cc Outdated
Comment on lines 190 to 191
[this](std::exception_ptr eptr) { return mitigate_error(eptr); },
[this](std::exception_ptr eptr) { return mitigate_error(eptr); }) {}
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same lambda [this](std::exception_ptr eptr) { return mitigate_error(eptr); } is duplicated on consecutive lines. This suggests the constructor now takes two mitigation callbacks where previously it took one. Consider extracting this to a named variable or using a helper function to avoid duplication and make the intent clearer.

Suggested change
[this](std::exception_ptr eptr) { return mitigate_error(eptr); },
[this](std::exception_ptr eptr) { return mitigate_error(eptr); }) {}
error_mitigator,
error_mitigator)
{
auto error_mitigator = [this](std::exception_ptr eptr) { return mitigate_error(eptr); };
}

Copilot uses AI. Check for mistakes.
Comment thread src/v/kafka/client/cluster.cc Outdated
metadata_update mu;
if (describe_cluster_version) {
auto describe_cluster_resp = co_await dispatch_describe_cluster_request(
broker);
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The indentation change here appears inconsistent with the original formatting. The opening of the function call should align with the assignment or follow project style conventions.

Suggested change
broker);
broker);

Copilot uses AI. Check for mistakes.
@michael-redpanda michael-redpanda force-pushed the fix-audit-log-client-on-upgrade branch from f033e8d to e23848d Compare December 4, 2025 20:58
@michael-redpanda
Copy link
Copy Markdown
Contributor Author

Force push:

  • FIxed python lint

@michael-redpanda michael-redpanda force-pushed the fix-audit-log-client-on-upgrade branch from e23848d to 98f6061 Compare December 5, 2025 13:12
@michael-redpanda
Copy link
Copy Markdown
Contributor Author

Force push:

  • Fixed issue causing test failures

Comment thread src/v/kafka/client/cluster.cc Outdated
Comment thread src/v/kafka/client/configuration.h
Comment thread src/v/kafka/client/client.cc Outdated
Comment on lines +59 to +67
, _external_mitigate(std::move(mitigater))
, _cluster(std::make_unique<cluster>(cfg.connection_cfg))
, _cluster(
std::make_unique<cluster>(
cfg.connection_cfg, std::move(metadata_mitigater)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we reuse the same mitigater instead of introducing a separate one?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noncopyable_function

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could wrap that in a ss::shared_ptr or switch to a std::function

Comment thread src/v/kafka/client/cluster.cc Outdated

vlog(_logger.debug, "Starting cluster metadata dispatch and update");

auto f = ss::now();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think now that this is not a coroutine, the gate holder on L146 no longer works.

Could we leave this as a coroutine instead of using continuation style? It's so much easier to reason about coroutines than continuation style. Maybe ss::coroutine::as_future is what we need here to make the code clear with coroutines + exceptions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right good call thanks

Comment thread src/v/security/audit/client.cc Outdated
: audit_client(sink, controller)
, _client(
config::to_yaml(client_config, config::redact_secrets::no),
[this](std::exception_ptr eptr) { return mitigate_error(eptr); },
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the SR client would have the same issue as well, right? I think it would be simpler to just reuse the existing mitigator function because I can't see why we wouldn't want to mitigate metadata update errors if we want to mitigate all other errors.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for reusing the existing mitigator instead of passing another one.

security = AuditLogTestSecurityConfig.default_credentials()
audit_config = AuditLogConfig(
enabled=True,
num_partitions=3,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be worth extracting the num_partitions + num_brokers into constants with a note about the reasoning why we need a lower num_partitions than num_brokers for the completeness of this test.

Is the reasoning that the audit log client prefers to send data to the partitions that are local to the node, and so this auth issue only surfaces if there are no local partitions to produce to?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also wondering if it would be worth disabling the "prefer partitions led by the local node" optimization in debug builds to get more test coverage over the scenario when the data needs to go to a different node.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reasoning that the audit log client prefers to send data to the partitions that are local to the node, and so this auth issue only surfaces if there are no local partitions to produce to?

Partially - there really wasn't an issue in the first place - the logs were mostly scary and hard to reason about whether there was an issue or not. Reducing the partition count will help ensure that our mitigation efforts work and I like the idea of disabling the "preference" on debug builds.

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented Dec 5, 2025

CI test results

test results on build#77406
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
MasterTestSuite test_remote_partition_read_cached_index unit https://buildkite.com/redpanda/redpanda/builds/77406#019aeea5-3d98-43f0-a92c-dd78c2c6eddf FAIL 0/1
test results on build#77539
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
AuditLogTestAdminApi test_audit_log_functioning {"audit_transport_mode": "kclient"} integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de90-4e75-9d9a-57864aa3f3e1 FLAKY 7/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestAdminApi&test_method=test_audit_log_functioning
AuditLogTestAdminApi test_audit_log_functioning {"audit_transport_mode": "kclient"} integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc9-0639-4ee1-b78e-c24e1f2ebe10 FLAKY 8/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestAdminApi&test_method=test_audit_log_functioning
AuditLogTestAdminAuthApi test_excluded_principal {"audit_transport_mode": "kclient"} integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de88-4753-be01-dc9232441327 FLAKY 7/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestAdminAuthApi&test_method=test_excluded_principal
AuditLogTestKafkaApi test_audit_topic_protections null integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de85-4c2d-a85e-120e2fde38b7 FLAKY 8/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestKafkaApi&test_method=test_audit_topic_protections
AuditLogTestKafkaApi test_consume {"audit_transport_mode": "kclient"} integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de87-49d2-bb54-ee13288f4317 FLAKY 8/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestKafkaApi&test_method=test_consume
AuditLogTestKafkaAuthnApi test_authn_failure_messages {"audit_transport_mode": "kclient"} integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de85-4c2d-a85e-120e2fde38b7 FLAKY 8/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestKafkaAuthnApi&test_method=test_authn_failure_messages
AuditLogTestKafkaAuthnApi test_excluded_principal {"audit_transport_mode": "kclient"} integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de8c-4e7e-b1f4-546411a2ac87 FLAKY 7/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestKafkaAuthnApi&test_method=test_excluded_principal
AuditLogTestKafkaAuthnApi test_no_audit_user_authn null integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de8e-44f9-b40f-5eed0227eba1 FLAKY 6/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestKafkaAuthnApi&test_method=test_no_audit_user_authn
AuditLogTestReproducer test_sanctioning_mode {"audit_transport_mode": "kclient"} integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de92-4b71-ab70-9b6bdf82d86d FLAKY 9/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestReproducer&test_method=test_sanctioning_mode
AuditLogTestSchemaRegistry test_sr_audit {"audit_transport_mode": "kclient"} integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de88-4753-be01-dc9232441327 FLAKY 3/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistry&test_method=test_sr_audit
AuditLogTestSchemaRegistry test_sr_audit_bad_authn {"audit_transport_mode": "kclient"} integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de8c-4e7e-b1f4-546411a2ac87 FLAKY 8/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistry&test_method=test_sr_audit_bad_authn
AuditLogTestSchemaRegistryACLs test_sr_audit_authz {"audit_transport_mode": "kclient", "endpoint_name": "DELETE_CONFIG_SUBJECT"} integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de85-4c2d-a85e-120e2fde38b7 FLAKY 8/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistryACLs&test_method=test_sr_audit_authz
AuditLogTestSchemaRegistryACLs test_sr_audit_authz {"audit_transport_mode": "kclient", "endpoint_name": "DELETE_SUBJECT"} integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de8c-4e7e-b1f4-546411a2ac87 FLAKY 6/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistryACLs&test_method=test_sr_audit_authz
AuditLogTestSchemaRegistryACLs test_sr_audit_authz {"audit_transport_mode": "kclient", "endpoint_name": "DELETE_SUBJECT_VERSION"} integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de8e-44f9-b40f-5eed0227eba1 FLAKY 8/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistryACLs&test_method=test_sr_audit_authz
AuditLogTestSchemaRegistryACLs test_sr_audit_authz {"audit_transport_mode": "kclient", "endpoint_name": "GET_CONFIG_SUBJECT"} integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc9-0631-4d56-8436-c6de6b854568 FLAKY 18/21 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistryACLs&test_method=test_sr_audit_authz
AuditLogTestSchemaRegistryACLs test_sr_audit_authz {"audit_transport_mode": "kclient", "endpoint_name": "GET_MODE"} integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc9-0633-4012-919a-2f2aadd4dcdb FLAKY 19/21 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistryACLs&test_method=test_sr_audit_authz
AuditLogTestSchemaRegistryACLs test_sr_audit_authz {"audit_transport_mode": "kclient", "endpoint_name": "GET_SUBJECT_VERSIONS_VERSION_SCHEMA"} integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de8e-44f9-b40f-5eed0227eba1 FLAKY 5/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistryACLs&test_method=test_sr_audit_authz
AuditLogTestSchemaRegistryACLs test_sr_audit_authz {"audit_transport_mode": "kclient", "endpoint_name": "POST_SUBJECT"} integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de92-4b71-ab70-9b6bdf82d86d FLAKY 7/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistryACLs&test_method=test_sr_audit_authz
AuditLogTestSchemaRegistryACLs test_sr_audit_authz {"audit_transport_mode": "kclient", "endpoint_name": "PUT_MODE"} integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de8e-44f9-b40f-5eed0227eba1 FLAKY 5/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistryACLs&test_method=test_sr_audit_authz
AuditLogTestSchemaRegistryACLs test_sr_audit_authz_get_subjects {"audit_transport_mode": "kclient"} integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc3-de88-4753-be01-dc9232441327 FLAKY 9/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestSchemaRegistryACLs&test_method=test_sr_audit_authz_get_subjects
AuditLogTestsAppLifecycle test_recovery_mode {"audit_transport_mode": "kclient"} integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc9-0636-4321-bb47-e9f4dced3730 FLAKY 9/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestsAppLifecycle&test_method=test_recovery_mode
ShadowLinkingReplicationTests test_topic_delete {"source_cluster_spec": {"cluster_type": "redpanda"}} integration https://buildkite.com/redpanda/redpanda/builds/77539#019affc9-0638-4ae0-8deb-9fb21374c5dd FLAKY 14/21 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0826, p0=0.0045, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_topic_delete

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented Dec 5, 2025

Retry command for Build#77406

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"DELETE_SUBJECT_VERSION"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"GET_SUBJECT_VERSIONS_VERSION_SCHEMA"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_public@{"audit_transport_mode":"kclient","endpoint_name":"SCHEMA_REGISTRY_STATUS_READY"}
tests/rptest/tests/audit_log_test.py::AuditLogTestKafkaApi.test_audit_topic_protections
tests/rptest/tests/audit_log_test.py::AuditLogTestSmallBuffers.test_bypass@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestReproducer.test_sanctioning_mode@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"POST_SUBJECT"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"PUT_MODE_SUBJECT"}
tests/rptest/tests/audit_log_test.py::AuditLogTestOauth.test_admin_oauth@{"audit_transport_mode":"kclient"}
tests/rptest/tests/partition_reassignments_test.py::PartitionReassignmentsTest.test_reassignments_cancel
tests/rptest/tests/audit_log_test.py::AuditLogTestAdminAuthApi.test_excluded_principal@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"GET_MODE"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz_get_subjects@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestKafkaApi.test_management@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistry.test_sr_audit_bad_authn@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestOauth.test_kafka_oauth@{"audit_transport_mode":"kclient","authz_match":"acl"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"GET_MODE_SUBJECT"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"GET_SUBJECT_VERSIONS_VERSION_REFERENCED_BY"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"PUT_CONFIG_SUBJECT"}
tests/rptest/tests/audit_log_test.py::AuditLogTestAdminApi.test_config_rejected
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"GET_SCHEMAS_IDS_ID_VERSIONS"}
tests/rptest/tests/audit_log_test.py::AuditLogTestOauth.test_kafka_oauth@{"audit_transport_mode":"kclient","authz_match":"rbac"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"DELETE_MODE_SUBJECT"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"DELETE_SUBJECT"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSanctionMode.test_sanctioning_mode@{"audit_transport_mode":"kclient"}

Copy link
Copy Markdown
Contributor

@IoannisRP IoannisRP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of question on the mitigate on the update_metadata path...

  1. Is it there just for the inform call to happen earlier? i.e. we don't have to wait for a a dispatch call to end up calling inform
  2. dispatch calls happen with a retry. The first fails, informs and then it "succeeds".
    There is no retry around update_metadata, is there? Is the logic that the first fails and the next (either timer or mitigator call) update_metadata succeeds?

Comment thread src/v/kafka/client/cluster.cc Outdated
Comment on lines +363 to +366
.include_cluster_authorized_operations
= _config.include_authorized_operations ? true : false,
.include_topic_authorized_operations
= _config.include_authorized_operations ? true : false}},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
.include_cluster_authorized_operations
= _config.include_authorized_operations ? true : false,
.include_topic_authorized_operations
= _config.include_authorized_operations ? true : false}},
.include_cluster_authorized_operations
= _config.include_authorized_operations(),
.include_topic_authorized_operations
= _config.include_authorized_operations()}},

same below

Comment thread src/v/kafka/client/cluster.cc Outdated
.handle_exception([this](std::exception_ptr ex) {
vlog(
_logger.warn,
"External metadata mitigation failed - {}",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the external mitigator is set, we wouldn't reach the warning bellow. Good to have the context of what failed in the warning, as well.

Suggested change
"External metadata mitigation failed - {}",
"Failed to dispatch metadata request: External metadata mitigation failed - {}",

Comment thread src/v/security/audit/client.cc Outdated
: audit_client(sink, controller)
, _client(
config::to_yaml(client_config, config::redact_secrets::no),
[this](std::exception_ptr eptr) { return mitigate_error(eptr); },
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for reusing the existing mitigator instead of passing another one.

@michael-redpanda
Copy link
Copy Markdown
Contributor Author

michael-redpanda commented Dec 8, 2025

@IoannisRP

Is it there just for the inform call to happen earlier? i.e. we don't have to wait for a a dispatch call to end up calling inform

The issue is that brokers may shut down and come back and in that time they'll lose the ephemeral credentials they previous held, so we need to react to these situations with infrom

dispatch calls happen with a retry. The first fails, informs and then it "succeeds".
There is no retry around update_metadata, is there? Is the logic that the first fails and the next (either timer or mitigator call) update_metadata succeeds?

Previously, the behavior would have been that the metadata request would have failed and thrown an exception and continued the retry.. The changes within kafka::client::cluster now 'swallow' that exception. The alternative would be to throw from the kafka::clinet::cluster but that change in behavior led to some strange behavior with Shadow Linking.

@IoannisRP
Copy link
Copy Markdown
Contributor

The issue is that brokers may shut down and come back and in that time they'll lose the ephemeral credentials they previous held, so we need to react to these situations with infrom

I have seen the same behavior in SR, as well. It is normally resolved through the next dispatch-fail-inform-retry cycle. If it can help to resolve this more efficiently, we should go with the existing external mitigator (as discussed here).

Previously, the behavior would have been that the metadata request would have failed and thrown an exception and continued the retry

which path are we discussing?

In the client, the client::dispatch, client::update_metadata and most other calls come with mitigation+retry from the client side. So after the first failure, inform would be called, and the retry would happen.

The paths that I see that don't do mitigation (so no inform) in the cluster, is the periodic update_metadata from update_timer_callback and the update_metadata that happens as part of the mitigation. However, the later should be happening after the external_mitigate_error, so after the inform has already happened.

Is it the periodic metadata update that is the target of this fix? i.e. to not have logs full of sasl errors, even when not in the presence of other calls?

Not all uses of the kafka client need metadata or describe_cluster
requests to include the bitmask of authorized operations.  If the
authenticated principal used by the client doesn't have the appropriate
level of authorization to get that information this clutters up the
broker logs with failed authorization checks.  This new option permits
users of k/client to opt out of getting that information.

Signed-off-by: Michael Boquard <michael@redpanda.com>
All uses of the kafka client outside of Shadow Linking do not require
the authorized operations flag to be set.  This change ensures that any
use disables that flag.

Signed-off-by: Michael Boquard <michael@redpanda.com>
@michael-redpanda michael-redpanda force-pushed the fix-audit-log-client-on-upgrade branch from 98f6061 to 4984b58 Compare December 8, 2025 20:14
@michael-redpanda
Copy link
Copy Markdown
Contributor Author

Is it the periodic metadata update that is the target of this fix? i.e. to not have logs full of sasl errors, even when not in the presence of other calls?

that would be the desire

@michael-redpanda
Copy link
Copy Markdown
Contributor Author

Force push:

  • Addressed pr comments

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Comment thread src/v/kafka/client/cluster.cc Outdated
Comment on lines +67 to +69
cluster::cluster(
connection_configuration config,
std::optional<external_metadata_mitigate> mitigater)
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'mitigater' to 'mitigator'.

Copilot uses AI. Check for mistakes.
Comment thread src/v/kafka/client/cluster.h Outdated
Comment on lines +35 to +40
// The mitigater can be used to perform actions when metadata update fails.
explicit cluster(
connection_configuration config,
std::optional<external_metadata_mitigate> mitigater = std::nullopt);

// The mitigater can be used to perform actions when metadata update fails.
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'mitigater' to 'mitigator'.

Suggested change
// The mitigater can be used to perform actions when metadata update fails.
explicit cluster(
connection_configuration config,
std::optional<external_metadata_mitigate> mitigater = std::nullopt);
// The mitigater can be used to perform actions when metadata update fails.
// The mitigator can be used to perform actions when metadata update fails.
explicit cluster(
connection_configuration config,
std::optional<external_metadata_mitigate> mitigater = std::nullopt);
// The mitigator can be used to perform actions when metadata update fails.

Copilot uses AI. Check for mistakes.
Comment thread src/v/kafka/client/cluster.h Outdated
Comment on lines +35 to +40
// The mitigater can be used to perform actions when metadata update fails.
explicit cluster(
connection_configuration config,
std::optional<external_metadata_mitigate> mitigater = std::nullopt);

// The mitigater can be used to perform actions when metadata update fails.
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'mitigater' to 'mitigator'.

Suggested change
// The mitigater can be used to perform actions when metadata update fails.
explicit cluster(
connection_configuration config,
std::optional<external_metadata_mitigate> mitigater = std::nullopt);
// The mitigater can be used to perform actions when metadata update fails.
// The mitigator can be used to perform actions when metadata update fails.
explicit cluster(
connection_configuration config,
std::optional<external_metadata_mitigate> mitigater = std::nullopt);
// The mitigator can be used to perform actions when metadata update fails.

Copilot uses AI. Check for mistakes.
Comment on lines 49 to 51
: client(
client_configuration::from_config_store(configuration(cfg)),
std::move(mitigater)) {}
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'mitigater' to 'mitigator'.

Copilot uses AI. Check for mistakes.
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented Dec 8, 2025

Retry command for Build#77539

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/audit_log_test.py::AuditLogTestKafkaApi.test_consume@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestReproducer.test_sanctioning_mode@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"POST_SUBJECT"}
tests/rptest/tests/audit_log_test.py::AuditLogTestAdminApi.test_audit_log_functioning@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestKafkaApi.test_audit_topic_protections
tests/rptest/tests/audit_log_test.py::AuditLogTestKafkaAuthnApi.test_authn_failure_messages@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"DELETE_CONFIG_SUBJECT"}
tests/rptest/tests/audit_log_test.py::AuditLogTestAdminAuthApi.test_excluded_principal@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistry.test_sr_audit@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz_get_subjects@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestKafkaAuthnApi.test_no_audit_user_authn
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"DELETE_SUBJECT_VERSION"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"GET_SUBJECT_VERSIONS_VERSION_SCHEMA"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"PUT_MODE"}
tests/rptest/tests/audit_log_test.py::AuditLogTestKafkaAuthnApi.test_excluded_principal@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistry.test_sr_audit_bad_authn@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"DELETE_SUBJECT"}
tests/rptest/tests/audit_log_test.py::AuditLogTestsAppLifecycle.test_recovery_mode@{"audit_transport_mode":"kclient"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"GET_MODE"}
tests/rptest/tests/audit_log_test.py::AuditLogTestSchemaRegistryACLs.test_sr_audit_authz@{"audit_transport_mode":"kclient","endpoint_name":"GET_CONFIG_SUBJECT"}
tests/rptest/tests/cluster_linking_e2e_test.py::ShadowLinkingReplicationTests.test_topic_delete@{"source_cluster_spec":{"cluster_type":"redpanda"}}

Copy link
Copy Markdown
Contributor

@IoannisRP IoannisRP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good 👍

Comment thread src/v/kafka/client/cluster.cc Outdated
Comment on lines +154 to +174
auto dispatch_f = co_await ss::coroutine::as_future(
do_dispatch_and_apply_metadata_updates(std::move(topics_request_list)));

if (dispatch_f.failed()) {
auto err = dispatch_f.get_exception();
if (_external_metadata_mitigate) {
auto external_f = co_await ss::coroutine::as_future(
_external_metadata_mitigate.value()(err));
if (external_f.failed()) {
auto external_err = external_f.get_exception();
vlog(
_logger.warn,
"Failed to dispatch metadata request: External metadata "
"mitigation failed - {}",
external_err);
}
} else {
// If describe cluster isn't supported, then dispatch a metadata
// request with no topics at API version 10 in order to get the
// cluster level authorized operations
auto metadata_resp = co_await dispatch_metadata_request(
broker, {}, api_version(10));
mu = to_metadata_update(
std::move(metadata_resp), topic_metadata_included_t::no);
vlog(_logger.warn, "Failed to dispatch metadata request - {}", err);
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a shame we can't reuse some of the "client/utils" here. The closest would be retry_with_mitigation, but it only does mitigation on retry, and if you add a retry, it would retry on every failure (not just sasl).

@michael-redpanda michael-redpanda added this to the v25.3.2 milestone Dec 9, 2025
Signed-off-by: Michael Boquard <michael@redpanda.com>
@michael-redpanda michael-redpanda force-pushed the fix-audit-log-client-on-upgrade branch from 4984b58 to d7f6816 Compare December 10, 2025 15:42
@michael-redpanda
Copy link
Copy Markdown
Contributor Author

Force push:

  • Removed mitigator as this was causing test flakiness
  • Reduced log severity of broker error in metadata refresh to 'debug' to reduce occurence of benign but alarming error messages

@michael-redpanda michael-redpanda merged commit 296e00e into redpanda-data:dev Dec 10, 2025
19 checks passed
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.3.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants