Skip to content

Do not audit ephemeral users#26965

Merged
michael-redpanda merged 3 commits intoredpanda-data:devfrom
michael-redpanda:do-not-audit-ephemeral-users
Jul 29, 2025
Merged

Do not audit ephemeral users#26965
michael-redpanda merged 3 commits intoredpanda-data:devfrom
michael-redpanda:do-not-audit-ephemeral-users

Conversation

@michael-redpanda
Copy link
Copy Markdown
Contributor

@michael-redpanda michael-redpanda commented Jul 23, 2025

Redpanda will no longer generate audit events for ephemeral users. Ephemeral users are only used by internal Redpanda services (SR and auditing).

Fixes: CORE-12910

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.2.x
  • v25.1.x
  • v24.3.x
  • v24.2.x

Release Notes

Improvements

  • No longer generating audit events for internal Redpanda services

Signed-off-by: Michael Boquard <michael@redpanda.com>
@michael-redpanda michael-redpanda requested review from a team and pgellert July 23, 2025 19:14
@michael-redpanda michael-redpanda self-assigned this Jul 23, 2025
Comment on lines +1615 to +1633
@skip_fips_mode
@cluster(num_nodes=5)
def test_no_ephemeral_user(self):
"""
Verifies that ephemeral users do not generate audit messages
"""
self.setup_cluster()

user_rpk = self.get_rpk()

_ = user_rpk.list_topics()

start_time = time.time()

# Read all records that have the audit log user for two seconds - should not get any records
records = self.read_all_from_audit_log(partial(self.authn_filter_function, self.kafka_rpc_service_name, "__auditing", 99, "SASL-SCRAM"),
lambda records: time.time() >= start_time + 2)

assert len(records) == 0, f'Expected 0 records: {records}'
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did verify that if the changes I made in audit_log_manager.cc were removed that this test fails

rockwotj
rockwotj previously approved these changes Jul 23, 2025
Copy link
Copy Markdown
Contributor

@rockwotj rockwotj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had a discussion about this when first doing the project? When did we start producing audit logs for the audit log, doesn't that result in cycles where we always write to the audit log because we keep writing we wrote to it?

start_time = time.time()

# Read all records that have the audit log user for two seconds - should not get any records
records = self.read_all_from_audit_log(partial(self.authn_filter_function, self.kafka_rpc_service_name, "__auditing", 99, "SASL-SCRAM"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have produce auditing on?

@michael-redpanda
Copy link
Copy Markdown
Contributor Author

I thought we had a discussion about this when first doing the project? When did we start producing audit logs for the audit log, doesn't that result in cycles where we always write to the audit log because we keep writing we wrote to it?

We already excluded the produces to the audit log. This is just expands that to include all audit events for all ephemeral users

@michael-redpanda michael-redpanda force-pushed the do-not-audit-ephemeral-users branch from 2384986 to f6cd5f1 Compare July 23, 2025 19:29
@michael-redpanda
Copy link
Copy Markdown
Contributor Author

Force push:

  • Fixed python lint

@michael-redpanda
Copy link
Copy Markdown
Contributor Author

Force push:

  • ACTUALLY fixed python lint

@rockwotj
Copy link
Copy Markdown
Contributor

I thought we had a discussion about this when first doing the project? When did we start producing audit logs for the audit log, doesn't that result in cycles where we always write to the audit log because we keep writing we wrote to it?

We already excluded the produces to the audit log. This is just expands that to include all audit events for all ephemeral users

Does that mean there is logic we can remove/cleanup now?

rockwotj
rockwotj previously approved these changes Jul 23, 2025
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented Jul 23, 2025

Retry command for Build#69563

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/audit_log_test.py::AuditLogTestKafkaAuthnApi.test_no_audit_user_authn

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented Jul 23, 2025

CI test results

test results on build#69563
test_class test_method test_arguments test_kind job_url test_status passed reason
AuditLogTestKafkaAuthnApi test_no_audit_user_authn null integration https://buildkite.com/redpanda/redpanda/builds/69563#019838ef-963a-45fd-94d7-5dc96f2d828b FAIL 0/21 The test has failed across all retries
AuditLogTestKafkaAuthnApi test_no_audit_user_authn null integration https://buildkite.com/redpanda/redpanda/builds/69563#019838f0-03ba-4d5e-8062-2faa70a373a6 FAIL 0/21 The test has failed across all retries
EndToEndCloudTopicsTxTest test_write null integration https://buildkite.com/redpanda/redpanda/builds/69563#019838f0-03b3-4594-b616-d60ba60711ea FLAKY 19/21 upstream reliability is '95.52238805970148'. current run reliability is '90.47619047619048'. drift is 5.0462 and the allowed drift is set to 50. The test should PASS
FeaturesMultiNodeTest test_license_upload_and_query null integration https://buildkite.com/redpanda/redpanda/builds/69563#019838ef-9639-4b4f-8706-8224a6709ee3 FLAKY 20/21 upstream reliability is '96.93251533742331'. current run reliability is '95.23809523809523'. drift is 1.69442 and the allowed drift is set to 50. The test should PASS
VerifyConsumerOffsetsThruUpgrades test_consumer_group_offsets {"versions_to_upgrade": 3} integration https://buildkite.com/redpanda/redpanda/builds/69563#019838ef-9638-4af6-98e6-2459537a4b49 FLAKY 20/21 upstream reliability is '99.62962962962963'. current run reliability is '95.23809523809523'. drift is 4.39153 and the allowed drift is set to 50. The test should PASS
test results on build#69665
test_class test_method test_arguments test_kind job_url test_status passed reason
PartitionBalancerTest test_unavailable_nodes null integration https://buildkite.com/redpanda/redpanda/builds/69665#019841d4-5005-43f2-9a32-85cbe7bc35ab FLAKY 20/21 upstream reliability is '99.43820224719101'. current run reliability is '95.23809523809523'. drift is 4.20011 and the allowed drift is set to 50. The test should PASS

Copy link
Copy Markdown
Contributor

@pgellert pgellert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

The new test is failing in CI, I suspect it's because of the stop condition. It's probably worth running it on repeat a couple times on CI to make sure it never fails.

Comment on lines +40 to +44
_audit_user.type_id = _principal.type() == principal_type::user
? audit::user::type::user
: _principal.type() == principal_type::ephemeral_user
? audit::user::type::system
: audit::user::type::other;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

switch statement instead of if/elseif ternary

@michael-redpanda
Copy link
Copy Markdown
Contributor Author

I thought we had a discussion about this when first doing the project? When did we start producing audit logs for the audit log, doesn't that result in cycles where we always write to the audit log because we keep writing we wrote to it?

We already excluded the produces to the audit log. This is just expands that to include all audit events for all ephemeral users

Does that mean there is logic we can remove/cleanup now?

Oh good point, yeah probably

@michael-redpanda
Copy link
Copy Markdown
Contributor Author

Looks good.

The new test is failing in CI, I suspect it's because of the stop condition. It's probably worth running it on repeat a couple times on CI to make sure it never fails.

Actually that's an older test that did validate that the audit user did authenticate and generate an audit event. This change makes that part of the test unecessary.

@michael-redpanda
Copy link
Copy Markdown
Contributor Author

Force push:

  • Fixed failing audit_log DT test
  • Updated skip_auditing on Kafka API to skip any ephemeral user
  • Updated the new DT test

If ephemeral user, make the user type by system in OCSF

Signed-off-by: Michael Boquard <michael@redpanda.com>
@michael-redpanda michael-redpanda force-pushed the do-not-audit-ephemeral-users branch from 2a16a3c to 08de292 Compare July 25, 2025 12:51
@michael-redpanda
Copy link
Copy Markdown
Contributor Author

Force push:

  • Remove ternary to use switch statement

Ephemeral users are used by auditing and schema registry.  There is no
reason to audit the auditing ephemeral user as that does not correspond
to a user derived action.

For schema registry, there are already audit events tracked when the
user access SR via the SR API.  It isn't necessary to audit that the SR
client has performed an action.

Signed-off-by: Michael Boquard <michael@redpanda.com>
@michael-redpanda michael-redpanda force-pushed the do-not-audit-ephemeral-users branch from 08de292 to 6b1d48e Compare July 25, 2025 12:51
@michael-redpanda
Copy link
Copy Markdown
Contributor Author

Force push:

  • Fixed python lint error

== security::audit::event_type::produce
&& principal == security::audit_principal;
/// Skips auditing ephemeral users
inline bool skip_auditing(const security::acl_principal& principal) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this check at both levels?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows the Kafka API to short circuit earlier in the auditing checks. The other check would apply to all other interfaces (existing or future) that need to generate audit events. I'd say it's a slight optimization just on the Kafka API.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be useful to add a comment in the source to that effect that it's only an optimization.

== security::audit::event_type::produce
&& principal == security::audit_principal;
/// Skips auditing ephemeral users
inline bool skip_auditing(const security::acl_principal& principal) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be useful to add a comment in the source to that effect that it's only an optimization.

@michael-redpanda michael-redpanda merged commit 2b3ed44 into redpanda-data:dev Jul 29, 2025
19 checks passed
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.2.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.1.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v24.3.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Failed to create a backport PR to v25.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-26965-v25.1.x-604 remotes/upstream/v25.1.x
git cherry-pick -x c7e4ec3f15 271095fdec 6b1d48ebbd

Workflow run logs.

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Failed to create a backport PR to v24.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-26965-v24.3.x-43 remotes/upstream/v24.3.x
git cherry-pick -x c7e4ec3f15 271095fdec 6b1d48ebbd

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants