cluster/feature_manager: use insert_or_assign for version updates by pgellert · Pull Request #29824 · redpanda-data/redpanda

pgellert · 2026-03-13T12:20:28Z

QuotaManagementUpgradeTest.test_quotas_during_upgrade was flaky because it can take longer than 10s for the logical version to update after the restart and for the user quotas feature to activate – which is then surfacing as a timeout for this test.

There are 2 issues here that are addressed by this PR:

The 10s timeout is generally fairly low because the health reports are collected on a 10s interval, so there is a non-0 chance of hitting this 10s timeout.
There was a bug in the feature_manager’s handling of node version updates where the logic was using emplace instead of insert_or_assign which meant that node version updates could be delayed further if there was a stale update earlier.

INFO  2026-03-09 06:21:03,170 [shard 0:clus] cluster - feature_manager.cc:103 - Starting...
DEBUG 2026-03-09 06:21:03,170 [shard 0:clus] cluster - feature_manager.cc:143 - Controller leader notification term 1
DEBUG 2026-03-09 06:21:03,170 [shard 0:clus] cluster - feature_manager.cc:413 - Waiting for enterprise license to be verified before checking for active version updates...
DEBUG 2026-03-09 06:21:03,174 [shard 0:main] cluster - feature_manager.cc:356 - Verifying enterprise license...
INFO  2026-03-09 06:21:03,174 [shard 0:main] cluster - feature_manager.cc:395 - Verifying enterprise license: active_version=17, latest_version=18, enterprise_features=[core_balancing_continuous, cloud_storage, partition_auto_balancing_continuous], license_missing_or_expired=false
DEBUG 2026-03-09 06:21:03,174 [shard 0:clus] cluster - feature_manager.cc:420 - Checking for active version update...
DEBUG 2026-03-09 06:21:03,400 [shard 0:admi] cluster - feature_manager.cc:529 - update_node_version: enqueuing update node=2 version=17 *** <- inserted into _updates
DEBUG 2026-03-09 06:21:03,400 [shard 0:admi] cluster - feature_manager.cc:529 - update_node_version: enqueuing update node=1 version=18
DEBUG 2026-03-09 06:21:06,787 [shard 0:rr  ] cluster - feature_manager.cc:143 - Controller leader notification term 3
DEBUG 2026-03-09 06:21:13,446 [shard 0:admi] cluster - feature_manager.cc:529 - update_node_version: enqueuing update node=2 version=18 *** <- this is ignored by .emplace
DEBUG 2026-03-09 06:21:13,446 [shard 0:admi] cluster - feature_manager.cc:529 - update_node_version: enqueuing update node=1 version=18
DEBUG 2026-03-09 06:21:13,912 [shard 0:rs  ] cluster - feature_manager.cc:143 - Controller leader notification term 3
DEBUG 2026-03-09 06:21:15,545 [shard 0:rs  ] cluster - feature_manager.cc:143 - Controller leader notification term 5
DEBUG 2026-03-09 06:21:15,545 [shard 0:rs  ] cluster - feature_manager.cc:161 - generating version update for controller leader 1 (18)
DEBUG 2026-03-09 06:21:15,545 [shard 0:rs  ] cluster - feature_manager.cc:529 - update_node_version: enqueuing update node=1 version=18
DEBUG 2026-03-09 06:21:15,546 [shard 0:clus] cluster - feature_manager.cc:413 - Waiting for enterprise license to be verified before checking for active version updates...
DEBUG 2026-03-09 06:21:15,546 [shard 0:clus] cluster - feature_manager.cc:420 - Checking for active version update...
DEBUG 2026-03-09 06:21:15,546 [shard 0:clus] cluster - feature_manager.cc:557 - Processing update node=1 version=18
DEBUG 2026-03-09 06:21:15,546 [shard 0:clus] cluster - feature_manager.cc:557 - Processing update node=2 version=17
DEBUG 2026-03-09 06:21:15,546 [shard 0:clus] cluster - feature_manager.cc:602 - Can't update active version to 18 because node 2 version is too low (17)

Fixes https://redpandadata.atlassian.net/browse/CORE-15478

Backports Required

Release Notes

Improvements

Minor improvement to speed up feature activation after major version upgrades.

emplace is a no-op when the key already exists, so a stale version entry can shadow a newer one queued before the loop drains _updates. This delays feature activation by an extra health monitor tick (~10s) because the controller must wait for the next health collection to re-learn the node's version.

The alter helper uses wait_until with a 10s timeout, but during a rolling upgrade user_based_client_quota activation is gated on all nodes reporting the new version. The controller collects node health on a 10s tick (health_monitor_tick_interval), so a single slow node restart can exhaust the timeout. Increase to 30s.

Copilot

Pull request overview

This PR addresses flakiness in quota upgrade tests by making version update propagation more reliable and increasing the retry timeout to better match the system’s health-reporting cadence during upgrades.

Changes:

Increase the retry timeout for rpk.alter_cluster_quotas when running with retries (10s → 30s).
Fix feature_manager node version update queuing to overwrite stale entries by switching from emplace to insert_or_assign.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`tests/rptest/tests/quota_management_test.py`	Raises the retry timeout to reduce upgrade-related flakiness when cluster state takes longer to stabilize.
`src/v/cluster/feature_manager.cc`	Ensures newer node version updates replace older queued updates, avoiding delayed active-version advancement.

vbotbuildovich · 2026-03-13T13:33:25Z

CI test results

test results on build#81756

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
CloudTopicsL0GCNodeFailureTest	test_node_failure_mid_gc	{"cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/81756#019ce730-9884-42c6-ac70-d8650f3fcbc5	FLAKY	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0382, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1104, p1=0.3105, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudTopicsL0GCNodeFailureTest&test_method=test_node_failure_mid_gc
WriteCachingFailureInjectionE2ETest	test_crash_all	{"use_transactions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/81756#019ce730-9888-4de4-9d58-f8a63529e91c	FLAKY	8/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0950, p0=0.2445, reject_threshold=0.0100. adj_baseline=0.2587, p1=0.4995, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

nguyen-andrew

👍

dotnwat · 2026-03-13T14:59:20Z

src/v/cluster/feature_manager.cc

      v);

-    _updates.emplace(update_node, v);
+    _updates.insert_or_assign(update_node, v);


how is this normally updated? the commit message makes it sound like this change improves feature activitation latency, but presumably there is a different slow path?

The feature still gets activated, just on the next health report update, which triggers the callback in feature_manager::start. This is the continuation of the logs from the PR header:

DEBUG 2026-03-09 06:21:24,172 [shard 0:main] cluster - feature_manager.cc:529 - update_node_version: enqueuing update node=2 version=18 DEBUG 2026-03-09 06:21:24,172 [shard 0:clus] cluster - feature_manager.cc:413 - Waiting for enterprise license to be verified before checking for active version updates... DEBUG 2026-03-09 06:21:24,172 [shard 0:clus] cluster - feature_manager.cc:420 - Checking for active version update... DEBUG 2026-03-09 06:21:24,172 [shard 0:clus] cluster - feature_manager.cc:557 - Processing update node=2 version=18 INFO 2026-03-09 06:21:24,172 [shard 0:clus] cluster - feature_manager.cc:640 - Auto-activating feature group_based_authorization (logical version 18) INFO 2026-03-09 06:21:24,172 [shard 0:clus] cluster - feature_manager.cc:640 - Auto-activating feature user_based_client_quota (logical version 18)

i guess i meant: how does the update ever happen if emplace is used? does _updates get cleared first?

Yes, the std::exchange on L548 inside do_maybe_update_active_version clears _updates, which is why emplace still eventually works.

@dotnwat I am going to merge this in because I believe this is a clarifying question from you rather than a concern, but let me know if you see any concerns, and I'm happy to course-correct

whoops, yes thanks merging is the right thing to do. i should have made it clear it was a clarifying comment. thanks!

vbotbuildovich · 2026-03-17T10:16:32Z

/backport v25.3.x

vbotbuildovich · 2026-03-17T10:16:33Z

/backport v25.2.x

vbotbuildovich · 2026-03-17T10:16:34Z

/backport v25.1.x

vbotbuildovich · 2026-03-17T10:17:47Z

Failed to create a backport PR to v25.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-29824-v25.3.x-392 remotes/upstream/v25.3.x
git cherry-pick -x 1ea6e4e51e c77c86edde

Workflow run logs.

vbotbuildovich · 2026-03-17T10:17:51Z

Failed to create a backport PR to v25.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-29824-v25.2.x-621 remotes/upstream/v25.2.x
git cherry-pick -x 1ea6e4e51e c77c86edde

Workflow run logs.

vbotbuildovich · 2026-03-17T10:18:08Z

Failed to create a backport PR to v25.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-29824-v25.1.x-833 remotes/upstream/v25.1.x
git cherry-pick -x 1ea6e4e51e c77c86edde

Workflow run logs.

pgellert added 2 commits March 13, 2026 12:08

pgellert requested a review from a team March 13, 2026 12:20

pgellert self-assigned this Mar 13, 2026

pgellert requested review from BenPope and Copilot and removed request for a team March 13, 2026 12:20

github-actions bot added the area/redpanda label Mar 13, 2026

pgellert requested review from nguyen-andrew and removed request for BenPope March 13, 2026 12:20

Copilot started reviewing on behalf of pgellert March 13, 2026 12:21 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

nguyen-andrew approved these changes Mar 13, 2026

View reviewed changes

dotnwat reviewed Mar 13, 2026

View reviewed changes

pgellert requested a review from dotnwat March 13, 2026 15:10

pgellert merged commit 8a3e2e5 into redpanda-data:dev Mar 17, 2026
26 checks passed

This was referenced Mar 17, 2026

[v25.3.x] cluster/feature_manager: use insert_or_assign for version updates #29855

Open

[v25.2.x] cluster/feature_manager: use insert_or_assign for version updates #29856

Open

vbotbuildovich mentioned this pull request Mar 17, 2026

[v25.1.x] cluster/feature_manager: use insert_or_assign for version updates #29857

Closed

This was referenced Mar 19, 2026

[v25.3.x] cluster/feature_manager: use insert_or_assign for version updates #29880

Merged

[v25.2.x] cluster/feature_manager: use insert_or_assign for version updates #29881

Merged

Conversation

pgellert commented Mar 13, 2026

Backports Required

Release Notes

Improvements

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

vbotbuildovich commented Mar 13, 2026

CI test results

Uh oh!

nguyen-andrew left a comment

Choose a reason for hiding this comment

Uh oh!

dotnwat Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

pgellert Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

dotnwat Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

pgellert Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

pgellert Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

dotnwat Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vbotbuildovich commented Mar 17, 2026

Uh oh!

vbotbuildovich commented Mar 17, 2026

Uh oh!

vbotbuildovich commented Mar 17, 2026

Uh oh!

vbotbuildovich commented Mar 17, 2026

Uh oh!

vbotbuildovich commented Mar 17, 2026

Uh oh!

vbotbuildovich commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants