Skip to content

cluster/feature_manager: use insert_or_assign for version updates#29824

Merged
pgellert merged 2 commits intoredpanda-data:devfrom
pgellert:fix/alter-flake-after-upgrade
Mar 17, 2026
Merged

cluster/feature_manager: use insert_or_assign for version updates#29824
pgellert merged 2 commits intoredpanda-data:devfrom
pgellert:fix/alter-flake-after-upgrade

Conversation

@pgellert
Copy link
Copy Markdown
Contributor

QuotaManagementUpgradeTest.test_quotas_during_upgrade was flaky because it can take longer than 10s for the logical version to update after the restart and for the user quotas feature to activate – which is then surfacing as a timeout for this test.

There are 2 issues here that are addressed by this PR:

  1. The 10s timeout is generally fairly low because the health reports are collected on a 10s interval, so there is a non-0 chance of hitting this 10s timeout.

  2. There was a bug in the feature_manager’s handling of node version updates where the logic was using emplace instead of insert_or_assign which meant that node version updates could be delayed further if there was a stale update earlier.

INFO  2026-03-09 06:21:03,170 [shard 0:clus] cluster - feature_manager.cc:103 - Starting...
DEBUG 2026-03-09 06:21:03,170 [shard 0:clus] cluster - feature_manager.cc:143 - Controller leader notification term 1
DEBUG 2026-03-09 06:21:03,170 [shard 0:clus] cluster - feature_manager.cc:413 - Waiting for enterprise license to be verified before checking for active version updates...
DEBUG 2026-03-09 06:21:03,174 [shard 0:main] cluster - feature_manager.cc:356 - Verifying enterprise license...
INFO  2026-03-09 06:21:03,174 [shard 0:main] cluster - feature_manager.cc:395 - Verifying enterprise license: active_version=17, latest_version=18, enterprise_features=[core_balancing_continuous, cloud_storage, partition_auto_balancing_continuous], license_missing_or_expired=false
DEBUG 2026-03-09 06:21:03,174 [shard 0:clus] cluster - feature_manager.cc:420 - Checking for active version update...
DEBUG 2026-03-09 06:21:03,400 [shard 0:admi] cluster - feature_manager.cc:529 - update_node_version: enqueuing update node=2 version=17 *** <- inserted into _updates
DEBUG 2026-03-09 06:21:03,400 [shard 0:admi] cluster - feature_manager.cc:529 - update_node_version: enqueuing update node=1 version=18
DEBUG 2026-03-09 06:21:06,787 [shard 0:rr  ] cluster - feature_manager.cc:143 - Controller leader notification term 3
DEBUG 2026-03-09 06:21:13,446 [shard 0:admi] cluster - feature_manager.cc:529 - update_node_version: enqueuing update node=2 version=18 *** <- this is ignored by .emplace
DEBUG 2026-03-09 06:21:13,446 [shard 0:admi] cluster - feature_manager.cc:529 - update_node_version: enqueuing update node=1 version=18
DEBUG 2026-03-09 06:21:13,912 [shard 0:rs  ] cluster - feature_manager.cc:143 - Controller leader notification term 3
DEBUG 2026-03-09 06:21:15,545 [shard 0:rs  ] cluster - feature_manager.cc:143 - Controller leader notification term 5
DEBUG 2026-03-09 06:21:15,545 [shard 0:rs  ] cluster - feature_manager.cc:161 - generating version update for controller leader 1 (18)
DEBUG 2026-03-09 06:21:15,545 [shard 0:rs  ] cluster - feature_manager.cc:529 - update_node_version: enqueuing update node=1 version=18
DEBUG 2026-03-09 06:21:15,546 [shard 0:clus] cluster - feature_manager.cc:413 - Waiting for enterprise license to be verified before checking for active version updates...
DEBUG 2026-03-09 06:21:15,546 [shard 0:clus] cluster - feature_manager.cc:420 - Checking for active version update...
DEBUG 2026-03-09 06:21:15,546 [shard 0:clus] cluster - feature_manager.cc:557 - Processing update node=1 version=18
DEBUG 2026-03-09 06:21:15,546 [shard 0:clus] cluster - feature_manager.cc:557 - Processing update node=2 version=17
DEBUG 2026-03-09 06:21:15,546 [shard 0:clus] cluster - feature_manager.cc:602 - Can't update active version to 18 because node 2 version is too low (17)

Fixes https://redpandadata.atlassian.net/browse/CORE-15478

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

Improvements

  • Minor improvement to speed up feature activation after major version upgrades.

emplace is a no-op when the key already exists, so a stale version
entry can shadow a newer one queued before the loop drains _updates.
This delays feature activation by an extra health monitor tick (~10s)
because the controller must wait for the next health collection to
re-learn the node's version.
The alter helper uses wait_until with a 10s timeout, but during a
rolling upgrade user_based_client_quota activation is gated on all
nodes reporting the new version. The controller collects node health
on a 10s tick (health_monitor_tick_interval), so a single slow node
restart can exhaust the timeout. Increase to 30s.
@pgellert pgellert requested a review from a team March 13, 2026 12:20
@pgellert pgellert self-assigned this Mar 13, 2026
@pgellert pgellert requested review from BenPope and Copilot and removed request for a team March 13, 2026 12:20
@pgellert pgellert requested review from nguyen-andrew and removed request for BenPope March 13, 2026 12:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses flakiness in quota upgrade tests by making version update propagation more reliable and increasing the retry timeout to better match the system’s health-reporting cadence during upgrades.

Changes:

  • Increase the retry timeout for rpk.alter_cluster_quotas when running with retries (10s → 30s).
  • Fix feature_manager node version update queuing to overwrite stale entries by switching from emplace to insert_or_assign.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
tests/rptest/tests/quota_management_test.py Raises the retry timeout to reduce upgrade-related flakiness when cluster state takes longer to stabilize.
src/v/cluster/feature_manager.cc Ensures newer node version updates replace older queued updates, avoiding delayed active-version advancement.

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

CI test results

test results on build#81756
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
CloudTopicsL0GCNodeFailureTest test_node_failure_mid_gc {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/81756#019ce730-9884-42c6-ac70-d8650f3fcbc5 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0382, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1104, p1=0.3105, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudTopicsL0GCNodeFailureTest&test_method=test_node_failure_mid_gc
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/81756#019ce730-9888-4de4-9d58-f8a63529e91c FLAKY 8/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0950, p0=0.2445, reject_threshold=0.0100. adj_baseline=0.2587, p1=0.4995, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

Copy link
Copy Markdown
Member

@nguyen-andrew nguyen-andrew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

v);

_updates.emplace(update_node, v);
_updates.insert_or_assign(update_node, v);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this normally updated? the commit message makes it sound like this change improves feature activitation latency, but presumably there is a different slow path?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature still gets activated, just on the next health report update, which triggers the callback in feature_manager::start. This is the continuation of the logs from the PR header:

DEBUG 2026-03-09 06:21:24,172 [shard 0:main] cluster - feature_manager.cc:529 - update_node_version: enqueuing update node=2 version=18
DEBUG 2026-03-09 06:21:24,172 [shard 0:clus] cluster - feature_manager.cc:413 - Waiting for enterprise license to be verified before checking for active version updates...
DEBUG 2026-03-09 06:21:24,172 [shard 0:clus] cluster - feature_manager.cc:420 - Checking for active version update...
DEBUG 2026-03-09 06:21:24,172 [shard 0:clus] cluster - feature_manager.cc:557 - Processing update node=2 version=18
INFO  2026-03-09 06:21:24,172 [shard 0:clus] cluster - feature_manager.cc:640 - Auto-activating feature group_based_authorization (logical version 18)
INFO  2026-03-09 06:21:24,172 [shard 0:clus] cluster - feature_manager.cc:640 - Auto-activating feature user_based_client_quota (logical version 18)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess i meant: how does the update ever happen if emplace is used? does _updates get cleared first?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the std::exchange on L548 inside do_maybe_update_active_version clears _updates, which is why emplace still eventually works.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dotnwat I am going to merge this in because I believe this is a clarifying question from you rather than a concern, but let me know if you see any concerns, and I'm happy to course-correct

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, yes thanks merging is the right thing to do. i should have made it clear it was a clarifying comment. thanks!

@pgellert pgellert requested a review from dotnwat March 13, 2026 15:10
@pgellert pgellert merged commit 8a3e2e5 into redpanda-data:dev Mar 17, 2026
26 checks passed
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.3.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.2.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.1.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Failed to create a backport PR to v25.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-29824-v25.3.x-392 remotes/upstream/v25.3.x
git cherry-pick -x 1ea6e4e51e c77c86edde

Workflow run logs.

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Failed to create a backport PR to v25.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-29824-v25.2.x-621 remotes/upstream/v25.2.x
git cherry-pick -x 1ea6e4e51e c77c86edde

Workflow run logs.

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Failed to create a backport PR to v25.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-29824-v25.1.x-833 remotes/upstream/v25.1.x
git cherry-pick -x 1ea6e4e51e c77c86edde

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants