cluster/feature_manager: use insert_or_assign for version updates#29824
cluster/feature_manager: use insert_or_assign for version updates#29824pgellert merged 2 commits intoredpanda-data:devfrom
Conversation
emplace is a no-op when the key already exists, so a stale version entry can shadow a newer one queued before the loop drains _updates. This delays feature activation by an extra health monitor tick (~10s) because the controller must wait for the next health collection to re-learn the node's version.
The alter helper uses wait_until with a 10s timeout, but during a rolling upgrade user_based_client_quota activation is gated on all nodes reporting the new version. The controller collects node health on a 10s tick (health_monitor_tick_interval), so a single slow node restart can exhaust the timeout. Increase to 30s.
There was a problem hiding this comment.
Pull request overview
This PR addresses flakiness in quota upgrade tests by making version update propagation more reliable and increasing the retry timeout to better match the system’s health-reporting cadence during upgrades.
Changes:
- Increase the retry timeout for
rpk.alter_cluster_quotaswhen running with retries (10s → 30s). - Fix
feature_managernode version update queuing to overwrite stale entries by switching fromemplacetoinsert_or_assign.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
tests/rptest/tests/quota_management_test.py |
Raises the retry timeout to reduce upgrade-related flakiness when cluster state takes longer to stabilize. |
src/v/cluster/feature_manager.cc |
Ensures newer node version updates replace older queued updates, avoiding delayed active-version advancement. |
CI test resultstest results on build#81756
|
| v); | ||
|
|
||
| _updates.emplace(update_node, v); | ||
| _updates.insert_or_assign(update_node, v); |
There was a problem hiding this comment.
how is this normally updated? the commit message makes it sound like this change improves feature activitation latency, but presumably there is a different slow path?
There was a problem hiding this comment.
The feature still gets activated, just on the next health report update, which triggers the callback in feature_manager::start. This is the continuation of the logs from the PR header:
DEBUG 2026-03-09 06:21:24,172 [shard 0:main] cluster - feature_manager.cc:529 - update_node_version: enqueuing update node=2 version=18
DEBUG 2026-03-09 06:21:24,172 [shard 0:clus] cluster - feature_manager.cc:413 - Waiting for enterprise license to be verified before checking for active version updates...
DEBUG 2026-03-09 06:21:24,172 [shard 0:clus] cluster - feature_manager.cc:420 - Checking for active version update...
DEBUG 2026-03-09 06:21:24,172 [shard 0:clus] cluster - feature_manager.cc:557 - Processing update node=2 version=18
INFO 2026-03-09 06:21:24,172 [shard 0:clus] cluster - feature_manager.cc:640 - Auto-activating feature group_based_authorization (logical version 18)
INFO 2026-03-09 06:21:24,172 [shard 0:clus] cluster - feature_manager.cc:640 - Auto-activating feature user_based_client_quota (logical version 18)
There was a problem hiding this comment.
i guess i meant: how does the update ever happen if emplace is used? does _updates get cleared first?
There was a problem hiding this comment.
Yes, the std::exchange on L548 inside do_maybe_update_active_version clears _updates, which is why emplace still eventually works.
There was a problem hiding this comment.
@dotnwat I am going to merge this in because I believe this is a clarifying question from you rather than a concern, but let me know if you see any concerns, and I'm happy to course-correct
There was a problem hiding this comment.
whoops, yes thanks merging is the right thing to do. i should have made it clear it was a clarifying comment. thanks!
|
/backport v25.3.x |
|
/backport v25.2.x |
|
/backport v25.1.x |
|
Failed to create a backport PR to v25.3.x branch. I tried: |
|
Failed to create a backport PR to v25.2.x branch. I tried: |
|
Failed to create a backport PR to v25.1.x branch. I tried: |
QuotaManagementUpgradeTest.test_quotas_during_upgradewas flaky because it can take longer than 10s for the logical version to update after the restart and for the user quotas feature to activate – which is then surfacing as a timeout for this test.There are 2 issues here that are addressed by this PR:
The 10s timeout is generally fairly low because the health reports are collected on a 10s interval, so there is a non-0 chance of hitting this 10s timeout.
There was a bug in the feature_manager’s handling of node version updates where the logic was using emplace instead of insert_or_assign which meant that node version updates could be delayed further if there was a stale update earlier.
Fixes https://redpandadata.atlassian.net/browse/CORE-15478
Backports Required
Release Notes
Improvements