Commit efe044d
authored
fix(shard distributor): remove heartbeat write cooldown (#7513)
<!-- Describe what has changed in this PR -->
**What changed?**
- Removed `_heartbeatRefreshRate` and the block that early returns in
service/sharddistributor/handler/executor.go, so executor heartbeats are
always persisted.
- Also updated TestHeartbeat in
service/sharddistributor/handler/executor_test.go. Replaced the
“within/after refresh rate” cases that were present before the removal
of `_heartbeatRereshRate` with a single test that expects
`RecordHeartbeat` to be called on a second heartbeat with the same
status.
<!-- Tell your future self why have you made these changes -->
**Why?**
- Previously, the handler skipped writing some heartbeats when:
- the migration mode was 'onboarding', a previous heartbeat existed, the
status was unchanged,
- and most importantly the new heartbeat arrived within
`**_heartbeatRefreshRate**` (2s) of the last one.
- This meant the stored LastHeartbeat in etcd could lag behind the real
heartbeat rate. With a heartbeat ttl being the same value (2s), healthy
executors could be misclassified as stale and removed, which matched
what we saw in the canary (executors disappearing from GetState, and
shards "collapsing" onto only a few executors).
- Removing the write cooldown gives a simpler and safer behavior:
- executors heartbeat roughly once a second (in development),
- every heartbeat is persisted,
- the staleness check is based on the real heartbeat frequency, and
can't be interrupted by another setting somewhere else in the code that
may gate it
Two other alternatives were considered, instead of removing the check
and cooldown:
- Increasing the heartbeat TTL (e.g., to 5–10s) while keeping the
cooldown.
- or decreasing `_heartbeatRefreshRate` (e.g., to 1s).
Both of these alternatives would reduce the chance of misclassifying
healthy executors as stale, but they keep a hidden coupling between
heartbeat.TTL and the write cooldown. Removing the cooldown entirely
makes the behavior easier to reason about and avoids this subtle issue
than can happen in configration.
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**
- Updated and ran unit tests in
service/sharddistributor/handler/executor_test.go (TestHeartbeat).
- ran the sharddistributor canary with multiple executors and observed
stable executor counts and shard distribution.
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**
- Higher etcd write load for heartbeats: we now persist every heartbeat
instead of at most once per `_heartbeatRefreshRate`.
- Adopters without external rate limiting (again, not sure if this is
relevant) may see more frequent heartbeat writes than before, so they
should be aware that this change removes a local write throttle and rely
on their own rate limiting
<!-- Is it notable for release? e.g. schema updates, configuration or
data migration required? If so, please mention it, and also update
CHANGELOG.md -->
**Release notes**
<!-- Is there any documentation updates should be made for config,
https://cadenceworkflow.io/docs/operation-guide/setup/ ? If so, please
open an PR in https://github.com/cadence-workflow/cadence-docs -->
**Documentation Changes**
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>1 parent 1836420 commit efe044d
File tree
2 files changed
+12
-52
lines changed- service/sharddistributor/handler
2 files changed
+12
-52
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
19 | | - | |
20 | | - | |
21 | 19 | | |
22 | 20 | | |
23 | 21 | | |
| |||
76 | 74 | | |
77 | 75 | | |
78 | 76 | | |
79 | | - | |
80 | | - | |
81 | | - | |
82 | | - | |
83 | | - | |
84 | | - | |
85 | | - | |
86 | | - | |
87 | | - | |
88 | 77 | | |
89 | 78 | | |
90 | 79 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
52 | 52 | | |
53 | 53 | | |
54 | 54 | | |
55 | | - | |
56 | | - | |
| 55 | + | |
| 56 | + | |
57 | 57 | | |
58 | 58 | | |
59 | 59 | | |
| |||
72 | 72 | | |
73 | 73 | | |
74 | 74 | | |
75 | | - | |
76 | | - | |
77 | | - | |
78 | | - | |
79 | | - | |
80 | | - | |
81 | | - | |
82 | | - | |
83 | | - | |
84 | | - | |
85 | | - | |
86 | | - | |
87 | | - | |
88 | | - | |
89 | | - | |
90 | | - | |
91 | | - | |
92 | | - | |
93 | | - | |
94 | | - | |
95 | | - | |
96 | | - | |
97 | | - | |
98 | | - | |
99 | | - | |
100 | | - | |
101 | | - | |
102 | | - | |
103 | | - | |
104 | 75 | | |
105 | 76 | | |
106 | | - | |
| 77 | + | |
107 | 78 | | |
108 | 79 | | |
109 | 80 | | |
110 | 81 | | |
111 | 82 | | |
112 | 83 | | |
113 | 84 | | |
114 | | - | |
| 85 | + | |
115 | 86 | | |
116 | 87 | | |
117 | 88 | | |
| |||
141 | 112 | | |
142 | 113 | | |
143 | 114 | | |
144 | | - | |
| 115 | + | |
145 | 116 | | |
146 | 117 | | |
147 | 118 | | |
| |||
164 | 135 | | |
165 | 136 | | |
166 | 137 | | |
167 | | - | |
| 138 | + | |
168 | 139 | | |
169 | 140 | | |
170 | 141 | | |
| |||
193 | 164 | | |
194 | 165 | | |
195 | 166 | | |
196 | | - | |
| 167 | + | |
197 | 168 | | |
198 | 169 | | |
199 | 170 | | |
| |||
222 | 193 | | |
223 | 194 | | |
224 | 195 | | |
225 | | - | |
| 196 | + | |
226 | 197 | | |
227 | 198 | | |
228 | 199 | | |
| |||
287 | 258 | | |
288 | 259 | | |
289 | 260 | | |
290 | | - | |
| 261 | + | |
291 | 262 | | |
292 | 263 | | |
293 | 264 | | |
| |||
330 | 301 | | |
331 | 302 | | |
332 | 303 | | |
333 | | - | |
| 304 | + | |
334 | 305 | | |
335 | 306 | | |
336 | 307 | | |
| |||
374 | 345 | | |
375 | 346 | | |
376 | 347 | | |
377 | | - | |
| 348 | + | |
378 | 349 | | |
379 | 350 | | |
380 | 351 | | |
| |||
403 | 374 | | |
404 | 375 | | |
405 | 376 | | |
406 | | - | |
| 377 | + | |
407 | 378 | | |
408 | 379 | | |
409 | 380 | | |
| |||
0 commit comments