You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/tiered-storage-pvc-resize.md
+56-30Lines changed: 56 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,70 +7,95 @@ with the rolling upgrade machinery so only one broker is affected at a time.
7
7
8
8
---
9
9
10
-
## Annotations
10
+
## State tracking
11
11
12
-
Two annotations are written on PVC objects to carry state across reconcile cycles.
13
-
They survive reconciler restarts, making every step re-entrant.
12
+
Resize state is stored in the `KafkaCluster` CR status under
13
+
`status.brokersState[<brokerId>].cacheVolumeStates`, keyed by mount path.
14
+
This keeps the KafkaCluster CR the single source of truth for all in-flight
15
+
broker operations and avoids a second, parallel state store on PVC objects.
14
16
15
-
| Annotation | Value | Written on | Meaning |
16
-
|------------|-------|------------|---------|
17
-
|`koperator.adobe.com/cache-resize-state`|`pending-deletion`| Old PVC | Being replaced; excluded from pod spec; deleted once broker pod stops |
18
-
|`koperator.adobe.com/cache-resize-state`|`replacement`| New PVC | Replacement PVC; rolling upgrade must complete before annotations are stripped |
19
-
|`koperator.adobe.com/replaces-pvc`|`<old-pvc-name>`| New PVC | Traceability — records which PVC is being replaced |
17
+
| Field | Value | Meaning |
18
+
|-------|-------|---------|
19
+
|`status.brokersState[N].cacheVolumeStates[<mountPath>]`|`pending-deletion`| A resize is in flight for this mount path. The old PVC (larger size) is waiting to be deleted once the broker pod stops; the replacement PVC (desired smaller size) has already been created. |
20
+
21
+
The entry is cleared once the old PVC has been deleted and the broker pod has
22
+
restarted. An empty map means no resize is in progress.
23
+
24
+
Two PVC annotations that describe what a PVC **is** (not operational state) are
25
+
always present on cache PVCs:
26
+
27
+
| Annotation | Value | Purpose |
28
+
|------------|-------|---------|
29
+
|`mountPath`|`<path>`| Used throughout reconcile logic to match PVCs to storage configs |
30
+
|`tieredStorageCache`|`"true"`| Identifies cache PVCs for special handling: skipped from `log.dirs` and CC capacity config |
20
31
21
32
---
22
33
23
34
## Resize flow
24
35
25
36
### Cycle N — resize detected, pod running
26
37
27
-
1. The old PVC is annotated `pending-deletion`.
28
-
2. A replacement PVC with the new (smaller) size is created and annotated `replacement`. Provisioning starts immediately.
29
-
3. The broker's `ConfigurationState` is set to `ConfigOutOfSync` to trigger a rolling restart via `handleRollingUpgrade`.
30
-
4.`handleRollingUpgrade` evaluates health gates (replica health, concurrent restart limit, rack awareness). If all pass the broker pod is deleted and the cycle requeues. If any gate fails the state is preserved in PVC annotations and retried next cycle.
38
+
1.`status.brokersState[N].cacheVolumeStates[<mountPath>]` is set to `pending-deletion`
39
+
in the KafkaCluster CR status. This is the durable record that a resize is in flight.
40
+
2. A replacement PVC with the new (smaller) size is created. Provisioning starts immediately.
41
+
3. The broker's `ConfigurationState` is set to `ConfigOutOfSync` to trigger a rolling restart
42
+
via `handleRollingUpgrade`.
43
+
4.`handleRollingUpgrade` evaluates health gates (replica health, concurrent restart limit,
44
+
rack awareness). If all pass the broker pod is deleted and the cycle requeues. If any gate
45
+
fails the state persists in the CR and is retried next cycle.
31
46
32
47
### Cycle N+1 — pod is absent
33
48
34
-
A pod is considered absent when it either does not exist or has a non-nil `DeletionTimestamp` (Terminating). Treating a Terminating pod as absent allows cleanup to start during the pod's Terminating window rather than waiting for it to fully disappear from etcd.
49
+
A pod is considered absent when it either does not exist or has a non-nil
50
+
`DeletionTimestamp` (Terminating). Treating a Terminating pod as absent allows
51
+
cleanup to start during the pod's Terminating window rather than waiting for it
52
+
to fully disappear from etcd.
35
53
36
-
1. The pending-deletion PVC is deleted.
37
-
2. A new broker pod is created referencing the replacement PVC. Because provisioning started in cycle N the PVC is likely already `Bound`, minimising startup latency.
54
+
1. The old PVC (the one whose size differs from the desired size at that mount path)
55
+
is deleted.
56
+
2. The `cacheVolumeStates` entry for that mount path is cleared from the CR status.
57
+
3. A new broker pod is created referencing the replacement PVC. Because provisioning
58
+
started in cycle N the PVC is likely already `Bound`, minimising startup latency.
38
59
39
60
### Cycle N+2 — pod is present again
40
61
41
-
The strip fires as soon as a non-Terminating pod exists for the broker and no pending-deletion PVC remains — the pod does not need to be fully Running.
42
-
43
-
1. No pending-deletion PVC remains and the replacement PVC exists → resize is complete.
44
-
2. The `cache-resize-state` and `replaces-pvc` annotations are stripped from the replacement PVC, which becomes an ordinary PVC from this point forward.
62
+
1. No `cacheVolumeStates` entry remains for the mount path → resize is complete.
63
+
2. The replacement PVC is now an ordinary cache PVC with no special state attached.
45
64
46
65
---
47
66
48
67
## Grow vs shrink
49
68
50
-
A cache PVC **grow** takes the normal Kubernetes in-place expansion path: the PVC spec is updated with the larger size and Kubernetes expands the volume without a pod restart (requires `allowVolumeExpansion: true` on the StorageClass). No annotations are written and no rolling restart is triggered.
69
+
A cache PVC **grow** takes the normal Kubernetes in-place expansion path: the PVC
70
+
spec is updated with the larger size and Kubernetes expands the volume without a
71
+
pod restart (requires `allowVolumeExpansion: true` on the StorageClass). No
72
+
`cacheVolumeStates` entry is written and no rolling restart is triggered.
51
73
52
-
A cache PVC **shrink** uses the delete-and-recreate flow described above. Shrinking is only supported for tiered storage cache volumes — regular Kafka log volumes reject any size decrease.
74
+
A cache PVC **shrink** uses the delete-and-recreate flow described above.
75
+
Shrinking is only supported for tiered storage cache volumes — regular Kafka log
76
+
volumes reject any size decrease with an error.
53
77
54
78
---
55
79
56
80
## Properties of this design
57
81
58
82
| Property | Value |
59
83
|----------|-------|
60
-
| State survives reconciler crash | Mostly — PVC annotations are durable in etcd; the one non-re-entrant window is between annotating the old PVC and creating the replacement, but `ConfigOutOfSync` set in that cycle persists in broker status so the rolling upgrade still proceeds |
61
-
| Atomicity gap | Eliminated — new PVC is created before old is deleted |
62
-
| Provisioning overlaps gate evaluation | Yes — new PVC created in cycle N, not N+1 |
63
-
| Observable via kubectl | Yes — `kubectl get pvc -o yaml` shows resize state directly |
64
-
| ConfigOutOfSync overloading | Reduced — `ConfigOutOfSync` still used, but the *reason* is legible in PVC annotations |
65
-
| CC disk rebalance for cache PVCs | Fixed — tiered cache PVCs are explicitly excluded from `GracefulDiskRebalanceRequired` logic |
84
+
| State survives reconciler crash | Yes — `cacheVolumeStates` is written to the KafkaCluster CR (etcd) before the replacement PVC is created; every step is re-entrant |
85
+
| Single source of truth | Yes — all broker state (configuration, graceful actions, cache resize) lives in `status.brokersState`|
86
+
| Atomicity gap | Eliminated — replacement PVC is created before old is deleted |
87
+
| Provisioning overlaps gate evaluation | Yes — replacement PVC created in cycle N, not N+1 |
88
+
| Observable via kubectl | Yes — `kubectl get kafkacluster <name> -o jsonpath='{.status.brokersState}'` shows resize state; an empty `cacheVolumeStates` means no resize is in progress |
89
+
| CC disk rebalance for cache PVCs | Excluded — tiered cache PVCs are explicitly skipped from `GracefulDiskRebalanceRequired` and CC capacity config |
90
+
|`log.dirs` for cache PVCs | Excluded — `generateStorageConfig` skips volumes with `TieredStorageCache: true`|
66
91
67
92
---
68
93
69
94
## Sequence diagram
70
95
71
96
```
72
97
Cycle N (pod UP, resize detected)
73
-
├─ annotate old PVC: pending-deletion
98
+
├─ set cacheVolumeStates[mountPath] = pending-deletion in CR status
74
99
├─ create replacement PVC (provisioning starts)
75
100
├─ set ConfigOutOfSync
76
101
└─ handleRollingUpgrade
@@ -81,9 +106,10 @@ Cycle N+k (pod UP, gates failing — any number of cycles)
81
106
└─ ensure ConfigOutOfSync, requeue
82
107
83
108
Cycle N+k+1 (pod ABSENT — gone or Terminating)
84
-
├─ delete pending-deletion PVC
109
+
├─ delete old PVC (identified as the PVC at mountPath whose size ≠ desired)
110
+
├─ clear cacheVolumeStates[mountPath] from CR status
85
111
└─ create new pod bound to replacement PVC
86
112
87
113
Cycle N+k+2 (pod PRESENT — non-Terminating, not necessarily Running)
0 commit comments