Skip to content

feat(actions): Add cloudnativepg reload, restart, promote, suspend and resume actions#24192

Merged
blakepettersson merged 32 commits intoargoproj:masterfrom
rouke-broersma:cloudnativepg-actions
Dec 10, 2025
Merged

feat(actions): Add cloudnativepg reload, restart, promote, suspend and resume actions#24192
blakepettersson merged 32 commits intoargoproj:masterfrom
rouke-broersma:cloudnativepg-actions

Conversation

@rouke-broersma
Copy link
Contributor

@rouke-broersma rouke-broersma commented Aug 18, 2025

Improves custom action result normalization to check for apiVersion Group, this reduces the chance that a Kind with the same name but the wrong api group is normalized accidentally in resource actions tests.

Adds specific actions for cloudnativepg that trigger the operator to execute an operational task that should not be governed by gitops. The following tasks are added:

  • Reload - this action instructs the cloudnativepg operator to check all Cluster child resources are still up-to-date
  • Restart - effectively kubectl rollout-restart but for the Cluster CRD
  • Promote - Starts the promotion process for promoting one of the healthy standby replicas to primary in one of three ways. You can either specify the full replica pod name, you can specify the replica pod instance number or you can specify any which will select the next available instance number automatically
  • Suspend - Suspends resource reconciliation by operator
  • Resume - Resumes resource reconciliation by operator

The cloudnativepg health check is extended to support detecting the reconciliation suspension

The promotion action is disabled if no healthy replicas are available to promote:

image

The promotion action is enabled if there are healthy replicas:

image

Before any promotion status:

targetPrimary: test-cluster-1
targetPrimaryTimestamp: "2025-08-18T16:34:03Z"
currentPrimary: test-cluster-1
currentPrimaryTimestamp: "2025-08-18T16:34:08.311908Z"
instancesStatus:
  healthy:
    - test-cluster-1
    - test-cluster-2

During promotion status:

targetPrimary: test-cluster-2
targetPrimaryTimestamp: "2025-08-18T16:41:38Z"
currentPrimary: test-cluster-1
currentPrimaryTimestamp: "2025-08-18T16:34:08.311908Z"
instancesStatus:
  healthy:
    - test-cluster-2
  replicating:
    - test-cluster-1

After promotion:

targetPrimary: test-cluster-2
targetPrimaryTimestamp: "2025-08-18T16:41:38Z"
currentPrimary: test-cluster-2
currentPrimaryTimestamp: "2025-08-18T16:41:42.441994Z"
instancesStatus:
  healthy:
    - test-cluster-1
    - test-cluster-2

The reload action triggers the operation to reconciliate the cluster resources:

{"level":"info","ts":"2025-08-18T17:16:06.310914875Z","logger":"cluster-resource","msg":"Defaulting for Cluster","version":"v1","name":"test-cluster","namespace":"test"}
{"level":"info","ts":"2025-08-18T17:16:06.346904494Z","logger":"cluster-resource","msg":"Validation for Cluster upon update","version":"v1","name":"test-cluster","namespace":"test"}
{"level":"info","ts":"2025-08-18T17:17:03.739202478Z","logger":"cluster-resource","msg":"Defaulting for Cluster","version":"v1","name":"test-cluster","namespace":"test"}
{"level":"info","ts":"2025-08-18T17:17:03.769107314Z","logger":"cluster-resource","msg":"Validation for Cluster upon update","version":"v1","name":"test-cluster","namespace":"test"}
{"level":"info","ts":"2025-08-18T17:17:14.529150851Z","logger":"cluster-resource","msg":"Defaulting for Cluster","version":"v1","name":"test-cluster","namespace":"test"}
{"level":"info","ts":"2025-08-18T17:17:14.564404404Z","logger":"cluster-resource","msg":"Validation for Cluster upon update","version":"v1","name":"test-cluster","namespace":"test"}
{"level":"info","ts":"2025-08-18T17:17:19.913762103Z","logger":"cluster-resource","msg":"Defaulting for Cluster","version":"v1","name":"test-cluster","namespace":"test"}
{"level":"info","ts":"2025-08-18T17:17:19.945209921Z","logger":"cluster-resource","msg":"Validation for Cluster upon update","version":"v1","name":"test-cluster","namespace":"test"}

And the restart action triggers a rolling restart without a primary failover:

{"level":"info","ts":"2025-08-18T17:18:52.646611677Z","msg":"Pod rollout required","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"test-cluster","namespace":"test"},"namespace":"test","name":"test-cluster","reconcileID":"bc77ac22-9ae4-40ef-bef9-19110d3c64da","podName":"test-cluster-1","reason":"cluster has been explicitly restarted via annotation"}
{"level":"info","ts":"2025-08-18T17:18:52.667735942Z","msg":"Cluster has become unhealthy","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"test-cluster","namespace":"test"},"namespace":"test","name":"test-cluster","reconcileID":"bc77ac22-9ae4-40ef-bef9-19110d3c64da"}
{"level":"info","ts":"2025-08-18T17:18:52.667806284Z","msg":"Recreating instance pod","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"test-cluster","namespace":"test"},"namespace":"test","name":"test-cluster","reconcileID":"bc77ac22-9ae4-40ef-bef9-19110d3c64da","pod":"test-cluster-1","to":"ghcr.io/cloudnative-pg/postgresql:17.5","reason":"Restarting instance test-cluster-1, because: cluster has been explicitly restarted via annotation"}
{"level":"info","ts":"2025-08-18T17:18:54.254567272Z","msg":"Creating new Pod to reattach a PVC","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"test-cluster","namespace":"test"},"namespace":"test","name":"test-cluster","reconcileID":"f51265aa-3a73-402b-968f-0e8c8bc58648","pod":"test-cluster-1","pvc":"test-cluster-1"}
{"level":"info","ts":"2025-08-18T17:18:59.843779751Z","msg":"Setting replica label","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"test-cluster","namespace":"test"},"namespace":"test","name":"test-cluster","reconcileID":"fa372546-a72a-49ae-a5cb-aaed83e78097","pod":"test-cluster-1"}
{"level":"info","ts":"2025-08-18T17:19:07.404906306Z","msg":"Pod rollout required","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"test-cluster","namespace":"test"},"namespace":"test","name":"test-cluster","reconcileID":"119e45d3-1c8e-4fab-bd63-95b48bc54760","podName":"test-cluster-2","reason":"cluster has been explicitly restarted via annotation"}
{"level":"info","ts":"2025-08-18T17:19:07.404979398Z","msg":"Restarting primary instance without a switchover first","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"test-cluster","namespace":"test"},"namespace":"test","name":"test-cluster","reconcileID":"119e45d3-1c8e-4fab-bd63-95b48bc54760","primaryPod":"test-cluster-2","reason":"cluster has been explicitly restarted via annotation"}
{"level":"info","ts":"2025-08-18T17:19:07.426877123Z","msg":"Recreating instance pod","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"test-cluster","namespace":"test"},"namespace":"test","name":"test-cluster","reconcileID":"119e45d3-1c8e-4fab-bd63-95b48bc54760","pod":"test-cluster-2","to":"ghcr.io/cloudnative-pg/postgresql:17.5","reason":"cluster has been explicitly restarted via annotation"}
{"level":"info","ts":"2025-08-18T17:19:16.746789739Z","msg":"Setting primary label","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"test-cluster","namespace":"test"},"namespace":"test","name":"test-cluster","reconcileID":"a98b9a04-46e3-4c3f-b526-ba1725633422","pod":"test-cluster-2"}
{"level":"info","ts":"2025-08-18T17:19:16.786118784Z","msg":"Waiting for the Kubelet to refresh the readiness probe","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"test-cluster","namespace":"test"},"namespace":"test","name":"test-cluster","reconcileID":"a98b9a04-46e3-4c3f-b526-ba1725633422","mostAdvancedInstanceName":"test-cluster-2","hasHTTPStatus":true,"isPodReady":false}
{"level":"info","ts":"2025-08-18T17:19:23.475217292Z","msg":"All instances ready, will proceed","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"test-cluster","namespace":"test"},"namespace":"test","name":"test-cluster","reconcileID":"b2f95c7a-0da3-4992-87ad-ce777aa666fc","currentPrimary":"test-cluster-2","targetPrimary":"test-cluster-2"}
{"level":"info","ts":"2025-08-18T17:19:23.511452375Z","msg":"Cluster has become healthy","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"test-cluster","namespace":"test"},"namespace":"test","name":"test-cluster","reconcileID":"b2f95c7a-0da3-4992-87ad-ce777aa666fc"}

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Title of the PR
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.
  • Optional. My organization is added to USERS.md.
  • Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

Signed-off-by: Rouke Broersma <mobrockers@gmail.com>
Signed-off-by: Rouke Broersma <mobrockers@gmail.com>
@bunnyshell
Copy link

bunnyshell bot commented Aug 18, 2025

❌ Preview Environment deleted from Bunnyshell

Available commands (reply to this comment):

  • 🚀 /bns:deploy to deploy the environment

Signed-off-by: Rouke Broersma <mobrockers@gmail.com>
Signed-off-by: Rouke Broersma <mobrockers@gmail.com>
Signed-off-by: Rouke Broersma <mobrockers@gmail.com>
Signed-off-by: Rouke Broersma <mobrockers@gmail.com>
Signed-off-by: Rouke Broersma <mobrockers@gmail.com>
Signed-off-by: Rouke Broersma <mobrockers@gmail.com>
…ions

Signed-off-by: Rouke Broersma <mobrockers@gmail.com>
Signed-off-by: Rouke Broersma <mobrockers@gmail.com>
Signed-off-by: Rouke Broersma <mobrockers@gmail.com>
Signed-off-by: Rouke Broersma <mobrockers@gmail.com>
Signed-off-by: Rouke Broersma <mobrockers@gmail.com>
Signed-off-by: Rouke Broersma <mobrockers@gmail.com>
@codecov
Copy link

codecov bot commented Aug 19, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 62.58%. Comparing base (1e9f4aa) to head (403cda9).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #24192      +/-   ##
==========================================
- Coverage   62.58%   62.58%   -0.01%     
==========================================
  Files         352      352              
  Lines       49759    49759              
==========================================
- Hits        31142    31141       -1     
- Misses      15640    15641       +1     
  Partials     2977     2977              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rouke-broersma rouke-broersma marked this pull request as ready for review August 19, 2025 11:41
@rouke-broersma rouke-broersma requested review from a team as code owners August 19, 2025 11:41
Copy link
Member

@blakepettersson blakepettersson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have insights into cloudnativepg but from a healthcheck perspective it LGTM

@rouke-broersma
Copy link
Contributor Author

@blakepettersson thanks! Do you want me to get someone from cloudnativepg to review this from their pov before this is merged?

@blakepettersson
Copy link
Member

@rouke-broersma if you have/know someone that can chime in that'd be highly appreciated! If that's not (easily) possible I can still merge this PR

@rouke-broersma
Copy link
Contributor Author

@sxd since you contributed the original health check and you're a member of cnpg I think you would be most suited for reviewing this pr. If you have the time I would really appreciate it you could take a look!

@sxd
Copy link
Contributor

sxd commented Oct 7, 2025

Hi @rouke-broersma
I'm more than happy to help with this! I'm currently traveling but by the end of the week I should be able to take a deep look into this!

@sxd
Copy link
Contributor

sxd commented Oct 23, 2025

Hi @rouke-broersma

Sorry for the delay! I took a quick view and I have some questions/observations:
repo:fluent/fluent-bit
1.- Promote: this looks really good on the interface, but there's a risk of promoting an instance that it's not aligned, this should be used very carefully, there's a way to add an alert/warning to the users before performing this action?
2.- Suspend/Resume: From a user perspective, this may lead to the wrong idea that the cluster is not working, but it's the other way around, it is working just not being reconciled, this will definitely lead to a data corruption if the cluster is unmanaged for a long time, in this case, I would avoid having this option one-click away. Now, if the intention it's to scale to zero, or just turn-off the cluster, you can always use the hibernation option, that will really turn off everything
3.- Reload; This is better done using cnpg.io/reload label set to true, since that will also reload the secrets and configmaps, you can set both, there's no harm on doing it.

I'm going to try to do more practical test on this this week and I'll keep you posted!!

Regards,

@rouke-broersma
Copy link
Contributor Author

Hi @sxd

No worries, thanks for taking a look!

  1. The promote action should not allow promoting to an instance that cannot fulfill the role. Feel free to double check the code to see if I missed any cases where an unhealthy node might get promoted. When an instance is selected that is not valid or if no valid instances are available an error is shown to the user. You can find the tests and the expected errors here: https://github.com/argoproj/argo-cd/pull/24192/files#diff-b93f7f951bf844857337cab0b466d566dd069ef6e349d6687beadeeb7482336b
  2. When the cluster reconcile is suspended this will be reflected in argocd with a pause icon. This is done through argocd health check here: https://github.com/argoproj/argo-cd/pull/24192/files#diff-7d643409c0eda3fcd7d01974db264041fbb51d55241f39399bd058f4edfbf60f. The reason I added this instead of hibernate is because imo hibernation is such a conscious action that it can always be done through gitops. Suspend reconciliation is something you might want to do as a quick action during an incident to make sure cnpg does not touch the database for whatever reason. I agree that a prolonged suspended reconciliation could be bad, but in that case I think cnpg should add a time-based suspension mechanism so a user can add an end-time or something like that and cnpg can automatically start reconciling again. In that case we can extend the action to take an input for the time the suspension should be active. I don't think we need to keep this functionality from the user because it might be bad, and that the resource health is sufficient information to the user.
  3. I followed what the kubectl plugin does: https://github.com/cloudnative-pg/cloudnative-pg/blob/main/internal/cmd/plugin/reload/reload.go I think this should also reload secrets and configmaps? In any case labels imo should not be modified by actions as they influence selectors, they always belong to the user and should usually be object lifetime stable. Annotations are I think the appropriate structure to add time-based metadata.

@rouke-broersma
Copy link
Contributor Author

@sxd have you been able to test out the custom actions? :)

Copilot AI review requested due to automatic review settings December 10, 2025 08:37
@rouke-broersma rouke-broersma requested a review from a team as a code owner December 10, 2025 08:37
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds custom actions for CloudNativePG Cluster resources, enabling operational tasks like reload, restart, promote, suspend, and resume directly from ArgoCD. It also improves the test normalizer to use apiVersion group disambiguation to prevent Kind name collisions, and extends the health check to detect reconciliation suspension.

  • Adds five new CloudNativePG Cluster actions (reload, restart, promote, suspend, resume) with conditional availability based on cluster state
  • Refactors test normalizer from kind-only to group+kind-based resource identification
  • Extends health check to detect and report suspended reconciliation state

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
util/lua/custom_actions_test.go Refactors normalizer to use API group disambiguation; adds CloudNativePG Cluster normalization for action annotations
resource_customizations/postgresql.cnpg.io/Cluster/health.lua Removes duplicate status entry; adds reconciliation suspension detection
resource_customizations/postgresql.cnpg.io/Cluster/health_test.yaml Adds test case for suspended reconciliation state
resource_customizations/postgresql.cnpg.io/Cluster/actions/discovery.lua Implements action discovery with conditional promotion based on healthy replicas
resource_customizations/postgresql.cnpg.io/Cluster/actions/*/action.lua Implements reload, restart, promote, suspend, and resume actions
resource_customizations/postgresql.cnpg.io/Cluster/actions/action_test.yaml Adds comprehensive action tests for all new CloudNativePG actions
resource_customizations/postgresql.cnpg.io/Cluster/actions/testdata/*.yaml Provides test data for action validation
resource_customizations/postgresql.cnpg.io/Cluster/testdata/cluster_reconcile_suspended.yaml Test data for health check with suspended reconciliation
docs/operator-manual/resource_actions_builtin.md Documents the five new CloudNativePG Cluster actions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Rouke Broersma <rouke.broersma@infosupport.com>
…hy instance as next primary

Signed-off-by: Rouke Broersma <rouke.broersma@infosupport.com>
Signed-off-by: Rouke Broersma <rouke.broersma@infosupport.com>
@rouke-broersma
Copy link
Contributor Author

@blakepettersson I've addressed the feedback by @sxd and there hasn't been any new feedback so I consider this ready to merge. We can always adjust later in a patch if necessary based on user feedback. Better to get it out there so we can get feedback from more people imo.

@sxd
Copy link
Contributor

sxd commented Dec 10, 2025

Sorry @rouke-broersma I was able to test only half of it but it should be ok, will ask some users for more feedback on this

@blakepettersson
Copy link
Member

@rouke-broersma sounds good to me! Thanks for your patience!

@blakepettersson blakepettersson merged commit e50dd00 into argoproj:master Dec 10, 2025
28 checks passed
yuehaii pushed a commit to yuehaii/argo-cd that referenced this pull request Dec 11, 2025
…d resume actions (argoproj#24192)

Signed-off-by: Rouke Broersma <mobrockers@gmail.com>
Signed-off-by: Rouke Broersma <rouke.broersma@infosupport.com>
Signed-off-by: hai.yue <hai.yue@ingka.com>
Elyytscha pushed a commit to WhizUs/argo-cd that referenced this pull request Dec 12, 2025
…d resume actions (argoproj#24192)

Signed-off-by: Rouke Broersma <mobrockers@gmail.com>
Signed-off-by: Rouke Broersma <rouke.broersma@infosupport.com>
rumstead pushed a commit to rumstead/argo-cd that referenced this pull request Dec 14, 2025
…d resume actions (argoproj#24192)

Signed-off-by: Rouke Broersma <mobrockers@gmail.com>
Signed-off-by: Rouke Broersma <rouke.broersma@infosupport.com>
Signed-off-by: rumstead <37445536+rumstead@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments