Skip to content

Remove toleration whitelist for PodVolumeBackup and data mover pods#9575

Open
kaovilai wants to merge 3 commits intovelero-io:mainfrom
kaovilai:9476-remove-toleration-whitelist
Open

Remove toleration whitelist for PodVolumeBackup and data mover pods#9575
kaovilai wants to merge 3 commits intovelero-io:mainfrom
kaovilai:9476-remove-toleration-whitelist

Conversation

@kaovilai
Copy link
Copy Markdown
Collaborator

@kaovilai kaovilai commented Mar 4, 2026

Summary

Remove the hardcoded toleration whitelist (ThirdPartyTolerations) and instead inherit all tolerations from the node-agent daemonset for PodVolumeBackup/Restore and DataUpload/Download hosting pods, and from the Velero deployment for maintenance jobs.

Previously, only two toleration keys were allowed (kubernetes.azure.com/scalesetpriority and CriticalAddonsOnly), which prevented backups and restores on nodes with custom NoExecute taints. This change enables Velero to work on nodes with any user-defined taints, as long as the node-agent daemonset is configured to tolerate them.

Changes

  • pkg/nodeagent/node_agent.go: Replace GetToleration(key) with GetTolerations() that returns all tolerations from the daemonset
  • 4 controllers (pod_volume_backup, pod_volume_restore, data_upload, data_download): Replace whitelist loop with single GetTolerations() call
  • pkg/repository/maintenance/maintenance.go: Inherit all tolerations from the Velero deployment instead of filtering through the allowlist
  • pkg/util/third_party.go: Remove ThirdPartyTolerations variable
  • Updated all associated tests

Does your change fix a particular issue?

Fixes #9476

Please indicate you've done the following:

Note

Responses generated with Claude

Copilot AI review requested due to automatic review settings March 4, 2026 21:44
@github-actions github-actions Bot requested review from anshulahuja98 and sseago March 4, 2026 21:44
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes the hardcoded toleration allowlist previously applied to PodVolumeBackup/Restore and data-mover pods, and instead inherits full toleration sets from the node-agent DaemonSet (and from the Velero Deployment for maintenance jobs) to support backups/restores on nodes with custom taints (fixes #9476).

Changes:

  • Removed the ThirdPartyTolerations allowlist and updated controllers to fetch all node-agent tolerations in one call.
  • Added nodeagent.GetTolerations() to return the full toleration list from the node-agent DaemonSet (vs. looking up a single key).
  • Updated maintenance job toleration inheritance to include all tolerations from the Velero Deployment and adjusted tests accordingly.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/util/third_party.go Removes the toleration allowlist constant.
pkg/nodeagent/node_agent.go Replaces single-key toleration lookup with returning all tolerations from the DaemonSet.
pkg/nodeagent/node_agent_test.go Updates tests for the new GetTolerations() behavior.
pkg/controller/pod_volume_backup_controller.go Switches hosting pod tolerations to inherit all from node-agent.
pkg/controller/pod_volume_restore_controller.go Switches hosting pod tolerations to inherit all from node-agent.
pkg/controller/data_upload_controller.go Switches hosting pod tolerations to inherit all from node-agent.
pkg/controller/data_download_controller.go Switches hosting pod tolerations to inherit all from node-agent.
pkg/repository/maintenance/maintenance.go Maintenance jobs now inherit all tolerations from the Velero Deployment.
pkg/repository/maintenance/maintenance_test.go Updates tests to expect full toleration inheritance.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 209 to +221
@@ -218,13 +218,7 @@ func GetToleration(ctx context.Context, kubeClient kubernetes.Interface, namespa
return nil, errors.Wrapf(err, "error getting %s daemonset", dsName)
}

for i, t := range ds.Spec.Template.Spec.Tolerations {
if t.Key == key {
return &ds.Spec.Template.Spec.Tolerations[i], nil
}
}

return nil, ErrNodeAgentTolerationNotFound
return ds.Spec.Template.Spec.Tolerations, nil
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Windows clusters, node-agent-windows DaemonSet already includes the os=windows NoSchedule/NoExecute tolerations (see pkg/install/daemonset.go), and the data-mover pod creation logic also unconditionally appends those same Windows tolerations when nodeOS == windows (in the exposer). With GetTolerations now returning all DaemonSet tolerations, Windows hosting pods will likely end up with duplicate os=windows tolerations. Consider filtering out the built-in Windows os=windows tolerations in GetTolerations when osType is Windows, or otherwise de-duplicating tolerations before setting them on the pod spec.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

Responses generated with Claude

The default node-agent-windows daemonset installed by Velero does not include os=windows tolerations (see pkg/install/daemonset.go). Those tolerations are only added by the exposer layer when building the hosting pod.

If a user were to manually add os=windows tolerations to the daemonset, the result would be duplicate tolerations on the hosting pod. However, duplicate tolerations are harmless in Kubernetes — the scheduler and kubelet simply evaluate all tolerations and match against taints; duplicates have no adverse effect.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

Responses generated with Claude

Correction to my previous reply: the node-agent-windows daemonset does include os=windows tolerations at pkg/install/daemonset.go:260-273. So on Windows nodes there would be duplicate os=windows tolerations on the hosting pod (from both the daemonset and the exposer).

However, duplicate tolerations are harmless in Kubernetes — the scheduler simply evaluates all tolerations and checks if any match a given taint. This is also the existing behavior for the previous code before this PR, since the old GetToleration would have returned the os=windows toleration from the daemonset too, if the os key were in the allowlist.

If desired, deduplication could be added as a follow-up, but it is not functionally necessary.

Instead of filtering tolerations through a hardcoded allowlist
(ThirdPartyTolerations), inherit all tolerations from the node-agent
daemonset for PodVolumeBackup/Restore and DataUpload/Download pods,
and from the Velero deployment for maintenance jobs.

This enables backups and restores on nodes with custom NoExecute taints,
which was previously impossible since only two specific toleration keys
were whitelisted.

Fixes velero-io#9476

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
@kaovilai kaovilai force-pushed the 9476-remove-toleration-whitelist branch from 8f72867 to 15f9332 Compare March 5, 2026 19:34
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 5, 2026

Codecov Report

❌ Patch coverage is 86.66667% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.62%. Comparing base (b7bc16f) to head (cc38909).
⚠️ Report is 24 commits behind head on main.

Files with missing lines Patch % Lines
pkg/exposer/csi_snapshot.go 0.00% 2 Missing ⚠️
pkg/exposer/pod_volume.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9575      +/-   ##
==========================================
- Coverage   60.64%   60.62%   -0.03%     
==========================================
  Files         387      388       +1     
  Lines       36518    36568      +50     
==========================================
+ Hits        22148    22169      +21     
- Misses      12782    12806      +24     
- Partials     1588     1593       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

… function

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
} else {
hostingPodTolerations = append(hostingPodTolerations, *v)
}
hostingPodTolerations, err := nodeagent.GetTolerations(context.Background(), r.kubeClient, dd.Namespace, nodeOS)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to keep the toleration allow list?
The annotation and label have similar logic. The allow list stills work when there is no configuration provided or overwritten them.
https://github.com/vmware-tanzu/velero/blob/245525c26b939bde2549d92f7f3a62d8fea489bb/pkg/controller/data_upload_controller.go#L967-L982

Copy link
Copy Markdown

@michalskrivanek michalskrivanek Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to keep the toleration allow list? The annotation and label have similar logic. The allow list stills work when there is no configuration provided or overwritten them.

I think it makes less sense for tolerations. They really are the other way around, I believe - you want to by default inherit all of them from the parent (rather than needing extra node-agent config), as these are more like denylists.
The pods should tolerate whatever taints are set on the parent (so that they can run the same), and only then perhaps as an additional thing some extra tolerations to add on top (i.e. not to filter out, but to add new ones).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe - you want to by default inherit all of them from the parent (rather than needing extra node-agent config)

No, I agree to read from an extra node-agent configuration.
I meant to combine the node-agent configuration and the existing internal allow list.
https://github.com/vmware-tanzu/velero/blob/bca6afada731e5bce03a26d5d8b6d87b2c849ec4/pkg/util/third_party.go#L27-L30

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if by default we inherit all, allowlist doesn't do anything.

So you're saying don't inherit all but allow using a specific existing configuration option?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blackpiglet but what is the use case of blocking mover pod execution by default? The configuration of loadAffinity/selectors for choosing where to run make sense, but the lack of inheriting tolerations is just limiting the number of nodes further down (to 0, currently).

I think it may make sense to have a denylist in node-agent-configmap to remove some of the parent tolerations instead? I.e. inherit tolerations by default, and allow configuration to remove some of them - effectively making the mover pods non-executable on subset of nodes. What do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if by default we inherit all, allowlist doesn't do anything.

So you're saying don't inherit all but allow using a specific existing configuration option?

By the current implementation, the pod's toleration setting is totally decided by the ConfigMap setting.
I was suggesting to also consider inheriting the tolerations from the node-agent in the third-party allow list.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michalskrivanek
How about mainly relying on the new node-agent ConfigMap's toleration configurations, and also inheriting the tolerations specified in the third-party allow list from the node-agent DaemonSet?

Copy link
Copy Markdown

@michalskrivanek michalskrivanek Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blackpiglet that would work as well.
The main point about tolerations was mentioned in the original issue:
But tolerations only have one explicit purpose, allow running on a node. If I allow the daemonset to run on a node, I obviously want to create backups there, so why not simply inherit the tolerations from the daemonset
so I think it's still more simple to just copy from the "parent" and there's no harm in that, but I think both ways would do the job.


func GetToleration(ctx context.Context, kubeClient kubernetes.Interface, namespace string, key string, osType string) (*corev1api.Toleration, error) {
// GetTolerations returns all tolerations from the node-agent daemonset.
func GetTolerations(ctx context.Context, kubeClient kubernetes.Interface, namespace string, osType string) ([]corev1api.Toleration, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tolerations shouldn't be read from the DaemonSet.
We should introduce a new configuration field in the node-agent-configmap to support the toleration setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove whitelist for tolerations of PodVolumeBackup Pod

4 participants