Skip to content

Fix Job Cleanup Ownership Bug in k8up#1075

Open
alezkv wants to merge 1 commit intok8up-io:masterfrom
alezkv:jobsHystoryLimit-filter-by-owners
Open

Fix Job Cleanup Ownership Bug in k8up#1075
alezkv wants to merge 1 commit intok8up-io:masterfrom
alezkv:jobsHystoryLimit-filter-by-owners

Conversation

@alezkv
Copy link
Copy Markdown

@alezkv alezkv commented Jun 30, 2025

Summary

Two k8up schedulers in the same namespace (daily remote jobs, hourly local jobs) were interfering with each other's cleanup cycles. The local scheduler's cleanup was deleting remote scheduler's jobs because the cleanup logic only filtered by namespace, not by job ownership.

Root Cause

In operator/executor/generic.go, the listOldResources() method was only filtering jobs by namespace: err := g.Client.List(ctx, objList, &client.ListOptions{
Namespace: namespace, // Only namespace filtering - no ownership check
})

This allowed one scheduler's cleanup to see and delete jobs belonging to other schedulers in the same namespace.

Solution

Fixed ownership filtering using existing label-based approach:

  1. Modified listOldResources() to filter by the k8up.io/owned-by label: err := g.Client.List(ctx, objList, client.MatchingLabels{k8upv1.LabelK8upOwnedBy: ownedBy}, client.InNamespace(namespace))
  2. Enhanced CleanupOldResources() to construct ownership identifier: jobObj, ok := limits.(k8upv1.JobObject)
    ownedBy := fmt.Sprintf("%s_%s", jobObj.GetType().String(), jobObj.GetName())

Fixes #1053

Checklist

For Code changes

  • Categorize the PR by setting a good title and adding one of the labels:
    bug, enhancement, documentation, change, breaking, dependency
    as they show up in the changelog
  • PR contains the label area:operator
  • Commits are signed off
  • Link this PR to related issues
  • I have not made any changes in the charts/ directory.

@alezkv alezkv requested a review from a team as a code owner June 30, 2025 10:48
@alezkv alezkv requested review from Kidswiss and wejdross and removed request for a team June 30, 2025 10:48
@Kidswiss
Copy link
Copy Markdown
Contributor

Hi @alezkv

Thanks for the contribution!

Can you please do the DCO? git rebase HEAD~1 --signoff and then force-push.

Once all the checks go through this can be merged!

Problem

Two k8up schedulers in the same namespace (daily remote jobs, hourly local jobs) were interfering with each other's cleanup cycles. The local scheduler's cleanup was deleting remote scheduler's jobs because
the cleanup logic only filtered by namespace, not by job ownership.

Root Cause

In operator/executor/generic.go, the listOldResources() method was only filtering jobs by namespace:
err := g.Client.List(ctx, objList, &client.ListOptions{
    Namespace: namespace,  // Only namespace filtering - no ownership check
})

This allowed one scheduler's cleanup to see and delete jobs belonging to other schedulers in the same namespace.

Solution

Fixed ownership filtering using existing label-based approach:

1. Modified listOldResources() to filter by the k8up.io/owned-by label:
err := g.Client.List(ctx, objList, client.MatchingLabels{k8upv1.LabelK8upOwnedBy: ownedBy}, client.InNamespace(namespace))
2. Enhanced CleanupOldResources() to construct ownership identifier:
jobObj, ok := limits.(k8upv1.JobObject)
ownedBy := fmt.Sprintf("%s_%s", jobObj.GetType().String(), jobObj.GetName())

Fixes k8up-io#1053

Signed-off-by: Aleksey Zhukov <353748+alezkv@users.noreply.github.com>
@alezkv alezkv force-pushed the jobsHystoryLimit-filter-by-owners branch from 07d2747 to 445f552 Compare June 30, 2025 20:46
@Kidswiss Kidswiss added area:operator bug Something isn't working labels Jul 1, 2025
@Kidswiss
Copy link
Copy Markdown
Contributor

Kidswiss commented Jul 1, 2025

HI @alezkv

The linter is not quite happy yet. There are some functions that seem unused, can you please remove them?

For the error about using a switch statement: feel free to ignore it via a //nolint comment.

@alezkv
Copy link
Copy Markdown
Author

alezkv commented Jul 1, 2025

@Kidswiss Yes, I understand. I've already resolved the issue locally, but I'm still trying to identify the end-to-end failures.

@alezkv
Copy link
Copy Markdown
Author

alezkv commented Jul 2, 2025

It seems that my initial assumption was incorrect. The currently proposed changes did not resolve the issue and may have introduced a regression in the cleaning. It appears that these changes involve more than just a bug fix, they may require architectural refactoring.

I would appreciate any input from the code owners @Kidswiss

Current state of the code:

  • cleaning logic works on generic executor (Backup,Archive,... objects) removing old resources with same type from namespace in accordance to settings => new Backup cleanup old Backups
  • Schedule didn't have any clean up logic at all, and depends on generic executor logic (in case of multiple schedule objects clean up logic will mess up a bit, removing more then needed)
  • schedule didn't have any configuration for history, only child objects will have some
  • cleanup logic called on each object provisioning phase

I didn't tested this flow but according to code I think currently it's working like so. Let's assume all jobs succeeds.

  • backup1 created, dst: backup1, successfulJobsHistoryLimit: 20
  • controller provision kick in, history check executed - 0 deletion
  • backup2 created, dst: backup2, successfulJobsHistoryLimit: 100
  • controller provision kick in, history check executed - 0 deletion => 1 < 100
  • backup3 created, dst: backup3, successfulJobsHistoryLimit: 1
  • controller provision kick in, history check executed - 2 deletion => 2 > 1

What issues I see in current approach:

  • cleanup removes neighboring objects, but settings semantically point to child resource Job (which also didn't make sence because object only can have single child job)
  • cleanup bound to newly created object setting, ignoring what was set to previous objects
  • scheduler don't have it's own cleaning mechanism

@Kidswiss
Copy link
Copy Markdown
Contributor

Kidswiss commented Jul 2, 2025

Hi @alezkv

Thanks for looking into this.

scheduler don't have it's own cleaning mechanism

I'd say that's the main issue. It should probably be the schedule's responsibility to clean-up the jobs. Especially now that we support multiple schedules per namespace.

So ideally it should do it like this:

  • the schedule gets a configuration for the cleanup
  • we remove it from each individual job logic

This way the schedule will clean all the jobs that belong to it. But any manually created jobs should also be manually cleaned up again.

What do you think?

@arska
Copy link
Copy Markdown
Member

arska commented Mar 22, 2026

Thanks for this contribution and the thorough write-up! The fix for the job cleanup ownership filtering looks solid.

It looks like CI is failing (lint and e2e) — likely due to the branch being based on an older master. Could you rebase onto current master and force-push? That should resolve the CI issues.

Also, to be able to merge it, we need the commits to be signed off (DCO — Developer Certificate of Origin), which is a requirement for all CNCF projects. You can do this by rebasing with:

git rebase --signoff HEAD~N

(where N is the number of commits in this PR)

and then force-pushing to your branch. If you have any questions about this, feel free to ask!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:operator bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

successfulJobsHistoryLimit Applies Namespace-Wide Instead of Per Schedule

3 participants