Skip to content

feat: skip work applier status updates if possible#375

Merged
michaelawyu merged 4 commits intokubefleet-dev:mainfrom
michaelawyu:feat/deep-equal-before-status-update-work-applier
Dec 29, 2025
Merged

feat: skip work applier status updates if possible#375
michaelawyu merged 4 commits intokubefleet-dev:mainfrom
michaelawyu:feat/deep-equal-before-status-update-work-applier

Conversation

@michaelawyu
Copy link
Member

Description of your changes

As a performance improvement, set the work applier to skip status updates if possible.

I have:

  • Run make reviewable to ensure this PR is ready for review.

How has this code been tested

  • Integration tests

Special notes for your reviewer

Signed-off-by: michaelawyu <chenyu1@microsoft.com>
@michaelawyu
Copy link
Member Author

Note: as anticipated after this PR the status update logic in work applier has reached the linter code complexity limit; to unblock progress that specific method is now exempted from the linter, will submit separate PRs to refactor the method.

@michaelawyu
Copy link
Member Author

This PR precedes #118.

Copy link
Member

@ryanzhang-oss ryanzhang-oss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can we make sure that a work's status in cache is actually the same in etcd? If we just compare with the cache and stop updating, we run the risk of the status never gets updated to the correct state (although the chance is small).

@michaelawyu
Copy link
Member Author

how can we make sure that a work's status in cache is actually the same in etcd? If we just compare with the cache and stop updating, we run the risk of the status never gets updated to the correct state (although the chance is small).

Hi Ryan! For the work applier we already requeue periodically (with back-off) so unless the client-side cache becomes significantly out-of-sync with the API server (in this case all writes will be rejected with optimistic locks), any inconsistent status updates will be overwritten within at max. 15 minutes.

return controller.NewAPIServerError(false, err)

// Skip the status update if no change found.
if equality.Semantic.DeepEqual(originalStatus, &appliedWork.Status) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have an idea of how many calls this generates?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! Do you mean the # of status updates this setup will skip? Or do you mean the time complexity about the DeepEqual calls?

For the former, in the target perf test environment each member agent was generating roughly 950K status updates per 24h (though this number might be a bit biased due to the agents being restarted occasionally).

For the latter, deep-equaling is usu. expensive, esp. when the object is complex. I don't have specifics on deep-equaling Work object status data now, but I could do a mini benchmarking if you are interested.

Copy link
Member

@ryanzhang-oss ryanzhang-oss Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I was asking the former. However, just to make sure that we are on the same page, this is appliedWork so it's an update on the member cluster API. 1M call per day isn't too bad though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! This part is about the AppliedWork, which is updated via the member cluster API server. So far we haven't seen any cases of member cluster API server being overloaded by such updates.

}

func shouldSkipStatusUpdate(isDriftedOrDiffed, isStatusBackReportingOn bool, originalStatus, currentStatus *fleetv1beta1.WorkStatus) bool {
if isDriftedOrDiffed || isStatusBackReportingOn {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will reduce the effectiveness of this PR by a lot, is there anyway to soften the blow a bit further?

Copy link
Member Author

@michaelawyu michaelawyu Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! We might need a flag to omit observation timestamps if needed? Though with exponential backoff, as long as the number of drifted/diffed placements + status back-reporting placements is low, and the system does not restart often, the # of calls should be relatively limited (~96 writes per placement) when everything stablizes.

ryanzhang-oss
ryanzhang-oss previously approved these changes Dec 22, 2025
Comment on lines +62 to +85
MockGet: func(ctx context.Context, key client.ObjectKey, obj client.Object) error {
return realClient.Get(ctx, key, obj)
},
MockList: func(ctx context.Context, list client.ObjectList, opts ...client.ListOption) error {
return realClient.List(ctx, list, opts...)
},
MockCreate: func(ctx context.Context, obj client.Object, opts ...client.CreateOption) error {
return realClient.Create(ctx, obj, opts...)
},
MockDelete: func(ctx context.Context, obj client.Object, opts ...client.DeleteOption) error {
return realClient.Delete(ctx, obj, opts...)
},
MockDeleteAllOf: func(ctx context.Context, obj client.Object, opts ...client.DeleteAllOfOption) error {
return realClient.DeleteAllOf(ctx, obj, opts...)
},
MockUpdate: func(ctx context.Context, obj client.Object, opts ...client.UpdateOption) error {
return realClient.Update(ctx, obj, opts...)
},
MockPatch: func(ctx context.Context, obj client.Object, patch client.Patch, opts ...client.PatchOption) error {
return realClient.Patch(ctx, obj, patch, opts...)
},
MockApply: func(ctx context.Context, config runtime.ApplyConfiguration, opts ...client.ApplyOption) error {
return realClient.Apply(ctx, config, opts...)
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are those not default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! The default is a nil function, whose invocation would trigger an error. There are some methods (e.g., DeleteAllOf) that are not used in our controllers at all; I just added them for completeness reasons.

@codecov
Copy link

codecov bot commented Dec 29, 2025

Codecov Report

❌ Patch coverage is 73.91304% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/controllers/workapplier/status.go 73.91% 4 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

@michaelawyu
Copy link
Member Author

Merging this PR to unblock progress (approval acquired before merge conflicts are resolved) -> if there's any concern, please let me know.

@michaelawyu michaelawyu merged commit 9291964 into kubefleet-dev:main Dec 29, 2025
13 of 15 checks passed
@michaelawyu michaelawyu deleted the feat/deep-equal-before-status-update-work-applier branch December 29, 2025 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants