feat: skip work applier status updates if possible#375
Conversation
Signed-off-by: michaelawyu <chenyu1@microsoft.com>
|
Note: as anticipated after this PR the status update logic in work applier has reached the linter code complexity limit; to unblock progress that specific method is now exempted from the linter, will submit separate PRs to refactor the method. |
|
This PR precedes #118. |
ryanzhang-oss
left a comment
There was a problem hiding this comment.
how can we make sure that a work's status in cache is actually the same in etcd? If we just compare with the cache and stop updating, we run the risk of the status never gets updated to the correct state (although the chance is small).
Hi Ryan! For the work applier we already requeue periodically (with back-off) so unless the client-side cache becomes significantly out-of-sync with the API server (in this case all writes will be rejected with optimistic locks), any inconsistent status updates will be overwritten within at max. 15 minutes. |
| return controller.NewAPIServerError(false, err) | ||
|
|
||
| // Skip the status update if no change found. | ||
| if equality.Semantic.DeepEqual(originalStatus, &appliedWork.Status) { |
There was a problem hiding this comment.
do we have an idea of how many calls this generates?
There was a problem hiding this comment.
Hi Ryan! Do you mean the # of status updates this setup will skip? Or do you mean the time complexity about the DeepEqual calls?
For the former, in the target perf test environment each member agent was generating roughly 950K status updates per 24h (though this number might be a bit biased due to the agents being restarted occasionally).
For the latter, deep-equaling is usu. expensive, esp. when the object is complex. I don't have specifics on deep-equaling Work object status data now, but I could do a mini benchmarking if you are interested.
There was a problem hiding this comment.
Thanks, I was asking the former. However, just to make sure that we are on the same page, this is appliedWork so it's an update on the member cluster API. 1M call per day isn't too bad though.
There was a problem hiding this comment.
Hi Ryan! This part is about the AppliedWork, which is updated via the member cluster API server. So far we haven't seen any cases of member cluster API server being overloaded by such updates.
| } | ||
|
|
||
| func shouldSkipStatusUpdate(isDriftedOrDiffed, isStatusBackReportingOn bool, originalStatus, currentStatus *fleetv1beta1.WorkStatus) bool { | ||
| if isDriftedOrDiffed || isStatusBackReportingOn { |
There was a problem hiding this comment.
this will reduce the effectiveness of this PR by a lot, is there anyway to soften the blow a bit further?
There was a problem hiding this comment.
Hi Ryan! We might need a flag to omit observation timestamps if needed? Though with exponential backoff, as long as the number of drifted/diffed placements + status back-reporting placements is low, and the system does not restart often, the # of calls should be relatively limited (~96 writes per placement) when everything stablizes.
| MockGet: func(ctx context.Context, key client.ObjectKey, obj client.Object) error { | ||
| return realClient.Get(ctx, key, obj) | ||
| }, | ||
| MockList: func(ctx context.Context, list client.ObjectList, opts ...client.ListOption) error { | ||
| return realClient.List(ctx, list, opts...) | ||
| }, | ||
| MockCreate: func(ctx context.Context, obj client.Object, opts ...client.CreateOption) error { | ||
| return realClient.Create(ctx, obj, opts...) | ||
| }, | ||
| MockDelete: func(ctx context.Context, obj client.Object, opts ...client.DeleteOption) error { | ||
| return realClient.Delete(ctx, obj, opts...) | ||
| }, | ||
| MockDeleteAllOf: func(ctx context.Context, obj client.Object, opts ...client.DeleteAllOfOption) error { | ||
| return realClient.DeleteAllOf(ctx, obj, opts...) | ||
| }, | ||
| MockUpdate: func(ctx context.Context, obj client.Object, opts ...client.UpdateOption) error { | ||
| return realClient.Update(ctx, obj, opts...) | ||
| }, | ||
| MockPatch: func(ctx context.Context, obj client.Object, patch client.Patch, opts ...client.PatchOption) error { | ||
| return realClient.Patch(ctx, obj, patch, opts...) | ||
| }, | ||
| MockApply: func(ctx context.Context, config runtime.ApplyConfiguration, opts ...client.ApplyOption) error { | ||
| return realClient.Apply(ctx, config, opts...) | ||
| }, |
There was a problem hiding this comment.
Hi Ryan! The default is a nil function, whose invocation would trigger an error. There are some methods (e.g., DeleteAllOf) that are not used in our controllers at all; I just added them for completeness reasons.
…o feat/deep-equal-before-status-update-work-applier
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
Merging this PR to unblock progress (approval acquired before merge conflicts are resolved) -> if there's any concern, please let me know. |
Description of your changes
As a performance improvement, set the work applier to skip status updates if possible.
I have:
make reviewableto ensure this PR is ready for review.How has this code been tested
Special notes for your reviewer