Skip to content

feat: handle the reportDiff Condition in the rollout controller #1038

Merged
ryanzhang-oss merged 3 commits intoAzure:mainfrom
ryanzhang-oss:handle-report-diff
Feb 20, 2025
Merged

feat: handle the reportDiff Condition in the rollout controller #1038
ryanzhang-oss merged 3 commits intoAzure:mainfrom
ryanzhang-oss:handle-report-diff

Conversation

@ryanzhang-oss
Copy link
Contributor

@ryanzhang-oss ryanzhang-oss commented Feb 11, 2025

Description of your changes

  1. Special handle the reportDiff Condition in the rollout controller
  2. Special handle the reportDiff Condition in the eviction controller.

Fixes #

I have:

  • Run make reviewable to ensure this PR is ready for review.

How has this code been tested

Special notes for your reviewer

@ryanzhang-oss ryanzhang-oss marked this pull request as draft February 11, 2025 02:18
Copy link
Contributor

@michaelawyu michaelawyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments, PTAL 🙏

if bindingutils.HasBindingFailed(binding) {
klog.V(3).InfoS("Found a failed to be ready unscheduled binding", "clusterResourcePlacement", crpKObj, "binding", bindingKObj)
} else {
} else if !bindingutils.IsBindingReportDiff(binding) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! For this part,

  • If a binding has an CSA/SSA apply strategy, and the Applied/Available conditions are stale/unset/True, it will be considered as a canBeReadyBinding;
  • If a binding has a ReportDiff apply strategy, and the DiffReported condition is stale/unset, it will be considered as a canBeReadyBinding;
    • As long as the DiffReported condition is set and refresh (True or False), it will not be considered as a canBeReadyBinding

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since canBeReadyBinding is used only when calculating the maxNumberToAdd, it would mean that for N target clusters, there will be at max. 2N (+maxSurge) clusters at any time being selected, as long as the DiffReported condition is populated for all N existing bindings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has no side effect per se, as ReportDiff mode does not make any change in member clusters; it is just that the behavior is a bit inconsistent with what would happen when CSA/SSA apply strategy is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are right. The reason that I added this is to not let ReportDiff to block the rollout based on max surge as we are not actually applying the resources on the clusters. AFAIK, the main reason for max surge is to prevent too many copies of a resource to exist.

klog.V(3).InfoS("Found a failed to be ready bound binding", "clusterResourcePlacement", crpKObj, "binding", bindingKObj)
bindingFailed = true
} else {
} else if !bindingutils.IsBindingReportDiff(binding) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A similar pattern happens here as well:

Since HasBindingFailed will only return True when either Applied or Available condition is set to False and fresh, any binding in ReportDiff mode will never be considered to be failing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So ReportDiff bindings are always updated per the maxUnavailable setting, even if the DiffReported condition has turned False (an error has occurred).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to say that it is a wrong behavior though. Again, it is just an inconsistency.

}

// IsBindingReportDiff checks if the binding is in diffReported state.
func IsBindingReportDiff(binding *placementv1beta1.ClusterResourceBinding) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! Just a nit here: would the name be better with IsBindingDiffReported? Since it's only evaluating the DiffReported condition?

@michaelawyu
Copy link
Contributor

michaelawyu commented Feb 11, 2025

Hi Ryan! I've gotten another minor concern for the new logic: at this moment we are only comparing the conditions -> for regular cases this works greatly, but in situations where apply strategy types are switched (ReportDiff <-> CSA/SSA), there are many intermediate states that are going to get difficult to reason about.

For example, when people switch from CSA/SSA to ReportDiff, it could happen that some bindings are still using the CSA/SSA apply strategy (and have fresh Applied/Available condition), some bindings are already using the ReportDiff mode (and have fresh DiffReported condition), while some bindings are in between (have stale Applied/Available condition and stale/unset DiffReported condition). I am having a bit of trouble of having my mind wrapped around how we should treat the rollout process in this case.

Since the CRP is always the source of truth, would it be better (possible) for us to always reference that?

@ryanzhang-oss
Copy link
Contributor Author

Since the CRP is always the source of truth, would it be better (possible) for us to always reference that?

Agree, the intermediate state is confusing. I wonder how can we reference CRP as the truth?

@ryanzhang-oss ryanzhang-oss marked this pull request as ready for review February 13, 2025 21:14
Copy link
Contributor

@michaelawyu michaelawyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ;)

Have some minor comments, will add them later; approve now to unblock the progress.

wantWaitTime: -1,
},

"binding diff report false on different generation with successful current apply should be ready": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! A nit: this should be available?

readyTimeCutOff: now,
wantReady: true,
wantWaitTime: 0,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably also need a diffReported is current but available is stale case.

})
}
}

Copy link
Contributor

@michaelawyu michaelawyu Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit (when I was resolving conflicts): the case on line 248 is essentially the same as the case on line 278. I assume that the former case was more of a partially stale case?

@ryanzhang-oss ryanzhang-oss merged commit 5947d8b into Azure:main Feb 20, 2025
12 checks passed
@ryanzhang-oss ryanzhang-oss deleted the handle-report-diff branch April 22, 2025 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants