feat: handle the reportDiff Condition in the rollout controller #1038
feat: handle the reportDiff Condition in the rollout controller #1038ryanzhang-oss merged 3 commits intoAzure:mainfrom
Conversation
michaelawyu
left a comment
There was a problem hiding this comment.
Added a few comments, PTAL 🙏
| if bindingutils.HasBindingFailed(binding) { | ||
| klog.V(3).InfoS("Found a failed to be ready unscheduled binding", "clusterResourcePlacement", crpKObj, "binding", bindingKObj) | ||
| } else { | ||
| } else if !bindingutils.IsBindingReportDiff(binding) { |
There was a problem hiding this comment.
Hi Ryan! For this part,
- If a binding has an CSA/SSA apply strategy, and the
Applied/Availableconditions are stale/unset/True, it will be considered as acanBeReadyBinding; - If a binding has a ReportDiff apply strategy, and the
DiffReportedcondition is stale/unset, it will be considered as acanBeReadyBinding;- As long as the
DiffReportedcondition is set and refresh (True or False), it will not be considered as acanBeReadyBinding
- As long as the
There was a problem hiding this comment.
Since canBeReadyBinding is used only when calculating the maxNumberToAdd, it would mean that for N target clusters, there will be at max. 2N (+maxSurge) clusters at any time being selected, as long as the DiffReported condition is populated for all N existing bindings.
There was a problem hiding this comment.
This has no side effect per se, as ReportDiff mode does not make any change in member clusters; it is just that the behavior is a bit inconsistent with what would happen when CSA/SSA apply strategy is used.
There was a problem hiding this comment.
yes, you are right. The reason that I added this is to not let ReportDiff to block the rollout based on max surge as we are not actually applying the resources on the clusters. AFAIK, the main reason for max surge is to prevent too many copies of a resource to exist.
| klog.V(3).InfoS("Found a failed to be ready bound binding", "clusterResourcePlacement", crpKObj, "binding", bindingKObj) | ||
| bindingFailed = true | ||
| } else { | ||
| } else if !bindingutils.IsBindingReportDiff(binding) { |
There was a problem hiding this comment.
A similar pattern happens here as well:
Since HasBindingFailed will only return True when either Applied or Available condition is set to False and fresh, any binding in ReportDiff mode will never be considered to be failing.
There was a problem hiding this comment.
So ReportDiff bindings are always updated per the maxUnavailable setting, even if the DiffReported condition has turned False (an error has occurred).
There was a problem hiding this comment.
Not to say that it is a wrong behavior though. Again, it is just an inconsistency.
pkg/utils/binding/binding.go
Outdated
| } | ||
|
|
||
| // IsBindingReportDiff checks if the binding is in diffReported state. | ||
| func IsBindingReportDiff(binding *placementv1beta1.ClusterResourceBinding) bool { |
There was a problem hiding this comment.
Hi Ryan! Just a nit here: would the name be better with IsBindingDiffReported? Since it's only evaluating the DiffReported condition?
|
Hi Ryan! I've gotten another minor concern for the new logic: at this moment we are only comparing the conditions -> for regular cases this works greatly, but in situations where apply strategy types are switched (ReportDiff <-> CSA/SSA), there are many intermediate states that are going to get difficult to reason about. For example, when people switch from CSA/SSA to ReportDiff, it could happen that some bindings are still using the CSA/SSA apply strategy (and have fresh Applied/Available condition), some bindings are already using the ReportDiff mode (and have fresh DiffReported condition), while some bindings are in between (have stale Applied/Available condition and stale/unset DiffReported condition). I am having a bit of trouble of having my mind wrapped around how we should treat the rollout process in this case. Since the CRP is always the source of truth, would it be better (possible) for us to always reference that? |
Agree, the intermediate state is confusing. I wonder how can we reference CRP as the truth? |
michaelawyu
left a comment
There was a problem hiding this comment.
LGTM ;)
Have some minor comments, will add them later; approve now to unblock the progress.
| wantWaitTime: -1, | ||
| }, | ||
|
|
||
| "binding diff report false on different generation with successful current apply should be ready": { |
There was a problem hiding this comment.
Hi Ryan! A nit: this should be available?
| readyTimeCutOff: now, | ||
| wantReady: true, | ||
| wantWaitTime: 0, | ||
| }, |
There was a problem hiding this comment.
We probably also need a diffReported is current but available is stale case.
| }) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
A nit (when I was resolving conflicts): the case on line 248 is essentially the same as the case on line 278. I assume that the former case was more of a partially stale case?
Description of your changes
Fixes #
I have:
make reviewableto ensure this PR is ready for review.How has this code been tested
Special notes for your reviewer