feat: wire namespace affinity scheduler plugin#461
feat: wire namespace affinity scheduler plugin#461weng271190436 merged 26 commits intokubefleet-dev:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
eebd744 to
8fa620f
Compare
Signed-off-by: Wei Weng <Wei.Weng@microsoft.com>
Signed-off-by: Wei Weng <Wei.Weng@microsoft.com>
Signed-off-by: Wei Weng <Wei.Weng@microsoft.com>
32318e6 to
08353f0
Compare
Signed-off-by: Wei Weng <Wei.Weng@microsoft.com>
Signed-off-by: Wei Weng <Wei.Weng@microsoft.com>
Signed-off-by: Wei Weng <Wei.Weng@microsoft.com>
| return nil | ||
| } | ||
| Eventually(rpStatusUpdatedActual, eventuallyDuration, eventuallyInterval).Should(Succeed(), "Failed to update RP %s status as expected", rpName) | ||
| rpStatusUpdatedActual := rpStatusUpdatedActual(appConfigMapIdentifiers(), []string{memberCluster2EastCanaryName, memberCluster3WestProdName}, nil, "0") |
There was a problem hiding this comment.
Previously we expect memberCluster1EastProdName to fail. Now with namespace affinity plugin, memberCluster1EastProdName won't be selected so the test expectation changes
| } | ||
| }) | ||
|
|
||
| It("should wait for namespace collection to sync on all member clusters", func() { |
There was a problem hiding this comment.
Namespace collection can take a while to finish so placing this in all RP tests where a CRP was created to place a namespace to member clusters where RP resources go
There was a problem hiding this comment.
What happens if the namepace is not reported in the MC yet? The scheduler should be triggered again when the namespace is reported. Does the eventually block in the following IT not enough?
There was a problem hiding this comment.
When the namespace is finally reported in MC, the member cluster watcher can detect it and triggers placement reconciliation.
The eventually block can work if we wait long enough. The current 10 second wait time is not enough. And waiting for namespace collection makes it more deterministic
There was a problem hiding this comment.
After pulling most waitForNamespaceCollectionOnClusters into its own It block based on
michaelawyu's comment, many of the waits are after createRP because many createRP happens inside BeforeAll. So waitForNamespaceCollectionOnClusters doesn't have to happen before createRP.
There are exceptions. For example in "Test cluster scale out and shrink using pickN policy with namespaced staged update run," we expect PickN where N = 1 to pick kind-cluster-3. This can also be deterministic if the namespaces are ready before createRP.
Signed-off-by: Wei Weng <Wei.Weng@microsoft.com>
| } | ||
|
|
||
| // TestNewProfileWithOptions tests the creation of a scheduling profile with custom options. | ||
| func TestNewProfileWithOptions(t *testing.T) { |
There was a problem hiding this comment.
what does this test actually test?
There was a problem hiding this comment.
I am testing calling NewProfile with option
| // default plugin list | ||
| clusterAffinityPlugin := clusteraffinity.New() | ||
| if opts.ClusterAffinityPlugin != nil { | ||
| clusterAffinityPlugin = *opts.ClusterAffinityPlugin |
There was a problem hiding this comment.
I wonder why is ClusterAffinityPlugin customizable while others are static?
There was a problem hiding this comment.
We want to avoid merge conflict when backporting so allowing customization here while using the customization capability in Azure/fleet
Azure/fleet uses a customized cluster affinity plugin to support cluster affinity based on VM SKU availability
| } | ||
| }) | ||
|
|
||
| It("should wait for namespace collection to sync on all member clusters", func() { |
There was a problem hiding this comment.
What happens if the namepace is not reported in the MC yet? The scheduler should be triggered again when the namespace is reported. Does the eventually block in the following IT not enough?
michaelawyu
left a comment
There was a problem hiding this comment.
Added some comments, PTAL.
| Eventually(crpStatusUpdatedActual, eventuallyDuration, eventuallyInterval).Should(Succeed(), "Failed to update CRP status as expected") | ||
|
|
||
| By("waiting for namespace to be collected on all member clusters") | ||
| waitForNamespaceCollectionOnClusters(nsName, allMemberClusterNames) |
There was a problem hiding this comment.
Hi weng271190436! Just a nit: I understand that this might be a must considering that it takes ~15 seconds for a namespace to be reported and our eventually block runs only for 10 seconds; but since it is a necessary step, would it be better if we make it a separate It block when applicable?
There was a problem hiding this comment.
I pulled them into It blocks in most cases. There are 3 cases where waiting in It block doesn't work and I commented below
| --set workApplierRequeueRateLimiterMaxFastBackoffDelaySeconds=5 \ | ||
| --set propertyProvider=$PROPERTY_PROVIDER \ | ||
| --set region=${REGIONS[$i]} \ | ||
| --set enableNamespaceCollectionInPropertyProvider=true \ |
There was a problem hiding this comment.
Hi weng271190436! Since this is enabled per member agent and as a part of the property provider, we might need a check in the setup as well to make sure that it is running properly.
There was a problem hiding this comment.
The member clusters are not joined to the hub during setup.sh. If they are joined I can verify that the namespaces are populated on the hub membercluster CRs. But since they are not joined, I can't think of a good way to make sure that it is running properly here. When they are joined, I check NamespaceCollectionSucceededCondType in utils_test.go to make sure that it is working.
There was a problem hiding this comment.
Hi Wei! I mean that there was a separate part of the check at the environment setup step, checkIfAzurePropertyProviderIsWorking(), for verifying if property provider is working properly before the tests start up.
There was a problem hiding this comment.
Come to think of that, for the test steps that require the property provider, we also need to mark them specifically so that they can be skipped if the environment has not provider available (isAzurePropertyProviderEnabled).
|
|
||
| // default plugin list | ||
| clusterAffinityPlugin := clusteraffinity.New() | ||
| if opts.ClusterAffinityPlugin != nil { |
There was a problem hiding this comment.
Yeah, a note on this part: originally we added the profile support so that developers (or users) can switch scheduler profiles as needed (e.g., for some placements, use profile A, and the others profile B). Each profile can have its own plugin chain (e.g., B has cluster affinity support, A does not); it seems that the options part was added to achieve the same purpose? Of course we haven't exposed the profile switch API yet.
There was a problem hiding this comment.
I checked the code and the options is not in use at all. Was it added back then to faciliate the development of the RP API?
There was a problem hiding this comment.
Doesn't seem to be RP related. But it was added back recently https://teams.microsoft.com/l/message/19:13f7d68c3be840eb9281e41b6d1c15ea@thread.skype/1763091181077?tenantId=72f988bf-86f1-41af-91ab-2d7cd011db47&groupId=e121dbfd-0ec1-40ea-8af5-26075f6a731b&parentMessageId=1763068248219&teamName=Azure%20Container%20Compute&channelName=SIG%20Multi-Cluster%20-%20Dataplane&createdTime=1763091181077
There was a problem hiding this comment.
I think I get it. We want to avoid merge conflict when backporting so allowing customization here while using the customization capability in Azure/fleet
Signed-off-by: Wei Weng <Wei.Weng@microsoft.com>
Signed-off-by: Wei Weng <Wei.Weng@microsoft.com>
Signed-off-by: Wei Weng <Wei.Weng@microsoft.com>
Signed-off-by: Wei Weng <Wei.Weng@microsoft.com>
Signed-off-by: Wei Weng <Wei.Weng@microsoft.com>
Signed-off-by: Wei Weng <Wei.Weng@microsoft.com>
…namespace-affinity-plugin
c848cb8 to
b1c572d
Compare
Signed-off-by: Wei Weng <Wei.Weng@microsoft.com>
Signed-off-by: Wei Weng <Wei.Weng@microsoft.com>
| Expect(memberCluster.KubeClient.Create(ctx, &ns)).To(Succeed()) | ||
| } | ||
|
|
||
| By("waiting for namespace to be collected on all member clusters") |
There was a problem hiding this comment.
Didn't convert this to It block (based on a comment) because the RP created right below depends on namespaces created above. So need to wait inside BeforeAll
| // Ensure the namespace collection is synced on all member clusters before creating the RP. | ||
| // We expect kind-cluster-3 to be picked by PickN with N=1 policy due to the alphabetical order of cluster names. | ||
| // If kind-cluster-3 does not have namespace synced during schedule time, the RP might pick kind-cluster-1 or kind-cluster-2. | ||
| By("should wait for namespace collection to sync on all member clusters") |
There was a problem hiding this comment.
Here also needs to wait inside BeforeAll. Commented on why
| crpStatusUpdatedActual := crpStatusUpdatedActual(workNamespaceIdentifiers(), allMemberClusterNames, nil, "0") | ||
| Eventually(crpStatusUpdatedActual, eventuallyDuration, eventuallyInterval).Should(Succeed(), "Failed to update CRP status as expected") | ||
|
|
||
| By("waiting for namespace collection to sync on all member clusters") |
There was a problem hiding this comment.
In custom test, we expect "should not update RP status immediately". If I wait for namespace collection after the RP creation, the RP is likely to be updated when we expect it to be not updated yet.
Signed-off-by: Wei Weng <Wei.Weng@microsoft.com>
Description of your changes
Namespace affinity scheduler plugin that schedules RP to clusters with namespace based on MemberCluster namespaces status.
The user scenario:
Fixes #501
I have:
make reviewableto ensure this PR is ready for review.How has this code been tested
Special notes for your reviewer