[8.4][ML] Previously assigned models should get at least one allocati…#89068
Merged
dimitris-athanasiou merged 1 commit intoelastic:8.4from Aug 3, 2022
Conversation
…on (elastic#88855) When for some reason ML nodes are replaced (cluster resize, upgrade, etc.), it is possible that some models cannot be allocated at all. Then, while the cluster is temporarily undersized, all cores are given for allocations of the models that have survived. If those ML nodes return later, there may be model deployments that were previously allocated that now do not get any allocations. The reason is that our planner will try to preserve all current allocations. Operationally, this is not what serves best our users. Instead, as we are already in a cluster that does not have enough resources to fully allocate all model deployments, we should try to give at least one allocation to each model that has previously been allocated. In order to know a model has previously been allocated, this commit adds a field to `TrainedModelAssignment` called `max_assigned_allocations` which records the max number of allocations a deployment has received in its life. We can then use this to establish whether a deployment has ever been allocated. Finally, we modify the `AssignmentPlanner` so that after computing a plan we check whether the plan gives at least one allocation to all previously allocated models. If not, we then compute a plan that tries to give at least one allocation to each previously allocated model. We can solve this just using bin-packing. Having that plan we can invoke the planner one more time to optimize the rest of the allocations whilst preserving the single allocations for previously allocated models. Backport of elastic#88855
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…on (#88855)
When for some reason ML nodes are replaced (cluster resize, upgrade, etc.),
it is possible that some models cannot be allocated at all. Then, while
the cluster is temporarily undersized, all cores are given for allocations
of the models that have survived. If those ML nodes return later, there may
be model deployments that were previously allocated that now do not get any
allocations. The reason is that our planner will try to preserve all current
allocations.
Operationally, this is not what serves best our users. Instead, as we are
already in a cluster that does not have enough resources to fully allocate
all model deployments, we should try to give at least one allocation to each
model that has previously been allocated.
In order to know a model has previously been allocated, this commit adds a field
to
TrainedModelAssignmentcalledmax_assigned_allocationswhich records themax number of allocations a deployment has received in its life. We can then use
this to establish whether a deployment has ever been allocated.
Finally, we modify the
AssignmentPlannerso that after computing a plan wecheck whether the plan gives at least one allocation to all previously allocated models.
If not, we then compute a plan that tries to give at least one allocation to each
previously allocated model. We can solve this just using bin-packing. Having that
plan we can invoke the planner one more time to optimize the rest of the allocations
whilst preserving the single allocations for previously allocated models.
Backport of #88855