Skip to content

[8.4][ML] Previously assigned models should get at least one allocati…#89068

Merged
dimitris-athanasiou merged 1 commit intoelastic:8.4from
dimitris-athanasiou:previously-assigned-models-should-get-at-least-one-allocation-8_4
Aug 3, 2022
Merged

[8.4][ML] Previously assigned models should get at least one allocati…#89068
dimitris-athanasiou merged 1 commit intoelastic:8.4from
dimitris-athanasiou:previously-assigned-models-should-get-at-least-one-allocation-8_4

Conversation

@dimitris-athanasiou
Copy link
Copy Markdown
Contributor

@dimitris-athanasiou dimitris-athanasiou commented Aug 3, 2022

…on (#88855)

When for some reason ML nodes are replaced (cluster resize, upgrade, etc.),
it is possible that some models cannot be allocated at all. Then, while
the cluster is temporarily undersized, all cores are given for allocations
of the models that have survived. If those ML nodes return later, there may
be model deployments that were previously allocated that now do not get any
allocations. The reason is that our planner will try to preserve all current
allocations.

Operationally, this is not what serves best our users. Instead, as we are
already in a cluster that does not have enough resources to fully allocate
all model deployments, we should try to give at least one allocation to each
model that has previously been allocated.

In order to know a model has previously been allocated, this commit adds a field
to TrainedModelAssignment called max_assigned_allocations which records the
max number of allocations a deployment has received in its life. We can then use
this to establish whether a deployment has ever been allocated.

Finally, we modify the AssignmentPlanner so that after computing a plan we
check whether the plan gives at least one allocation to all previously allocated models.
If not, we then compute a plan that tries to give at least one allocation to each
previously allocated model. We can solve this just using bin-packing. Having that
plan we can invoke the planner one more time to optimize the rest of the allocations
whilst preserving the single allocations for previously allocated models.

Backport of #88855

…on (elastic#88855)

When for some reason ML nodes are replaced (cluster resize, upgrade, etc.),
it is possible that some models cannot be allocated at all. Then, while
the cluster is temporarily undersized, all cores are given for allocations
of the models that have survived. If those ML nodes return later, there may
be model deployments that were previously allocated that now do not get any
allocations. The reason is that our planner will try to preserve all current
allocations.

Operationally, this is not what serves best our users. Instead, as we are
already in a cluster that does not have enough resources to fully allocate
all model deployments, we should try to give at least one allocation to each
model that has previously been allocated.

In order to know a model has previously been allocated, this commit adds a field
to `TrainedModelAssignment` called `max_assigned_allocations` which records the
max number of allocations a deployment has received in its life. We can then use
this to establish whether a deployment has ever been allocated.

Finally, we modify the `AssignmentPlanner` so that after computing a plan we
check whether the plan gives at least one allocation to all previously allocated models.
If not, we then compute a plan that tries to give at least one allocation to each
previously allocated model. We can solve this just using bin-packing. Having that
plan we can invoke the planner one more time to optimize the rest of the allocations
whilst preserving the single allocations for previously allocated models.

Backport of elastic#88855
@dimitris-athanasiou dimitris-athanasiou merged commit d2e56b0 into elastic:8.4 Aug 3, 2022
@dimitris-athanasiou dimitris-athanasiou deleted the previously-assigned-models-should-get-at-least-one-allocation-8_4 branch August 3, 2022 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants