[8.4][ML] Previously assigned models should get at least one allocati… by dimitris-athanasiou · Pull Request #89068 · elastic/elasticsearch

dimitris-athanasiou · 2022-08-03T10:07:28Z

When for some reason ML nodes are replaced (cluster resize, upgrade, etc.),
it is possible that some models cannot be allocated at all. Then, while
the cluster is temporarily undersized, all cores are given for allocations
of the models that have survived. If those ML nodes return later, there may
be model deployments that were previously allocated that now do not get any
allocations. The reason is that our planner will try to preserve all current
allocations.

Operationally, this is not what serves best our users. Instead, as we are
already in a cluster that does not have enough resources to fully allocate
all model deployments, we should try to give at least one allocation to each
model that has previously been allocated.

In order to know a model has previously been allocated, this commit adds a field
to TrainedModelAssignment called max_assigned_allocations which records the
max number of allocations a deployment has received in its life. We can then use
this to establish whether a deployment has ever been allocated.

Finally, we modify the AssignmentPlanner so that after computing a plan we
check whether the plan gives at least one allocation to all previously allocated models.
If not, we then compute a plan that tries to give at least one allocation to each
previously allocated model. We can solve this just using bin-packing. Having that
plan we can invoke the planner one more time to optimize the rest of the allocations
whilst preserving the single allocations for previously allocated models.

Backport of #88855

…on (elastic#88855) When for some reason ML nodes are replaced (cluster resize, upgrade, etc.), it is possible that some models cannot be allocated at all. Then, while the cluster is temporarily undersized, all cores are given for allocations of the models that have survived. If those ML nodes return later, there may be model deployments that were previously allocated that now do not get any allocations. The reason is that our planner will try to preserve all current allocations. Operationally, this is not what serves best our users. Instead, as we are already in a cluster that does not have enough resources to fully allocate all model deployments, we should try to give at least one allocation to each model that has previously been allocated. In order to know a model has previously been allocated, this commit adds a field to `TrainedModelAssignment` called `max_assigned_allocations` which records the max number of allocations a deployment has received in its life. We can then use this to establish whether a deployment has ever been allocated. Finally, we modify the `AssignmentPlanner` so that after computing a plan we check whether the plan gives at least one allocation to all previously allocated models. If not, we then compute a plan that tries to give at least one allocation to each previously allocated model. We can solve this just using bin-packing. Having that plan we can invoke the planner one more time to optimize the rest of the allocations whilst preserving the single allocations for previously allocated models. Backport of elastic#88855

dimitris-athanasiou added backport v8.4.0 labels Aug 3, 2022

elasticsearchmachine added the v8.4.1 label Aug 3, 2022

dimitris-athanasiou merged commit d2e56b0 into elastic:8.4 Aug 3, 2022

dimitris-athanasiou deleted the previously-assigned-models-should-get-at-least-one-allocation-8_4 branch August 3, 2022 10:54

mark-vieira removed the v8.4.1 label Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[8.4][ML] Previously assigned models should get at least one allocati…#89068

[8.4][ML] Previously assigned models should get at least one allocati…#89068
dimitris-athanasiou merged 1 commit intoelastic:8.4from
dimitris-athanasiou:previously-assigned-models-should-get-at-least-one-allocation-8_4

dimitris-athanasiou commented Aug 3, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dimitris-athanasiou commented Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dimitris-athanasiou commented Aug 3, 2022 •

edited

Loading