Skip to content

[RFC][WLM] Approaches to enforcement of system resource limits #11846

@kaushalmahi12

Description

@kaushalmahi12

Author: Kaushal Kumar

Is your feature request related to a problem? Please describe

It is a meta request to QSB feature request to get some community feedback on possible approaches.

Describe the solution you'd like

Approaches to Enforce Resource Limits

There are basically two ways to enforce the resource consumption limits I can think of. First one can focuses on allocating or maintaining the fixed amount of resource usage for a sandbox while second one can be made flexible to make optimum use of resources available.

  1. Reserved - With this approach we can assign a fixed percentage of a resource for a sandbox. All sandboxes cumulatively should not exceed 100. Going with this approach even though multiple sandboxes are underutilised it can trigger cancellation from sandboxes as soon as they hit their limit.
    • Pros
      1. It will make the cancellation a bit easier as we only need to cancel when a sandbox exceeds its limit.
      2. This will help us free ourselves from the pain of tracking sandbox resource usage cumulatively. We need not employ Hierarchical topology for sandboxes.
      3. Sandboxes need not have priority.
      4. Efficacy of system resources overall is more since the #rejections > #cancellations
    • Cons
      1. This can lead to underutilization of resources system wide.
      2. Additional overhead of validating the individual sandboxes resource limit each time Cx creates a sandbox.
  2. Constrained - With this approach we will assign a limit which we will always honor. But one important thing we will do to make optimum usage of available resources is that cumulatively for a resource across all the sandboxes need not have sum up to system level duress limit. But this will create the problem of which sandbox should be selected to cancel the queries. To solve this problem we will have sandbox priority to help when contention happens.
    • Pros
      1. Optimum use of available system resources.
      2. It can cause more cancellations than rejections if not configured properly (free flowing limits e,g; every sandbox with max limit configured)
    • Cons
      1. It is complex to maintain tree topology and priority based cancellation in case of contention.
      2. Efficacy of system resources is less as #cancellations > #rejections in case where none of the snadboxes are hitting the configured limits but cumulatively they are duressing the node. Cancelling a task is wasting the resources on the cancelled task progress so far.

Lets understand them with the help of some examples here. For the sake of simplicity I am only using a single value for resource limit but there will be two limits for each system resource low and high.

Constrained

Lets say we have 3 Sandboxes in the System

  • Sandbox1 - { ResourceLimit: 60, Priority: 1}
  • Sandbox2 - { ResourceLimit: 20, Priority: 3}
  • Sandbox3 - { ResourceLimit: 40, Priority: 2}

System wide resource limit: 90

Lets caputre the current resource usage of the sandboxes at different times

Cancellation Case: sandbox limit breached

Time Sandbox1 Sandbox2 Sandbox3
T1 40 10 30
T2 50 25 10

Sandbox2 will start rejecting new requests for this sandbox and cancel some.

Cancellation Case: system limit breached

Time Sandbox1 Sandbox2 Sandbox3
T1 40 10 30
T2 50 15 30

here cells in bold will see cancellation as cumulatively it is breaching the system limit. It means that sandbox2 will face cancellation even though the sandbox level limits are not breaching here.

Rejection Case:

Time Sandbox1 Sandbox2 Sandbox3
T1 40 10 30
T2 35 22 30

In this case Sandbox2 will face rejections as the sandbox level limits are breaching.

Reserved

Lets say we have 3 Sandboxes in the System

  • Sandbox1 - { ResourceLimit: 50, Priority: 1}
  • Sandbox2 - { ResourceLimit: 20, Priority: 3}
  • Sandbox3 - { ResourceLimit: 30, Priority: 2}

The sandbox limits for the example are taken in such a way that cumulative sum of the resource limits on sandboxes should sum up to 100 as inherent in the approach.

Cancellation Case: sanbox limit breached

Time Sandbox1 Sandbox2 Sandbox3
T1 40 10 30
T2 40 20(1) 20
T3 45 25(2) 10

(1) At this point the sandbox2 will start rejecting new incoming requests
(2) At this point we will also start cancelling running requests from sandbox2 due to sandbox level resource limit breach.

Cancellation Case: system level limit breached

Time Sandbox1 Sandbox2 Sandbox3
T1 40 10 30
T2 50 15 25
T3 50 18 30

In this case the sandbox2 will start cancelling the requests because it is the lowest priority sandbox.

Decision driving factors to select the Approach from one of the Above

  • We want to improve the efficacy of the system resources overall which means we would avoid wasting resources on tasks which potentially can shoot beyond enforced limits. Basically we will favor rejections over cancellations.
  • Our system should try its best to honor the user assigned limits for these sandboxes even though this can lead to underutilisation in the system. For example let say there are 3 sandboxes in the system having limits as 60, 20, 10 respectively, there might be a time when lets say only sandbox 2 has the traffic and the sandbox 2 is inundated with traffic hence it will start rejecting the requests even though system is still not under duress.
  • At any point in time sandbox assigned limit should be honored. For example if at any point in time sandbox should not face cancellation or rejection until defined limit breached.

Personal Verdict

  • We will go ahead with reserved approach for enforching resource limits considering the above points.

Problems with the selected approach to enforce sys resource limits and possible solutions

The only ambiguity with this approach is the ability to maintain the cumulative resource limit to 100 since the user can supply any random value for new sandboxes.

To understand this with the help of examples, lets say at any point in time we have 3 sandboxes in the system

  • sandbox1: { limit: 40 }
  • sandbox2: { limit: 30 }
  • sandbox3: { limit: 20 }

now lets say user want to create a new sandbox with resource limit of 30 the new cumulative sum will become 120 (>100). This warrants the readjustment of the existing sandbox limits or create the new sandbox with the limit of 10.

Now how do we resolve this conflict there are two ways I can think of resolving this

  1. We re-adjust the resource limits of existing sandboxes in the same proportion on user's behalf. e,g; in the above scenario we can let the new sandbox be created with a limit of 30 * 10/12 and readjust the other sandboxe limits to 40 * 10/12, 30 * 10/12 and 20 * 10/12.
  2. We error out the request to create the new sandbox and ask user to re-adjust the limit of existing sandboxes to accomodate the new one.

Personally I think the 2nd option provides better user experience. But I am looking forward to hear from the folks on this.

I am using Sandbox keyword as we had started envisioning this feature with it. But It is not the final name for the construct to be used in the implementation.

Main Issues

Related component

Search:Resiliency

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

RFCIssues requesting major changesRoadmap:Stability/Availability/ResiliencyProject-wide roadmap labelSearch:ResiliencydiscussIssues intended to help drive brainstorming and decision makingenhancementEnhancement or improvement to existing feature or request

Type

No type

Projects

Status

✅ Done

Status

New

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions