[RFC] Query Sandboxing for search requests 

**Author - Kaushal Kumar**
Parent RFC - https://github.com/opensearch-project/OpenSearch/issues/8879
### Introduction
In most of the information retrieval systems it is very common to see the performance impact on few tenants of the system caused by other tenants. As few tenants can take significant amount of system resources leaving the others deprived. Hence it becomes critically important for a IR system to minimize the performance impact for all tenants. What could be a better way than providing the user to define and configure the limits and priority for the tenants of the system.
(Node drops)/(Poor performance) due to bad behaving tenant queries are one of the common pain in the OpenSearch clusters. Creating the tenant based performance isolation in OpenSearch becomes critically important improve the resiliency and stability of the cluster with Cx value as bonus. 

**Tenant here I am assuming an User or Index but not limited to these**


### Problem 
In the current OpenSearch there is no mechanism for tenant based performance isolation for search workloads. We want to enable the admin users of OpenSearch cluster to manage tenant based `Sandboxes`
to enforce resource based limits on the tenant queries. Each Sandbox will have a priority which will determine the cancellation order in node duress situations.


### Scope
Since we want to partition the resources amongst the tenants on a node, It makes more sense to confine this feature to node level so that node level resiliency is achieved. 


### Use cases
* User/s based performance isolation
* Index based resource usage enforcement for search workload e,g; hot and warm.
* skewed search traffic throttling, i,e; if only one of the indices is getting most of the queries on a node and causing other search requests to be either throttled/rejected, we can avoid it by confining the same index queries to a sandbox. 


### Proposal
We are proposing to introduce a reactive mechanism to actively track the resources for tenants and cancel in case of oversubscription of system resources for tenant/s. This will help us in identifying and cancelling the rogue queries reactively and help us maintain the node stability.

As part of this proposal we will introduce new software constructs called `Sandbox` which will be attribute based and admin users of OpenSearch cluster can manage (CRUD ability) at node level. . The attributes we are selecting will be generic across all users(Domains/cluster). For the System resources we will track the `jvmAllocations`(due to jdk api limitation for thread level current jvm usage) and `cpuUtilisation` . 
Other system resources like network IO, Disk IO we are not considering because of multiple reasons
* No JAVA api for getting these per thread
* Though we can get it from `/proc` but this data is loaded from kernel data structures which stores this info in binary. To access this information per thread stats for IO is 3 sys calls(open, read, close), 

A Sandbox will track the resources for all the requests associated with it and will try to enforce the resource usage limits per sandbox. We will cancel the queries from low priority sandboxes in node duress scenarios. Since tracking the resource usage could be an overhead for too many sandboxes in the system, we can limit this with cluster level setting to enforce node level count of sandboxes.

We are planning to start with reactive mechanism i,e; track and cancel in case of contention or threshold breaches. But going forward we want to build a robust search query cost estimation framework to cancel majority of the search queries upfront. 

### Future Improvements
* **Hard cancellation** - Since this feature will also be dependent on hard cancellation to be more effective. We will need hard cancellation for making this highly effective as max this feature can do is hint towards cancellation.
* **Search Query Cost Estimation** - This component will help us estimate the resource usage for search queries which can help in rejecting search queries upfront based on estimated cost based framework. 
* **Async Completion of cancellable queries** - We can punt these queries to async queries specific sandbox which can complete at some later point in time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Query Sandboxing for search requests #11061

Introduction

Problem

Scope

Use cases

Proposal

Future Improvements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Query Sandboxing for search requests #11061

Description

Introduction

Problem

Scope

Use cases

Proposal

Future Improvements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions