[RFC] Tiering/Migration of indices from hot to warm where warm indices are mutable

### Is your feature request related to a problem? Please describe

This proposal aims at presenting the different design choices for tiering the index from hot to warm where warm indices are writable and the proposed design choice. The proposal described below is related to the RFC https://github.com/opensearch-project/OpenSearch/issues/12809. The tiering APIs to provide the customer experience have already been discussed in https://github.com/opensearch-project/OpenSearch/issues/12501

### Describe the solution you'd like

# Hot to Warm Tiering

**API: POST /`<indexNameOrPattern>`/_tier/_warm**

```
{
 "" : ""  // supporting body to make it extensible for future use-cases
}

```
Response:
```
Success:

{
"acknowledged": true
}
```

Failure:

```
{
    "error": {
        "root_cause": [
            {
                "type": "",
                "reason": "",
            }
        ],
    },
    "status": xxx
}

```

There are two cases presented in each of the below design choices - 
Case 1: Dedicated setup - cluster with dedicated warm nodes
Case 2: Non-Dedicated/Shared node setup - cluster without the dedicated warm nodes

## DESIGN 1: Requests served via cluster manager node with cluster state entry and push notification from shards

In this design, custom cluster state entry is introduced to store the metadata of the indices under-going tiering. 
In the non-dedicated setup case, the hot-warm node listens to the cluster state update and on detecting the change for index.store.data_locality (introduced in the [PR](https://github.com/opensearch-project/OpenSearch/pull/12782)) change from FULL to PARTIAL, the shard level updates on the composite directory is triggered. Allocation routing settings help in relocating the shards from data node to search dedicated nodes. The index locality setting value set to partial helps in initializing the shard on the dedicated node as a PARTIAL shard.

![image](https://github.com/opensearch-project/OpenSearch/assets/13830585/8255f4b8-aa06-4fdd-92b1-9a216413559f)




**_Pros_**

* Master is able to deduplicate the requests by checking the cluster state entry
* Cluster state entry is useful in tracking the status of the tiering
* On master flips, the accepted tiering requests are not lost as the custom cluster state is replicated across all the master nodes

**_Cons_**

* There will be a limitation in the number of indices supported for migration at a time depending on the space occupied by each cluster state entry and the number of cluster state updates.
* On the master going down, the notification from the shard can be missed which could mark the migration as stuck while its actually completed in the non-dedicated nodes setup.
* Accepted migration requests are lost on the failure or reboot of all the master nodes
* Its difficult  for TieringService to track failures and trigger retires in the shared node setup case and might need another cluster state update for retries. Or the other way would be to have an API to trigger retries for the data locality update that exposes two ways of doing the same thing - one via detecting a cluster state change and other via an API.


## (Preferred) DESIGN 2: Requests served via cluster manager node with no cluster state entry and internal API call for shard level update

In this design, there is no custom cluster state entry stored in the cluster state. 
In case of dedicated warm nodes setup, TieringService adds allocation settings to the index (`index.routing.allocation.require._role : search`,  `index.routing.allocation.exclude._role : data`) along with the other tiering related settings as shown in the diagram below. When the re-route is triggered, the allocation deciders run and decides to relocate the shard from hot node to warm node. `index.store.data_locality: PARTIAL`
helps in initializing the shard in the PARTIAL mode on the warm node during shard relocation.
In non-dedicated setup, there is a new API (more details will be provided in a separate issue) that is used to trigger the shard level update on the Composite Directory. This api can also be used for retrying in case of failures.

To track the shard level details of the tiering, the status api would give more insights on relocation status in case of dedicated nodes setup and shard level updates in the non-dedicated setup. More details on the status API would be covered in the follow-up issue.


![image](https://github.com/opensearch-project/OpenSearch/assets/13830585/60c3e004-639a-472a-a3ff-6102984ffef1)


**_Pros_**

* No cluster state entry is kept for migrations, so there is some space we save on not keeping the custom cluster state
* On master flips or failure or reboot of all master nodes, the accepted migration requests are not lost
* Failures or timeouts are easily tracked via tiering service and retries are triggered via the same API used for data locality update.

**_Cons_**

* There could be still a limit in the number of parallel migrations supported at a time as there are still 2 or more cluster states happening per requested index
* Since there is no custom cluster state entry, for checking the indices under-going migration, the entire list of indices need to be traversed to find the in progress indices. However, this can be optimized by keeping in-memory metadata at the master node that needs to be re-computed on master switch or restart.



## DESIGN 3: Requests served via cluster manager node with no Cluster state entry and Polling Service

In this design, instead of relying on the notification from the shards, there is a polling service that runs periodically and checks for the status of the shards of the indices under-going tiering.

_About Tiering Polling Service_
Tiering polling service is a service that runs on the master node on a schedule after every x seconds defined by the interval settings. There can be a cluster wide dynamic setting to configure the value of the interval of polling. The polling interval can be different for dedicated and shared node setup as the migrations are expected to be faster in the shared node setup.
Polling service begins by checking if there are any in progress migrations. If there are no in progress migrations, the polling service doesn’t do anything. If there are any in progress index migrations, the polling service calls an API on the composite directory to check for each of the shards status for the in progress migrations.
On success of all the shards for the indices, the polling service updates the index settings of the successful indices.

The caveat with this design is that that with multiple indices in the in-progress state, the polling service has to call the status API to check the status for all shards of the in-progress indices. This could contain the shards of the indices which were already successful in the previous run of the polling service, however since one or two shards were still in the relocating state, the status check has to be re-done in the next run.  To optimize this, we can store some in-memory metadata to save the information of the in-progress indices and the status of the shards in the indices. However, this metadata will be lost on a master switch or a master going down. Design 4 tries to deal with this limitation.


![image](https://github.com/opensearch-project/OpenSearch/assets/13830585/e86890e1-37b2-4549-b585-d1b81b07828f)


**_Pros_**

* On master going down, the migration state will still be updated as it doesn’t rely on the notification. This would rule out the cases which are successful but marked stuck.
* Polling service can club the cluster state updates for indices that had an update in the given interval

**_Cons_**

* On one shard in the stuck state, the polling service needs to trigger an API call for all the other shards which are already successful and found successful in the previous run.
* Determining the polling interval can be tricky and needs performance benchmarking to have the intervals. This could vary for different domains depending on the migration traffic.
* Since there is no custom cluster state entry, for checking the indices under-going migration, the entire list of indices need to be traversed to find the in progress indices. However, this can be optimized by keeping in-memory metadata at the master node that needs to be re-computed on master switch or restart.


## DESIGN 4: Requests served via cluster manager node with Cluster state entry and Polling Service

This design is similar to design 3 except that there is a custom cluster state entry to store the in progress migrations to prevent the need to keep the local metadata on the master node and avoid re-computation of the metadata on the master node switch.


![image](https://github.com/opensearch-project/OpenSearch/assets/13830585/f05dadd8-4b22-48a7-bd69-e006c816eebe)



**_Pros_**

* On a master switch, the custom cluster state is still persisted which prevents the status of the shards to be re-checked
* On master going down, the migration state will still be updated as it doesn’t rely on the notification. This would rule out the cases which are successful but marked stuck.
* The cluster state entry can be used to give the status of the in progress migrations to the user
* Polling service can club the cluster state updates for indices that had an update in the given interval

**_Cons_**

* The cluster state stores the shard level status which can increase the overall number of cluster state updates per requested index
* There would be a limitation in the number of in progress migration entries supported at time to prevent cluster state hogging up.
* On the failure/restart of all the master nodes, the custom cluster state is lost and needs to be re-computed.
* Determining the polling interval can be tricky and needs performance benchmarking to have the intervals. This could vary for different domains depending on the migration traffic.

### Open Questions and next steps
- List down the set of validations to be done for validating the request and check if these can be done on the co-ordinator node instead of master node.
- Determine the number of parallel migrations that can be supported at a time for the chosen design
- More details to follow on the Warm to Hot Tiering and the Tracking the status of the tiering

### Related component

Search:Remote Search


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Tiering/Migration of indices from hot to warm where warm indices are mutable #13294

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Hot to Warm Tiering

DESIGN 1: Requests served via cluster manager node with cluster state entry and push notification from shards

(Preferred) DESIGN 2: Requests served via cluster manager node with no cluster state entry and internal API call for shard level update

DESIGN 3: Requests served via cluster manager node with no Cluster state entry and Polling Service

DESIGN 4: Requests served via cluster manager node with Cluster state entry and Polling Service

Open Questions and next steps

Related component

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Tiering/Migration of indices from hot to warm where warm indices are mutable #13294

Description

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Hot to Warm Tiering

DESIGN 1: Requests served via cluster manager node with cluster state entry and push notification from shards

(Preferred) DESIGN 2: Requests served via cluster manager node with no cluster state entry and internal API call for shard level update

DESIGN 3: Requests served via cluster manager node with no Cluster state entry and Polling Service

DESIGN 4: Requests served via cluster manager node with Cluster state entry and Polling Service

Open Questions and next steps

Related component

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions