[Remote Translog] Phase 1 - Fail requests on isolated shard if no successful leader checker call in last x minutes to enable auto restore

# **[Remote Translog] Fail requests on isolated shard** 

## Problem statement

### Document type replication 

For ***document replication backed index***, if a shard is configured with one or more replicas, an isolated primary can realise that it is no more the current primary when it fans out the incoming request to all the replicas and the replica does primary term validation. If no replicas are configured, the isolated primary can continue to accept the incoming writes depending upon the `cluster.no_cluster_manager_block` setting. If `cluster.no_cluster_manager_block` is `write`, then indexing requests would be returned with HTTP status code 5xx. If `cluster.no_cluster_manager_block` is `metadata_write`, then metadata writes would be rejected but indexing would continue to happen forever. Since there is no auto recovery of the failed primary (without replicas), this would require manual intervention for joining the isolated primary back to the cluster.


### Remote backed index

For ***an index backed with remote store***, if a shard is configured with one or more replicas, an isolated primary can still realise that it is no more a primary while doing the primary term validation. This will ensure that the indexing requests are not acknowledged when cluster manager promotes an in-sync replica to the primary.  However, when a shard is configured with no replicas, if `cluster.no_cluster_manager_block` is `metadata_write`, then the isolated primary can continue to accept indexing requests forever. In remote store vision, we also plan to have auto restore of red index from remote store as remote-backed indexes have request level durability. However, a shard with no replica hinders the auto restore of red index as the isolated primary can continue to accept the writes and acknowledge the same while the new primary shard also restores from remote store and starts accepting writes. The isolated primary continues to accept the writes when it is acting as the coordinator. If the request falls on a node which is not isolated, then it would forward the request to the right primary shard. 

If the master block setting `cluster.no_cluster_manager_block` is set to `write` then it would start failing the writes after LeaderChecker on the concerned node has exhausted all retries and has applied global level cluster block. This ensures that there are no new requests that would be accepted. However, if there were requests accepted before the global level cluster block was applied and then there are long GC pauses, it is possible that auto restore kicks in before the request is acknowledged. This can lead to data loss. 

**Note** - Cluster blocks are checked at the entry when a transport call is made. For remote store cases, we must ensure that there is a check even at the end of the call so that if auto restore has happened, it does not acknowledge any upload that started long back.


## Approach

* Re-enable  `cluster.no_cluster_manager_block` to `write` for remote translog backed indexes.
* Use last successful leader check call time - If the last leader checker call succeeded more than x minutes (leader checker time threshold) ago, then we fail the request even though the translog remote sync has gone though. This ensures that once auto restore has kicked in, there are no more requests that would be acknowledged. Also, the auto restore time should be higher than the leader checker time threshold with enough buffer for clock errors & resolution)

The above approach has most benefit when there are no more active replicas or the replica configured is 0.

## Execution

Planning to implement this in 2 phases -

* Phase 1 - Fail the writes that have elapsed time above threshold over the last successful leader checker call.
* Phase 2 - Fail the shards to stop accepting any write once we discover that the last successful leader checker call has elapsed the threshold defined earlier.

## Usecases

* Network partition when replicas are configured
* Network partition without any replicas
* Asymmetric network partitions

## Considerations

* Search / Indexing behaviour during isolation
* Sync / Async durability modes both should adhere to the invariant of not acknowledging any writes that can get lost.

## References

* https://github.com/opensearch-project/OpenSearch/pull/3621
* https://github.com/opensearch-project/OpenSearch/issues/6736


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Remote Translog] Phase 1 - Fail requests on isolated shard if no successful leader checker call in last x minutes to enable auto restore #6737

[Remote Translog] Fail requests on isolated shard

Problem statement

Document type replication

Remote backed index

Approach

Execution

Usecases

Considerations

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Remote Translog] Phase 1 - Fail requests on isolated shard if no successful leader checker call in last x minutes to enable auto restore #6737

Description

[Remote Translog] Fail requests on isolated shard

Problem statement

Document type replication

Remote backed index

Approach

Execution

Usecases

Considerations

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions