-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
[Remote Translog] Fail requests on isolated shard
Problem statement
Document type replication
For document replication backed index, if a shard is configured with one or more replicas, an isolated primary can realise that it is no more the current primary when it fans out the incoming request to all the replicas and the replica does primary term validation. If no replicas are configured, the isolated primary can continue to accept the incoming writes depending upon the cluster.no_cluster_manager_block setting. If cluster.no_cluster_manager_block is write, then indexing requests would be returned with HTTP status code 5xx. If cluster.no_cluster_manager_block is metadata_write, then metadata writes would be rejected but indexing would continue to happen forever. Since there is no auto recovery of the failed primary (without replicas), this would require manual intervention for joining the isolated primary back to the cluster.
Remote backed index
For an index backed with remote store, if a shard is configured with one or more replicas, an isolated primary can still realise that it is no more a primary while doing the primary term validation. This will ensure that the indexing requests are not acknowledged when cluster manager promotes an in-sync replica to the primary. However, when a shard is configured with no replicas, if cluster.no_cluster_manager_block is metadata_write, then the isolated primary can continue to accept indexing requests forever. In remote store vision, we also plan to have auto restore of red index from remote store as remote-backed indexes have request level durability. However, a shard with no replica hinders the auto restore of red index as the isolated primary can continue to accept the writes and acknowledge the same while the new primary shard also restores from remote store and starts accepting writes. The isolated primary continues to accept the writes when it is acting as the coordinator. If the request falls on a node which is not isolated, then it would forward the request to the right primary shard.
If the master block setting cluster.no_cluster_manager_block is set to write then it would start failing the writes after LeaderChecker on the concerned node has exhausted all retries and has applied global level cluster block. This ensures that there are no new requests that would be accepted. However, if there were requests accepted before the global level cluster block was applied and then there are long GC pauses, it is possible that auto restore kicks in before the request is acknowledged. This can lead to data loss.
Note - Cluster blocks are checked at the entry when a transport call is made. For remote store cases, we must ensure that there is a check even at the end of the call so that if auto restore has happened, it does not acknowledge any upload that started long back.
Approach
- Re-enable
cluster.no_cluster_manager_blocktowritefor remote translog backed indexes. - Use last successful leader check call time - If the last leader checker call succeeded more than x minutes (leader checker time threshold) ago, then we fail the request even though the translog remote sync has gone though. This ensures that once auto restore has kicked in, there are no more requests that would be acknowledged. Also, the auto restore time should be higher than the leader checker time threshold with enough buffer for clock errors & resolution)
The above approach has most benefit when there are no more active replicas or the replica configured is 0.
Execution
Planning to implement this in 2 phases -
- Phase 1 - Fail the writes that have elapsed time above threshold over the last successful leader checker call.
- Phase 2 - Fail the shards to stop accepting any write once we discover that the last successful leader checker call has elapsed the threshold defined earlier.
Usecases
- Network partition when replicas are configured
- Network partition without any replicas
- Asymmetric network partitions
Considerations
- Search / Indexing behaviour during isolation
- Sync / Async durability modes both should adhere to the invariant of not acknowledging any writes that can get lost.
References
Metadata
Metadata
Assignees
Labels
Type
Projects
Status