[BUG] [Segment Replication] Resync failures results in removal of in-sync allocation id 

**Describe the bug**
Coming from https://github.com/opensearch-project/OpenSearch/issues/6761 exercise, few tests are flaky because of smaller subset of in-sync ids. In failing test, a network disruption (partition) is attained b/w replica nodes followed by primary node stop, resulting in promotion of one of the replica from one partition. The promoted replica [performs](https://github.com/opensearch-project/OpenSearch/blob/632eb44a541b28ef16ed261904b45d74b84f3b9f/server/src/main/java/org/opensearch/index/shard/IndexShard.java#L689) `resync` operations on existing replicas. Due to partition, this operation fails on replicas part of other partition, followed by removal of these replicas from in-sync allocation ids set, resulting in test assertion trip.  

Background
Cluster manager node assigns a unique id on shard allocation to a node, called as allocation id. Cluster manager keeps list of all active allocation ids (also called in-sync allocation ids) belonging to replication group in cluster state & persisted on disk. An inactive replica is one which is not able to keep with primary and thus shouldn't be used during failover. During failover when primary dies, cluster-manager then pings all nodes containing shard data, filters which are part of in-sync id and [selects](https://github.com/opensearch-project/OpenSearch/blob/632eb44a541b28ef16ed261904b45d74b84f3b9f/server/src/main/java/org/opensearch/gateway/PrimaryShardAllocator.java#L395) one node. 

Impact
Low. This failure happened when there are other problems b/w node communication and is not a likely case.

**To Reproduce**
PrimaryAllocationIT.testPrimaryReplicaResyncFailed fails reliably

**Expected behavior**
Failure of resync b/w primary and replica should not result removal of replica's allocation id from in-sync ids.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [Segment Replication] Resync failures results in removal of in-sync allocation id #7163

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] [Segment Replication] Resync failures results in removal of in-sync allocation id #7163

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions