Skip to content

[BUG] [Segment Replication] Resync failures results in removal of in-sync allocation id  #7163

@dreamer-89

Description

@dreamer-89

Describe the bug
Coming from #6761 exercise, few tests are flaky because of smaller subset of in-sync ids. In failing test, a network disruption (partition) is attained b/w replica nodes followed by primary node stop, resulting in promotion of one of the replica from one partition. The promoted replica performs resync operations on existing replicas. Due to partition, this operation fails on replicas part of other partition, followed by removal of these replicas from in-sync allocation ids set, resulting in test assertion trip.

Background
Cluster manager node assigns a unique id on shard allocation to a node, called as allocation id. Cluster manager keeps list of all active allocation ids (also called in-sync allocation ids) belonging to replication group in cluster state & persisted on disk. An inactive replica is one which is not able to keep with primary and thus shouldn't be used during failover. During failover when primary dies, cluster-manager then pings all nodes containing shard data, filters which are part of in-sync id and selects one node.

Impact
Low. This failure happened when there are other problems b/w node communication and is not a likely case.

To Reproduce
PrimaryAllocationIT.testPrimaryReplicaResyncFailed fails reliably

Expected behavior
Failure of resync b/w primary and replica should not result removal of replica's allocation id from in-sync ids.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions