-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[BUG] [Segment Replication] Resync failures results in removal of in-sync allocation id #7163
Description
Describe the bug
Coming from #6761 exercise, few tests are flaky because of smaller subset of in-sync ids. In failing test, a network disruption (partition) is attained b/w replica nodes followed by primary node stop, resulting in promotion of one of the replica from one partition. The promoted replica performs resync operations on existing replicas. Due to partition, this operation fails on replicas part of other partition, followed by removal of these replicas from in-sync allocation ids set, resulting in test assertion trip.
Background
Cluster manager node assigns a unique id on shard allocation to a node, called as allocation id. Cluster manager keeps list of all active allocation ids (also called in-sync allocation ids) belonging to replication group in cluster state & persisted on disk. An inactive replica is one which is not able to keep with primary and thus shouldn't be used during failover. During failover when primary dies, cluster-manager then pings all nodes containing shard data, filters which are part of in-sync id and selects one node.
Impact
Low. This failure happened when there are other problems b/w node communication and is not a likely case.
To Reproduce
PrimaryAllocationIT.testPrimaryReplicaResyncFailed fails reliably
Expected behavior
Failure of resync b/w primary and replica should not result removal of replica's allocation id from in-sync ids.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status