[BUG] Dual Replication - Failover to remote replica from remote primary fails when the replication group contains a docrep index

### Describe the bug

In a situation where a replication group has at-least one docrep shard copy, failover from a remote primary to a remote replica fails with `no retention lease for tracked shard`

During dual replication phase, RetentionLeases generated on the primary shard, is synced over to the docrep copy through the `RetentionLeaseBackgroundSyncAction`, but we block the replication call to remote enabled replica copies. When the primary shard copy fails over to another remote enabled replica, the `invariant()` check fails.

This is how the code flows. During a failover, the `activatePrimaryMode()` method of `ReplicationTracker` is invoked

https://github.com/opensearch-project/OpenSearch/blob/7103e56a00ff4a17fb8e4e2087a8e69d916ed40c/server/src/main/java/org/opensearch/index/shard/IndexShard.java#L784-L790

This enabled the `primaryMode` flag for the `ReplicationTracker` instance, updates global and local Ckp, creates retention lease for itself and runs the `invariant()` checks

https://github.com/opensearch-project/OpenSearch/blob/7103e56a00ff4a17fb8e4e2087a8e69d916ed40c/server/src/main/java/org/opensearch/index/seqno/ReplicationTracker.java#L1359-L1364

The `invariant()` method checks for retention leases again all `replicated` shard copies. During dual replication all docrep shard copies are marked as replicated.

https://github.com/opensearch-project/OpenSearch/blob/7103e56a00ff4a17fb8e4e2087a8e69d916ed40c/server/src/main/java/org/opensearch/index/seqno/ReplicationTracker.java#L958-L975

Since retention leases weren't copied over from the primary shard instance, the assertion trips here.

We need to re-create retention leases for docrep shard copies and hold off from invoking this assertion until the leases are created.

### Related component

Storage:Remote

### To Reproduce

N/A

### Expected behavior

Failover from both remote primary to both docrep and remote replicas should work seamlessly during the dual replication phase

### Additional Details

**Plugins**
Please list all plugins currently enabled.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Host/Environment (please complete the following information):**
 - OS: [e.g. iOS]
 - Version [e.g. 22]

**Additional context**
Add any other context about the problem here.


	replicationTracker.activatePrimaryMode(getLocalCheckpoint());
	if (indexSettings.isSegRepEnabledOrRemoteNode()) {
	// force publish a checkpoint once in primary mode so that replicas not caught up to previous primary
	// are brought up to date.
	checkpointPublisher.publish(this, getLatestReplicationCheckpoint());
	}
	postActivatePrimaryMode();

	primaryMode = true;
	updateLocalCheckpoint(shardAllocationId, checkpoints.get(shardAllocationId), localCheckpoint);
	updateGlobalCheckpointOnPrimary();

	addPeerRecoveryRetentionLeaseForSolePrimary();
	assert invariant();

	if (primaryMode && indexSettings.isSoftDeleteEnabled() && hasAllPeerRecoveryRetentionLeases) {
	// all tracked shard copies have a corresponding peer-recovery retention lease
	for (final ShardRouting shardRouting : routingTable.assignedShards()) {
	final CheckpointState cps = checkpoints.get(shardRouting.allocationId().getId());
	if (cps.tracked && cps.replicated) {
	assert retentionLeases.contains(getPeerRecoveryRetentionLeaseId(shardRouting))
	: "no retention lease for tracked shard [" + shardRouting + "] in " + retentionLeases;
	assert PEER_RECOVERY_RETENTION_LEASE_SOURCE.equals(
	retentionLeases.get(getPeerRecoveryRetentionLeaseId(shardRouting)).source()
	) : "incorrect source ["
	+ retentionLeases.get(getPeerRecoveryRetentionLeaseId(shardRouting)).source()
	+ "] for ["
	+ shardRouting
	+ "] in "
	+ retentionLeases;
	}
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Dual Replication - Failover to remote replica from remote primary fails when the replication group contains a docrep index #13158

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Dual Replication - Failover to remote replica from remote primary fails when the replication group contains a docrep index #13158

Description

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions