Describe the bug
The replication lag metric appears to be growing indefinitely even though no document has been indexed.
"segment_replication" : {
"max_bytes_behind" : 0,
"total_bytes_behind" : 0,
"max_replication_lag" : 7555216
},
Some logs I captured
[2023-11-09T09:19:33,698][INFO ][o.o.i.r.RecoverySourceHandler] [45d10d39f1d283b11e920bc74e55fdf8] [test-index][1][recover to 2c02e270b7490c83796b2e271a4c490e] finalizing recovery took [46.6ms]
[2023-11-09T09:34:57,336][INFO ][o.o.i.r.RecoverySourceHandler] [45d10d39f1d283b11e920bc74e55fdf8] [test-index][1][recover to 49ec3af290fd4b0794a6bdb24d5bcda3] finalizing recovery took [62.2ms]
[2023-11-09T09:34:58,466][INFO ][o.o.i.r.RecoverySourceHandler] [45d10d39f1d283b11e920bc74e55fdf8] [test-index][1][recover to 10d328ac6d4e995c52df28caf143325c] finalizing recovery took [777.1ms]
Where the node assignment is as follows:
Recovered shard to 2c02e270b7490c83796b2e271a4c490e - old primary
Recovered shard to 49ec3af290fd4b0794a6bdb24d5bcda3 - replica
Recovered shard to 10d328ac6d4e995c52df28caf143325c - new primary
From the replica's logs it is trying to notify the old primary of its state instead of the new, after the time the primary relocated.
[2023-11-09T09:34:59,172][ERROR][o.o.i.r.SegmentReplicationTargetService] [49ec3af290fd4b0794a6bdb24d5bcda3] Failed to update visible checkpoint for replica [test-index][1], ReplicationCheckpoint{shardId=[test-index][1], primaryTerm=1, segmentsGen=6, version=9, size=0, codec=Lucene95}:
RemoteTransportException[[45d10d39f1d283b11e920bc74e55fdf8][internal:index/shard/replication/update_visible_checkpoint]]; nested: IndexNotFoundException[no such index [test-index]];
Caused by: [test-index/01RD3t9jZMSnCHvntC8p9DlQ] IndexNotFoundException[no such index [test-index]]
at org.opensearch.indices.IndicesService.indexServiceSafe(IndicesService.java:734)
at org.opensearch.indices.replication.SegmentReplicationSourceService$UpdateVisibleCheckpointRequestHandler.messageReceived(SegmentReplicationSourceService.java:157)
at org.opensearch.indices.replication.SegmentReplicationSourceService$UpdateVisibleCheckpointRequestHandler.messageReceived(SegmentReplicationSourceService.java:153)
at ...
I've found from logs that this happens after a primary relocation. From this it looks like the primary is moving to a new node and refreshing, publishing a checkpoint to its replicas and starting its timers, the replica syncs/discards the checkpoint and calls back to the old primary to update its state.
SegmentReplicationTargetService identifies the primary using:
ShardRouting primaryShard = clusterService.state().routingTable().shardRoutingTable(replicaShard.shardId()).primaryShard();
To Reproduce
Steps to reproduce the behavior:
This is not 100% reproducible...
- Create an index with n replicas
- Continuously relocate the primary to a new node
- Check
/_cluster/stats and we will see lag even though no document has been indexed.
I have also been able to reproduce this case with an IT using NetworkDisruption, added this to SegmentReplicationUsingRemoteStoreDisruptionIT and fails 100% of the time:
public void testUpdateVisibleCheckpointWithLaggingClusterStateUpdates() throws Exception {
Path location = randomRepoPath().toAbsolutePath();
Settings nodeSettings = Settings.builder()
.put(buildRemoteStoreNodeAttributes(location, 0d, "metadata", Long.MAX_VALUE)).build();
internalCluster().startClusterManagerOnlyNode(nodeSettings);
internalCluster().startDataOnlyNodes(2, nodeSettings);
final Settings indexSettings = Settings.builder().put(indexSettings()).put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, 1).build();
createIndex(INDEX_NAME, indexSettings);
ensureGreen(INDEX_NAME);
final Set<String> dataNodeNames = internalCluster().getDataNodeNames();
final String replicaNode = getNode(dataNodeNames, false);
final String oldPrimary = getNode(dataNodeNames, true);
// index a doc.
client().prepareIndex(INDEX_NAME).setId("1").setSource("foo", randomInt()).get();
refresh(INDEX_NAME);
logger.info("--> start another node");
final String newPrimary = internalCluster().startDataOnlyNode(nodeSettings);
ClusterHealthResponse clusterHealthResponse = client().admin()
.cluster()
.prepareHealth()
.setWaitForEvents(Priority.LANGUID)
.setWaitForNodes("4")
.get();
assertEquals(clusterHealthResponse.isTimedOut(), false);
SlowClusterStateProcessing disruption = new SlowClusterStateProcessing(replicaNode, random(), 0, 0, 1000, 2000);
internalCluster().setDisruptionScheme(disruption);
disruption.startDisrupting();
//relocate the primary
logger.info("--> relocate the shard");
client().admin()
.cluster()
.prepareReroute()
.add(new MoveAllocationCommand(INDEX_NAME, 0, oldPrimary, newPrimary))
.execute()
.actionGet();
clusterHealthResponse = client().admin()
.cluster()
.prepareHealth()
.setWaitForEvents(Priority.LANGUID)
.setWaitForNoRelocatingShards(true)
.setTimeout(ACCEPTABLE_RELOCATION_TIME)
.execute()
.actionGet();
assertEquals(clusterHealthResponse.isTimedOut(), false);
IndexShard newPrimary_shard = getIndexShard(newPrimary, INDEX_NAME);
IndexShard replica = getIndexShard(replicaNode, INDEX_NAME);
assertBusy(() -> {
assertEquals(
newPrimary_shard.getLatestReplicationCheckpoint().getSegmentInfosVersion(),
replica.getLatestReplicationCheckpoint().getSegmentInfosVersion()
);
});
assertBusy(() -> {
ClusterStatsResponse clusterStatsResponse = client().admin().cluster().prepareClusterStats().get();
ReplicationStats replicationStats = clusterStatsResponse.getIndicesStats().getSegments().getReplicationStats();
assertEquals(0L, replicationStats.maxBytesBehind);
assertEquals(0L, replicationStats.maxReplicationLag);
assertEquals(0L, replicationStats.totalBytesBehind);
});
disruption.stopDisrupting();
disableRepoConsistencyCheck("Remote Store Creates System Repository");
cleanupRepo();
}
Expected behavior
Lag should not grow unless there is an active replication event in the the group.
Plugins
N/A
Describe the bug
The replication lag metric appears to be growing indefinitely even though no document has been indexed.
Some logs I captured
Where the node assignment is as follows:
From the replica's logs it is trying to notify the old primary of its state instead of the new, after the time the primary relocated.
I've found from logs that this happens after a primary relocation. From this it looks like the primary is moving to a new node and refreshing, publishing a checkpoint to its replicas and starting its timers, the replica syncs/discards the checkpoint and calls back to the old primary to update its state.
SegmentReplicationTargetService identifies the primary using:
To Reproduce
Steps to reproduce the behavior:
This is not 100% reproducible...
/_cluster/statsand we will see lag even though no document has been indexed.I have also been able to reproduce this case with an IT using NetworkDisruption, added this to
SegmentReplicationUsingRemoteStoreDisruptionITand fails 100% of the time:Expected behavior
Lag should not grow unless there is an active replication event in the the group.
Plugins
N/A