-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Describe the bug
During a rolling restart of our OpenSearch cluster, some replica shards fail to re-assign to available nodes. The logs indicate that the destination node rejects the data because a "stale metadata checkpoint" is received from the primary shard. This suggests that the primary's state is changing during the recovery process, leading to a replication failure.
The shard fails to assign itself after 5 retries, and the cluster gives up. The log message explicitly states, shard has exceeded the maximum number of retries [5] on failed allocation attempts. The root cause is identified as a ReplicationFailedException due to a stale checkpoint.
shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-03T23:50:03.463Z], failed_attempts[5], failed_nodes[[Qm1RnXJQQYqSrlqcBq-X6Q]], delayed=false, details[failed shard on node [Qm1RnXJQQYqSrlqcBq-X6Q]: failed recovery, failure RecoveryFailedException[[logstash-2025.08.22][5]: Recovery failed from {prod-eu2-opensearch-logs-g4nt}{jh9ILZaaQvOZGcrO3MiFwA}{CY6nBxCVSeStO8Tv-oTAuQ}{10.202.0.19}{10.202.0.19:9300}{dimr}{shard_indexing_pressure_enabled=true} into {prod-eu2-opensearch-logs-lntz}{Qm1RnXJQQYqSrlqcBq-X6Q}{6FOVN8scQTWmeIvfSoG8pQ}{10.202.0.17}{10.202.0.17:9300}{dimr}{shard_indexing_pressure_enabled=true} ([logstash-2025.08.22][5]: Recovery failed from {prod-eu2-opensearch-logs-g4nt}{jh9ILZaaQvOZGcrO3MiFwA}{CY6nBxCVSeStO8Tv-oTAuQ}{10.202.0.19}{10.202.0.19:9300}{dimr}{shard_indexing_pressure_enabled=true} into {prod-eu2-opensearch-logs-lntz}{Qm1RnXJQQYqSrlqcBq-X6Q}{6FOVN8scQTWmeIvfSoG8pQ}{10.202.0.17}{10.202.0.17:9300}{dimr}{shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[logstash-2025.08.22][5]: Recovery failed from {prod-eu2-opensearch-logs-g4nt}{jh9ILZaaQvOZGcrO3MiFwA}{CY6nBxCVSeStO8Tv-oTAuQ}{10.202.0.19}{10.202.0.19:9300}{dimr}{shard_indexing_pressure_enabled=true} into {prod-eu2-opensearch-logs-lntz}{Qm1RnXJQQYqSrlqcBq-X6Q}{6FOVN8scQTWmeIvfSoG8pQ}{10.202.0.17}{10.202.0.17:9300}{dimr}{shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-eu2-opensearch-logs-g4nt][10.202.0.19:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-eu2-opensearch-logs-lntz][10.202.0.17:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[logstash-2025.08.22][5], primaryTerm=3, segmentsGen=171, version=13559, size=32294083233, codec=ZSTD912, timestamp=0}] since initial checkpoint [ReplicationCheckpoint{shardId=[logstash-2025.08.22][5], primaryTerm=3, segmentsGen=1066, version=14449, size=32294083233, codec=ZSTD101, timestamp=1756943403278888666}] is ahead of it]; ], allocation_status[no_attempt]]]
This appears to be a bug where the primary and replica shards get out of sync during the recovery process. The primary's state changes while it's trying to send an old copy of the data, which the new replica correctly rejects.
Related component
Other
To Reproduce
- Disable shard allocation.
- Restart an OpenSearch node.
- Enable shard allocation.
- The cluster never becomes
green, as the shards remain unassigned, preventing subsequent steps in the rolling restart process.
Expected behavior
The shard should successfully re-assign to the new node, completing the recovery process, and the cluster should transition back to a green status.
Additional Details
Environment
- OpenSearch Version:
3.2.0 - JVM Version:
OpenJDK Runtime Environment Temurin-24.0.2+12 (build 24.0.2+12 - OS:
Ubuntu 22.04
Metadata
Metadata
Assignees
Labels
Type
Projects
Status