Skip to content

[BUG] Shard fails to re-assign after a rolling restart #19234

@shamil

Description

@shamil

Describe the bug

During a rolling restart of our OpenSearch cluster, some replica shards fail to re-assign to available nodes. The logs indicate that the destination node rejects the data because a "stale metadata checkpoint" is received from the primary shard. This suggests that the primary's state is changing during the recovery process, leading to a replication failure.

The shard fails to assign itself after 5 retries, and the cluster gives up. The log message explicitly states, shard has exceeded the maximum number of retries [5] on failed allocation attempts. The root cause is identified as a ReplicationFailedException due to a stale checkpoint.

shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-03T23:50:03.463Z], failed_attempts[5], failed_nodes[[Qm1RnXJQQYqSrlqcBq-X6Q]], delayed=false, details[failed shard on node [Qm1RnXJQQYqSrlqcBq-X6Q]: failed recovery, failure RecoveryFailedException[[logstash-2025.08.22][5]: Recovery failed from {prod-eu2-opensearch-logs-g4nt}{jh9ILZaaQvOZGcrO3MiFwA}{CY6nBxCVSeStO8Tv-oTAuQ}{10.202.0.19}{10.202.0.19:9300}{dimr}{shard_indexing_pressure_enabled=true} into {prod-eu2-opensearch-logs-lntz}{Qm1RnXJQQYqSrlqcBq-X6Q}{6FOVN8scQTWmeIvfSoG8pQ}{10.202.0.17}{10.202.0.17:9300}{dimr}{shard_indexing_pressure_enabled=true} ([logstash-2025.08.22][5]: Recovery failed from {prod-eu2-opensearch-logs-g4nt}{jh9ILZaaQvOZGcrO3MiFwA}{CY6nBxCVSeStO8Tv-oTAuQ}{10.202.0.19}{10.202.0.19:9300}{dimr}{shard_indexing_pressure_enabled=true} into {prod-eu2-opensearch-logs-lntz}{Qm1RnXJQQYqSrlqcBq-X6Q}{6FOVN8scQTWmeIvfSoG8pQ}{10.202.0.17}{10.202.0.17:9300}{dimr}{shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[logstash-2025.08.22][5]: Recovery failed from {prod-eu2-opensearch-logs-g4nt}{jh9ILZaaQvOZGcrO3MiFwA}{CY6nBxCVSeStO8Tv-oTAuQ}{10.202.0.19}{10.202.0.19:9300}{dimr}{shard_indexing_pressure_enabled=true} into {prod-eu2-opensearch-logs-lntz}{Qm1RnXJQQYqSrlqcBq-X6Q}{6FOVN8scQTWmeIvfSoG8pQ}{10.202.0.17}{10.202.0.17:9300}{dimr}{shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-eu2-opensearch-logs-g4nt][10.202.0.19:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-eu2-opensearch-logs-lntz][10.202.0.17:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[logstash-2025.08.22][5], primaryTerm=3, segmentsGen=171, version=13559, size=32294083233, codec=ZSTD912, timestamp=0}] since initial checkpoint [ReplicationCheckpoint{shardId=[logstash-2025.08.22][5], primaryTerm=3, segmentsGen=1066, version=14449, size=32294083233, codec=ZSTD101, timestamp=1756943403278888666}] is ahead of it]; ], allocation_status[no_attempt]]]

This appears to be a bug where the primary and replica shards get out of sync during the recovery process. The primary's state changes while it's trying to send an old copy of the data, which the new replica correctly rejects.

Related component

Other

To Reproduce

  1. Disable shard allocation.
  2. Restart an OpenSearch node.
  3. Enable shard allocation.
  4. The cluster never becomes green, as the shards remain unassigned, preventing subsequent steps in the rolling restart process.

Expected behavior

The shard should successfully re-assign to the new node, completing the recovery process, and the cluster should transition back to a green status.

Additional Details

Environment

  • OpenSearch Version: 3.2.0
  • JVM Version: OpenJDK Runtime Environment Temurin-24.0.2+12 (build 24.0.2+12
  • OS: Ubuntu 22.04

Metadata

Metadata

Assignees

No one assigned

    Labels

    Cluster ManagerIndexing:ReplicationIssues and PRs related to core replication framework eg segrepbugSomething isn't working

    Type

    No type

    Projects

    Status

    ✅ Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions