Skip to content

[BUG] Remote cluster state compatibility failures #20910

@andrross

Description

@andrross

Describe the bug

A BWC test for remote cluster state was added in #20221. This is failing intermittently:

https://build.ci.opensearch.org/job/gradle-check/72744/consoleText
https://build.ci.opensearch.org/job/gradle-check/72748/consoleText

  1. Build failure (top-level):

Task :qa:rolling-upgrade:v2.19.6-remote#twoThirdsUpgradedTest FAILED

Execution failed for task ':qa:rolling-upgrade:v2.19.6-remote#twoThirdsUpgradedTest'.

process was found dead while waiting for cluster health yellow, cluster{:qa:rolling-upgrade:v2.19.6-remote}

  1. IndexMetadata XContent deserialization failure (old node reading index metadata blobs written by upgraded cluster-manager):
[2026-03-18T11:33:45,850][ERROR][o.o.g.r.RemoteClusterStateService] [v2.19.6-remote-2] Failed to read cluster state from remote
org.opensearch.gateway.remote.RemoteStateTransferException: Download failed for java_for_range
        at org.opensearch.gateway.remote.RemoteIndexMetadataManager.lambda$getWrappedReadListener$3(RemoteIndexMetadataManager.java:159)
        at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90)
        at org.opensearch.common.remote.RemoteWriteableEntityBlobStore.lambda$readAsync$0(RemoteWriteableEntityBlobStore.java:87)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:955)
        ...
Caused by: java.lang.IllegalStateException: Can't get text on a START_ARRAY at -1:702
        at org.opensearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:99)
        at org.opensearch.core.xcontent.AbstractXContentParser.map(AbstractXContentParser.java:298)
        at org.opensearch.core.xcontent.AbstractXContentParser.mapStrings(AbstractXContentParser.java:282)
        at org.opensearch.cluster.metadata.IndexMetadata$Builder.fromXContent(IndexMetadata.java:2013)
        at org.opensearch.cluster.metadata.IndexMetadata.fromXContent(IndexMetadata.java:1080)
        at org.opensearch.repositories.blobstore.ChecksumBlobStoreFormat.deserialize(ChecksumBlobStoreFormat.java:144)
        at org.opensearch.gateway.remote.model.RemoteIndexMetadata.deserialize(RemoteIndexMetadata.java:136)
        at org.opensearch.gateway.remote.model.RemoteIndexMetadata.deserialize(RemoteIndexMetadata.java:35)
        at org.opensearch.common.remote.RemoteWriteableEntityBlobStore.read(RemoteWriteableEntityBlobStore.java:77)
        at org.opensearch.common.remote.RemoteWriteableEntityBlobStore.lambda$readAsync$0(RemoteWriteableEntityBlobStore.java:85)

This repeats for every index in the cluster (test_index, test_recovery, index_with_replicas, test_index_old, geo_shape_index_old, test-index-segrep, etc.).

  1. DiscoveryNodes binary deserialization failure (old node reading discovery nodes blob written by upgraded cluster-manager):
[2026-03-18T11:33:45,859][ERROR][o.o.g.r.RemoteClusterStateService] [v2.19.6-remote-2] Failed to read cluster state from remote
org.opensearch.gateway.remote.RemoteStateTransferException: Download failed for nodes
        at org.opensearch.gateway.remote.RemoteClusterStateAttributesManager.lambda$getWrappedReadListener$3(RemoteClusterStateAttributesManager.java:103)
        at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90)
        at org.opensearch.common.remote.RemoteWriteableEntityBlobStore.lambda$readAsync$0(RemoteWriteableEntityBlobStore.java:87)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:955)
        ...
Caused by: java.lang.IllegalStateException: unexpected byte [0x08]
        at org.opensearch.core.common.io.stream.StreamInput.readBoolean(StreamInput.java:596)
        at org.opensearch.core.common.io.stream.StreamInput.readBoolean(StreamInput.java:586)
        at org.opensearch.cluster.node.DiscoveryNode.<init>(DiscoveryNode.java:344)
        at org.opensearch.cluster.node.DiscoveryNodes.readFrom(DiscoveryNodes.java:777)
        at org.opensearch.gateway.remote.model.RemoteDiscoveryNodes.lambda$static$0(RemoteDiscoveryNodes.java:37)
        at org.opensearch.repositories.blobstore.ChecksumWritableBlobStoreFormat.deserialize(ChecksumWritableBlobStoreFormat.java:105)
        at org.opensearch.gateway.remote.model.RemoteDiscoveryNodes.deserialize(RemoteDiscoveryNodes.java:101)
        at org.opensearch.gateway.remote.model.RemoteDiscoveryNodes.deserialize(RemoteDiscoveryNodes.java:32)
        at org.opensearch.common.remote.RemoteWriteableEntityBlobStore.read(RemoteWriteableEntityBlobStore.java:77)
        at org.opensearch.common.remote.RemoteWriteableEntityBlobStore.lambda$readAsync$0(RemoteWriteableEntityBlobStore.java:85)

Related component

Cluster Manager

To Reproduce

Not deterministic. I think it requires a scenario where there is a mixed version cluster, a new version node is elected as cluster manager, and the new version cluster manager publishes a new cluster state.

Expected behavior

The tests should pass every time.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    ✅ Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions