Skip to content

[BUG] Cluster crushes after upgrade of one node (rolling upgrade) 2.4.0->3.0.0 #5065

@martin-gaievski

Description

@martin-gaievski

Describe the bug
Cluster return

cluster failed to wait for cluster health yellow after 40 SECONDS
IO error while waiting cluster
503 Service Unavailable

after one out of three nodes got upgraded from 2.4.0 to 3.0.0. It's a test cluster, we use core build-tools to construct it and upgrade the node.

To Reproduce
We're seeing this as part of our test BWC workflow in CI example. It's also reproducible in local dev environment.

Steps to reproduce the behavior locally:

  1. Clone k-NN main branch from https://github.com/opensearch-project/k-NN
  2. Execute command below from dev guide - https://github.com/opensearch-project/k-NN/blob/main/DEVELOPER_GUIDE.md#backwards-compatibility-testing:
./gradlew :qa:bwcTestSuite -Dbwc.version=2.4.0

or narrowed version for rolling upgrade

./gradlew :qa:rolling-upgrade:testRollingUpgrade -Dbwc.version=2.4.0

I've constructed minimal setup that run just one test and test doesn't call any of k-NN specific API. I've captured log files from each cluster node individually, please check attached zip. In that test we started 3 nodes cluster, then node with index 0 got upgraded to 3.0.0 while nodes 1 and 2 were still running 2.4.0.

knn-minimal-test.zip

From my understanding what essentially happens is nodes on different version do not discover each other.

I do not see any errors on old 2.4.0 nodes, but on upgraded 3.0.0 node there is this error after node attempted to start:

[2022-11-03T21:12:23,365][WARN ][r.suppressed             ] [knnBwcCluster-rolling-0] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=>=3}
org.opensearch.discovery.ClusterManagerNotDiscoveredException: null
	at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:305) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:707) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]

Expected behavior
Cluster shouldn't crush

Plugins
it's min distribution, we're testing k-nn plugin

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions