-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Describe the bug
Cluster return
clusterfailed to wait for cluster health yellow after 40 SECONDS
IO error while waiting cluster
503 Service Unavailable
after one out of three nodes got upgraded from 2.4.0 to 3.0.0. It's a test cluster, we use core build-tools to construct it and upgrade the node.
To Reproduce
We're seeing this as part of our test BWC workflow in CI example. It's also reproducible in local dev environment.
Steps to reproduce the behavior locally:
- Clone k-NN main branch from https://github.com/opensearch-project/k-NN
- Execute command below from dev guide - https://github.com/opensearch-project/k-NN/blob/main/DEVELOPER_GUIDE.md#backwards-compatibility-testing:
./gradlew :qa:bwcTestSuite -Dbwc.version=2.4.0
or narrowed version for rolling upgrade
./gradlew :qa:rolling-upgrade:testRollingUpgrade -Dbwc.version=2.4.0
I've constructed minimal setup that run just one test and test doesn't call any of k-NN specific API. I've captured log files from each cluster node individually, please check attached zip. In that test we started 3 nodes cluster, then node with index 0 got upgraded to 3.0.0 while nodes 1 and 2 were still running 2.4.0.
From my understanding what essentially happens is nodes on different version do not discover each other.
I do not see any errors on old 2.4.0 nodes, but on upgraded 3.0.0 node there is this error after node attempted to start:
[2022-11-03T21:12:23,365][WARN ][r.suppressed ] [knnBwcCluster-rolling-0] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=>=3}
org.opensearch.discovery.ClusterManagerNotDiscoveredException: null
at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:305) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:707) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Expected behavior
Cluster shouldn't crush
Plugins
it's min distribution, we're testing k-nn plugin