Skip to content

[BUG] Fix remote shards balancer when filtering throttled nodes#11724

Merged
kotwanikunal merged 4 commits intoopensearch-project:mainfrom
bugmakerrrrrr:fix_remote_shards_balancer
Jan 25, 2024
Merged

[BUG] Fix remote shards balancer when filtering throttled nodes#11724
kotwanikunal merged 4 commits intoopensearch-project:mainfrom
bugmakerrrrrr:fix_remote_shards_balancer

Conversation

@bugmakerrrrrr
Copy link
Copy Markdown
Contributor

Description

Today, RemoteShardsBalancer uses AllocationDecider#canAllocateAnyShardToNode to filter throttled or ineligible nodes during allocating unassigned shards. If all eligible nodes of an unassigned shard are filtered before trying to allocate this shard, the shard will be marked as ignored with UnassignedInfo.AllocationStatus.DECIDERS_NO status. As a result, the corresponding ShardRestoreStatus will be set to Failure (RestoreService.RestoreInProgressUpdater#unassignedInfoUpdated). This pull request takes throttled nodes into account and ensures that shards are marked with the appropriate status.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 3, 2024

Compatibility status:

Checks if related components are compatible with change c8d000d

Incompatible components

Incompatible components: [https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/cross-cluster-replication.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/sql.git]

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 3, 2024

❕ Gradle check result for 4c5408b: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testSnapshotAndRestore
      1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.classMethod
      1 org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@bugmakerrrrrr
Copy link
Copy Markdown
Contributor Author

@kotwanikunal could you take a look at this additional one?

Signed-off-by: panguixin <panguixin@bytedance.com>
Signed-off-by: panguixin <panguixin@bytedance.com>
@bugmakerrrrrr bugmakerrrrrr force-pushed the fix_remote_shards_balancer branch from 7ab39d9 to ec30c58 Compare January 23, 2024 13:11
@bugmakerrrrrr
Copy link
Copy Markdown
Contributor Author

@kotwanikunal can we merge this

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for ec30c58: SUCCESS

@kotwanikunal
Copy link
Copy Markdown
Member

@andrross / @linuxpi - Mind giving this one another review.

andrross
andrross previously approved these changes Jan 24, 2024
@andrross andrross self-requested a review January 24, 2024 18:05
@andrross andrross dismissed their stale review January 24, 2024 18:06

Question about tests

@andrross
Copy link
Copy Markdown
Member

@bugmakerrrrrr Can we add a unit test for the bug being fixed here?

@bugmakerrrrrr
Copy link
Copy Markdown
Contributor Author

@bugmakerrrrrr Can we add a unit test for the bug being fixed here?

@andrross I think that RemoteShardsAllocateUnassignedTests#testNoRemoteAllocation can cover this bug fix after modifying RemoteShardsBalancerBaseTestCase.

@github-actions
Copy link
Copy Markdown
Contributor

❕ Gradle check result for c7ef861: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@andrross andrross added the backport 2.x Backport to 2.x branch label Jan 25, 2024
Signed-off-by: Andrew Ross <andrross@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for c8d000d:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for c8d000d:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❕ Gradle check result for c8d000d: UNSTABLE

  • TEST FAILURES:
      3 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testConcurrentDecommissionAction
      1 org.opensearch.search.SearchWeightedRoutingIT.testShardRoutingWithNetworkDisruption_FailOpenEnabled
      1 org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@kotwanikunal kotwanikunal merged commit 9f649e0 into opensearch-project:main Jan 25, 2024
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 25, 2024
* fix remote shards balancer

Signed-off-by: panguixin <panguixin@bytedance.com>

* add change log

Signed-off-by: panguixin <panguixin@bytedance.com>

---------

Signed-off-by: panguixin <panguixin@bytedance.com>
Signed-off-by: Andrew Ross <andrross@amazon.com>
Co-authored-by: Andrew Ross <andrross@amazon.com>
(cherry picked from commit 9f649e0)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
kotwanikunal pushed a commit that referenced this pull request Jan 26, 2024
…) (#12024)

* fix remote shards balancer



* add change log



---------




(cherry picked from commit 9f649e0)

Signed-off-by: panguixin <panguixin@bytedance.com>
Signed-off-by: Andrew Ross <andrross@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Andrew Ross <andrross@amazon.com>
peteralfonsi pushed a commit to peteralfonsi/OpenSearch that referenced this pull request Mar 1, 2024
…search-project#11724)

* fix remote shards balancer

Signed-off-by: panguixin <panguixin@bytedance.com>

* add change log

Signed-off-by: panguixin <panguixin@bytedance.com>

---------

Signed-off-by: panguixin <panguixin@bytedance.com>
Signed-off-by: Andrew Ross <andrross@amazon.com>
Co-authored-by: Andrew Ross <andrross@amazon.com>
rayshrey pushed a commit to rayshrey/OpenSearch that referenced this pull request Mar 18, 2024
…search-project#11724)

* fix remote shards balancer

Signed-off-by: panguixin <panguixin@bytedance.com>

* add change log

Signed-off-by: panguixin <panguixin@bytedance.com>

---------

Signed-off-by: panguixin <panguixin@bytedance.com>
Signed-off-by: Andrew Ross <andrross@amazon.com>
Co-authored-by: Andrew Ross <andrross@amazon.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…search-project#11724)

* fix remote shards balancer

Signed-off-by: panguixin <panguixin@bytedance.com>

* add change log

Signed-off-by: panguixin <panguixin@bytedance.com>

---------

Signed-off-by: panguixin <panguixin@bytedance.com>
Signed-off-by: Andrew Ross <andrross@amazon.com>
Co-authored-by: Andrew Ross <andrross@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 2.x Backport to 2.x branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants