Skip to content

Adding warn logging for timed out remote state read tasks#18019

Open
gargharsh3134 wants to merge 2 commits intoopensearch-project:mainfrom
gargharsh3134:remoteLogging
Open

Adding warn logging for timed out remote state read tasks#18019
gargharsh3134 wants to merge 2 commits intoopensearch-project:mainfrom
gargharsh3134:remoteLogging

Conversation

@gargharsh3134
Copy link
Copy Markdown
Contributor

@gargharsh3134 gargharsh3134 commented Apr 22, 2025

Description

This change focuses on improving logging for remoteState read tasks which might get timed out and lead to node drops. Absence of logging hinders the ability to identify the task/blob entity which is leading to timeout issues while reading diffed cluster state from remote.
This change introduces a new dynamic cluster setting to identify the threshold for logging the details of a slow task.


[2025-02-17T11:31:01,012][ERROR][o.o.g.r.RemoteClusterStateService] [b2c6c40ba58ebb56b0020b8a15c196a4] Failure in downloading diff cluster state.
org.opensearch.gateway.remote.RemoteStateTransferException: Timed out waiting to read cluster state from remote within timeout 40s
        at org.opensearch.gateway.remote.RemoteClusterStateService.readClusterStateInParallel(RemoteClusterStateService.java:1397)
        at org.opensearch.gateway.remote.RemoteClusterStateService.getClusterStateUsingDiff(RemoteClusterStateService.java:1623)
        at org.opensearch.cluster.coordination.PublicationTransportHandler.handleIncomingRemotePublishRequest(PublicationTransportHandler.java:289)
        at org.opensearch.cluster.coordination.PublicationTransportHandler.lambda$new$1(PublicationTransportHandler.java:136)

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 643f969: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 5b024f7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 8d22e1d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jul 8, 2025

❌ Gradle check result for 683e211: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@gargharsh3134 gargharsh3134 marked this pull request as ready for review July 30, 2025 12:21
@gargharsh3134 gargharsh3134 requested a review from a team as a code owner July 30, 2025 12:21
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for c0e71c6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 68287d8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 1870449: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Copy Markdown
Contributor

@rajiv-kv rajiv-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM ! Please fix the flaky tests.

);
assertEquals("Timed out waiting to read cluster state from remote within timeout " + readTimeOut + "s", exception.getMessage());
// All lists and maps are passed as empty, while other boolean variables are set to true.
// So, for readClusterStateInParallel() total read tasks would be 7
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add reason as to why there are 7 files to be read ?

listener.onResponse(read(entity));
RemoteReadResult<T> result = readWithMetrics(entity, component, componentName);
listener.onResponse(result);
final long executionTimeMS = Math.max(0, TimeValue.nsecToMSec(System.nanoTime() - queueStartTimeNS));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Should we log before calling the listener#onResponse to be consistent with exception block ?
  • Can we dedupe the logging code to a private method ?

Harsh Garg added 2 commits August 1, 2025 14:59
Signed-off-by: Harsh Garg <gkharsh@amazon.com>
Signed-off-by: Harsh Garg <gkharsh@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Aug 1, 2025

✅ Gradle check result for 4e3fac1: SUCCESS

@codecov
Copy link
Copy Markdown

codecov bot commented Aug 1, 2025

Codecov Report

❌ Patch coverage is 56.12245% with 43 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.88%. Comparing base (9b22c9b) to head (4e3fac1).
⚠️ Report is 740 commits behind head on main.

Files with missing lines Patch % Lines
...arch/gateway/remote/RemoteClusterStateService.java 38.09% 21 Missing and 5 partials ⚠️
.../common/remote/RemoteWriteableEntityBlobStore.java 63.41% 12 Missing and 3 partials ⚠️
...nsearch/gateway/remote/model/RemoteReadResult.java 60.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #18019      +/-   ##
============================================
+ Coverage     72.77%   72.88%   +0.10%     
- Complexity    68690    68826     +136     
============================================
  Files          5582     5587       +5     
  Lines        315456   315797     +341     
  Branches      45778    45825      +47     
============================================
+ Hits         229568   230154     +586     
+ Misses        67290    66989     -301     
- Partials      18598    18654      +56     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@opensearch-trigger-bot
Copy link
Copy Markdown
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Aug 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants