Adding warn logging for timed out remote state read tasks#18019
Adding warn logging for timed out remote state read tasks#18019gargharsh3134 wants to merge 2 commits intoopensearch-project:mainfrom
Conversation
|
❌ Gradle check result for 643f969: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
643f969 to
5b024f7
Compare
|
❌ Gradle check result for 5b024f7: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
server/src/main/java/org/opensearch/common/remote/RemoteWriteableEntityBlobStore.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateService.java
Show resolved
Hide resolved
server/src/main/java/org/opensearch/common/remote/RemoteWriteableEntityBlobStore.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/common/remote/RemoteWriteableEntityBlobStore.java
Show resolved
Hide resolved
server/src/test/java/org/opensearch/gateway/remote/RemoteClusterStateServiceTests.java
Show resolved
Hide resolved
server/src/main/java/org/opensearch/common/remote/RemoteWriteableEntityBlobStore.java
Show resolved
Hide resolved
|
❌ Gradle check result for 8d22e1d: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
server/src/main/java/org/opensearch/common/remote/ReadBlobWithMetrics.java
Outdated
Show resolved
Hide resolved
...r/src/main/java/org/opensearch/cluster/routing/remote/InternalRemoteRoutingTableService.java
Outdated
Show resolved
Hide resolved
|
❌ Gradle check result for 683e211: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
...r/src/main/java/org/opensearch/cluster/routing/remote/InternalRemoteRoutingTableService.java
Show resolved
Hide resolved
server/src/main/java/org/opensearch/gateway/remote/model/RemoteReadResultsVerbose.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/gateway/remote/model/RemoteReadResultsVerbose.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateService.java
Show resolved
Hide resolved
server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/common/remote/RemoteWriteableEntityBlobStore.java
Show resolved
Hide resolved
server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateService.java
Outdated
Show resolved
Hide resolved
683e211 to
c0e71c6
Compare
|
❌ Gradle check result for c0e71c6: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 68287d8: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
68287d8 to
1870449
Compare
|
❌ Gradle check result for 1870449: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
rajiv-kv
left a comment
There was a problem hiding this comment.
Changes LGTM ! Please fix the flaky tests.
| ); | ||
| assertEquals("Timed out waiting to read cluster state from remote within timeout " + readTimeOut + "s", exception.getMessage()); | ||
| // All lists and maps are passed as empty, while other boolean variables are set to true. | ||
| // So, for readClusterStateInParallel() total read tasks would be 7 |
There was a problem hiding this comment.
Can we add reason as to why there are 7 files to be read ?
| listener.onResponse(read(entity)); | ||
| RemoteReadResult<T> result = readWithMetrics(entity, component, componentName); | ||
| listener.onResponse(result); | ||
| final long executionTimeMS = Math.max(0, TimeValue.nsecToMSec(System.nanoTime() - queueStartTimeNS)); |
There was a problem hiding this comment.
- Should we log before calling the listener#onResponse to be consistent with exception block ?
- Can we dedupe the logging code to a private method ?
Signed-off-by: Harsh Garg <gkharsh@amazon.com>
Signed-off-by: Harsh Garg <gkharsh@amazon.com>
1870449 to
4e3fac1
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #18019 +/- ##
============================================
+ Coverage 72.77% 72.88% +0.10%
- Complexity 68690 68826 +136
============================================
Files 5582 5587 +5
Lines 315456 315797 +341
Branches 45778 45825 +47
============================================
+ Hits 229568 230154 +586
+ Misses 67290 66989 -301
- Partials 18598 18654 +56 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This PR is stalled because it has been open for 30 days with no activity. |
Description
This change focuses on improving logging for remoteState read tasks which might get timed out and lead to node drops. Absence of logging hinders the ability to identify the task/blob entity which is leading to timeout issues while reading diffed cluster state from remote.
This change introduces a new dynamic cluster setting to identify the threshold for logging the details of a slow task.
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.