Skip to content

Add metrics for remote cluster state metadata transfer timeouts#20809

Open
zheliu2 wants to merge 2 commits into
opensearch-project:mainfrom
zheliu2:feat/remote-state-timeout-metrics
Open

Add metrics for remote cluster state metadata transfer timeouts#20809
zheliu2 wants to merge 2 commits into
opensearch-project:mainfrom
zheliu2:feat/remote-state-timeout-metrics

Conversation

@zheliu2
Copy link
Copy Markdown
Contributor

@zheliu2 zheliu2 commented Mar 9, 2026

Summary

  • Adds index_metadata_upload_timeout_count and metadata_upload_timeout_count counters to RemoteUploadStats to emit metrics when metadata uploads to the remote store time out
  • When the latch timeout fires in writeMetadataInParallel, the code now inspects which upload tasks did not complete and increments the appropriate counter (index metadata vs global metadata)
  • Resolves [Remote cluster state] Emit metrics for timeout for metadata transfer to remote store #10687

Test plan

  • Existing testTimeoutWhileWritingMetadata test updated with assertions verifying the new metric counters are incremented correctly on timeout
  • Manual verification that metrics appear in node stats API output under remote_upload extended fields

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 9, 2026

PR Reviewer Guide 🔍

(Review updated until commit 8fc1f76)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add timeout metric counters to RemoteUploadStats and RemotePersistenceStats

Relevant files:

  • server/src/main/java/org/opensearch/gateway/remote/RemoteUploadStats.java
  • server/src/main/java/org/opensearch/gateway/remote/RemotePersistenceStats.java

Sub-PR theme: Integrate timeout metrics into writeMetadataInParallel with tests

Relevant files:

  • server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateService.java
  • server/src/test/java/org/opensearch/gateway/remote/RemoteClusterStateServiceTests.java

⚡ Recommended focus areas for review

Incorrect Classification

The logic classifying timed-out tasks as index metadata vs global metadata relies on matching uploadTask strings against index names via idx.getIndex().getName().equals(uploadTask). If upload task keys are not exactly equal to index names (e.g., they include prefixes, suffixes, or are structured differently), tasks could be misclassified, leading to incorrect metric counts.

for (String uploadTask : uploadTasks) {
    if (results.containsKey(uploadTask) == false) {
        if (indexToUpload.stream().anyMatch(idx -> idx.getIndex().getName().equals(uploadTask))) {
            timedOutIndexCount++;
        } else {
            timedOutGlobalCount++;
        }
    }
Counter Granularity

The counters indexMetadataUploadTimedOut and metadataUploadTimedOut are incremented at most once per timeout event (a boolean check on count > 0), regardless of how many individual tasks timed out. This means if 5 index metadata uploads time out simultaneously, the counter only increments by 1. Consider whether the intent is to count timeout events or individual timed-out tasks, and adjust accordingly.

if (timedOutIndexCount > 0) {
    remoteStateStats.indexMetadataUploadTimedOut();
}
if (timedOutGlobalCount > 0) {
    remoteStateStats.metadataUploadTimedOut();
}
Weak Test Coverage

The test only verifies the case where global metadata times out (count=1) and index metadata does not (count=0). There is no test case verifying that indexMetadataUploadTimeoutCount is incremented when index metadata uploads time out, leaving that code path untested.

assertEquals(0, remoteClusterStateService.getRemoteStateStats().getIndexMetadataUploadTimeoutCount());
assertEquals(1, remoteClusterStateService.getRemoteStateStats().getMetadataUploadTimeoutCount());

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 9, 2026

PR Code Suggestions ✨

Latest suggestions up to 8fc1f76

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Increment timeout counter per timed-out task

The current implementation only increments the timeout counter once regardless of
how many tasks timed out. If multiple index metadata uploads time out,
indexMetadataUploadTimedOut() is still called only once. Consider calling the
increment method once per timed-out task, or passing the count to the stats method,
to accurately reflect the number of timed-out uploads.

server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateService.java [841-855]

 for (String uploadTask : uploadTasks) {
     if (results.containsKey(uploadTask) == false) {
         if (indexToUpload.stream().anyMatch(idx -> idx.getIndex().getName().equals(uploadTask))) {
             timedOutIndexCount++;
+            remoteStateStats.indexMetadataUploadTimedOut();
         } else {
             timedOutGlobalCount++;
+            remoteStateStats.metadataUploadTimedOut();
         }
     }
 }
-if (timedOutIndexCount > 0) {
-    remoteStateStats.indexMetadataUploadTimedOut();
-}
-if (timedOutGlobalCount > 0) {
-    remoteStateStats.metadataUploadTimedOut();
-}
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that the current implementation only increments the counter once even if multiple tasks time out. Moving the increment inside the loop would more accurately track the number of timed-out uploads. The improved_code is logically consistent with the suggestion and removes the need for the separate count variables.

Low
General
Improve task classification accuracy for timeout metrics

The classification of a timed-out task as "index metadata" vs "global metadata"
relies on matching the upload task name against index names. However, if uploadTasks
contains keys that are not index names (e.g., routing table tasks or other metadata
tasks), they would be incorrectly counted as global metadata timeouts. Ensure that
the task name matching logic correctly identifies index metadata upload tasks,
potentially by using a dedicated set of index metadata task keys rather than
matching against index names.

server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateService.java [843]

-if (indexToUpload.stream().anyMatch(idx -> idx.getIndex().getName().equals(uploadTask))) {
+if (indexToUpload.stream().anyMatch(idx -> uploadTask.contains(idx.getIndex().getName()))) {
Suggestion importance[1-10]: 3

__

Why: The suggestion changes equals to contains for matching upload task names against index names, but this could introduce false positives if one index name is a substring of another. The improvement is marginal and the improved_code doesn't fully address the concern raised about non-index task keys being misclassified.

Low

Previous suggestions

Suggestions up to commit 8e5d76c
CategorySuggestion                                                                                                                                    Impact
General
Increment metrics per timed-out task, not per event

The current implementation only increments the timeout counter once regardless of
how many tasks timed out. If the intent is to track the number of individual
timed-out uploads, the stats methods should be called once per timed-out task
(inside the loop). If the intent is to count timeout events (one per latch timeout),
the current approach is correct but should be documented. Consider calling
indexMetadataUploadTimedOut() and metadataUploadTimedOut() inside the loop for each
timed-out task, or rename the counters to reflect they count timeout events rather
than individual timed-out uploads.

server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateService.java [841-855]

 for (String uploadTask : uploadTasks) {
     if (results.containsKey(uploadTask) == false) {
         if (indexToUpload.stream().anyMatch(idx -> idx.getIndex().getName().equals(uploadTask))) {
             timedOutIndexCount++;
+            remoteStateStats.indexMetadataUploadTimedOut();
         } else {
             timedOutGlobalCount++;
+            remoteStateStats.metadataUploadTimedOut();
         }
     }
 }
-if (timedOutIndexCount > 0) {
-    remoteStateStats.indexMetadataUploadTimedOut();
-}
-if (timedOutGlobalCount > 0) {
-    remoteStateStats.metadataUploadTimedOut();
-}
Suggestion importance[1-10]: 5

__

Why: The suggestion raises a valid design question about whether counters should track individual timed-out tasks or timeout events. However, the test at line 860 asserts getMetadataUploadTimeoutCount() == 1, which aligns with the current "per event" approach. Changing to per-task counting would break the existing test, making this suggestion potentially incorrect in the current context.

Low
Avoid repeated linear scans in loop

Using stream().anyMatch() inside a loop over uploadTasks results in O(n*m)
complexity. Consider building a Set of index names from indexToUpload before the
loop to achieve O(1) lookup per task.

server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateService.java [843]

-if (indexToUpload.stream().anyMatch(idx -> idx.getIndex().getName().equals(uploadTask))) {
+Set<String> indexNames = indexToUpload.stream()
+    .map(idx -> idx.getIndex().getName())
+    .collect(Collectors.toSet());
+for (String uploadTask : uploadTasks) {
+    if (results.containsKey(uploadTask) == false) {
+        if (indexNames.contains(uploadTask)) {
+            timedOutIndexCount++;
+        } else {
+            timedOutGlobalCount++;
+        }
+    }
+}
Suggestion importance[1-10]: 4

__

Why: The suggestion correctly identifies an O(n*m) complexity issue with stream().anyMatch() inside the loop and proposes building a Set<String> for O(1) lookups. This is a valid performance improvement, though the impact is minimal since this code only runs on timeout paths which are exceptional cases.

Low
Suggestions up to commit ce37bb9
CategorySuggestion                                                                                                                                    Impact
General
Increment counter per timed-out task

The current implementation only increments the timeout counter once regardless of
how many tasks timed out. If the intent is to track the number of individual
timed-out uploads, the counter should be incremented once per timed-out task. If the
intent is to track the number of timeout events (one per latch.await call), the
current approach is correct but should be documented. Consider incrementing the
counter for each timed-out task to provide more granular metrics.

server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateService.java [841-855]

 for (String uploadTask : uploadTasks) {
     if (results.containsKey(uploadTask) == false) {
         if (indexToUpload.stream().anyMatch(idx -> idx.getIndex().getName().equals(uploadTask))) {
             timedOutIndexCount++;
+            remoteStateStats.indexMetadataUploadTimedOut();
         } else {
             timedOutGlobalCount++;
+            remoteStateStats.metadataUploadTimedOut();
         }
     }
 }
-if (timedOutIndexCount > 0) {
-    remoteStateStats.indexMetadataUploadTimedOut();
-}
-if (timedOutGlobalCount > 0) {
-    remoteStateStats.metadataUploadTimedOut();
-}
Suggestion importance[1-10]: 5

__

Why: The suggestion raises a valid design question about granularity of metrics - whether to count per-task timeouts vs per-event timeouts. Moving the counter increments inside the loop would provide more granular metrics, but the current approach (counting once per timeout event) is also a valid design choice. The improved code is accurate and the suggestion has moderate impact on metrics quality.

Low
Add test coverage for index metadata timeout

The test only verifies the case where global metadata times out (count = 1) but does
not test the scenario where index metadata times out. Consider adding a separate
test case that verifies getIndexMetadataUploadTimeoutCount() returns a non-zero
value when index metadata upload times out, to ensure full coverage of the new
metrics.

server/src/test/java/org/opensearch/gateway/remote/RemoteClusterStateServiceTests.java [859-860]

 assertEquals(0, remoteClusterStateService.getRemoteStateStats().getIndexMetadataUploadTimeoutCount());
 assertEquals(1, remoteClusterStateService.getRemoteStateStats().getMetadataUploadTimeoutCount());
+// TODO: Add a test case where index metadata upload times out to verify getIndexMetadataUploadTimeoutCount() > 0
Suggestion importance[1-10]: 2

__

Why: The suggestion is valid in pointing out missing test coverage for getIndexMetadataUploadTimeoutCount(), but the improved_code only adds a TODO comment rather than an actual test, making it essentially the same as the existing code. The suggestion doesn't provide actionable code improvement.

Low

…search-project#10687)

Add index_metadata_upload_timeout_count and metadata_upload_timeout_count
counters to RemoteUploadStats to track when metadata uploads to the remote
store time out. These metrics are emitted from writeMetadataInParallel when
the latch await exceeds the configured timeout, distinguishing between
index metadata and global metadata (coordination, settings, templates, etc.)
transfer timeouts.

Signed-off-by: zheliu2 <770120041@qq.com>
@zheliu2 zheliu2 force-pushed the feat/remote-state-timeout-metrics branch from ce37bb9 to 8e5d76c Compare March 9, 2026 16:25
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 9, 2026

Persistent review updated to latest commit 8e5d76c

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 9, 2026

❌ Gradle check result for 8e5d76c: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: zheliu2 <770120041@qq.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 9, 2026

Persistent review updated to latest commit 8fc1f76

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 9, 2026

✅ Gradle check result for 8fc1f76: SUCCESS

@zheliu2 zheliu2 marked this pull request as ready for review March 9, 2026 20:12
@zheliu2 zheliu2 requested a review from a team as a code owner March 9, 2026 20:12
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 9, 2026

Codecov Report

❌ Patch coverage is 64.28571% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.30%. Comparing base (6a910bd) to head (8fc1f76).
⚠️ Report is 153 commits behind head on main.

Files with missing lines Patch % Lines
...arch/gateway/remote/RemoteClusterStateService.java 50.00% 2 Missing and 4 partials ⚠️
...nsearch/gateway/remote/RemotePersistenceStats.java 66.66% 2 Missing ⚠️
...g/opensearch/gateway/remote/RemoteUploadStats.java 80.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20809      +/-   ##
============================================
+ Coverage     73.27%   73.30%   +0.02%     
- Complexity    72125    72200      +75     
============================================
  Files          5794     5794              
  Lines        329826   329854      +28     
  Branches      47596    47600       +4     
============================================
+ Hits         241686   241791     +105     
+ Misses        68723    68707      -16     
+ Partials      19417    19356      -61     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andrross
Copy link
Copy Markdown
Member

@shwetathareja Can you help find someone to review this change? I'm curious if we want to keep adding data like this to the stats API instead of leveraging the telemetry framework.

@opensearch-trigger-bot
Copy link
Copy Markdown
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot Bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[Remote cluster state] Emit metrics for timeout for metadata transfer to remote store

2 participants