Skip to content

Add additional WLM search settings#20830

Open
dzane17 wants to merge 3 commits intoopensearch-project:mainfrom
dzane17:wlm-2
Open

Add additional WLM search settings#20830
dzane17 wants to merge 3 commits intoopensearch-project:mainfrom
dzane17:wlm-2

Conversation

@dzane17
Copy link
Copy Markdown
Member

@dzane17 dzane17 commented Mar 10, 2026

Description

Adds support for additional workload group search settings. Follow up to #20536

  1. cancel_after_time_interval
    Ensures that long-running searches are automatically canceled after a fixed interval, preventing runaway queries from consuming cluster resources indefinitely and protecting other tenants from noisy neighbors.

  2. max_concurrent_shard_requests
    Limits how many shard-level requests a single search can execute in parallel, reducing fan-out pressure on the cluster and preventing high-cardinality queries from overwhelming CPU and thread pools.

  3. batched_reduce_size
    Controls how many shard results are reduced at a time during the reduce phase, helping to manage memory usage for large fan-out searches and reducing peak heap pressure in multi-tenant environments.

Currently WLM settings act strictly as defaults — if the user has explicitly set a value on the request, it is always preserved. This behavior is subject to change in subsequent PRs (pending discussion).

Related Issues

Part of #20555

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 10, 2026

PR Reviewer Guide 🔍

(Review updated until commit 2f07ae6)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
📝 TODO sections

🔀 No multiple PR themes
⚡ Recommended focus areas for review

Known Limitation / TODO

The BATCHED_REDUCE_SIZE case explicitly acknowledges in a TODO comment that it cannot distinguish between "user explicitly set 512" and "default 512". If a user sets batched_reduce_size=512 on their request, the WLM setting will silently override it, violating the stated contract that "WLM settings act strictly as defaults — if the user has explicitly set a value on the request, it is always preserved." This is a known behavioral inconsistency that is left unresolved in this PR.

case BATCHED_REDUCE_SIZE:
    // Only apply WLM batched reduce size when the request uses the default value
    // TODO: batchedReduceSize is a primitive int with no sentinel value, so we cannot
    // distinguish between "not set" and "explicitly set to 512 (the default)". If a user
    // explicitly sets batched_reduce_size=512, WLM will still override it. Consider adding
    // a raw accessor or tracking boolean similar to maxConcurrentShardRequests.
    int wlmBatchedReduceSize = Integer.parseInt(entry.getValue());
    if (searchRequest.getBatchedReduceSize() == SearchRequest.DEFAULT_BATCHED_REDUCE_SIZE) {
        searchRequest.setBatchedReduceSize(wlmBatchedReduceSize);
    }
    break;
Missing Test

The test testApplySearchSettings_BatchedReduceSize_RequestAlreadySet sets batchedReduceSize=50 and verifies WLM does not override it. However, there is no test covering the edge case where the user explicitly sets batchedReduceSize=512 (equal to DEFAULT_BATCHED_REDUCE_SIZE), which is the known bug described in the TODO. Adding such a test would document the known limitation and catch any future regression.

public void testApplySearchSettings_BatchedReduceSize_RequestAlreadySet() {
    mockSearchRequest.setBatchedReduceSize(50); // explicitly set by user

    String wgId = "test-wg";
    WorkloadGroup wg = createWorkloadGroup(wgId, Map.of("batched_reduce_size", "100"));
    when(workloadGroupService.getWorkloadGroupById(wgId)).thenReturn(wg);
    testThreadPool.getThreadContext().putHeader(WorkloadGroupTask.WORKLOAD_GROUP_ID_HEADER, wgId);

    sut.onRequestStart(mockSearchRequestContext);

    assertEquals(50, mockSearchRequest.getBatchedReduceSize()); // Request value preserved
}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 10, 2026

PR Code Suggestions ✨

Latest suggestions up to 2f07ae6

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix silent override of explicitly-set default value

The BATCHED_REDUCE_SIZE case silently overrides a user-explicitly-set value of 512
(the default) with the WLM value, because there is no way to distinguish between
"not set" and "set to 512". This is a known correctness issue noted in the TODO
comment. Similar to how getMaxConcurrentShardRequestsRaw() was added to use 0 as a
sentinel, consider adding a getBatchedReduceSizeRaw() method that returns a sentinel
(e.g., -1 or 0) when not explicitly set, and update setBatchedReduceSize to track
whether it was explicitly set.

server/src/main/java/org/opensearch/wlm/listeners/WorkloadGroupRequestOperationListener.java [97-107]

 case BATCHED_REDUCE_SIZE:
-    // Only apply WLM batched reduce size when the request uses the default value
-    // TODO: batchedReduceSize is a primitive int with no sentinel value, so we cannot
-    // distinguish between "not set" and "explicitly set to 512 (the default)". If a user
-    // explicitly sets batched_reduce_size=512, WLM will still override it. Consider adding
-    // a raw accessor or tracking boolean similar to maxConcurrentShardRequests.
     int wlmBatchedReduceSize = Integer.parseInt(entry.getValue());
-    if (searchRequest.getBatchedReduceSize() == SearchRequest.DEFAULT_BATCHED_REDUCE_SIZE) {
+    // Only apply WLM value when user has not explicitly set batched_reduce_size
+    if (searchRequest.getBatchedReduceSizeRaw() == SearchRequest.UNSET_BATCHED_REDUCE_SIZE) {
         searchRequest.setBatchedReduceSize(wlmBatchedReduceSize);
     }
     break;
Suggestion importance[1-10]: 5

__

Why: This is a valid concern about the BATCHED_REDUCE_SIZE case silently overriding user-set values of 512, which is already acknowledged in the TODO comment. However, the improved_code references getBatchedReduceSizeRaw() and UNSET_BATCHED_REDUCE_SIZE which don't exist yet in the codebase, making this more of a design suggestion than an immediately applicable fix.

Low
General
Document sentinel value assumption to prevent future regressions

A user could explicitly set max_concurrent_shard_requests to 0 (which is technically
an invalid value but may be passed), and the WLM would incorrectly override it. More
importantly, the validation in validatePositiveInt requires values >= 1, but the
sentinel check uses 0. If setMaxConcurrentShardRequests(0) is ever called
programmatically, the WLM will incorrectly treat it as "not set". The sentinel value
and validation constraint should be clearly documented together to avoid future
regressions.

server/src/main/java/org/opensearch/wlm/listeners/WorkloadGroupRequestOperationListener.java [119-124]

 case MAX_CONCURRENT_SHARD_REQUESTS:
-    // Raw value 0 means not explicitly set; only apply WLM when not explicitly set
+    // Raw value 0 is the sentinel meaning "not explicitly set by user" (validated values are >= 1)
     if (searchRequest.getMaxConcurrentShardRequestsRaw() == 0) {
-        searchRequest.setMaxConcurrentShardRequests(Integer.parseInt(entry.getValue()));
+        int wlmMaxConcurrent = Integer.parseInt(entry.getValue());
+        searchRequest.setMaxConcurrentShardRequests(wlmMaxConcurrent);
     }
     break;
Suggestion importance[1-10]: 2

__

Why: The improved_code is functionally nearly identical to the existing_code — it only extracts Integer.parseInt(entry.getValue()) into a local variable and adds a comment. This is a minor style improvement with negligible impact, and the existing_code and improved_code are essentially equivalent in behavior.

Low

Previous suggestions

Suggestions up to commit a5d7602
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix ambiguous default value detection for batched reduce size

The BATCHED_REDUCE_SIZE case silently overrides a user-explicitly-set value of 512
(the default) with the WLM value, since there is no way to distinguish between "not
set" and "set to 512". Similar to how getMaxConcurrentShardRequestsRaw() was added
to SearchRequest to expose a sentinel value of 0, a getBatchedReduceSizeRaw() method
(returning 0 or -1 when not explicitly set) should be added to SearchRequest and
used here to reliably detect whether the user explicitly set this field. This would
make the behavior consistent with the MAX_CONCURRENT_SHARD_REQUESTS case.

server/src/main/java/org/opensearch/wlm/listeners/WorkloadGroupRequestOperationListener.java [97-107]

 case BATCHED_REDUCE_SIZE:
-    // Only apply WLM batched reduce size when the request uses the default value
-    // TODO: batchedReduceSize is a primitive int with no sentinel value, so we cannot
-    // distinguish between "not set" and "explicitly set to 512 (the default)". If a user
-    // explicitly sets batched_reduce_size=512, WLM will still override it. Consider adding
-    // a raw accessor or tracking boolean similar to maxConcurrentShardRequests.
+    // Only apply WLM batched reduce size when the request has not explicitly set it
+    // Use raw accessor (0 = not set) to avoid overriding user-set value of 512 (the default)
     int wlmBatchedReduceSize = Integer.parseInt(entry.getValue());
-    if (searchRequest.getBatchedReduceSize() == SearchRequest.DEFAULT_BATCHED_REDUCE_SIZE) {
+    if (searchRequest.getBatchedReduceSizeRaw() == 0) {
         searchRequest.setBatchedReduceSize(wlmBatchedReduceSize);
     }
     break;
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies a real limitation: when batchedReduceSize is 512 (the default), WLM cannot distinguish between "user explicitly set 512" and "not set". The TODO comment in the PR already acknowledges this issue. The suggested fix (getBatchedReduceSizeRaw()) would require adding a new method to SearchRequest, which is a non-trivial change. The suggestion is valid but the improved_code references a method (getBatchedReduceSizeRaw()) that doesn't exist yet in the PR.

Low
General
Add upper bound validation for concurrent shard requests

The validatePositiveInt method uses Integer.parseInt which will silently accept
values larger than Integer.MAX_VALUE by throwing a NumberFormatException caught as
"must be a valid integer", but it does not guard against overflow for values that
parse successfully but are semantically invalid (e.g., very large numbers). More
importantly, MAX_CONCURRENT_SHARD_REQUESTS in OpenSearch has a documented maximum of
256; passing an unchecked large value could cause unexpected behavior. Consider
adding an upper bound check consistent with the underlying field's constraints.

server/src/main/java/org/opensearch/wlm/WorkloadGroupSearchSettings.java [129-139]

 private static String validatePositiveInt(String value) {
     try {
         int intValue = Integer.parseInt(value);
         if (intValue < 1) {
             return "must be positive";
+        }
+        if (intValue > 256) {
+            return "must be <= 256";
         }
         return null;
     } catch (NumberFormatException e) {
         return "must be a valid integer";
     }
 }
Suggestion importance[1-10]: 4

__

Why: Adding an upper bound of 256 for max_concurrent_shard_requests could prevent misconfiguration, but the claim that OpenSearch has a "documented maximum of 256" is not clearly established in the PR diff. The validatePositiveInt method is also used generically, so hardcoding 256 may not be appropriate if the method is reused for other settings.

Low
Suggestions up to commit c7bb91d
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix silent override of user-set batched reduce size

The BATCHED_REDUCE_SIZE case silently overrides the user's explicit setting when it
equals the default value (512), which is a correctness bug. Similar to how
MAX_CONCURRENT_SHARD_REQUESTS uses a raw accessor returning 0 as a sentinel for "not
set", a raw accessor or a boolean flag should be added to SearchRequest for
batchedReduceSize to distinguish between "not set" and "explicitly set to default".
Until that is resolved, the TODO should at minimum be tracked as a known behavioral
issue, but the current implementation can silently override user intent.

server/src/main/java/org/opensearch/wlm/listeners/WorkloadGroupRequestOperationListener.java [97-107]

 case BATCHED_REDUCE_SIZE:
-    // Only apply WLM batched reduce size when the request uses the default value
-    // TODO: batchedReduceSize is a primitive int with no sentinel value, so we cannot
-    // distinguish between "not set" and "explicitly set to 512 (the default)". If a user
-    // explicitly sets batched_reduce_size=512, WLM will still override it. Consider adding
-    // a raw accessor or tracking boolean similar to maxConcurrentShardRequests.
-    int wlmBatchedReduceSize = Integer.parseInt(entry.getValue());
-    if (searchRequest.getBatchedReduceSize() == SearchRequest.DEFAULT_BATCHED_REDUCE_SIZE) {
-        searchRequest.setBatchedReduceSize(wlmBatchedReduceSize);
-    }
+    // Only apply WLM batched reduce size when the request has not been explicitly set.
+    // NOTE: batchedReduceSize is a primitive int with no sentinel value; we cannot
+    // distinguish "not set" from "explicitly set to DEFAULT_BATCHED_REDUCE_SIZE (512)".
+    // A raw accessor (returning 0 when unset) similar to getMaxConcurrentShardRequestsRaw()
+    // should be added to SearchRequest to fix this ambiguity before enabling this feature.
+    // For now, skip applying WLM setting to avoid silently overriding user intent.
+    // TODO: Add getBatchedReduceSizeRaw() to SearchRequest and use it here.
     break;
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies a real behavioral issue where a user explicitly setting batched_reduce_size=512 (the default) would have it silently overridden by WLM. However, the proposed fix of simply skipping the setting entirely is overly conservative and would break the intended WLM functionality. The PR already acknowledges this limitation via a TODO comment, and the suggested "improved_code" removes the feature entirely rather than fixing it properly.

Low
General
Improve validation error message clarity

The validatePositiveInt method uses Integer.parseInt which will silently accept
values up to Integer.MAX_VALUE, but maxConcurrentShardRequests has a documented
upper bound (typically 1 to Integer.MAX_VALUE is fine, but very large values could
cause issues). More importantly, the method does not guard against overflow for very
large numeric strings that fit in a long but not an int — Integer.parseInt will
throw NumberFormatException for those, which is handled, but the error message "must
be a valid integer" is misleading for overflow cases. Consider adding an upper bound
check or improving the error message.

server/src/main/java/org/opensearch/wlm/WorkloadGroupSearchSettings.java [129-139]

 private static String validatePositiveInt(String value) {
     try {
         int intValue = Integer.parseInt(value);
         if (intValue < 1) {
-            return "must be positive";
+            return "must be a positive integer (>= 1)";
         }
         return null;
     } catch (NumberFormatException e) {
-        return "must be a valid integer";
+        return "must be a valid positive integer";
     }
 }
Suggestion importance[1-10]: 2

__

Why: This suggestion only improves error message wording slightly, which is a very minor cosmetic change. The existing messages are already reasonably clear, and the NumberFormatException overflow case is correctly handled (just with a slightly imprecise message).

Low

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit a5d7602

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for a5d7602: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 2f07ae6

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 2f07ae6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@dzane17 dzane17 marked this pull request as ready for review March 23, 2026 20:04
@dzane17 dzane17 requested a review from a team as a code owner March 23, 2026 20:04
Signed-off-by: David Zane <davizane@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

* @param value the string to validate
* @return null if valid, error message if invalid
*/
private static String validatePositiveInt(String value) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not reinvent the wheel, I believe we have intSettings where you can specify min value as 1.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. intSetting is not ideal since I don't need to create a Setting object. I was able to reuse Settings.parseInt(), Settings.parseTimeValue() methods though.

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 2b0758e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

Signed-off-by: David Zane <davizane@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

Copy link
Copy Markdown
Contributor

@jainankitk jainankitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dzane17 for raising this PR. At a high level, the management of these settings per workload management group seems like bit of a pain to me. Is it possible to update individual setting value for a workload management group instead of updating the workload management group itself? Also, is it better to call it just settings and have search prefix with the setting name, so that we can have indexing also part of same json object instead of separate one?

@dzane17
Copy link
Copy Markdown
Member Author

dzane17 commented Mar 27, 2026

@jainankitk Right now search_settings are configured the same way as resource_limits in a WLM group. They are unique to each WLM group and can be updated dynamically. The only difference is that search settings are optional fields. Are you recommending the same search settings be shared across multiple WLM groups?

{
  "_id" : "GxwBfp3_SSyEJ-MpfwFZWw",
  "name" : "analytics",
  "resiliency_mode" : "enforced",
  "resource_limits" : {
    "cpu" : 0.2,
    "memory" : 0.3
  },
  "search_settings" : {
    "batched_reduce_size" : "400",
    "cancel_after_time_interval" : "10s",
    "max_concurrent_shard_requests" : "10",
    "timeout" : "200ms"
  },
  "updated_at" : 1774646244691
}

I think calling it settings would introduce ambiguity between general WLM group config vs actual search settings that are enforced upon queries in the group. WLM does not track indexing requests, only searches. So there is no need to add a separate index_settings field. Any settings that impact search requests can anyway be clumped in to search_settings.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for d94c8f4: SUCCESS

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.19%. Comparing base (142d483) to head (d94c8f4).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##               main   #20830   +/-   ##
=========================================
  Coverage     73.19%   73.19%           
- Complexity    72592    72614   +22     
=========================================
  Files          5848     5849    +1     
  Lines        331991   332077   +86     
  Branches      47948    47953    +5     
=========================================
+ Hits         242986   243069   +83     
- Misses        69541    69547    +6     
+ Partials      19464    19461    -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants