Add additional WLM search settings by dzane17 · Pull Request #20830 · opensearch-project/OpenSearch

dzane17 · 2026-03-10T23:07:00Z

Description

Adds support for additional workload group search settings. Follow up to #20536

cancel_after_time_interval
Ensures that long-running searches are automatically canceled after a fixed interval, preventing runaway queries from consuming cluster resources indefinitely and protecting other tenants from noisy neighbors.
max_concurrent_shard_requests
Limits how many shard-level requests a single search can execute in parallel, reducing fan-out pressure on the cluster and preventing high-cardinality queries from overwhelming CPU and thread pools.
batched_reduce_size
Controls how many shard results are reduced at a time during the reduce phase, helping to manage memory usage for large fan-out searches and reducing peak heap pressure in multi-tenant environments.

Currently WLM settings act strictly as defaults — if the user has explicitly set a value on the request, it is always preserved. This behavior is subject to change in subsequent PRs (pending discussion).

Related Issues

Part of #20555

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2026-03-10T23:08:25Z

PR Reviewer Guide 🔍

(Review updated until commit `2f07ae6`)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
📝 TODO sections server/src/main/java/org/opensearch/wlm/listeners/WorkloadGroupRequestOperationListener.java [99]: `batchedReduceSize` is a primitive int with no sentinel value; cannot distinguish "not set" from "explicitly set to 512 (the default)". Consider adding a raw accessor or tracking boolean similar to `maxConcurrentShardRequests`.
🔀 No multiple PR themes
⚡ Recommended focus areas for review Known Limitation / TODO The `BATCHED_REDUCE_SIZE` case explicitly acknowledges in a TODO comment that it cannot distinguish between "user explicitly set 512" and "default 512". If a user sets `batched_reduce_size=512` on their request, the WLM setting will silently override it, violating the stated contract that "WLM settings act strictly as defaults — if the user has explicitly set a value on the request, it is always preserved." This is a known behavioral inconsistency that is left unresolved in this PR. case BATCHED_REDUCE_SIZE: // Only apply WLM batched reduce size when the request uses the default value // TODO: batchedReduceSize is a primitive int with no sentinel value, so we cannot // distinguish between "not set" and "explicitly set to 512 (the default)". If a user // explicitly sets batched_reduce_size=512, WLM will still override it. Consider adding // a raw accessor or tracking boolean similar to maxConcurrentShardRequests. int wlmBatchedReduceSize = Integer.parseInt(entry.getValue()); if (searchRequest.getBatchedReduceSize() == SearchRequest.DEFAULT_BATCHED_REDUCE_SIZE) { searchRequest.setBatchedReduceSize(wlmBatchedReduceSize); } break; Missing Test The test `testApplySearchSettings_BatchedReduceSize_RequestAlreadySet` sets `batchedReduceSize=50` and verifies WLM does not override it. However, there is no test covering the edge case where the user explicitly sets `batchedReduceSize=512` (equal to `DEFAULT_BATCHED_REDUCE_SIZE`), which is the known bug described in the TODO. Adding such a test would document the known limitation and catch any future regression. public void testApplySearchSettings_BatchedReduceSize_RequestAlreadySet() { mockSearchRequest.setBatchedReduceSize(50); // explicitly set by user String wgId = "test-wg"; WorkloadGroup wg = createWorkloadGroup(wgId, Map.of("batched_reduce_size", "100")); when(workloadGroupService.getWorkloadGroupById(wgId)).thenReturn(wg); testThreadPool.getThreadContext().putHeader(WorkloadGroupTask.WORKLOAD_GROUP_ID_HEADER, wgId); sut.onRequestStart(mockSearchRequestContext); assertEquals(50, mockSearchRequest.getBatchedReduceSize()); // Request value preserved }

github-actions · 2026-03-10T23:08:52Z

PR Code Suggestions ✨

Latest suggestions up to 2f07ae6

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Fix silent override of explicitly-set default value The `BATCHED_REDUCE_SIZE` case silently overrides a user-explicitly-set value of 512 (the default) with the WLM value, because there is no way to distinguish between "not set" and "set to 512". This is a known correctness issue noted in the TODO comment. Similar to how `getMaxConcurrentShardRequestsRaw()` was added to use `0` as a sentinel, consider adding a `getBatchedReduceSizeRaw()` method that returns a sentinel (e.g., `-1` or `0`) when not explicitly set, and update `setBatchedReduceSize` to track whether it was explicitly set. server/src/main/java/org/opensearch/wlm/listeners/WorkloadGroupRequestOperationListener.java [97-107] case BATCHED_REDUCE_SIZE: - // Only apply WLM batched reduce size when the request uses the default value - // TODO: batchedReduceSize is a primitive int with no sentinel value, so we cannot - // distinguish between "not set" and "explicitly set to 512 (the default)". If a user - // explicitly sets batched_reduce_size=512, WLM will still override it. Consider adding - // a raw accessor or tracking boolean similar to maxConcurrentShardRequests. int wlmBatchedReduceSize = Integer.parseInt(entry.getValue()); - if (searchRequest.getBatchedReduceSize() == SearchRequest.DEFAULT_BATCHED_REDUCE_SIZE) { + // Only apply WLM value when user has not explicitly set batched_reduce_size + if (searchRequest.getBatchedReduceSizeRaw() == SearchRequest.UNSET_BATCHED_REDUCE_SIZE) { searchRequest.setBatchedReduceSize(wlmBatchedReduceSize); } break; Suggestion importance[1-10]: 5 __ Why: This is a valid concern about the `BATCHED_REDUCE_SIZE` case silently overriding user-set values of 512, which is already acknowledged in the TODO comment. However, the `improved_code` references `getBatchedReduceSizeRaw()` and `UNSET_BATCHED_REDUCE_SIZE` which don't exist yet in the codebase, making this more of a design suggestion than an immediately applicable fix.	Low
General	Document sentinel value assumption to prevent future regressions A user could explicitly set `max_concurrent_shard_requests` to `0` (which is technically an invalid value but may be passed), and the WLM would incorrectly override it. More importantly, the validation in `validatePositiveInt` requires values >= 1, but the sentinel check uses `0`. If `setMaxConcurrentShardRequests(0)` is ever called programmatically, the WLM will incorrectly treat it as "not set". The sentinel value and validation constraint should be clearly documented together to avoid future regressions. server/src/main/java/org/opensearch/wlm/listeners/WorkloadGroupRequestOperationListener.java [119-124] case MAX_CONCURRENT_SHARD_REQUESTS: - // Raw value 0 means not explicitly set; only apply WLM when not explicitly set + // Raw value 0 is the sentinel meaning "not explicitly set by user" (validated values are >= 1) if (searchRequest.getMaxConcurrentShardRequestsRaw() == 0) { - searchRequest.setMaxConcurrentShardRequests(Integer.parseInt(entry.getValue())); + int wlmMaxConcurrent = Integer.parseInt(entry.getValue()); + searchRequest.setMaxConcurrentShardRequests(wlmMaxConcurrent); } break; Suggestion importance[1-10]: 2 __ Why: The `improved_code` is functionally nearly identical to the `existing_code` — it only extracts `Integer.parseInt(entry.getValue())` into a local variable and adds a comment. This is a minor style improvement with negligible impact, and the `existing_code` and `improved_code` are essentially equivalent in behavior.	Low

Previous suggestions

Suggestions up to commit a5d7602

Category	Suggestion	Impact
Possible issue	Fix ambiguous default value detection for batched reduce size The `BATCHED_REDUCE_SIZE` case silently overrides a user-explicitly-set value of 512 (the default) with the WLM value, since there is no way to distinguish between "not set" and "set to 512". Similar to how `getMaxConcurrentShardRequestsRaw()` was added to `SearchRequest` to expose a sentinel value of 0, a `getBatchedReduceSizeRaw()` method (returning 0 or -1 when not explicitly set) should be added to `SearchRequest` and used here to reliably detect whether the user explicitly set this field. This would make the behavior consistent with the `MAX_CONCURRENT_SHARD_REQUESTS` case. server/src/main/java/org/opensearch/wlm/listeners/WorkloadGroupRequestOperationListener.java [97-107] case BATCHED_REDUCE_SIZE: - // Only apply WLM batched reduce size when the request uses the default value - // TODO: batchedReduceSize is a primitive int with no sentinel value, so we cannot - // distinguish between "not set" and "explicitly set to 512 (the default)". If a user - // explicitly sets batched_reduce_size=512, WLM will still override it. Consider adding - // a raw accessor or tracking boolean similar to maxConcurrentShardRequests. + // Only apply WLM batched reduce size when the request has not explicitly set it + // Use raw accessor (0 = not set) to avoid overriding user-set value of 512 (the default) int wlmBatchedReduceSize = Integer.parseInt(entry.getValue()); - if (searchRequest.getBatchedReduceSize() == SearchRequest.DEFAULT_BATCHED_REDUCE_SIZE) { + if (searchRequest.getBatchedReduceSizeRaw() == 0) { searchRequest.setBatchedReduceSize(wlmBatchedReduceSize); } break; Suggestion importance[1-10]: 6 __ Why: The suggestion correctly identifies a real limitation: when `batchedReduceSize` is 512 (the default), WLM cannot distinguish between "user explicitly set 512" and "not set". The TODO comment in the PR already acknowledges this issue. The suggested fix (`getBatchedReduceSizeRaw()`) would require adding a new method to `SearchRequest`, which is a non-trivial change. The suggestion is valid but the `improved_code` references a method (`getBatchedReduceSizeRaw()`) that doesn't exist yet in the PR.	Low
General	Add upper bound validation for concurrent shard requests The `validatePositiveInt` method uses `Integer.parseInt` which will silently accept values larger than `Integer.MAX_VALUE` by throwing a `NumberFormatException` caught as "must be a valid integer", but it does not guard against overflow for values that parse successfully but are semantically invalid (e.g., very large numbers). More importantly, `MAX_CONCURRENT_SHARD_REQUESTS` in OpenSearch has a documented maximum of 256; passing an unchecked large value could cause unexpected behavior. Consider adding an upper bound check consistent with the underlying field's constraints. server/src/main/java/org/opensearch/wlm/WorkloadGroupSearchSettings.java [129-139] private static String validatePositiveInt(String value) { try { int intValue = Integer.parseInt(value); if (intValue < 1) { return "must be positive"; + } + if (intValue > 256) { + return "must be <= 256"; } return null; } catch (NumberFormatException e) { return "must be a valid integer"; } } Suggestion importance[1-10]: 4 __ Why: Adding an upper bound of 256 for `max_concurrent_shard_requests` could prevent misconfiguration, but the claim that OpenSearch has a "documented maximum of 256" is not clearly established in the PR diff. The `validatePositiveInt` method is also used generically, so hardcoding 256 may not be appropriate if the method is reused for other settings.	Low

Suggestions up to commit c7bb91d

Category	Suggestion	Impact
Possible issue	Fix silent override of user-set batched reduce size The `BATCHED_REDUCE_SIZE` case silently overrides the user's explicit setting when it equals the default value (512), which is a correctness bug. Similar to how `MAX_CONCURRENT_SHARD_REQUESTS` uses a raw accessor returning `0` as a sentinel for "not set", a raw accessor or a boolean flag should be added to `SearchRequest` for `batchedReduceSize` to distinguish between "not set" and "explicitly set to default". Until that is resolved, the TODO should at minimum be tracked as a known behavioral issue, but the current implementation can silently override user intent. server/src/main/java/org/opensearch/wlm/listeners/WorkloadGroupRequestOperationListener.java [97-107] case BATCHED_REDUCE_SIZE: - // Only apply WLM batched reduce size when the request uses the default value - // TODO: batchedReduceSize is a primitive int with no sentinel value, so we cannot - // distinguish between "not set" and "explicitly set to 512 (the default)". If a user - // explicitly sets batched_reduce_size=512, WLM will still override it. Consider adding - // a raw accessor or tracking boolean similar to maxConcurrentShardRequests. - int wlmBatchedReduceSize = Integer.parseInt(entry.getValue()); - if (searchRequest.getBatchedReduceSize() == SearchRequest.DEFAULT_BATCHED_REDUCE_SIZE) { - searchRequest.setBatchedReduceSize(wlmBatchedReduceSize); - } + // Only apply WLM batched reduce size when the request has not been explicitly set. + // NOTE: batchedReduceSize is a primitive int with no sentinel value; we cannot + // distinguish "not set" from "explicitly set to DEFAULT_BATCHED_REDUCE_SIZE (512)". + // A raw accessor (returning 0 when unset) similar to getMaxConcurrentShardRequestsRaw() + // should be added to SearchRequest to fix this ambiguity before enabling this feature. + // For now, skip applying WLM setting to avoid silently overriding user intent. + // TODO: Add getBatchedReduceSizeRaw() to SearchRequest and use it here. break; Suggestion importance[1-10]: 5 __ Why: The suggestion correctly identifies a real behavioral issue where a user explicitly setting `batched_reduce_size=512` (the default) would have it silently overridden by WLM. However, the proposed fix of simply skipping the setting entirely is overly conservative and would break the intended WLM functionality. The PR already acknowledges this limitation via a TODO comment, and the suggested "improved_code" removes the feature entirely rather than fixing it properly.	Low
General	Improve validation error message clarity The `validatePositiveInt` method uses `Integer.parseInt` which will silently accept values up to `Integer.MAX_VALUE`, but `maxConcurrentShardRequests` has a documented upper bound (typically 1 to `Integer.MAX_VALUE` is fine, but very large values could cause issues). More importantly, the method does not guard against overflow for very large numeric strings that fit in a long but not an int — `Integer.parseInt` will throw `NumberFormatException` for those, which is handled, but the error message "must be a valid integer" is misleading for overflow cases. Consider adding an upper bound check or improving the error message. server/src/main/java/org/opensearch/wlm/WorkloadGroupSearchSettings.java [129-139] private static String validatePositiveInt(String value) { try { int intValue = Integer.parseInt(value); if (intValue < 1) { - return "must be positive"; + return "must be a positive integer (>= 1)"; } return null; } catch (NumberFormatException e) { - return "must be a valid integer"; + return "must be a valid positive integer"; } } Suggestion importance[1-10]: 2 __ Why: This suggestion only improves error message wording slightly, which is a very minor cosmetic change. The existing messages are already reasonably clear, and the `NumberFormatException` overflow case is correctly handled (just with a slightly imprecise message).	Low

github-actions · 2026-03-10T23:26:17Z

Persistent review updated to latest commit a5d7602

github-actions · 2026-03-11T00:11:53Z

❌ Gradle check result for a5d7602: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-03-23T18:45:49Z

Persistent review updated to latest commit 2f07ae6

github-actions · 2026-03-23T19:28:32Z

❌ Gradle check result for 2f07ae6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: David Zane <davizane@amazon.com>

github-actions · 2026-03-27T19:01:10Z

Failed to generate code suggestions for PR

cwperks · 2026-03-27T19:48:52Z

server/src/main/java/org/opensearch/wlm/WorkloadGroupSearchSettings.java

+     * @param value the string to validate
+     * @return null if valid, error message if invalid
+     */
+    private static String validatePositiveInt(String value) {


Let's not reinvent the wheel, I believe we have intSettings where you can specify min value as 1.

Good point. intSetting is not ideal since I don't need to create a Setting object. I was able to reuse Settings.parseInt(), Settings.parseTimeValue() methods though.

server/src/main/java/org/opensearch/wlm/WorkloadGroupSearchSettings.java

github-actions · 2026-03-27T20:13:07Z

❌ Gradle check result for 2b0758e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-03-27T21:17:12Z

Failed to generate code suggestions for PR

Signed-off-by: David Zane <davizane@amazon.com>

github-actions · 2026-03-27T21:22:13Z

Failed to generate code suggestions for PR

jainankitk

Thanks @dzane17 for raising this PR. At a high level, the management of these settings per workload management group seems like bit of a pain to me. Is it possible to update individual setting value for a workload management group instead of updating the workload management group itself? Also, is it better to call it just settings and have search prefix with the setting name, so that we can have indexing also part of same json object instead of separate one?

dzane17 · 2026-03-27T22:04:38Z

@jainankitk Right now search_settings are configured the same way as resource_limits in a WLM group. They are unique to each WLM group and can be updated dynamically. The only difference is that search settings are optional fields. Are you recommending the same search settings be shared across multiple WLM groups?

{
  "_id" : "GxwBfp3_SSyEJ-MpfwFZWw",
  "name" : "analytics",
  "resiliency_mode" : "enforced",
  "resource_limits" : {
    "cpu" : 0.2,
    "memory" : 0.3
  },
  "search_settings" : {
    "batched_reduce_size" : "400",
    "cancel_after_time_interval" : "10s",
    "max_concurrent_shard_requests" : "10",
    "timeout" : "200ms"
  },
  "updated_at" : 1774646244691
}

I think calling it settings would introduce ambiguity between general WLM group config vs actual search settings that are enforced upon queries in the group. WLM does not track indexing requests, only searches. So there is no need to add a separate index_settings field. Any settings that impact search requests can anyway be clumped in to search_settings.

github-actions · 2026-03-27T22:39:29Z

✅ Gradle check result for d94c8f4: SUCCESS

codecov · 2026-03-27T22:40:33Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.19%. Comparing base (142d483) to head (d94c8f4).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff            @@
##               main   #20830   +/-   ##
=========================================
  Coverage     73.19%   73.19%           
- Complexity    72592    72614   +22     
=========================================
  Files          5848     5849    +1     
  Lines        331991   332077   +86     
  Branches      47948    47953    +5     
=========================================
+ Hits         242986   243069   +83     
- Misses        69541    69547    +6     
+ Partials      19464    19461    -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dzane17 force-pushed the wlm-2 branch from c7bb91d to a5d7602 Compare March 10, 2026 23:24

opensearch-ci-bot mentioned this pull request Mar 11, 2026

[AUTOCUT] Gradle Check Flaky Test Report for QueryPhaseResultConsumerTests #19725

Open

dzane17 force-pushed the wlm-2 branch from a5d7602 to 2f07ae6 Compare March 23, 2026 18:44

dzane17 marked this pull request as ready for review March 23, 2026 20:04

dzane17 requested a review from a team as a code owner March 23, 2026 20:04

opensearch-ci-bot mentioned this pull request Mar 23, 2026

[AUTOCUT] Gradle Check Flaky Test Report for IngestFromKinesisIT #17678

Open

kkhatua requested review from jainankitk, kaushalmahi12 and kkhatua March 27, 2026 18:16

Add additional WLM search settings

2b0758e

Signed-off-by: David Zane <davizane@amazon.com>

dzane17 force-pushed the wlm-2 branch from 2f07ae6 to 2b0758e Compare March 27, 2026 18:59

cwperks reviewed Mar 27, 2026

View reviewed changes

server/src/main/java/org/opensearch/wlm/WorkloadGroupSearchSettings.java Outdated Show resolved Hide resolved

Merge branch 'main' into wlm-2

9c55503

Use existing Setting validators, add update integ tests

d94c8f4

Signed-off-by: David Zane <davizane@amazon.com>

jainankitk reviewed Mar 27, 2026

View reviewed changes

Conversation

dzane17 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Check List

Uh oh!

github-actions bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit 2f07ae6)

Uh oh!

github-actions bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

cwperks Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

dzane17 Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

jainankitk left a comment

Choose a reason for hiding this comment

Uh oh!

dzane17 commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

codecov bot commented Mar 27, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dzane17 commented Mar 10, 2026 •

edited

Loading

github-actions bot commented Mar 10, 2026 •

edited

Loading

(Review updated until commit `2f07ae6`)

github-actions bot commented Mar 10, 2026 •

edited

Loading