Skip to content

Fix SearchPhaseExecutionException to properly initCause#20320

Merged
cwperks merged 3 commits intoopensearch-project:mainfrom
cwperks:fix-shard-failure-exception
Dec 26, 2025
Merged

Fix SearchPhaseExecutionException to properly initCause#20320
cwperks merged 3 commits intoopensearch-project:mainfrom
cwperks:fix-shard-failure-exception

Conversation

@cwperks
Copy link
Copy Markdown
Member

@cwperks cwperks commented Dec 24, 2025

Description

This PR is an alternative to opensearch-project/anomaly-detection#1638 and fixes an issue with the log4j upgrade from 2.21.0 to 2.25.3. This PR ensures that the SearchPhaseExecutionException calls initCause with the proper root cause to ensure that log4j can get cause metadata.

Related Issues

See linked exception on opensearch-project/anomaly-detection#1638 that this PR resolves.

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • Bug Fixes

    • Improved search error reporting so root causes of shard failures are initialized and surfaced more reliably, giving clearer, more actionable error messages.
    • Serialized error responses now include a top-level caused_by block so downstream tools and UI see the underlying failure details.
  • Tests

    • Updated tests to reflect the new error serialization and whitespace-insensitive JSON comparisons.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Craig Perkins <cwperx@amazon.com>
@cwperks cwperks requested a review from a team as a code owner December 24, 2025 16:14
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Dec 24, 2025

📝 Walkthrough

Walkthrough

Replaces SearchPhaseExecutionException's previous deduplication logic with a new getEffectiveCause helper used by constructors; an explicit cause is preferred, otherwise the first shard failure cause is used. Tests and serialized expectations were updated to reflect the new caused_by behavior; changelog entry added.

Changes

Cohort / File(s) Summary
Changelog
CHANGELOG.md
Added a Fixed entry: "Fix SearchPhaseExecutionException to properly initCause"
Exception cause resolution
server/src/main/java/org/opensearch/action/search/SearchPhaseExecutionException.java
Removed deduplicateCause(...) and the overridden getCause(); added getEffectiveCause(Throwable, ShardSearchFailure[]); constructors now use getEffectiveCause to set the cause (prefer explicit cause, else first shard failure cause, else null)
Tests & serialization expectations
server/src/test/java/org/opensearch/action/search/SearchPhaseExecutionExceptionTests.java, server/src/test/java/org/opensearch/OpenSearchExceptionTests.java, server/src/test/java/org/opensearch/rest/BytesRestResponseTests.java
Updated expected XContent/JSON in tests to include caused_by where appropriate and to use whitespace-insensitive comparisons via XContentHelper.stripWhitespace; adjusted some exception construction in tests to match new cause wiring

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit finds a tangled trace,
Sniffs causes in a crowded space,
Picks the true one, hops with glee,
"Now stacktraces are clearer," says he. 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.14% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main change: fixing SearchPhaseExecutionException to properly initialize the cause.
Description check ✅ Passed The description provides context about the fix, references related issues and PRs, and follows the template structure with Description, Related Issues, and Checklist sections.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4923381 and 14b84e1.

📒 Files selected for processing (3)
  • server/src/test/java/org/opensearch/OpenSearchExceptionTests.java
  • server/src/test/java/org/opensearch/action/search/SearchPhaseExecutionExceptionTests.java
  • server/src/test/java/org/opensearch/rest/BytesRestResponseTests.java
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-02T22:44:14.799Z
Learnt from: prudhvigodithi
Repo: opensearch-project/OpenSearch PR: 20112
File: server/src/internalClusterTest/java/org/opensearch/search/slice/SearchSliceIT.java:73-81
Timestamp: 2025-12-02T22:44:14.799Z
Learning: In OpenSearch integration tests extending OpenSearchIntegTestCase, using `LuceneTestCase.SuppressCodecs("*")` triggers special handling that selects a random production codec from the CODECS array, while `SuppressCodecs("Asserting")` or other specific codec suppressions still allow Lucene's default codec randomization which may include the asserting codec. Use `SuppressCodecs("*")` when you need to completely avoid asserting codecs (e.g., for cross-thread StoredFieldsReader usage) while maintaining production codec test coverage.

Applied to files:

  • server/src/test/java/org/opensearch/OpenSearchExceptionTests.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: gradle-check
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: Analyze (java)
  • GitHub Check: detect-breaking-change
🔇 Additional comments (9)
server/src/test/java/org/opensearch/rest/BytesRestResponseTests.java (2)

43-43: LGTM!

The import is correctly added to support the new XContentHelper.stripWhitespace usage in the test.


183-221: LGTM!

The test update correctly reflects the new behavior where SearchPhaseExecutionException properly wires up the cause. The multi-line JSON string format improves readability, and the caused_by block accurately captures the first shard failure's ParsingException. Using XContentHelper.stripWhitespace ensures whitespace-insensitive comparison while maintaining the readable format in the test source.

server/src/test/java/org/opensearch/action/search/SearchPhaseExecutionExceptionTests.java (2)

82-121: LGTM!

The expected JSON correctly includes the new caused_by block reflecting the first shard failure's ParsingException. The multi-line format with XContentHelper.stripWhitespace improves test readability while maintaining accurate comparison.


158-159: LGTM!

The assertion change from expecting a null cause to expecting a non-null cause correctly reflects the new behavior where SearchPhaseExecutionException wires up the first shard failure's cause. The accompanying comment clearly explains this behavior change.

server/src/test/java/org/opensearch/OpenSearchExceptionTests.java (5)

48-48: LGTM!

The import is correctly added to support the XContentHelper.stripWhitespace usage in test assertions.


255-282: LGTM!

The expected JSON correctly reflects the new caused_by block in the SearchPhaseExecutionException serialization. The multi-line format with XContentHelper.stripWhitespace improves readability.


306-344: LGTM!

The expected JSON for the second deduplicate test case correctly includes the caused_by block derived from the first shard failure's ParsingException, consistent with the new behavior.


367-371: LGTM!

This test case correctly verifies that when an explicit NullPointerException cause is provided to SearchPhaseExecutionException, it takes precedence and appears in the caused_by block with "reason":null as expected for a NullPointerException.


1127-1129: LGTM!

The expected exception now correctly includes a nested OpenSearchException cause wrapping the ParsingException, which aligns with the new behavior where SearchPhaseExecutionException wires up the first shard failure's cause when no explicit cause is provided.


Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Craig Perkins <cwperx@amazon.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
server/src/main/java/org/opensearch/action/search/SearchPhaseExecutionException.java (2)

76-85: Redundant loop — both branches return cause.

The loop checking for duplicate causes has no effect on the outcome. Whether a duplicate is found or not, cause is always returned when non-null. The comment mentions "not duplicated" but the behavior is identical in both cases.

Consider simplifying:

🔎 Proposed simplification
     // If explicit cause provided and not duplicated in shard failures, use it
     if (cause != null) {
-        for (ShardSearchFailure failure : shardFailures) {
-            if (failure.getCause() == cause) {
-                // Cause is duplicated, but still return it to properly wire the chain
-                return cause;
-            }
-        }
         return cause;
     }

163-175: Stale comment references removed getCause() override.

The comment at lines 163-165 references {@link #getCause()} with "guessing" behavior, but the overridden getCause() method was removed in this PR. The distinction between getCause() and super.getCause() (line 174) is no longer meaningful since both now return the same value.

Consider updating the comment to reflect that cause resolution now happens at construction time via getEffectiveCause.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e047211 and 4923381.

📒 Files selected for processing (2)
  • CHANGELOG.md
  • server/src/main/java/org/opensearch/action/search/SearchPhaseExecutionException.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
  • GitHub Check: gradle-check
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: Analyze (java)
🔇 Additional comments (2)
CHANGELOG.md (1)

25-25: LGTM!

The changelog entry follows the project format and is correctly placed under the Fixed section.

server/src/main/java/org/opensearch/action/search/SearchPhaseExecutionException.java (1)

62-66: LGTM!

The constructor now properly initializes the cause at construction time, which ensures initCause is called correctly for log4j compatibility.

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 4923381: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep lucene labels Dec 24, 2025
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 4923381: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@dbwiddis
Copy link
Copy Markdown
Member

dbwiddis commented Dec 24, 2025

@cwperks
Copy link
Copy Markdown
Member Author

cwperks commented Dec 24, 2025

I'll push a fix for the tests soon

Signed-off-by: Craig Perkins <cwperx@amazon.com>
@cwperks
Copy link
Copy Markdown
Member Author

cwperks commented Dec 24, 2025

@dbwiddis I pushed a commit to fix the tests. Note that this slightly switches the serialization of SearchPhaseExecutionException to always include a caused_by if at least a single shard failed. I think this is reasonable for the fix, but lmk what you think.

I also took the opportunity to modernize and use Text Blocks (IMO much more readable).

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 14b84e1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 14b84e1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Copy Markdown
Member

@dbwiddis dbwiddis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one question on logic sequence. OK to merge if you're happy with it.

@github-actions
Copy link
Copy Markdown
Contributor

❕ Gradle check result for 14b84e1: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@codecov
Copy link
Copy Markdown

codecov bot commented Dec 25, 2025

Codecov Report

❌ Patch coverage is 72.72727% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.26%. Comparing base (2216423) to head (14b84e1).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...h/action/search/SearchPhaseExecutionException.java 72.72% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20320      +/-   ##
============================================
- Coverage     73.28%   73.26%   -0.02%     
- Complexity    71723    71761      +38     
============================================
  Files          5785     5785              
  Lines        328137   328136       -1     
  Branches      47270    47269       -1     
============================================
- Hits         240465   240408      -57     
- Misses        68397    68485      +88     
+ Partials      19275    19243      -32     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cwperks cwperks merged commit 696951d into opensearch-project:main Dec 26, 2025
36 of 41 checks passed
@dbwiddis
Copy link
Copy Markdown
Member

@cwperks should we backport this? I think 2.19 will suffer the same fate

@cwperks
Copy link
Copy Markdown
Member Author

cwperks commented Dec 26, 2025

@cwperks should we backport this? I think 2.19 will suffer the same fate

the log4j bump hasn't been merged yet. I will raise a backport for this once the log4j bump is merged

@cwperks
Copy link
Copy Markdown
Member Author

cwperks commented Dec 27, 2025

#20314 is merged now. I will open up a backport on Monday.

@cwperks
Copy link
Copy Markdown
Member Author

cwperks commented Dec 29, 2025

Opened backport for 2.19: #20336

andrross pushed a commit that referenced this pull request Dec 29, 2025
…se (#20320) (#20336)

* [Backport 2.19] Fix SearchPhaseExecutionException to properly initCause

Signed-off-by: Craig Perkins <cwperx@amazon.com>

* Add to CHANGELOG

Signed-off-by: Craig Perkins <cwperx@amazon.com>

---------

Signed-off-by: Craig Perkins <cwperx@amazon.com>
mohit10011999 pushed a commit to mohit10011999/OpenSearch that referenced this pull request Dec 30, 2025
…roject#20320)

* Fix SearchPhaseExecutionException to properly initCause

Signed-off-by: Craig Perkins <cwperx@amazon.com>

* Add to CHANGELOG

Signed-off-by: Craig Perkins <cwperx@amazon.com>

* Fix tests

Signed-off-by: Craig Perkins <cwperx@amazon.com>

---------

Signed-off-by: Craig Perkins <cwperx@amazon.com>
mohit10011999 pushed a commit to mohit10011999/OpenSearch that referenced this pull request Dec 30, 2025
…roject#20320)

* Fix SearchPhaseExecutionException to properly initCause

Signed-off-by: Craig Perkins <cwperx@amazon.com>

* Add to CHANGELOG

Signed-off-by: Craig Perkins <cwperx@amazon.com>

* Fix tests

Signed-off-by: Craig Perkins <cwperx@amazon.com>

---------

Signed-off-by: Craig Perkins <cwperx@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep lucene

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ThrowableStackTraceRenderer throws NPE when rendering a Throwable with concurrently-mutated suppressions

2 participants