Adding chaos test cases for Context Aware Segments by RS146BIJAY · Pull Request #20468 · opensearch-project/OpenSearch

RS146BIJAY · 2026-01-23T18:24:08Z

Description

Adding chaos testing for Context Aware Segments during indexing, refresh and update/delete operations for following scenarios:

For disk full scenarios.
For JRE crash scenarios
For OutOfFileDescriptors scenarios

coderabbitai · 2026-01-23T18:24:46Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

This pull request introduces test infrastructure enhancements for engine reliability testing. It adds a utility class for suppressing merge scheduler exceptions, three new chaos test classes for simulating disk-full, JRE-crash, and file descriptor exhaustion scenarios, and extends the test framework with context-aware document creation helpers.

Changes

Cohort / File(s)	Summary
Test Utility `server/src/test/java/org/apache/lucene/index/IndexWriterUtil.java`	New utility class with `suppressMergePolicyException()` method that suppresses thread-exception output from `ConcurrentMergeScheduler` during test execution.
Chaos Test Classes `server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnDiskWriterFullTests.java`, `InternalEngineOnJRECrashTests.java`, `InternalEngineOnOutOfFileDescriptorsTests.java`	Three new test classes for engine failure scenarios: `InternalEngineOnDiskWriterFullTests` verifies exception handling when disk fills during document indexing and refresh; `InternalEngineOnJRECrashTests` simulates JRE interruption with variable crash delays (10ms–10s) and validates data persistence; `InternalEngineOnOutOfFileDescriptorsTests` injects IO errors and verifies engine behavior under file descriptor exhaustion. All classes use mock directory wrappers, custom IndexWriter configuration, and CONTEXT_AWARE_MIGRATION_EXPERIMENTAL_FLAG.
Test Framework Enhancement `test/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java`	Adds `testContextSpecificDocumentWithTenantField()` helper method for creating context-aware documents with tenant grouping criteria; enhances context-aware index mapping with grouping_criteria field and script-based context-aware grouping configuration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

lucene, Indexing & Search

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The description covers the main purpose (chaos testing for Context Aware Segments) and lists three key scenarios, but is missing related issue reference and checklist completion required by the template.	Add the 'Resolves #[Issue number]' section and complete the checklist items to fully align with the repository's pull request description template.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding chaos test cases for Context Aware Segments, which matches the pull request's objective of introducing multiple chaos testing scenarios.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In
`@server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnJRECrashTests.java`:
- Around line 95-99: The test currently calls started.await() after
testThread.start(), which can hang the CI if the worker fails to signal; modify
the wait to use a bounded timeout (e.g., started.await(timeout, timeUnit)) and
fail the test with a clear error if the latch times out. Update the code around
testThread.start()/started.await() in InternalEngineOnJRECrashTests so the start
latch uses a sensible timeout and asserts that the await returned true (or
throws/records failure) before proceeding.

In
`@server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnOutOfFileDescriptorsTests.java`:
- Around line 44-101: The test reuses hitException for both the initial
low-error phase and the forced-failure phase, which can hide a missing exception
in phase‑2; introduce a separate boolean (e.g., hitForcedFailure) and use it
only for the second try/catch that indexes with
dir.setRandomIOExceptionRateOnOpen(rate) and the subsequent
engine.refresh("testing"), set hitForcedFailure = true in that catch
(EngineException e) and assertTrue(hitForcedFailure) after that block while
keeping the original hitException checks/assertFalse for the first phase (the
variables to change are hitException and the new hitForcedFailure used around
the second indexing/refresh try/catch).

In
`@test/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java`:
- Around line 826-836: The override of index(Index index) unconditionally calls
ParseContext.Document.setGroupingCriteria("grouping_criteria") which will
overwrite caller-provided tenant/grouping values (e.g., those set by
testContextSpecificDocumentWithTenantField); change the logic in the index(Index
index) override (inside the config().getIndexSettings().isContextAwareEnabled()
branch) to only call setGroupingCriteria("grouping_criteria") when the document
has no grouping criteria already (e.g., check the document's existing grouping
value via the appropriate getter or null/empty check before calling
setGroupingCriteria), so provided grouping criteria are preserved.

♻️ Duplicate comments (1)

test/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java (1)

854-863: Same grouping-criteria overwrite concern here.
This override has the same unconditional clobbering behavior as the earlier block.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5041d43 and 231cb15.

📒 Files selected for processing (5)

server/src/test/java/org/apache/lucene/index/IndexWriterUtil.java
server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnDiskWriterFullTests.java
server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnJRECrashTests.java
server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnOutOfFileDescriptorsTests.java
test/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java

🧰 Additional context used

🧬 Code graph analysis (2)

server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnJRECrashTests.java (3)

libs/core/src/main/java/org/opensearch/core/common/bytes/BytesArray.java (1)

BytesArray (48-151)

server/src/main/java/org/opensearch/index/engine/InternalEngine.java (1)

InternalEngine (143-3057)

server/src/main/java/org/opensearch/common/util/FeatureFlags.java (1)

FeatureFlags (29-341)

server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnOutOfFileDescriptorsTests.java (2)

server/src/test/java/org/apache/lucene/index/IndexWriterUtil.java (1)

IndexWriterUtil (15-24)

server/src/main/java/org/opensearch/common/util/FeatureFlags.java (1)

FeatureFlags (29-341)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)

GitHub Check: gradle-check
GitHub Check: detect-breaking-change
GitHub Check: precommit (21, windows-2025, true)
GitHub Check: precommit (21, macos-15)
GitHub Check: precommit (21, ubuntu-latest)
GitHub Check: precommit (25, macos-15-intel)
GitHub Check: precommit (25, ubuntu-24.04-arm)
GitHub Check: precommit (25, ubuntu-latest)
GitHub Check: precommit (21, macos-15-intel)
GitHub Check: precommit (25, macos-15)
GitHub Check: precommit (21, ubuntu-24.04-arm)
GitHub Check: precommit (21, windows-latest)
GitHub Check: precommit (25, windows-latest)
GitHub Check: Analyze (java)
GitHub Check: assemble (21, ubuntu-latest)
GitHub Check: assemble (21, windows-latest)
GitHub Check: assemble (25, ubuntu-24.04-arm)
GitHub Check: assemble (21, ubuntu-24.04-arm)
GitHub Check: assemble (25, ubuntu-latest)
GitHub Check: assemble (25, windows-latest)
GitHub Check: Mend Security Check

🔇 Additional comments (7)

server/src/test/java/org/apache/lucene/index/IndexWriterUtil.java (1)

15-23: LGTM — utility cleanly suppresses merge-scheduler noise in tests.

test/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java (2)

398-403: Nice addition for tenant-scoped docs.

1695-1696: LGTM — mapping update aligns with context-aware grouping.
server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnDiskWriterFullTests.java (2)
90-133: LGTM — refresh failure path is asserted clearly.

71-83: Improve exception handling to catch the actual failures from disk-full.
The suggested IOException | AlreadyClosedException does not account for RefreshFailedEngineException, which is thrown by engine.refresh() during disk-full conditions (confirmed by the companion test testEngineRefreshOnDiskFull that explicitly expects this). Since the loop calls both engine.index() and engine.refresh(), catch EngineException instead, which covers RefreshFailedEngineException and provides a more precise failure specification than broad Exception.
🧹 Suggested tightening
-            } catch (Exception ex) {
+            } catch (EngineException ex) {
                 hitException = true;
             }
Likely an incorrect or invalid review comment.
server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnJRECrashTests.java (2)

37-55: LGTM — workload loop is straightforward and tenant-aware.

122-140: LGTM — good coverage across crash delays.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-23T18:35:47Z

server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnJRECrashTests.java

+                // Start the indexing thread.
+                testThread.start();
+                // Wait for the indexing thread to start.
+                started.await();
+


⚠️ Potential issue | 🟡 Minor

Add a timeout to the start latch to prevent hangs.
started.await() without a timeout can hang CI if the worker thread fails to start.

⏱️ Suggested safeguard

- started.await(); + assertTrue(started.await(5, TimeUnit.SECONDS));

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Start the indexing thread.

testThread.start();

// Wait for the indexing thread to start.

started.await();

// Start the indexing thread.

testThread.start();

// Wait for the indexing thread to start.

assertTrue(started.await(5, TimeUnit.SECONDS));

🤖 Prompt for AI Agents

In `@server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnJRECrashTests.java` around lines 95 - 99, The test currently calls started.await() after testThread.start(), which can hang the CI if the worker fails to signal; modify the wait to use a bounded timeout (e.g., started.await(timeout, timeUnit)) and fail the test with a clear error if the latch times out. Update the code around testThread.start()/started.await() in InternalEngineOnJRECrashTests so the start latch uses a sensible timeout and asserts that the await returned true (or throws/records failure) before proceeding.

coderabbitai · 2026-01-23T18:35:47Z

...c/test/java/org/opensearch/index/engine/chaos/InternalEngineOnOutOfFileDescriptorsTests.java

+        boolean hitException = false;
+        final AtomicLong globalCheckpoint = new AtomicLong(SequenceNumbers.NO_OPS_PERFORMED);
+        final IndexSettings indexSettings = IndexSettingsModule.newIndexSettings(
+            "test",
+            Settings.builder()
+                .put(defaultSettings.getSettings())
+                .put(IndexSettings.INDEX_CONTEXT_AWARE_ENABLED_SETTING.getKey(), true)
+                .build()
+        );
+        try (
+            Store store = createStore(dir);
+            InternalEngine engine = createEngine(
+                indexSettings,
+                store,
+                primaryTranslogDir,
+                newMergePolicy(),
+                indexWriterFactory,
+                null,
+                globalCheckpoint::get
+            )
+        ) {
+            int numDocsFirstSegment = randomIntBetween(50, 100);
+            try {
+                for (int i = 0; i < numDocsFirstSegment; i++) {
+                    String id = Integer.toString(i);
+                    ParsedDocument doc = testParsedDocument(id, null, testDocument(), B_1, null);
+                    engine.index(indexForDoc(doc));
+                }
+            } catch (IOException ex) {
+                hitException = true;
+            }
+
+            assertFalse(hitException);
+            assertTrue(DirectoryReader.indexExists(dir));
+
+            try {
+                engine.refresh("testing");
+            } catch (EngineException e) {
+                hitException = true;
+            }
+
+            assertTrue(DirectoryReader.indexExists(dir));
+            rate = 1.0;
+            dir.setRandomIOExceptionRateOnOpen(rate);
+            try {
+                for (int i = numDocsFirstSegment; i < numDocsFirstSegment + numDocsFirstSegment; i++) {
+                    String id = Integer.toString(i);
+                    ParsedDocument doc = testParsedDocument(id, null, testDocument(), B_1, null);
+                    engine.index(indexForDoc(doc));
+                }
+
+                engine.refresh("testing");
+            } catch (EngineException e) {
+                hitException = true;
+            }
+
+            assertTrue(hitException);
+            assertTrue(DirectoryReader.indexExists(dir));


⚠️ Potential issue | 🟡 Minor

Separate failure flags for the two phases.
hitException is reused across the low-rate and 100% rate phases, so a phase‑1 failure can mask a missing phase‑2 failure. A dedicated flag for the forced-failure phase makes the assertion precise.

🧪 Suggested adjustment

- rate = 1.0; + boolean phase2Failed = false; + rate = 1.0; dir.setRandomIOExceptionRateOnOpen(rate); try { for (int i = numDocsFirstSegment; i < numDocsFirstSegment + numDocsFirstSegment; i++) { String id = Integer.toString(i); ParsedDocument doc = testParsedDocument(id, null, testDocument(), B_1, null); engine.index(indexForDoc(doc)); } engine.refresh("testing"); } catch (EngineException e) { - hitException = true; + phase2Failed = true; } - assertTrue(hitException); + assertTrue(phase2Failed);

🤖 Prompt for AI Agents

In `@server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnOutOfFileDescriptorsTests.java` around lines 44 - 101, The test reuses hitException for both the initial low-error phase and the forced-failure phase, which can hide a missing exception in phase‑2; introduce a separate boolean (e.g., hitForcedFailure) and use it only for the second try/catch that indexes with dir.setRandomIOExceptionRateOnOpen(rate) and the subsequent engine.refresh("testing"), set hitForcedFailure = true in that catch (EngineException e) and assertTrue(hitForcedFailure) after that block while keeping the original hitException checks/assertFalse for the first phase (the variables to change are hitException and the new hitForcedFailure used around the second indexing/refresh try/catch).

coderabbitai · 2026-01-23T18:35:47Z

test/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java

+
+                @Override
+                public IndexResult index(Index index) throws IOException {
+                    if (config().getIndexSettings().isContextAwareEnabled()) {
+                        for (ParseContext.Document doc : index.docs()) {
+                            doc.setGroupingCriteria("grouping_criteria");
+                        }
+                    }
+
+                    return super.index(index);
+                }


⚠️ Potential issue | 🟡 Minor

Avoid overwriting an explicit grouping criteria.
The unconditional setGroupingCriteria("grouping_criteria") will clobber tenant-specific values set by testContextSpecificDocumentWithTenantField(...). Consider only applying a default when none is provided.

🔧 Suggested guard to preserve caller-provided criteria

- if (config().getIndexSettings().isContextAwareEnabled()) { - for (ParseContext.Document doc : index.docs()) { - doc.setGroupingCriteria("grouping_criteria"); - } - } + if (config().getIndexSettings().isContextAwareEnabled()) { + for (ParseContext.Document doc : index.docs()) { + if (doc.getGroupingCriteria() == null) { + doc.setGroupingCriteria("grouping_criteria"); + } + } + }

🤖 Prompt for AI Agents

In `@test/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java` around lines 826 - 836, The override of index(Index index) unconditionally calls ParseContext.Document.setGroupingCriteria("grouping_criteria") which will overwrite caller-provided tenant/grouping values (e.g., those set by testContextSpecificDocumentWithTenantField); change the logic in the index(Index index) override (inside the config().getIndexSettings().isContextAwareEnabled() branch) to only call setGroupingCriteria("grouping_criteria") when the document has no grouping criteria already (e.g., check the document's existing grouping value via the appropriate getter or null/empty check before calling setGroupingCriteria), so provided grouping criteria are preserved.

github-actions · 2026-01-23T19:34:56Z

❌ Gradle check result for 231cb15: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: RS146BIJAY <rishavsagar4b1@gmail.com>

github-actions · 2026-01-23T20:50:21Z

❌ Gradle check result for dbedebd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-01-23T21:33:29Z

❌ Gradle check result for dbedebd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Bukhtawar

MockDirectoryWrapper already provides a framework for chaos testing. Have we evaluated or used it already?

RS146BIJAY · 2026-02-02T09:55:42Z

MockDirectoryWrapper already provides a framework for chaos testing. Have we evaluated or used it already?

We are using MockDirectoryWrapper for simulating disk full and out of file handles scenarios.

opensearch-trigger-bot · 2026-03-04T15:26:01Z

This PR is stalled because it has been open for 30 days with no activity.

RS146BIJAY requested review from a team, Bukhtawar, CEHENKLE, Rishikesh1159, anasalkouz, andrross, ashking94, cwperks, dbwiddis, gbbafna, jed326, kotwanikunal, mch2, msfroh, owaiskazi19, reta, sachinpkale, saratvemulapalli, shwetathareja and sohami as code owners January 23, 2026 18:24

RS146BIJAY changed the title ~~Adding chaos testing for Context Aware Segments~~ Adding chaos test cases for Context Aware Segments Jan 23, 2026

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

Adding chaos testing for Context Aware Segments

dbedebd

Signed-off-by: RS146BIJAY <rishavsagar4b1@gmail.com>

RS146BIJAY force-pushed the chaos-testing branch from 231cb15 to dbedebd Compare January 23, 2026 20:36

Bukhtawar reviewed Feb 1, 2026

View reviewed changes

opensearch-ci-bot mentioned this pull request Feb 18, 2026

[AUTOCUT] Gradle Check Flaky Test Report for FullClusterRestartIT #15813

Open

opensearch-trigger-bot bot added the stalled Issues that have stalled label Mar 4, 2026

srikanthpadakanti mentioned this pull request Mar 12, 2026

[BUG] Shard routings for closed index are allocated again without opening the index #20200

Open

opensearch-ci-bot mentioned this pull request Mar 27, 2026

[AUTOCUT] Gradle Check Flaky Test Report for CloneSnapshotIT #16115

Open

Conversation

RS146BIJAY commented Jan 23, 2026

Description

Uh oh!

coderabbitai bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Suggested labels

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 23, 2026

Uh oh!

github-actions bot commented Jan 23, 2026

Uh oh!

github-actions bot commented Jan 23, 2026

Uh oh!

Bukhtawar left a comment

Choose a reason for hiding this comment

Uh oh!

RS146BIJAY commented Feb 2, 2026

Uh oh!

opensearch-trigger-bot bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Jan 23, 2026 •

edited

Loading