Adding chaos test cases for Context Aware Segments#20468
Adding chaos test cases for Context Aware Segments#20468RS146BIJAY wants to merge 1 commit intoopensearch-project:mainfrom
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughThis pull request introduces test infrastructure enhancements for engine reliability testing. It adds a utility class for suppressing merge scheduler exceptions, three new chaos test classes for simulating disk-full, JRE-crash, and file descriptor exhaustion scenarios, and extends the test framework with context-aware document creation helpers. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In
`@server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnJRECrashTests.java`:
- Around line 95-99: The test currently calls started.await() after
testThread.start(), which can hang the CI if the worker fails to signal; modify
the wait to use a bounded timeout (e.g., started.await(timeout, timeUnit)) and
fail the test with a clear error if the latch times out. Update the code around
testThread.start()/started.await() in InternalEngineOnJRECrashTests so the start
latch uses a sensible timeout and asserts that the await returned true (or
throws/records failure) before proceeding.
In
`@server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnOutOfFileDescriptorsTests.java`:
- Around line 44-101: The test reuses hitException for both the initial
low-error phase and the forced-failure phase, which can hide a missing exception
in phase‑2; introduce a separate boolean (e.g., hitForcedFailure) and use it
only for the second try/catch that indexes with
dir.setRandomIOExceptionRateOnOpen(rate) and the subsequent
engine.refresh("testing"), set hitForcedFailure = true in that catch
(EngineException e) and assertTrue(hitForcedFailure) after that block while
keeping the original hitException checks/assertFalse for the first phase (the
variables to change are hitException and the new hitForcedFailure used around
the second indexing/refresh try/catch).
In
`@test/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java`:
- Around line 826-836: The override of index(Index index) unconditionally calls
ParseContext.Document.setGroupingCriteria("grouping_criteria") which will
overwrite caller-provided tenant/grouping values (e.g., those set by
testContextSpecificDocumentWithTenantField); change the logic in the index(Index
index) override (inside the config().getIndexSettings().isContextAwareEnabled()
branch) to only call setGroupingCriteria("grouping_criteria") when the document
has no grouping criteria already (e.g., check the document's existing grouping
value via the appropriate getter or null/empty check before calling
setGroupingCriteria), so provided grouping criteria are preserved.
♻️ Duplicate comments (1)
test/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java (1)
854-863: Same grouping-criteria overwrite concern here.
This override has the same unconditional clobbering behavior as the earlier block.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
server/src/test/java/org/apache/lucene/index/IndexWriterUtil.javaserver/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnDiskWriterFullTests.javaserver/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnJRECrashTests.javaserver/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnOutOfFileDescriptorsTests.javatest/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java
🧰 Additional context used
🧬 Code graph analysis (2)
server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnJRECrashTests.java (3)
libs/core/src/main/java/org/opensearch/core/common/bytes/BytesArray.java (1)
BytesArray(48-151)server/src/main/java/org/opensearch/index/engine/InternalEngine.java (1)
InternalEngine(143-3057)server/src/main/java/org/opensearch/common/util/FeatureFlags.java (1)
FeatureFlags(29-341)
server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnOutOfFileDescriptorsTests.java (2)
server/src/test/java/org/apache/lucene/index/IndexWriterUtil.java (1)
IndexWriterUtil(15-24)server/src/main/java/org/opensearch/common/util/FeatureFlags.java (1)
FeatureFlags(29-341)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
- GitHub Check: gradle-check
- GitHub Check: detect-breaking-change
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: Analyze (java)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: Mend Security Check
🔇 Additional comments (7)
server/src/test/java/org/apache/lucene/index/IndexWriterUtil.java (1)
15-23: LGTM — utility cleanly suppresses merge-scheduler noise in tests.test/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java (2)
398-403: Nice addition for tenant-scoped docs.
1695-1696: LGTM — mapping update aligns with context-aware grouping.server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnDiskWriterFullTests.java (2)
90-133: LGTM — refresh failure path is asserted clearly.
71-83: Improve exception handling to catch the actual failures from disk-full.
The suggestedIOException | AlreadyClosedExceptiondoes not account forRefreshFailedEngineException, which is thrown byengine.refresh()during disk-full conditions (confirmed by the companion testtestEngineRefreshOnDiskFullthat explicitly expects this). Since the loop calls bothengine.index()andengine.refresh(), catchEngineExceptioninstead, which coversRefreshFailedEngineExceptionand provides a more precise failure specification than broadException.🧹 Suggested tightening
- } catch (Exception ex) { + } catch (EngineException ex) { hitException = true; }Likely an incorrect or invalid review comment.
server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnJRECrashTests.java (2)
37-55: LGTM — workload loop is straightforward and tenant-aware.
122-140: LGTM — good coverage across crash delays.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| // Start the indexing thread. | ||
| testThread.start(); | ||
| // Wait for the indexing thread to start. | ||
| started.await(); | ||
|
|
There was a problem hiding this comment.
Add a timeout to the start latch to prevent hangs.
started.await() without a timeout can hang CI if the worker thread fails to start.
⏱️ Suggested safeguard
- started.await();
+ assertTrue(started.await(5, TimeUnit.SECONDS));📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Start the indexing thread. | |
| testThread.start(); | |
| // Wait for the indexing thread to start. | |
| started.await(); | |
| // Start the indexing thread. | |
| testThread.start(); | |
| // Wait for the indexing thread to start. | |
| assertTrue(started.await(5, TimeUnit.SECONDS)); | |
🤖 Prompt for AI Agents
In
`@server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnJRECrashTests.java`
around lines 95 - 99, The test currently calls started.await() after
testThread.start(), which can hang the CI if the worker fails to signal; modify
the wait to use a bounded timeout (e.g., started.await(timeout, timeUnit)) and
fail the test with a clear error if the latch times out. Update the code around
testThread.start()/started.await() in InternalEngineOnJRECrashTests so the start
latch uses a sensible timeout and asserts that the await returned true (or
throws/records failure) before proceeding.
| boolean hitException = false; | ||
| final AtomicLong globalCheckpoint = new AtomicLong(SequenceNumbers.NO_OPS_PERFORMED); | ||
| final IndexSettings indexSettings = IndexSettingsModule.newIndexSettings( | ||
| "test", | ||
| Settings.builder() | ||
| .put(defaultSettings.getSettings()) | ||
| .put(IndexSettings.INDEX_CONTEXT_AWARE_ENABLED_SETTING.getKey(), true) | ||
| .build() | ||
| ); | ||
| try ( | ||
| Store store = createStore(dir); | ||
| InternalEngine engine = createEngine( | ||
| indexSettings, | ||
| store, | ||
| primaryTranslogDir, | ||
| newMergePolicy(), | ||
| indexWriterFactory, | ||
| null, | ||
| globalCheckpoint::get | ||
| ) | ||
| ) { | ||
| int numDocsFirstSegment = randomIntBetween(50, 100); | ||
| try { | ||
| for (int i = 0; i < numDocsFirstSegment; i++) { | ||
| String id = Integer.toString(i); | ||
| ParsedDocument doc = testParsedDocument(id, null, testDocument(), B_1, null); | ||
| engine.index(indexForDoc(doc)); | ||
| } | ||
| } catch (IOException ex) { | ||
| hitException = true; | ||
| } | ||
|
|
||
| assertFalse(hitException); | ||
| assertTrue(DirectoryReader.indexExists(dir)); | ||
|
|
||
| try { | ||
| engine.refresh("testing"); | ||
| } catch (EngineException e) { | ||
| hitException = true; | ||
| } | ||
|
|
||
| assertTrue(DirectoryReader.indexExists(dir)); | ||
| rate = 1.0; | ||
| dir.setRandomIOExceptionRateOnOpen(rate); | ||
| try { | ||
| for (int i = numDocsFirstSegment; i < numDocsFirstSegment + numDocsFirstSegment; i++) { | ||
| String id = Integer.toString(i); | ||
| ParsedDocument doc = testParsedDocument(id, null, testDocument(), B_1, null); | ||
| engine.index(indexForDoc(doc)); | ||
| } | ||
|
|
||
| engine.refresh("testing"); | ||
| } catch (EngineException e) { | ||
| hitException = true; | ||
| } | ||
|
|
||
| assertTrue(hitException); | ||
| assertTrue(DirectoryReader.indexExists(dir)); |
There was a problem hiding this comment.
Separate failure flags for the two phases.
hitException is reused across the low-rate and 100% rate phases, so a phase‑1 failure can mask a missing phase‑2 failure. A dedicated flag for the forced-failure phase makes the assertion precise.
🧪 Suggested adjustment
- rate = 1.0;
+ boolean phase2Failed = false;
+ rate = 1.0;
dir.setRandomIOExceptionRateOnOpen(rate);
try {
for (int i = numDocsFirstSegment; i < numDocsFirstSegment + numDocsFirstSegment; i++) {
String id = Integer.toString(i);
ParsedDocument doc = testParsedDocument(id, null, testDocument(), B_1, null);
engine.index(indexForDoc(doc));
}
engine.refresh("testing");
} catch (EngineException e) {
- hitException = true;
+ phase2Failed = true;
}
- assertTrue(hitException);
+ assertTrue(phase2Failed);🤖 Prompt for AI Agents
In
`@server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnOutOfFileDescriptorsTests.java`
around lines 44 - 101, The test reuses hitException for both the initial
low-error phase and the forced-failure phase, which can hide a missing exception
in phase‑2; introduce a separate boolean (e.g., hitForcedFailure) and use it
only for the second try/catch that indexes with
dir.setRandomIOExceptionRateOnOpen(rate) and the subsequent
engine.refresh("testing"), set hitForcedFailure = true in that catch
(EngineException e) and assertTrue(hitForcedFailure) after that block while
keeping the original hitException checks/assertFalse for the first phase (the
variables to change are hitException and the new hitForcedFailure used around
the second indexing/refresh try/catch).
|
|
||
| @Override | ||
| public IndexResult index(Index index) throws IOException { | ||
| if (config().getIndexSettings().isContextAwareEnabled()) { | ||
| for (ParseContext.Document doc : index.docs()) { | ||
| doc.setGroupingCriteria("grouping_criteria"); | ||
| } | ||
| } | ||
|
|
||
| return super.index(index); | ||
| } |
There was a problem hiding this comment.
Avoid overwriting an explicit grouping criteria.
The unconditional setGroupingCriteria("grouping_criteria") will clobber tenant-specific values set by testContextSpecificDocumentWithTenantField(...). Consider only applying a default when none is provided.
🔧 Suggested guard to preserve caller-provided criteria
- if (config().getIndexSettings().isContextAwareEnabled()) {
- for (ParseContext.Document doc : index.docs()) {
- doc.setGroupingCriteria("grouping_criteria");
- }
- }
+ if (config().getIndexSettings().isContextAwareEnabled()) {
+ for (ParseContext.Document doc : index.docs()) {
+ if (doc.getGroupingCriteria() == null) {
+ doc.setGroupingCriteria("grouping_criteria");
+ }
+ }
+ }🤖 Prompt for AI Agents
In `@test/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java`
around lines 826 - 836, The override of index(Index index) unconditionally calls
ParseContext.Document.setGroupingCriteria("grouping_criteria") which will
overwrite caller-provided tenant/grouping values (e.g., those set by
testContextSpecificDocumentWithTenantField); change the logic in the index(Index
index) override (inside the config().getIndexSettings().isContextAwareEnabled()
branch) to only call setGroupingCriteria("grouping_criteria") when the document
has no grouping criteria already (e.g., check the document's existing grouping
value via the appropriate getter or null/empty check before calling
setGroupingCriteria), so provided grouping criteria are preserved.
|
❌ Gradle check result for 231cb15: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: RS146BIJAY <rishavsagar4b1@gmail.com>
231cb15 to
dbedebd
Compare
|
❌ Gradle check result for dbedebd: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for dbedebd: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Bukhtawar
left a comment
There was a problem hiding this comment.
MockDirectoryWrapper already provides a framework for chaos testing. Have we evaluated or used it already?
We are using |
|
This PR is stalled because it has been open for 30 days with no activity. |
Description
Adding chaos testing for Context Aware Segments during indexing, refresh and update/delete operations for following scenarios: