Skip to content

CCR: Remove non-primary origin assertions from operation planners#20952

Closed
mishail wants to merge 2 commits into
opensearch-project:mainfrom
mishail:ms/3.6-ccr-fix-oss
Closed

CCR: Remove non-primary origin assertions from operation planners#20952
mishail wants to merge 2 commits into
opensearch-project:mainfrom
mishail:ms/3.6-ccr-fix-oss

Conversation

@mishail
Copy link
Copy Markdown
Contributor

@mishail mishail commented Mar 21, 2026

Disclaimer: portions of this text were reformatted and edited with the assistance of generative AI tools for clarity and readability. The analysis and technical content originate from the author.

Description

OpenSearch commit 59be6ae / #20585 refactored InternalEngine to extract its two large inner static classes into separate top-level files:

  • InternalEngine.IndexingStrategy → top-level IndexingStrategy extends OperationStrategy
  • InternalEngine.DeletionStrategy → top-level DeletionStrategy extends OperationStrategy

It also introduced IndexingStrategyPlanner and DeletionStrategyPlanner — classes that encapsulate the planning logic previously inlined in InternalEngine.

The refactor broke CCR's ReplicationEngine in two ways. See opensearch-project/cross-cluster-replication#1646

  1. Compilation failure

ReplicationEngine.kt references IndexingStrategy and DeletionStrategy. These were previously inner static classes of InternalEngine; they are now top-level classes in org.opensearch.index.engine and require explicit imports. Two lines change on the CCR side.

  1. Runtime assertion failure
    This affects a lot of tests in CCR at the moment.
java.lang.AssertionError: planing as primary but got PRIMARY
    at org.opensearch.index.engine.OperationStrategyPlanner.assertNonPrimaryOrigin(OperationStrategyPlanner.java:41)
    at org.opensearch.index.engine.IndexingStrategyPlanner.planOperationAsNonPrimary(IndexingStrategyPlanner.java:162)
    at org.opensearch.index.engine.InternalEngine.planIndexingAsNonPrimary(InternalEngine.java:1030)
    at org.opensearch.replication.ReplicationEngine.indexingStrategyForOperation(ReplicationEngine.kt:34)
    at org.opensearch.index.engine.InternalEngine.index(InternalEngine.java:929)
    ...
    at org.opensearch.replication.action.replay.TransportReplayChangesAction.performOnPrimary(TransportReplayChangesAction.kt:126)

And that happens because of introduction of OperationStrategyPlanners and more specifically because those planners now have their own assertNonPrimaryOrigin, which CCR's ReplicationEngine has no way to override.

The old way:

  • ReplicationEngine.indexingStrategyForOperation always calls planIndexingAsNonPrimary
  • ReplicationEngine.assertNonPrimaryOrigin overrides InternalEngine.assertNonPrimaryOrigin to return true, bypassing the origin check inside InternalEngine.planIndexingAsNonPrimary and planDeletionAsNonPrimary

After the refactor, InternalEngine.planIndexingAsNonPrimary and planDeletionAsNonPrimary delegate to private IndexingStrategyPlanner / DeletionStrategyPlanner instances. Both planners call assertNonPrimaryOrigin(operation) internally, which comes from OperationStrategyPlanner, which is entirely separate from InternalEngine.assertNonPrimaryOrigin. So ReplicationEngine.assertNonPrimaryOrigin is never invoked, and the assertion fires.

Proposed solution:

Remove the assert assertNonPrimaryOrigin(...) calls from IndexingStrategyPlanner.planOperationAsNonPrimary (line 162) and DeletionStrategyPlanner.planOperationAsNonPrimary (line 117).

Rationale: These asserts call assertNonPrimaryOrigin through the OperationStrategyPlanner interface's default method, a path that bypasses InternalEngine.assertNonPrimaryOrigin and any overrides defined in InternalEngine subclasses (such as ReplicationEngine). This breaks InternalEngine's established extensibility contract: subclasses that legitimately override assertNonPrimaryOrigin can no longer influence this check. Removing the asserts from the planners restores that contract. The check that matters — the one that IS overridable — remains in InternalEngine.planIndexingAsNonPrimary and planDeletionAsNonPrimary.

Pros:

  • Minimal change: 2 assert lines removed in OpenSearch, 2 imports added in CCR
  • Restores the pre-existing extensibility contract without adding new API surface
  • Rollback is trivial

Cons:

  • Removes a precondition guard from the planner classes themselves; planners no longer self-enforce that they are called in the correct context
  • Does not provide a clean extension point for future subclasses that may need similar customization of planner behavior

Related Issues

Resolves opensearch-project/cross-cluster-replication#1646

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Mikhail Stepura <mstepura@apple.com>
@github-actions github-actions Bot added Build Libraries & Interfaces enhancement Enhancement or improvement to existing feature or request labels Mar 21, 2026
@mishail mishail changed the title Remove non-primary origin assertions from operation planners CCR: Remove non-primary origin assertions from operation planners Mar 21, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit ef96603.

PathLineSeverityDescription
server/src/main/java/org/opensearch/index/engine/DeletionStrategyPlanner.java117mediumRemoval of `assert assertNonPrimaryOrigin(delete)` in planOperationAsNonPrimary weakens the origin validation check that ensures operations processed as non-primary actually originate from non-primary sources. While Java assertions are disabled by default in production, this check also documents and enforces a contract about operation origins during development/testing. Combined with the parallel removal in IndexingStrategyPlanner, the simultaneous removal of both non-primary origin guards across indexing and deletion paths is suspicious and warrants investigation to confirm there is a legitimate reason (e.g., a known edge case where the assertion incorrectly fires) rather than an attempt to silently allow primary-origin operations to be processed through the non-primary path without scrutiny.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 1 | Low: 0


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@mishail
Copy link
Copy Markdown
Contributor Author

mishail commented Mar 21, 2026

Another option, if we do want to keep those assertions in Planners, would be adding protected factory methods to InternalEngine that create the planner instances, allowing subclasses to substitute custom planner implementations:

// In InternalEngine:
protected IndexingStrategyPlanner createIndexingStrategyPlanner(...) {
    return new IndexingStrategyPlanner(...);
}
protected DeletionStrategyPlanner createDeletionStrategyPlanner(...) {
    return new DeletionStrategyPlanner(...);
}

@mishail
Copy link
Copy Markdown
Contributor Author

mishail commented Mar 21, 2026

cc @shank9918

@cwperks
Copy link
Copy Markdown
Member

cwperks commented Mar 24, 2026

@mishail Can you please sync with main? There are a few test fixes that this feature branch would need to get for CI checks to pass.

@cwperks
Copy link
Copy Markdown
Member

cwperks commented Mar 24, 2026

java.lang.AssertionError: planing as primary but got PRIMARY

@mishail Can we please fix this typo while at it? maybe even change the text entirely. At least for me, its hard to tell what the purpose is.

After inspecting the call sites, I agree these calls are redundant.

cwperks
cwperks previously approved these changes Mar 24, 2026
@cwperks
Copy link
Copy Markdown
Member

cwperks commented Mar 24, 2026

@mishail FYI I updated to include the latest commits on main which includes test fixes. You may need to pull before push. I left one comment about updating an error message. That can either be addressed here or a separate PR, its not a blocker.

@github-actions
Copy link
Copy Markdown
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 71ab4d4.

PathLineSeverityDescription
server/src/main/java/org/opensearch/index/engine/DeletionStrategyPlanner.java117lowRemoval of `assert assertNonPrimaryOrigin(delete)` in planOperationAsNonPrimary weakens origin validation for non-primary deletion operations. Java assertions are disabled by default in production, so this has minimal runtime impact, but it removes a guard that catches incorrect operation routing during testing. The same pattern is mirrored in IndexingStrategyPlanner.java. Could be a legitimate fix if the assertion was too strict, but warrants review to confirm there is equivalent validation elsewhere.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 0 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Removed Assertion

The assertNonPrimaryOrigin assertion in planOperationAsNonPrimary was removed. This assertion guards against incorrectly calling the non-primary planning path with a PRIMARY origin operation. Without it, a PRIMARY-origin operation could silently proceed through the non-primary planning path, potentially causing data inconsistency or incorrect versioning. The fix should instead make ReplicationEngine pass a non-primary origin, not remove the guard that catches misuse.

public IndexingStrategy planOperationAsNonPrimary(Engine.Index index) throws IOException {
    // needs to maintain the auto_id timestamp in case this replica becomes primary
    if (canOptimizeAddDocument(index)) {
        mayHaveBeenIndexedBefore(index);
    }
Removed Assertion

Same concern as in IndexingStrategyPlanner: removing assertNonPrimaryOrigin from planOperationAsNonPrimary eliminates a correctness guard. If a PRIMARY-origin delete is routed through this path, it will proceed silently rather than failing fast. The root cause (CCR passing PRIMARY origin) should be fixed at the call site in ReplicationEngine, not by removing the assertion here.

public DeletionStrategy planOperationAsNonPrimary(Engine.Delete delete) throws IOException {
    final DeletionStrategy plan;
    if (hasBeenProcessedBefore.test(delete)) {
        // the operation seq# was processed thus this operation was already put into lucene

@cwperks cwperks dismissed their stale review March 24, 2026 14:12

Looking deeper into CCR plugin

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 71ab4d4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@cwperks
Copy link
Copy Markdown
Member

cwperks commented Mar 24, 2026

@mishail how does this fix the issue in CCR repo exactly? I see that they always call planIndexingAsNonPrimary regardless of origin type. Presumably the fix lies in CCR repo?

@mishail
Copy link
Copy Markdown
Contributor Author

mishail commented Mar 24, 2026

@mishail how does this fix the issue in CCR repo exactly? I see that they always call planIndexingAsNonPrimary regardless of origin type. Presumably the fix lies in CCR repo?

@cwperks The problem is that this piece in their ReplicationEngine doesn't work the way it used to work anymore.

    override fun assertNonPrimaryOrigin(operation: Operation): Boolean {
        return true
    }

After the refactor, InternalEngine.planIndexingAsNonPrimary and planDeletionAsNonPrimary delegate to private IndexingStrategyPlanner / DeletionStrategyPlanner instances. Both planners call assertNonPrimaryOrigin(operation) internally, which comes from OperationStrategyPlanner, which is entirely separate from InternalEngine.assertNonPrimaryOrigin. So ReplicationEngine.assertNonPrimaryOrigin is never invoked, and the assertion fires.

@cwperks
Copy link
Copy Markdown
Member

cwperks commented Mar 24, 2026

@mishail I'm not deep in this area unfortunately, but can the ReplicationEngine in CCR follow same pattern as InternalEngine?

See comment here: https://github.com/opensearch-project/cross-cluster-replication/pull/1647/changes#r2982353370

protected DeletionStrategy deletionStrategyForOperation(final Delete delete) throws IOException {
    if (delete.origin() == Operation.Origin.PRIMARY) {
        return planDeletionAsPrimary(delete);
    } else {
        // non-primary mode (i.e., replica or recovery)
        return planDeletionAsNonPrimary(delete);
    }
}

@mishail
Copy link
Copy Markdown
Contributor Author

mishail commented Mar 24, 2026

@cwperks are we going to use opensearch-project/cross-cluster-replication#1647 to fix the issue?

@cwperks
Copy link
Copy Markdown
Member

cwperks commented Mar 24, 2026

@cwperks are we going to use opensearch-project/cross-cluster-replication#1647 to fix the issue?

yes and I think that's the appropriate place to put it because this is logic specific to the CCR repo. My understanding from @mohit10011999 is that in CCR all indexing operations to the follower cluster are to be treated as replicas so the nature of that fix is to "spoof" the index operation to always say its origin replica by nature of it being an indexing op on the following cluster.

@cwperks
Copy link
Copy Markdown
Member

cwperks commented Mar 24, 2026

Do let me know if you have other thoughts though.

@mishail mishail closed this Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Build Libraries & Interfaces enhancement Enhancement or improvement to existing feature or request skip-changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Plugin is not compatible with the latest OpenSearch 3.6.0-SNAPSHOT

2 participants