[Segment Replication] Add cancellation support in RemoteStoreReplicationSource#9234
[Segment Replication] Add cancellation support in RemoteStoreReplicationSource#9234dreamer-89 wants to merge 7 commits intoopensearch-project:mainfrom
Conversation
server/src/test/java/org/opensearch/index/shard/SegmentReplicationIndexShardTests.java
Show resolved
Hide resolved
|
Compatibility status: |
Gradle Check (Jenkins) Run Completed with:
|
Issue is not repro'able locally. |
mch2
left a comment
There was a problem hiding this comment.
Can we run
with remote store enabled to ensure resources are cleaned up on close?
server/src/main/java/org/opensearch/indices/replication/RemoteStoreReplicationSource.java
Outdated
Show resolved
Hide resolved
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
@@ Coverage Diff @@
## main #9234 +/- ##
============================================
+ Coverage 71.02% 71.13% +0.11%
- Complexity 57365 57432 +67
============================================
Files 4776 4776
Lines 270809 270806 -3
Branches 39582 39582
============================================
+ Hits 192331 192647 +316
+ Misses 62271 62026 -245
+ Partials 16207 16133 -74
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Gradle Check (Jenkins) Run Completed with:
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Gradle Check (Jenkins) Run Completed with:
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Gradle Check (Jenkins) Run Completed with:
|
|
Gradle Check (Jenkins) Run Completed with:
|
| checkpointInfoListener.whenComplete(checkpointInfo -> { | ||
| final List<StoreFileMetadata> filesToFetch = getFiles(checkpointInfo); | ||
| state.setStage(SegmentReplicationState.Stage.GET_FILES); | ||
| cancellableThreads.checkForCancel(); |
There was a problem hiding this comment.
nit - if we are moving to a sync cancel here and wrapping calls to the replication source I don't think this class needs the checkForCancel checks anymore before invoking the source methods.
There was a problem hiding this comment.
We can remove checkForCancel call before/after ReplicationSource call. The remaining checkForCancel would help in fail-fast when cancellation happens after source calls e.g. cancellation just after getSegmentFiles method call but before finalizeReplication inside SegmentReplicationTarget.
There was a problem hiding this comment.
I see remaining checkForCancel() calls are right after the ReplicationSource method calls (first statement inside SegmentReplicationTarget.getFiles, SegmentReplicationTarget.finalizeReplication), which means we can remove these calls as well. BUT, it is still possible that cancellation happens outside of replication source method calls, in which case resulting in non-graceful cancellations. I think we should also wrap getFiles and finalizeReplication method calls inside CancellationThreads.execute(). This is required as target contains heavy-weight operations such as finalizeReplication (reads commit from disk) and there are chances that cancellation happens during method execution.
server/src/test/java/org/opensearch/index/shard/RemoteIndexShardTests.java
Outdated
Show resolved
Hide resolved
…ionSource Signed-off-by: Suraj Singh <surajrider@gmail.com> Self review Signed-off-by: Suraj Singh <surajrider@gmail.com> Address review comments Signed-off-by: Suraj Singh <surajrider@gmail.com> Address review comments Signed-off-by: Suraj Singh <surajrider@gmail.com>
Signed-off-by: Suraj Singh <surajrider@gmail.com>
Signed-off-by: Suraj Singh <surajrider@gmail.com>
Signed-off-by: Suraj Singh <surajrider@gmail.com>
Signed-off-by: Suraj Singh <surajrider@gmail.com>
Signed-off-by: Suraj Singh <surajrider@gmail.com>
Signed-off-by: Suraj Singh <surajrider@gmail.com>
|
|
||
| class TestRSReplicationSource extends RemoteStoreReplicationSource { | ||
|
|
||
| private final Thread beforeCheckpoint; |
There was a problem hiding this comment.
Note: Created threads vs Runnable to more closely mimic the actual scenario where cancellations happen in separate threads. Sample trace showing segrep event & cancellation invoke in separate threads (note Thread %tid in logs below).
[2023-08-21T19:42:05,088][INFO ][o.o.i.r.SegmentReplicationTarget] [org.opensearch.index.shard.RemoteIndexShardTests] [test][0] [Thread 42] Starting Replication Target: Id:[3] Checkpoint [ReplicationCheckpoint{shardId=[test][0], primaryTerm=88, segmentsGen=3, version=7, size=0, codec=Lucene95}] Shard:[[test][0]] Source:[TestReplicationSource] with thread opensearch[org.opensearch.index.shard.RemoteIndexShardTests][generic][T#3]
[2023-08-21T19:42:05,088][INFO ][o.o.i.r.SegmentReplicationTarget] [[Thread-5]] [test][0] [Thread 46] Cancelling replication for target Id:[3] Checkpoint [ReplicationCheckpoint{shardId=[test][0], primaryTerm=88, segmentsGen=3, version=7, size=0, codec=Lucene95}] Shard:[[test][0]] Source:[TestReplicationSource]
[2023-08-21T19:42:05,090][ERROR][o.o.i.r.SegmentReplicationTargetService] [org.opensearch.index.shard.RemoteIndexShardTests] [Thread 42] Error during segment replication, Id:[3] Checkpoint [ReplicationCheckpoint{shardId=[test][0], primaryTerm=88, segmentsGen=3, version=7, size=0, codec=Lucene95}] Shard:[[test][0]] Source:[TestReplicationSource]
...
Gradle Check (Jenkins) Run Completed with:
|
|
@dreamer-89 FYI #9499 - Not sure if your changes here will fix this or not but would like to first determine the source of flakiness here. |
|
Discussed separately with @mch2, there is no data point around whether this change is needed for 2.10.0 release provided existing tests are stable and we are close to release. We may need to revisit this change, identify use-cases where it helps. I will keep the PR open in case we identify it is needed. |
|
This PR is stalled because it has been open for 30 days with no activity. |
|
Cancellation support was added as part of #10725, closing this stalled PR. |

Description
Adds cancellation to RemoteStoreReplicationSource and checks for event cancellation in
getCheckpointMetadataandgetSegmentFilesmethod calls. This is useful as it allows:Related Issues
Resolves #8089
Testing
One test
testDropRandomNodeDuringReplicationfailure due to assertion trip which seems unrelated. Ran test againstmainand I am still seeing this failure. So, this is un-related to this change.Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.