Fix segment replication bug during primary relocation#18944
Fix segment replication bug during primary relocation#18944ashking94 merged 3 commits intoopensearch-project:mainfrom
Conversation
Signed-off-by: Ashish Singh <ssashish@amazon.com>
|
@mch2 @andrross @getsaurabh02 @sachinpkale @Bukhtawar - This one is a small PR for fixing a flaky test before 3.2 release. Can you help with the review? |
|
❌ Gradle check result for 2d76208: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
@ashking94 thanks for fixing this. Not for now but with node-node segrep I'm thinking we could also change the replication source to fetch segments from the publisher of the cp vs today where we rely on the cluster state lookup of the active primary. This would allow us to replicate from non primary nodes. |
|
❌ Gradle check result for 2d76208: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 2d76208: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 2d76208: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 2d76208: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: Ashish Singh <ssashish@amazon.com>
Sure, Marc. This does make sense. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #18944 +/- ##
============================================
- Coverage 72.89% 72.85% -0.04%
- Complexity 69318 69340 +22
============================================
Files 5642 5642
Lines 318636 318640 +4
Branches 46107 46108 +1
============================================
- Hits 232254 232138 -116
- Misses 67540 67752 +212
+ Partials 18842 18750 -92 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
The backport to To backport manually, run these commands in your terminal: # Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.19 2.19
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.19
# Create a new branch
git switch --create backport/backport-18944-to-2.19
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 251cc3603e73405cc047e192663e8b5dbaa1c61d
# Push it to GitHub
git push --set-upstream origin backport/backport-18944-to-2.19
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.19Then, create a pull request where the |
* Fix segment replication bug during primary relocation Signed-off-by: Ashish Singh <ssashish@amazon.com> * Fix applicable for segrep local indexes only Signed-off-by: Ashish Singh <ssashish@amazon.com> --------- Signed-off-by: Ashish Singh <ssashish@amazon.com> (cherry picked from commit 251cc36) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…ject#18944) * Fix segment replication bug during primary relocation Signed-off-by: Ashish Singh <ssashish@amazon.com> * Fix applicable for segrep local indexes only Signed-off-by: Ashish Singh <ssashish@amazon.com> --------- Signed-off-by: Ashish Singh <ssashish@amazon.com>
|
Raised manual backport for 2.19 - #18958 |
* Fix segment replication bug during primary relocation * Fix applicable for segrep local indexes only --------- (cherry picked from commit 251cc36) Signed-off-by: Ashish Singh <ssashish@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…ject#18944) * Fix segment replication bug during primary relocation Signed-off-by: Ashish Singh <ssashish@amazon.com> * Fix applicable for segrep local indexes only Signed-off-by: Ashish Singh <ssashish@amazon.com> --------- Signed-off-by: Ashish Singh <ssashish@amazon.com>
…ject#18944) * Fix segment replication bug during primary relocation Signed-off-by: Ashish Singh <ssashish@amazon.com> * Fix applicable for segrep local indexes only Signed-off-by: Ashish Singh <ssashish@amazon.com> --------- Signed-off-by: Ashish Singh <ssashish@amazon.com>
…ject#18944) * Fix segment replication bug during primary relocation Signed-off-by: Ashish Singh <ssashish@amazon.com> * Fix applicable for segrep local indexes only Signed-off-by: Ashish Singh <ssashish@amazon.com> --------- Signed-off-by: Ashish Singh <ssashish@amazon.com>
During rolling restarts, replica shards may have received newer checkpoints from the primary before the restart, but after restart, the primary may have rolled back to an older state. The strict checkpoint validation added in opensearch-project#18944 to fix race conditions during primary relocation incorrectly rejects this legitimate scenario, causing shards to fail allocation after 5 retries. This fix distinguishes between two scenarios: 1. Normal replication - strict checkpoint validation applies to prevent accepting stale data during primary relocation (maintains opensearch-project#18944 fix) 2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's current state even if it appears older than the replica's last known checkpoint, as this is expected during recovery from restart Added unit tests to verify: - Stale checkpoint is rejected during normal replication - Stale checkpoint is accepted during shard recovery Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Fixes opensearch-project#19234
During rolling restarts, replica shards may have received newer checkpoints from the primary before the restart, but after restart, the primary may have rolled back to an older state. The strict checkpoint validation added in opensearch-project#18944 to fix race conditions during primary relocation incorrectly rejects this legitimate scenario, causing shards to fail allocation after 5 retries. This fix distinguishes between two scenarios: 1. Normal replication - strict checkpoint validation applies to prevent accepting stale data during primary relocation (maintains opensearch-project#18944 fix) 2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's current state even if it appears older than the replica's last known checkpoint, as this is expected during recovery from restart Added unit tests to verify: - Stale checkpoint is rejected during normal replication - Stale checkpoint is accepted during shard recovery Note: In 2.19, the logic is in SegmentReplicationTarget.java instead of AbstractSegmentReplicationTarget.java (which was introduced in later versions). Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Fixes opensearch-project#19234
During rolling restarts, replica shards may have received newer checkpoints from the primary before the restart, but after restart, the primary may have rolled back to an older state. The strict checkpoint validation added in opensearch-project#18944 to fix race conditions during primary relocation incorrectly rejects this legitimate scenario, causing shards to fail allocation after 5 retries. This fix distinguishes between two scenarios: 1. Normal replication - strict checkpoint validation applies to prevent accepting stale data during primary relocation (maintains opensearch-project#18944 fix) 2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's current state even if it appears older than the replica's last known checkpoint, as this is expected during recovery from restart Added unit tests to verify: - Stale checkpoint is rejected during normal replication - Stale checkpoint is accepted during shard recovery Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Fixes opensearch-project#19234
* Fix segment replication failure during rolling restart During rolling restarts, replica shards may have received newer checkpoints from the primary before the restart, but after restart, the primary may have rolled back to an older state. The strict checkpoint validation added in #18944 to fix race conditions during primary relocation incorrectly rejects this legitimate scenario, causing shards to fail allocation after 5 retries. This fix distinguishes between two scenarios: 1. Normal replication - strict checkpoint validation applies to prevent accepting stale data during primary relocation (maintains #18944 fix) 2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's current state even if it appears older than the replica's last known checkpoint, as this is expected during recovery from restart Added unit tests to verify: - Stale checkpoint is rejected during normal replication - Stale checkpoint is accepted during shard recovery Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Fixes #19234 * Fix incorrect mock chaining in unit tests The chained mock syntax when(spyIndexShard.routingEntry().initializing()) doesn't work as intended because routingEntry() returns a real ShardRouting object, not a mock. Fixed by: - Added ShardRouting import - Created separate ShardRouting mocks for both test cases - Properly stubbed initializing() and relocating() methods on the mock - Stubbed routingEntry() to return the mocked ShardRouting This ensures tests correctly verify the behavior for both: 1. Active shard (initializing=false) - should reject stale checkpoint 2. Recovering shard (initializing=true) - should accept stale checkpoint Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> * Fix ReplicationCheckpoint constructor in tests Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> * Fix code formatting Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> * Add CHANGELOG entry Signed-off-by: Andrew Ross <andrross@amazon.com> --------- Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Signed-off-by: Andrew Ross <andrross@amazon.com> Co-authored-by: Andrew Ross <andrross@amazon.com>
During rolling restarts, replica shards may have received newer checkpoints from the primary before the restart, but after restart, the primary may have rolled back to an older state. The strict checkpoint validation added in opensearch-project#18944 to fix race conditions during primary relocation incorrectly rejects this legitimate scenario, causing shards to fail allocation after 5 retries. This fix distinguishes between two scenarios: 1. Normal replication - strict checkpoint validation applies to prevent accepting stale data during primary relocation (maintains opensearch-project#18944 fix) 2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's current state even if it appears older than the replica's last known checkpoint, as this is expected during recovery from restart Added unit tests to verify: - Stale checkpoint is rejected during normal replication - Stale checkpoint is accepted during shard recovery Note: In 2.19, the logic is in SegmentReplicationTarget.java instead of AbstractSegmentReplicationTarget.java (which was introduced in later versions). Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Fixes opensearch-project#19234
#20498) * [2.19] Fix segment replication failure during rolling restart During rolling restarts, replica shards may have received newer checkpoints from the primary before the restart, but after restart, the primary may have rolled back to an older state. The strict checkpoint validation added in #18944 to fix race conditions during primary relocation incorrectly rejects this legitimate scenario, causing shards to fail allocation after 5 retries. This fix distinguishes between two scenarios: 1. Normal replication - strict checkpoint validation applies to prevent accepting stale data during primary relocation (maintains #18944 fix) 2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's current state even if it appears older than the replica's last known checkpoint, as this is expected during recovery from restart Added unit tests to verify: - Stale checkpoint is rejected during normal replication - Stale checkpoint is accepted during shard recovery Note: In 2.19, the logic is in SegmentReplicationTarget.java instead of AbstractSegmentReplicationTarget.java (which was introduced in later versions). Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Fixes #19234 * Fix javadoc syntax error in SearchPhase Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> * Fix ReplicationCheckpoint constructor in unit tests Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> * Fix code formatting Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> * Add CHANGELOG entry Signed-off-by: Andrew Ross <andrross@amazon.com> --------- Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Signed-off-by: Andrew Ross <andrross@amazon.com> Co-authored-by: Andrew Ross <andrross@amazon.com>
…oject#20422) * Fix segment replication failure during rolling restart During rolling restarts, replica shards may have received newer checkpoints from the primary before the restart, but after restart, the primary may have rolled back to an older state. The strict checkpoint validation added in opensearch-project#18944 to fix race conditions during primary relocation incorrectly rejects this legitimate scenario, causing shards to fail allocation after 5 retries. This fix distinguishes between two scenarios: 1. Normal replication - strict checkpoint validation applies to prevent accepting stale data during primary relocation (maintains opensearch-project#18944 fix) 2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's current state even if it appears older than the replica's last known checkpoint, as this is expected during recovery from restart Added unit tests to verify: - Stale checkpoint is rejected during normal replication - Stale checkpoint is accepted during shard recovery Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Fixes opensearch-project#19234 * Fix incorrect mock chaining in unit tests The chained mock syntax when(spyIndexShard.routingEntry().initializing()) doesn't work as intended because routingEntry() returns a real ShardRouting object, not a mock. Fixed by: - Added ShardRouting import - Created separate ShardRouting mocks for both test cases - Properly stubbed initializing() and relocating() methods on the mock - Stubbed routingEntry() to return the mocked ShardRouting This ensures tests correctly verify the behavior for both: 1. Active shard (initializing=false) - should reject stale checkpoint 2. Recovering shard (initializing=true) - should accept stale checkpoint Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> * Fix ReplicationCheckpoint constructor in tests Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> * Fix code formatting Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> * Add CHANGELOG entry Signed-off-by: Andrew Ross <andrross@amazon.com> --------- Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Signed-off-by: Andrew Ross <andrross@amazon.com> Co-authored-by: Andrew Ross <andrross@amazon.com>
…oject#20422) * Fix segment replication failure during rolling restart During rolling restarts, replica shards may have received newer checkpoints from the primary before the restart, but after restart, the primary may have rolled back to an older state. The strict checkpoint validation added in opensearch-project#18944 to fix race conditions during primary relocation incorrectly rejects this legitimate scenario, causing shards to fail allocation after 5 retries. This fix distinguishes between two scenarios: 1. Normal replication - strict checkpoint validation applies to prevent accepting stale data during primary relocation (maintains opensearch-project#18944 fix) 2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's current state even if it appears older than the replica's last known checkpoint, as this is expected during recovery from restart Added unit tests to verify: - Stale checkpoint is rejected during normal replication - Stale checkpoint is accepted during shard recovery Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Fixes opensearch-project#19234 * Fix incorrect mock chaining in unit tests The chained mock syntax when(spyIndexShard.routingEntry().initializing()) doesn't work as intended because routingEntry() returns a real ShardRouting object, not a mock. Fixed by: - Added ShardRouting import - Created separate ShardRouting mocks for both test cases - Properly stubbed initializing() and relocating() methods on the mock - Stubbed routingEntry() to return the mocked ShardRouting This ensures tests correctly verify the behavior for both: 1. Active shard (initializing=false) - should reject stale checkpoint 2. Recovering shard (initializing=true) - should accept stale checkpoint Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> * Fix ReplicationCheckpoint constructor in tests Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> * Fix code formatting Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> * Add CHANGELOG entry Signed-off-by: Andrew Ross <andrross@amazon.com> --------- Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Signed-off-by: Andrew Ross <andrross@amazon.com> Co-authored-by: Andrew Ross <andrross@amazon.com>
Description
After the bug which led to infinite loop of segment replication was fixed in PR #18636, the FullRollingRestartIT test became flaky as seen in #18490. On deeper analysis, I found that this happens due to race condition in primary shard relocation. On primary shard relocation, the new primary has a bumped up segment infos generation and version which is broadcasted to all of it's replica via the checkpoint publisher. This happens around the same time when the shard_started primary action is called to active cluster manager to inform that the primary handover happened successfully. In certain condition, it was seen that the replica received the latest checkpoint from the new primary, but the cluster applier service was yet to be applied. This led to the replica reaching out to the old primary for getting the segment infos. This issue has slight probability of happening for indexes not getting any kind of ingestion during relocation after the permits have been acquired on the older primary.
With this PR, the following things would happen to prevent the issue that happens now:
Related Issues
Resolves #18490
Check List
[ ] API changes companion pull request created, if applicable.[ ] Public documentation issue/PR created, if applicable.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.