Skip to content

[RFC] Simplify the ckp verification for the primary shard relocation #20610

@guojialiang92

Description

@guojialiang92

Description

Recently, there have been some PRs for fixing segment replication Rejecting stale metadata checkpoint, such as #20422 and #20551, all of which have adopted the approach of skipping verification. After some in-depth thinking, I believe we can adopt a more elegant way to solve this problem.

In #18944, ckp verification was added to fix flaky test. For ease of reading, I've pasted the key analysis here.

On deeper analysis, I found that this happens due to race condition in primary shard relocation. On primary shard relocation, the new primary has a bumped up segment infos generation and version which is broadcasted to all of it's replica via the checkpoint publisher. This happens around the same time when the shard_started primary action is called to active cluster manager to inform that the primary handover happened successfully. In certain condition, it was seen that the replica received the latest checkpoint from the new primary, but the cluster applier service was yet to be applied. This led to the replica reaching out to the old primary for getting the segment infos. This issue has slight probability of happening for indexes not getting any kind of ingestion during relocation after the permits have been acquired on the older primary.

Adding ckp verification can indeed cover the above scenarios of primary sharding migration, but it will also affect the processing of some normal logic, such as the situations mentioned in #20422 and #20551. We may still have some other situations that have not yet been discovered.

Solution

If we can specifically identify the scenario of primary shard relocation, we can avoid continuously patching the ckp verification logic.

Fortunately, we set state handoffInProgress during the hand-off phase of peer recovery. We can leverage this state to require that when the primary shard receives request GET_CHECKPOINT_INFO, it must be a started primary shard that is not in the hand-off process.
I submitted a PR and ran tests for verification. I hope @ashking94 @mch2 @andrross @cuonghm2809 @atris can take a look and provide feedback.

Related component

Indexing:Replication

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Indexing:ReplicationIssues and PRs related to core replication framework eg segrepenhancementEnhancement or improvement to existing feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions