Skip to content

Writable warm relocation recovery#14670

Closed
nisgoel-amazon wants to merge 4 commits intoopensearch-project:mainfrom
rayshrey:writable-warm-relocation-recovery
Closed

Writable warm relocation recovery#14670
nisgoel-amazon wants to merge 4 commits intoopensearch-project:mainfrom
rayshrey:writable-warm-relocation-recovery

Conversation

@nisgoel-amazon
Copy link
Copy Markdown

Description

Changes are related to warm index recovery and relocation flow. In this change we are not downloading the complete files in replicas.

Related Issues

Resolves #13647

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Nishant Goel <nisgoel@amazon.com>
Signed-off-by: Nishant Goel <nisgoel@amazon.com>
@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request Storage:Remote labels Jul 8, 2024
Signed-off-by: Nishant Goel <nisgoel@amazon.com>
@rayshrey rayshrey added the backport 2.x Backport to 2.x branch label Jul 8, 2024
Signed-off-by: Nishant Goel <nisgoel@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jul 8, 2024

❌ Gradle check result for 41fd1ec: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jul 8, 2024

❌ Gradle check result for 3004994: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jul 8, 2024

❌ Gradle check result for d42263b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@nisgoel-amazon nisgoel-amazon requested a review from VachaShah as a code owner July 8, 2024 12:49
// In that case we still commit into the next local generation.
if (incomingGeneration != this.lastReceivedPrimaryGen) {
flush(false, true);
if (engineConfig.getIndexSettings().isStoreLocalityPartial() == false) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this engine will only flush here and on close. I don't think we have a reason to commit on close as we will sync from store on re-open? If thats the case lets push this logic to the flush method call and cover both cases? Please also add a comment to why we are not committing here.

for (String file : uploadedSegments.keySet()) {
long checksum = Long.parseLong(uploadedSegments.get(file).getChecksum());
if (overrideLocal || localDirectoryContains(storeDirectory, file, checksum) == false) {
if (shouldOverrideLocalFiles || localDirectoryContains(storeDirectory, file, checksum) == false) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than adding all these checks to this method on locality can we create a separate trimmed down sync method for this case?

for (StoreFileMetadata fileMetadata : filesToFetch) {
String file = fileMetadata.name();
assert directoryFiles.contains(file) == false : "Local store already contains the file " + file;
assert directoryFiles.contains(file) == false || indexShard.indexSettings().isStoreLocalityPartial()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not skip the call to getSegmentFiles entirely? That would greatly reduce the changes to SegmentReplicationTarget.

cancellableThreads.checkForCancel();
state.setStage(SegmentReplicationState.Stage.FILE_DIFF);
final Store.RecoveryDiff diff = Store.segmentReplicationDiff(checkpointInfo.getMetadataMap(), indexShard.getSegmentMetadataMap());
final Store.RecoveryDiff diff = Store.segmentReplicationDiff(checkpointInfo.getMetadataMap(), finalReplicaMd);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to compute a diff here if all files are remote and simply return early if data locality is partial. The Checkpoint info call will include the sis bytes which we load onto the readermanager.

verifyStoreContent();
// skipping verify store content over here as readLastCommittedSegmentsInfo files are not present in latest metadata of remote
// store.
if (!warmIndexSegmentReplicationEnabled()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know similar logic was implemented with remote store but I would prefer we not couple these test classes together and instead write new tests where appropriate for warm cases. Or, we break up the SegmentReplicationIT such that reusable tests move to the base IT.


@OpenSearchIntegTestCase.ClusterScope(scope = OpenSearchIntegTestCase.Scope.TEST, numDataNodes = 0)
@ThreadLeakFilters(filters = CleanerDaemonThreadLeakFilter.class)
public class WarmIndexRemoteStoreSegmentReplicationIT extends SegmentReplicationIT {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are likely tests in the regular segrep suite that aren't valid here. Particularly when asserting against store content.

@rayshrey rayshrey added the v2.16.0 Issues and PRs related to version 2.16.0 label Jul 10, 2024
// skipping verify store content over here as readLastCommittedSegmentsInfo files are not present in latest metadata of remote
// store.
if (!warmIndexSegmentReplicationEnabled()) {
verifyStoreContent();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a backlog issue to fix this?

@opensearch-trigger-bot
Copy link
Copy Markdown
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@skumawat2025
Copy link
Copy Markdown
Contributor

Closing this PR and starting a new PR for this change: #17390

@github-project-automation github-project-automation bot moved this from Now(This Quarter) to ✅ Done in Storage Project Board Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 2.x Backport to 2.x branch enhancement Enhancement or improvement to existing feature or request stalled Issues that have stalled Storage:Remote v2.16.0 Issues and PRs related to version 2.16.0

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

[Writable Warm] Recovery/Replication flow for Composite Directory

5 participants