Skip to content

[HUDI-3805] Delete existing corrupted requested rollback plan during rollback#5245

Merged
codope merged 1 commit intoapache:masterfrom
yihua:HUDI-3805-fix-empty-requested-rollback
Apr 7, 2022
Merged

[HUDI-3805] Delete existing corrupted requested rollback plan during rollback#5245
codope merged 1 commit intoapache:masterfrom
yihua:HUDI-3805-fix-empty-requested-rollback

Conversation

@yihua
Copy link
Copy Markdown
Contributor

@yihua yihua commented Apr 7, 2022

What is the purpose of the pull request

When the instant_time.rollback.requested file in the timeline is empty or corrupted, it cannot be parsed. When running getPendingRollbackInfos(), it's going to skip that empty/corrupted requested rollback instant and the rollback instant is going to stay on the timeline forever, preventing metadata table archival.

This PR fixes the problem by deleting the requested rollback plan if it is empty or corrupted.

Brief change log

  • Adds logic to delete requested rollback plan if it cannot be parsed in BaseHoodieWriteClient::getPendingRollbackInfos().

Verify this pull request

This change added tests for requested rollbacks, either valid or corrupted, in TestClientRollback. The fix is also verified by running the deltastreamer on a Hudi Table with corrupted requested rollback in the timeline. The corrupted rollback plan is deleted afterwards.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

Copy link
Copy Markdown
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yihua yihua added the priority:blocker Production down; release blocker label Apr 7, 2022
@yihua yihua force-pushed the HUDI-3805-fix-empty-requested-rollback branch from 6e1603b to 40fc116 Compare April 7, 2022 03:00
@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented Apr 7, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Copy Markdown
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would exactly corrupt the plan? Is this for HDFS where writes are not-atomic

HoodieRollbackPlan rollbackPlan;
try {
rollbackPlan = RollbackUtils.getRollbackPlan(metaClient, rollbackInstant);
} catch (IOException e) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would n't this go ahead and delete even if say the error was a legit IO exception? lets say cloud storage was inaccessible/connection timeout.. We should explicitly handle corruptions cleanly IMO

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. probably we should catch only for the known exception and stacktrace as well.

Caused by: java.io.IOException: Not an Avro data fileat org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)

this is the stacktrace when avro file is corrupt. we will follow up w/ the right fix.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

@yihua yihua Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deletion only happens for the requested rollback plan. Any inflight rollback is not affected. The assumption here is that even if the requested rollback plan is inaccessible and deleted, it can be requested again by a new writer, which is still safe. We don't want the hanging rollback plan to block metadata table compaction.

The corruption is mainly due to writes of a rollback plan not atomic, and the job fails during that time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. Ethan reminded me of the same discussion we had when the patch was put up. we found the existing fix as the safest option compared to other alternatives. Just to add to what Ethan has mentioned above, we do this only incase of rollback.requested and not for rollback.inflight. For inflight, its safe to re-use the plan from the rollback.requested file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

No open projects
Status: No status

Development

Successfully merging this pull request may close these issues.

5 participants