[HUDI-3805] Delete existing corrupted requested rollback plan during rollback#5245
Conversation
6e1603b to
40fc116
Compare
vinothchandar
left a comment
There was a problem hiding this comment.
What would exactly corrupt the plan? Is this for HDFS where writes are not-atomic
| HoodieRollbackPlan rollbackPlan; | ||
| try { | ||
| rollbackPlan = RollbackUtils.getRollbackPlan(metaClient, rollbackInstant); | ||
| } catch (IOException e) { |
There was a problem hiding this comment.
would n't this go ahead and delete even if say the error was a legit IO exception? lets say cloud storage was inaccessible/connection timeout.. We should explicitly handle corruptions cleanly IMO
There was a problem hiding this comment.
yes. probably we should catch only for the known exception and stacktrace as well.
Caused by: java.io.IOException: Not an Avro data fileat org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
this is the stacktrace when avro file is corrupt. we will follow up w/ the right fix.
There was a problem hiding this comment.
There was a problem hiding this comment.
The deletion only happens for the requested rollback plan. Any inflight rollback is not affected. The assumption here is that even if the requested rollback plan is inaccessible and deleted, it can be requested again by a new writer, which is still safe. We don't want the hanging rollback plan to block metadata table compaction.
The corruption is mainly due to writes of a rollback plan not atomic, and the job fails during that time.
There was a problem hiding this comment.
yeah. Ethan reminded me of the same discussion we had when the patch was put up. we found the existing fix as the safest option compared to other alternatives. Just to add to what Ethan has mentioned above, we do this only incase of rollback.requested and not for rollback.inflight. For inflight, its safe to re-use the plan from the rollback.requested file.
What is the purpose of the pull request
When the
instant_time.rollback.requestedfile in the timeline is empty or corrupted, it cannot be parsed. When runninggetPendingRollbackInfos(), it's going to skip that empty/corrupted requested rollback instant and the rollback instant is going to stay on the timeline forever, preventing metadata table archival.This PR fixes the problem by deleting the requested rollback plan if it is empty or corrupted.
Brief change log
BaseHoodieWriteClient::getPendingRollbackInfos().Verify this pull request
This change added tests for requested rollbacks, either valid or corrupted, in
TestClientRollback. The fix is also verified by running the deltastreamer on a Hudi Table with corrupted requested rollback in the timeline. The corrupted rollback plan is deleted afterwards.Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.