Skip to content

Conversation

@manuzhang
Copy link
Member

@manuzhang manuzhang commented Jan 3, 2024

Even the rewrite task failure without partial progress is logged as a WARN. It shouldn't be an ERROR with partial progress.

@github-actions github-actions bot added the spark label Jan 3, 2024
@manuzhang
Copy link
Member Author

@ajantha-bhat please help review.

.onFailure(
(fileGroup, exception) -> {
LOG.error("Failure during rewrite group {}", fileGroup.info(), exception);
LOG.warn("Failure during rewrite group {}", fileGroup.info(), exception);
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I think I get the rationale that during partial progress it could be too noisy if we error logged every failure. But the issue is let's say most of the commit tasks are failures, and we only see warn logs which are easy to get masked. The only case we error log is when all tasks fail which seems insufficient logging in case of failure.

I actually think error is fine here. Are you seeing a lot of noisiness in this in any systems you may be running?

I also noticed that the non-partial progress case is also warn, but tbh that also doesn't seem right. What do you think @manuzhang also cc @szehon-ho @RussellSpitzer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea is that we log an error if the commit results size is less than some threshold rather than completely empty. This would eliminate the task level noise from errors but at the same time indicate to logging systems that the compaction isn't super effective.

    if (commitResults.size() == 0) {
      LOG.error(
          "{} is true but no rewrite commits succeeded. Check the logs to determine why the individual "
              + "commits failed. If this is persistent it may help to increase {} which will break the rewrite operation "
              + "into smaller commits.",
          PARTIAL_PROGRESS_ENABLED,
          PARTIAL_PROGRESS_MAX_COMMITS);
    }

to

    if (commitResults.size()/commitAttempts <= SOME_THRESHOLD) {
      LOG.error(
          "{} is true but less than SOME_THRESHOLD rewrite commits succeeded. Check the logs to determine why the individual "
              + "commits failed. If this is persistent it may help to increase {} which will break the rewrite operation "
              + "into smaller commits.",
          PARTIAL_PROGRESS_ENABLED,
          PARTIAL_PROGRESS_MAX_COMMITS);
    }
```

but maybe determining commitAttempts is overcomplicated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree ERROR is better for non-partial case. For partial progress, since failure is allowed, WARN sends the signal that failure is not fatal to the whole procedure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants