Skip to content

[HUDI-2048] HoodieRealtimeInputFormatUtils#groupLogsByBaseFile throws…#3122

Closed
danny0405 wants to merge 1 commit intoapache:masterfrom
danny0405:HUDI-2048
Closed

[HUDI-2048] HoodieRealtimeInputFormatUtils#groupLogsByBaseFile throws…#3122
danny0405 wants to merge 1 commit intoapache:masterfrom
danny0405:HUDI-2048

Conversation

@danny0405
Copy link
Copy Markdown
Contributor

… NPE for file group that has only logs

Tips

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented Jun 21, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run travis re-run the last Travis build
  • @hudi-bot run azure re-run the last Azure build

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : Should we modify the exception message so that when throw an exception, we can distinguish it from the exception in the getRealtimeSplits method?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception already throws out, what kind of exception do you suggest it to be ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean to modify this content,for example 'Error creating hoodie real time split for group logs by BaseFile',because the exception content of groupLogsByBaseFile and getRealtimeSplits method are the same.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we use hive query the hudi table, the same NPE exception may also occur here , should we add the same judgment logic?

Oops, i guess we should

@zhangjun0x01
Copy link
Copy Markdown
Contributor

When we use hive query the hudi table, the same NPE exception may also occur here , should we add the same judgment logic?

@danny0405
Copy link
Copy Markdown
Contributor Author

Hi @vinothchandar , can you also take a look, i guess this is also a blocker because the global index causes this problem (file group that has no parquet triggers the NPE).

Copy link
Copy Markdown
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is filtering out file slices that may not have a base file? If so, those logs can contain committed data and thus it may not be the intended behavior?

@xiarixiaoyao
Copy link
Copy Markdown
Contributor

HI @vinothchandar @danny0405 i think we should not filter those logs directly, those logs contains the data which we needed。 we encounter this problem and solved it in our products,this problem is sub_problem3 of HUDI-2086 and the correlation pr will come next few days。

@danny0405
Copy link
Copy Markdown
Contributor Author

HI @vinothchandar @danny0405 i think we should not filter those logs directly, those logs contains the data which we needed。 we encounter this problem and solved it in our products,this problem is sub_problem3 of HUDI-2086 and the correlation pr will come next few days。

That's cool, so you mean you would fix that in HUDI-2086 though ? Can we close this one instead ?

@xiarixiaoyao
Copy link
Copy Markdown
Contributor

HI @danny0405 yes, we will fix this problem by HUDI-2086, it will be ok to close this one

@danny0405
Copy link
Copy Markdown
Contributor Author

@vinothchandar I have modified to support pure logs file group reading, can you take a look again, thanks ~

@vinothchandar vinothchandar added the priority:blocker Production down; release blocker label Jul 5, 2021
@vinothchandar
Copy link
Copy Markdown
Member

Does #3203 solve this?

Copy link
Copy Markdown
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not 100% if this is the approach we should take. Let me also review the PR for HUDI-2086 and get back?

colNamesWithTypesForExternal.size(),
true);
}
} else if (split instanceof RealtimeSplit) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do this at the sub class? Not sure if we want to make this class aware of RealtimeSplit

/**
* Dummy record reader that outputs nothing.
*/
public static class DummyRecordReader implements RecordReader<NullWritable, ArrayWritable> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename: NoopRecordReader

} else {
// the file group has only logs (say the index is global).
try {
rtSplits.add(new HoodieRealtimeFileSplit(DummyInputSplit.INSTANCE, metaClient.getBasePath(), logFilePaths, maxCommitTime));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code seems to be intending to solely avoid reading this split, if it only has logs? Can't we just skip this fileId /fileslice, instead of adding a dummy split and record reader. They do add quite bit of complexity here. Wondering if we can fix it localized manner if all we want to do is avoid NPE

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally i want to fix the NPE at first, read the file groups with parquets is better than a thrown exception. You you think reading the logs only file group adds quite bit of complexity, we can avoid that.

@danny0405
Copy link
Copy Markdown
Contributor Author

Does #3203 solve this?

I think it should, would just close this one.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 36.95652% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.19%. Comparing base (202887b) to head (84bfc2d).
⚠️ Report is 5572 commits behind head on master.

Files with missing lines Patch % Lines
...i/hadoop/utils/HoodieRealtimeInputFormatUtils.java 23.52% 23 Missing and 3 partials ⚠️
...g/apache/hudi/hadoop/HoodieParquetInputFormat.java 75.00% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master    #3122   +/-   ##
=========================================
  Coverage     46.19%   46.19%           
- Complexity     5385     5386    +1     
=========================================
  Files           921      921           
  Lines         40040    40062   +22     
  Branches       4294     4297    +3     
=========================================
+ Hits          18495    18506   +11     
- Misses        19661    19669    +8     
- Partials       1884     1887    +3     
Flag Coverage Δ
hudicli 39.95% <ø> (ø)
hudiclient 30.45% <ø> (ø)
hudicommon 47.56% <ø> (-0.02%) ⬇️
hudiflink 59.91% <ø> (ø)
hudihadoopmr 51.33% <36.95%> (+0.03%) ⬆️
hudisparkdatasource 67.06% <ø> (ø)
hudisync 54.05% <ø> (ø)
huditimelineservice 64.36% <ø> (ø)
hudiutilities 58.44% <ø> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...g/apache/hudi/hadoop/HoodieParquetInputFormat.java 44.66% <75.00%> (+4.00%) ⬆️
...i/hadoop/utils/HoodieRealtimeInputFormatUtils.java 34.40% <23.52%> (-0.39%) ⬇️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants