[HUDI-2048] HoodieRealtimeInputFormatUtils#groupLogsByBaseFile throws…#3122
[HUDI-2048] HoodieRealtimeInputFormatUtils#groupLogsByBaseFile throws…#3122danny0405 wants to merge 1 commit intoapache:masterfrom
Conversation
There was a problem hiding this comment.
nit : Should we modify the exception message so that when throw an exception, we can distinguish it from the exception in the getRealtimeSplits method?
There was a problem hiding this comment.
The exception already throws out, what kind of exception do you suggest it to be ?
There was a problem hiding this comment.
I mean to modify this content,for example 'Error creating hoodie real time split for group logs by BaseFile',because the exception content of groupLogsByBaseFile and getRealtimeSplits method are the same.
There was a problem hiding this comment.
When we use hive query the hudi table, the same NPE exception may also occur here , should we add the same judgment logic?
Oops, i guess we should
|
When we use hive query the hudi table, the same NPE exception may also occur here , should we add the same judgment logic? |
|
Hi @vinothchandar , can you also take a look, i guess this is also a blocker because the global index causes this problem (file group that has no parquet triggers the NPE). |
vinothchandar
left a comment
There was a problem hiding this comment.
IIUC this is filtering out file slices that may not have a base file? If so, those logs can contain committed data and thus it may not be the intended behavior?
|
HI @vinothchandar @danny0405 i think we should not filter those logs directly, those logs contains the data which we needed。 we encounter this problem and solved it in our products,this problem is sub_problem3 of HUDI-2086 and the correlation pr will come next few days。 |
That's cool, so you mean you would fix that in HUDI-2086 though ? Can we close this one instead ? |
|
HI @danny0405 yes, we will fix this problem by HUDI-2086, it will be ok to close this one |
… NPE for file group that has only logs
|
@vinothchandar I have modified to support pure logs file group reading, can you take a look again, thanks ~ |
|
Does #3203 solve this? |
vinothchandar
left a comment
There was a problem hiding this comment.
I am still not 100% if this is the approach we should take. Let me also review the PR for HUDI-2086 and get back?
| colNamesWithTypesForExternal.size(), | ||
| true); | ||
| } | ||
| } else if (split instanceof RealtimeSplit) { |
There was a problem hiding this comment.
can we do this at the sub class? Not sure if we want to make this class aware of RealtimeSplit
| /** | ||
| * Dummy record reader that outputs nothing. | ||
| */ | ||
| public static class DummyRecordReader implements RecordReader<NullWritable, ArrayWritable> { |
| } else { | ||
| // the file group has only logs (say the index is global). | ||
| try { | ||
| rtSplits.add(new HoodieRealtimeFileSplit(DummyInputSplit.INSTANCE, metaClient.getBasePath(), logFilePaths, maxCommitTime)); |
There was a problem hiding this comment.
This code seems to be intending to solely avoid reading this split, if it only has logs? Can't we just skip this fileId /fileslice, instead of adding a dummy split and record reader. They do add quite bit of complexity here. Wondering if we can fix it localized manner if all we want to do is avoid NPE
There was a problem hiding this comment.
Personally i want to fix the NPE at first, read the file groups with parquets is better than a thrown exception. You you think reading the logs only file group adds quite bit of complexity, we can avoid that.
I think it should, would just close this one. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #3122 +/- ##
=========================================
Coverage 46.19% 46.19%
- Complexity 5385 5386 +1
=========================================
Files 921 921
Lines 40040 40062 +22
Branches 4294 4297 +3
=========================================
+ Hits 18495 18506 +11
- Misses 19661 19669 +8
- Partials 1884 1887 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
… NPE for file group that has only logs
Tips
What is the purpose of the pull request
(For example: This pull request adds quick-start document.)
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.