Skip to content

Updating master#3

Merged
aditiwari01 merged 875 commits intoaditiwari01:masterfrom
apache:master
Dec 29, 2021
Merged

Updating master#3
aditiwari01 merged 875 commits intoaditiwari01:masterfrom
apache:master

Conversation

@aditiwari01
Copy link
Copy Markdown
Owner

Tips

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

danny0405 and others added 30 commits November 6, 2021 12:23
…with Tez(#3630)

Co-authored-by: dylonyu <dylonyu@tencent.com>
 Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
Co-authored-by: yuzhaojing <yuzhaojing@bytedance.com>
…es from data table can trigger table services in metadata table (#3900)
…#3820)

 Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
…3833)

* [HUDI-1877] Support records staying in same fileId after clustering

Add plan strategy

* Ensure same filegroup id and refactor based on comments
Co-authored-by: yuzhaojing <yuzhaojing@bytedance.com>
…no commit data (#3928)

Co-authored-by: yuzhaojing <yuzhaojing@bytedance.com>
…es. (#3873)

* [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Following improvements are implemented:
1. Memory overhead reduction:
  - Existing code caches FileStatus for each file in memory.
  - Created a new class DirectoryInfo which is used to cache a director's file list with parts of the FileStatus (only filename and file len). This reduces the memory requirements.

2. Improved parallelism:
  - Existing code collects all the listing to the Driver and then creates HoodieRecord on the Driver.
  - This takes a long time for large tables (11million HoodieRecords to be created)
  - Created a new function in SparkRDDWriteClient specifically for bootstrap commit. In it, the HoodieRecord creation is parallelized across executors so it completes fast.

3. Fixed setting to limit the number of parallel listings:
  - Existing code had a bug wherein 1500 executors were hardcoded to perform listing. This leads to exception due to limit in the spark's result memory.
  - Corrected the use of the config.

Result:
Dataset has 1299 partitions and 12Million files.
file listing time=1.5mins
HoodieRecord creation time=13seconds
deltacommit duration=2.6mins

Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
Co-authored-by: yuzhaojing <yuzhaojing@bytedance.com>
…ithmeticException (#3955)

- ExternalSpillableMap does the payload/value size estimation on the first put to
  determine when to spill over to disk map. The payload size re-estimation also
  happens after a minimum threshold of puts. This size re-estimation goes my the
  current in-memory map size for calculating average payload size and does attempts
  divide by zero operation when the map is size is empty. Avoiding the
  ArithmeticException during the payload size re-estimate by checking the map size
  upfront.
…nReadRollbackActionExecutor (#3978)

- With rollback after first commit support added to metadata table, these test cases are safe to have metadata table turned on.
Co-authored-by: 闫杜峰 <yandufeng@sinochem.com>
xushiyan and others added 29 commits December 18, 2021 17:00
…ix and TestHoodieClientMultiWriter test fixes (#4384)

 - Made FileSystemBasedLockProviderTestClass thread safe and fixed the
   tryLock retry logic.

 - Made TestHoodieClientMultiWriter. testHoodieClientBasicMultiWriter
   deterministic in verifying the HoodieWriteConflictException.

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
Co-authored-by: yuzhaojing <yuzhaojing@bytedance.com>
[HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields
…with empty checkpoint (#4334)

* Adding ability to read entire data with HoodieIncrSource with empty checkpoint

* Addressing comments
@aditiwari01 aditiwari01 merged commit abdc7d8 into aditiwari01:master Dec 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.