Skip to content

[SUPPORT]Sync hive lost some partitions when submit multiple commits at the same time  #7570

@perfectcw

Description

@perfectcw

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

Issue:
Lost some partitions when sync hive

Background:
We have a data ingest pipeline, which ingest about 500 partitions per day. And the pipeline will submit multiple commits at the same time to insert different partitions. The sync hive function is enabled for each commit.

And after all of commits succeed, we found that some partitions are missing in the hive table.

The following is the analysis of log and hoodie files:
For the hoodie files, shows six of the commits. Then it was found that only 20221227042858342 & 20221227042906103 two commits were synced to hive, and the rest of the partitions did not appear in hive table.

I think the root cause is because of the mechanism of sync hive. When hudi sync hive after the commit is succeed, it will first get the latest synced commit, and then use the timestamp of this commit as a benchmark to check whether the new column and partition are added to the commit behind it, and if so, it will sync to hive.
So if a commit A is submmitted before this latest synced commit B, but succeeds after commit B, so it will not be synced hive. Because of commit A's timestamp < commit B's timestamp, it won't be detected.

Here is the log of commit 20221227042859357, we can see it get latest synced commit is 20221227042906103, which commit after 20221227042859357 itself. So the partition inserted by 20221227042859357 commit has not been detected, and the partition that needs to be synced is 0.

log of commit 20221227042859357:
2022-12-27 04:30:16,449 INFO hive.metastore: Opened a connection to metastore, current connections: 1
2022-12-27 04:30:16,465 INFO hive.metastore: Connected to metastore.
2022-12-27 04:30:16,676 INFO hive.HiveSyncTool: Syncing target hoodie table with hive table(forecast_agg_hoover_multi_publish). Hive metastore URL :jdbc:hive2://hs2.presto.stg.aws.fwmrm.net:10000/;auth=noSasl, basePath :s3a://fw1-stg-af-dip/hudi/forecast_agg_hoover_multi_publish
2022-12-27 04:30:16,676 INFO hive.HiveSyncTool: Trying to sync hoodie table forecast_agg_hoover_multi_publish with base path s3a://fw1-stg-af-dip/hudi/forecast_agg_hoover_multi_publish of type COPY_ON_WRITE
2022-12-27 04:30:16,815 INFO table.TableSchemaResolver: Reading schema from s3a://fw1-stg-af-dip/hudi/forecast_agg_hoover_multi_publish/20221227/0/20230108/9820ce59-03a8-4efa-8978-3c3cf61298d8-0_1-11-3890_20221227042906103.parquet
2022-12-27 04:30:16,904 INFO s3a.S3AInputStream: Switching to Random IO seek policy
2022-12-27 04:30:17,477 INFO hive.HiveSyncTool: No Schema difference for forecast_agg_hoover_multi_publish
2022-12-27 04:30:17,477 INFO hive.HiveSyncTool: Schema sync complete. Syncing partitions for forecast_agg_hoover_multi_publish
2022-12-27 04:30:17,525 INFO hive.HiveSyncTool: Last commit time synced was found to be 20221227042906103
2022-12-27 04:30:17,525 INFO common.AbstractSyncHoodieClient: Last commit time synced is 20221227042906103, Getting commits since then
2022-12-27 04:30:17,527 INFO hive.HiveSyncTool: Storage partitions scan complete. Found 0
2022-12-27 04:30:17,697 INFO hive.HiveSyncTool: Sync complete for forecast_agg_hoover_multi_publish

    
.hoodie files: (order by time)
 name                                   type            last modify time            partition            if exist in hive
 20221227042855832.commit.requested	requested      2022-12-27 pm12:28:59 CST    20221227/0/20230101        no
 20221227042858342.commit.requested	requested      2022-12-27 pm12:29:00 CST    20221227/0/20230106        yes
 20221227042858801.commit.requested	requested      2022-12-27 pm12:29:01 CST    20221227/0/20230107        no
 20221227042859357.commit.requested	requested      2022-12-27 pm12:29:01 CST    20221227/0/20221229        no
 20221227042901993.commit.requested	requested      2022-12-27 pm12:29:04 CST    20221227/0/20230103        no
 20221227042906103.commit.requested	requested      2022-12-27 pm12:29:08 CST    20221227/0/20230108        yes
 ...
 20221227042855832.inflight	        inflight       2022-12-27 pm12:29:16 CST
 20221227042858342.inflight	        inflight       2022-12-27 pm12:29:16 CST
 20221227042858801.inflight	        inflight       2022-12-27 pm12:29:17 CST
 20221227042859357.inflight 	        inflight       2022-12-27 pm12:29:19 CST
 20221227042906103.inflight	        inflight       2022-12-27 pm12:29:19 CST
 20221227042901993.inflight	        inflight       2022-12-27 pm12:29:20 CST
 ...
 20221227042858342.commit	        commit         2022-12-27 pm12:29:46 CST   20221227/0/20230106                          
 20221227042906103.commit	        commit         2022-12-27 pm12:29:54 CST   20221227/0/20230108                         
 20221227042858801.commit	        commit         2022-12-27 pm12:30:04 CST   20221227/0/20230107 
 20221227042859357.commit	        commit         2022-12-27 pm12:30:14 CST
 20221227042855832.commit	        commit         2022-12-27 pm12:30:23 CST
 20221227042901993.commit	        commit         2022-12-27 pm12:30:33 CST
 ...

To Reproduce

Steps to reproduce the behavior:

1.Submit multiple commits at the same time to insert different partitions. The sync hive function is enabled for each commit.
2.The order in which all commits succeed is inconsistent with the order in which they were submitted.
3.Check whether the hive table has parititon for all inserts

Expected behavior

A clear and concise description of what you expected to happen.
Each commit can specify a synchronized partition as the currently inserted parition.

Environment Description

  • Hudi version :0.11.1

  • Spark version :3.2.1

  • Hive version :XXX

  • Hadoop version :3.3.2

  • Storage (HDFS/S3/GCS..) :S3

  • Running on Docker? (yes/no) :no

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

Status

✅ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions