Tips before filing an issue
-
Have you gone through our FAQs?
-
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
-
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
Issue:
Lost some partitions when sync hive
Background:
We have a data ingest pipeline, which ingest about 500 partitions per day. And the pipeline will submit multiple commits at the same time to insert different partitions. The sync hive function is enabled for each commit.
And after all of commits succeed, we found that some partitions are missing in the hive table.
The following is the analysis of log and hoodie files:
For the hoodie files, shows six of the commits. Then it was found that only 20221227042858342 & 20221227042906103 two commits were synced to hive, and the rest of the partitions did not appear in hive table.
I think the root cause is because of the mechanism of sync hive. When hudi sync hive after the commit is succeed, it will first get the latest synced commit, and then use the timestamp of this commit as a benchmark to check whether the new column and partition are added to the commit behind it, and if so, it will sync to hive.
So if a commit A is submmitted before this latest synced commit B, but succeeds after commit B, so it will not be synced hive. Because of commit A's timestamp < commit B's timestamp, it won't be detected.
Here is the log of commit 20221227042859357, we can see it get latest synced commit is 20221227042906103, which commit after 20221227042859357 itself. So the partition inserted by 20221227042859357 commit has not been detected, and the partition that needs to be synced is 0.
log of commit 20221227042859357:
2022-12-27 04:30:16,449 INFO hive.metastore: Opened a connection to metastore, current connections: 1
2022-12-27 04:30:16,465 INFO hive.metastore: Connected to metastore.
2022-12-27 04:30:16,676 INFO hive.HiveSyncTool: Syncing target hoodie table with hive table(forecast_agg_hoover_multi_publish). Hive metastore URL :jdbc:hive2://hs2.presto.stg.aws.fwmrm.net:10000/;auth=noSasl, basePath :s3a://fw1-stg-af-dip/hudi/forecast_agg_hoover_multi_publish
2022-12-27 04:30:16,676 INFO hive.HiveSyncTool: Trying to sync hoodie table forecast_agg_hoover_multi_publish with base path s3a://fw1-stg-af-dip/hudi/forecast_agg_hoover_multi_publish of type COPY_ON_WRITE
2022-12-27 04:30:16,815 INFO table.TableSchemaResolver: Reading schema from s3a://fw1-stg-af-dip/hudi/forecast_agg_hoover_multi_publish/20221227/0/20230108/9820ce59-03a8-4efa-8978-3c3cf61298d8-0_1-11-3890_20221227042906103.parquet
2022-12-27 04:30:16,904 INFO s3a.S3AInputStream: Switching to Random IO seek policy
2022-12-27 04:30:17,477 INFO hive.HiveSyncTool: No Schema difference for forecast_agg_hoover_multi_publish
2022-12-27 04:30:17,477 INFO hive.HiveSyncTool: Schema sync complete. Syncing partitions for forecast_agg_hoover_multi_publish
2022-12-27 04:30:17,525 INFO hive.HiveSyncTool: Last commit time synced was found to be 20221227042906103
2022-12-27 04:30:17,525 INFO common.AbstractSyncHoodieClient: Last commit time synced is 20221227042906103, Getting commits since then
2022-12-27 04:30:17,527 INFO hive.HiveSyncTool: Storage partitions scan complete. Found 0
2022-12-27 04:30:17,697 INFO hive.HiveSyncTool: Sync complete for forecast_agg_hoover_multi_publish
.hoodie files: (order by time)
name type last modify time partition if exist in hive
20221227042855832.commit.requested requested 2022-12-27 pm12:28:59 CST 20221227/0/20230101 no
20221227042858342.commit.requested requested 2022-12-27 pm12:29:00 CST 20221227/0/20230106 yes
20221227042858801.commit.requested requested 2022-12-27 pm12:29:01 CST 20221227/0/20230107 no
20221227042859357.commit.requested requested 2022-12-27 pm12:29:01 CST 20221227/0/20221229 no
20221227042901993.commit.requested requested 2022-12-27 pm12:29:04 CST 20221227/0/20230103 no
20221227042906103.commit.requested requested 2022-12-27 pm12:29:08 CST 20221227/0/20230108 yes
...
20221227042855832.inflight inflight 2022-12-27 pm12:29:16 CST
20221227042858342.inflight inflight 2022-12-27 pm12:29:16 CST
20221227042858801.inflight inflight 2022-12-27 pm12:29:17 CST
20221227042859357.inflight inflight 2022-12-27 pm12:29:19 CST
20221227042906103.inflight inflight 2022-12-27 pm12:29:19 CST
20221227042901993.inflight inflight 2022-12-27 pm12:29:20 CST
...
20221227042858342.commit commit 2022-12-27 pm12:29:46 CST 20221227/0/20230106
20221227042906103.commit commit 2022-12-27 pm12:29:54 CST 20221227/0/20230108
20221227042858801.commit commit 2022-12-27 pm12:30:04 CST 20221227/0/20230107
20221227042859357.commit commit 2022-12-27 pm12:30:14 CST
20221227042855832.commit commit 2022-12-27 pm12:30:23 CST
20221227042901993.commit commit 2022-12-27 pm12:30:33 CST
...
To Reproduce
Steps to reproduce the behavior:
1.Submit multiple commits at the same time to insert different partitions. The sync hive function is enabled for each commit.
2.The order in which all commits succeed is inconsistent with the order in which they were submitted.
3.Check whether the hive table has parititon for all inserts
Expected behavior
A clear and concise description of what you expected to happen.
Each commit can specify a synchronized partition as the currently inserted parition.
Environment Description
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.
Tips before filing an issue
Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
Issue:
Lost some partitions when sync hive
Background:
We have a data ingest pipeline, which ingest about 500 partitions per day. And the pipeline will submit multiple commits at the same time to insert different partitions. The sync hive function is enabled for each commit.
And after all of commits succeed, we found that some partitions are missing in the hive table.
The following is the analysis of log and hoodie files:
For the hoodie files, shows six of the commits. Then it was found that only 20221227042858342 & 20221227042906103 two commits were synced to hive, and the rest of the partitions did not appear in hive table.
I think the root cause is because of the mechanism of sync hive. When hudi sync hive after the commit is succeed, it will first get the latest synced commit, and then use the timestamp of this commit as a benchmark to check whether the new column and partition are added to the commit behind it, and if so, it will sync to hive.
So if a commit A is submmitted before this latest synced commit B, but succeeds after commit B, so it will not be synced hive. Because of commit A's timestamp < commit B's timestamp, it won't be detected.
Here is the log of commit 20221227042859357, we can see it get latest synced commit is 20221227042906103, which commit after 20221227042859357 itself. So the partition inserted by 20221227042859357 commit has not been detected, and the partition that needs to be synced is 0.
To Reproduce
Steps to reproduce the behavior:
1.Submit multiple commits at the same time to insert different partitions. The sync hive function is enabled for each commit.
2.The order in which all commits succeed is inconsistent with the order in which they were submitted.
3.Check whether the hive table has parititon for all inserts
Expected behavior
A clear and concise description of what you expected to happen.
Each commit can specify a synchronized partition as the currently inserted parition.
Environment Description
Hudi version :0.11.1
Spark version :3.2.1
Hive version :XXX
Hadoop version :3.3.2
Storage (HDFS/S3/GCS..) :S3
Running on Docker? (yes/no) :no
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.