Describe the problem you faced
After running an insert to overwrite a Hudi table inplace using insert_overwrite_table, partitions which no longer exist in the new input data are not removed by the Hive Sync. This causes some query engines to fail until the old partitions are manually removed (e.g. AWS Athena).
This is on Hudi 0.12.1, but I'm fairly sure this issue still exists on 0.13.0 - this change: #6662 fixes this behaviour for delete_partition operations, but doesn't add any handling for insert_overwrite_table.
I'd be happy to be proven otherwise if this is fixed in 0.13.0 - I don't have an environment to easily test this without working out how to upgrade on EMR without a release.
To Reproduce
Steps to reproduce the behavior:
- Create a new Hudi table using input data with two partitions, e.g. partition_col=1, partition_col=2
- Insert into the table using the operation
hoodie.datasource.write.operation=insert_overwrite_table with input data containing 1/2 of the original partitions, e.g. only partition_col=2
- Run HiveSyncTool or similar (doesn't work with Spark writer sync or HiveSyncTool)
- Check the Hive partitions. Both partitions still exist
Expected behavior
I'd expect the partition which was not inserted to be removed, e.g. only partition_col=2 exists, partition_col=1 is deleted.
Environment Description
Additional context
Running on EMR 0.6.9
Describe the problem you faced
After running an insert to overwrite a Hudi table inplace using
insert_overwrite_table, partitions which no longer exist in the new input data are not removed by the Hive Sync. This causes some query engines to fail until the old partitions are manually removed (e.g. AWS Athena).This is on Hudi 0.12.1, but I'm fairly sure this issue still exists on 0.13.0 - this change: #6662 fixes this behaviour for
delete_partitionoperations, but doesn't add any handling forinsert_overwrite_table.I'd be happy to be proven otherwise if this is fixed in 0.13.0 - I don't have an environment to easily test this without working out how to upgrade on EMR without a release.
To Reproduce
Steps to reproduce the behavior:
hoodie.datasource.write.operation=insert_overwrite_tablewith input data containing 1/2 of the original partitions, e.g. only partition_col=2Expected behavior
I'd expect the partition which was not inserted to be removed, e.g. only partition_col=2 exists, partition_col=1 is deleted.
Environment Description
Hudi version : 0.12.1
Spark version : 3.3.1
Hive version : AWS Glue
Hadoop version :
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Additional context
Running on EMR 0.6.9