Skip to content

[SUPPORT] Hudi partitions not dropped by Hive sync after insert_overwrite_table operation #8114

@Limess

Description

@Limess

Describe the problem you faced

After running an insert to overwrite a Hudi table inplace using insert_overwrite_table, partitions which no longer exist in the new input data are not removed by the Hive Sync. This causes some query engines to fail until the old partitions are manually removed (e.g. AWS Athena).

This is on Hudi 0.12.1, but I'm fairly sure this issue still exists on 0.13.0 - this change: #6662 fixes this behaviour for delete_partition operations, but doesn't add any handling for insert_overwrite_table.

I'd be happy to be proven otherwise if this is fixed in 0.13.0 - I don't have an environment to easily test this without working out how to upgrade on EMR without a release.

To Reproduce

Steps to reproduce the behavior:

  1. Create a new Hudi table using input data with two partitions, e.g. partition_col=1, partition_col=2
  2. Insert into the table using the operation hoodie.datasource.write.operation=insert_overwrite_table with input data containing 1/2 of the original partitions, e.g. only partition_col=2
  3. Run HiveSyncTool or similar (doesn't work with Spark writer sync or HiveSyncTool)
  4. Check the Hive partitions. Both partitions still exist

Expected behavior

I'd expect the partition which was not inserted to be removed, e.g. only partition_col=2 exists, partition_col=1 is deleted.

Environment Description

  • Hudi version : 0.12.1

  • Spark version : 3.3.1

  • Hive version : AWS Glue

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : no

Additional context

Running on EMR 0.6.9

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

Status

⏳ Awaiting Triage

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions