[SUPPORT] Zordering clustering on a moderate size dataset taking large amounts of time. 

**Describe the problem you faced**

I am trying to play with z-ordering on a 50G+ dataset locally to understand everything. Noticed large number of stages, and its pretty slow due to that. I want to make sure this is expected. 

![image](https://user-images.githubusercontent.com/1179324/143666883-da6c64f2-9c1c-49fb-ae44-9f9a941f7116.png)

![image](https://user-images.githubusercontent.com/1179324/143666898-9b9350f6-fa76-4949-a358-bd064a60e7dc.png)


**To Reproduce**

Steps to reproduce the behavior:

1. Any 50GB+ dataset. I am using the amazon reviews dataset here  https://s3.amazonaws.com/amazon-reviews-pds/readme.html 
2. Run inline compaction 

```
val df = spark.read.parquet(inputPath)
val commonOpts = Map("hoodie.bulk_insert.shuffle.parallelism" -> "10",
                     "hoodie.clustering.inline" -> "true",
                     "hoodie.clustering.inline.max.commits" -> "1",
                     "hoodie.layout.optimize.enable" -> "true",
                     "hoodie.clustering.plan.strategy.sort.columns" -> "product_id,customer_id,review_date")
df.write.format("hudi").
  option(PRECOMBINE_FIELD.key(), "review_id").
  option(RECORDKEY_FIELD.key(), "review_id").
  option("hoodie.table.name", "amazon_reviews_hudi").
  option(OPERATION.key(),"bulk_insert").
  option(BULK_INSERT_SORT_MODE.key(), "NONE").
  options(commonOpts).
  mode(Overwrite).
  save(outputPath)
```

**Expected behavior**

A clear and concise description of what you expected to happen.

**Environment Description**

* Hudi version : 0.10-SNAPSHOT

* Spark version : Apache Spark 3.0 

* Hive version :

* Hadoop version :

* Storage (HDFS/S3/GCS..) : Local filesystem

* Running on Docker? (yes/no) :


**Additional context**




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] Zordering clustering on a moderate size dataset taking large amounts of time. #4135

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[SUPPORT] Zordering clustering on a moderate size dataset taking large amounts of time. #4135

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions