Describe the problem you faced
I am trying to play with z-ordering on a 50G+ dataset locally to understand everything. Noticed large number of stages, and its pretty slow due to that. I want to make sure this is expected.


To Reproduce
Steps to reproduce the behavior:
- Any 50GB+ dataset. I am using the amazon reviews dataset here https://s3.amazonaws.com/amazon-reviews-pds/readme.html
- Run inline compaction
val df = spark.read.parquet(inputPath)
val commonOpts = Map("hoodie.bulk_insert.shuffle.parallelism" -> "10",
"hoodie.clustering.inline" -> "true",
"hoodie.clustering.inline.max.commits" -> "1",
"hoodie.layout.optimize.enable" -> "true",
"hoodie.clustering.plan.strategy.sort.columns" -> "product_id,customer_id,review_date")
df.write.format("hudi").
option(PRECOMBINE_FIELD.key(), "review_id").
option(RECORDKEY_FIELD.key(), "review_id").
option("hoodie.table.name", "amazon_reviews_hudi").
option(OPERATION.key(),"bulk_insert").
option(BULK_INSERT_SORT_MODE.key(), "NONE").
options(commonOpts).
mode(Overwrite).
save(outputPath)
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
-
Hudi version : 0.10-SNAPSHOT
-
Spark version : Apache Spark 3.0
-
Hive version :
-
Hadoop version :
-
Storage (HDFS/S3/GCS..) : Local filesystem
-
Running on Docker? (yes/no) :
Additional context
Describe the problem you faced
I am trying to play with z-ordering on a 50G+ dataset locally to understand everything. Noticed large number of stages, and its pretty slow due to that. I want to make sure this is expected.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
Hudi version : 0.10-SNAPSHOT
Spark version : Apache Spark 3.0
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) : Local filesystem
Running on Docker? (yes/no) :
Additional context