Skip to content

[SUPPORT] Zordering clustering on a moderate size dataset taking large amounts of time.  #4135

@vinothchandar

Description

@vinothchandar

Describe the problem you faced

I am trying to play with z-ordering on a 50G+ dataset locally to understand everything. Noticed large number of stages, and its pretty slow due to that. I want to make sure this is expected.

image

image

To Reproduce

Steps to reproduce the behavior:

  1. Any 50GB+ dataset. I am using the amazon reviews dataset here https://s3.amazonaws.com/amazon-reviews-pds/readme.html
  2. Run inline compaction
val df = spark.read.parquet(inputPath)
val commonOpts = Map("hoodie.bulk_insert.shuffle.parallelism" -> "10",
                     "hoodie.clustering.inline" -> "true",
                     "hoodie.clustering.inline.max.commits" -> "1",
                     "hoodie.layout.optimize.enable" -> "true",
                     "hoodie.clustering.plan.strategy.sort.columns" -> "product_id,customer_id,review_date")
df.write.format("hudi").
  option(PRECOMBINE_FIELD.key(), "review_id").
  option(RECORDKEY_FIELD.key(), "review_id").
  option("hoodie.table.name", "amazon_reviews_hudi").
  option(OPERATION.key(),"bulk_insert").
  option(BULK_INSERT_SORT_MODE.key(), "NONE").
  options(commonOpts).
  mode(Overwrite).
  save(outputPath)

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version : 0.10-SNAPSHOT

  • Spark version : Apache Spark 3.0

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) : Local filesystem

  • Running on Docker? (yes/no) :

Additional context

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions