Skip to content

[SUPPORT]Duplicate data in MOR table Hudi #8236

@xiagupqin

Description

@xiagupqin

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced
We run sparkstructed streaming application where we read kafka stream process the data and stores in Hudi.
We started seeing duplicates in our hudi dataset
Below are our Hudi configs

DataSourceWriteOptions.TABLE_TYPE.key() -> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, DataSourceWriteOptions.RECORDKEY_FIELD.key() -> "id", DataSourceWriteOptions.PARTITIONPATH_FIELD.key() -> "dt", DataSourceWriteOptions.PRECOMBINE_FIELD.key() -> "ts", HoodieCompactionConfig.INLINE_COMPACT.key() -> "true",
We only use upsert in our code
dataframe.write.format("org.apache.hudi") .option(HoodieWriteConfig.TABLE_NAME, hudiTableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) .options(hudiOptions).mode(SaveMode.Append) .save(s3Location)
A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version :0.12.0

  • Spark version :3.3.0

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) :s3

  • Running on Docker? (yes/no) :no

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    issue:data-consistencyData consistency issues (duplicates/phantoms)priority:criticalProduction degraded; pipelines stalled

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    ✅ Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions