Skip to content

feat: Support reading Parquet ENUM logical type as String#7805

Open
leochen4891 wants to merge 4 commits intodeephaven:mainfrom
leochen4891:fix/parquet-enum-logical-type
Open

feat: Support reading Parquet ENUM logical type as String#7805
leochen4891 wants to merge 4 commits intodeephaven:mainfrom
leochen4891:fix/parquet-enum-logical-type

Conversation

@leochen4891
Copy link
Contributor

@leochen4891 leochen4891 commented Mar 19, 2026

Closes #7723

Background

Parquet's ENUM logical type is physically identical to STRING: both annotate a BINARY column with UTF-8 encoded bytes. The only difference is the label. External tools such as Spark and PyArrow use ENUM to indicate a column holds a finite set of string values, but the wire format is the same.

Deephaven's read pipeline has three stages where logical type is dispatched. All three previously had no handling for EnumLogicalTypeAnnotation, causing ENUM-annotated columns from externally produced files to fail on read.

Changes

Stage 1 — Schema to Java type (ParquetSchemaReader)

Before: visit(EnumLogicalTypeAnnotation) set an error string and returned Optional.empty(), so the column was unresolvable.

After: Returns Optional.of(String.class), the same result as visit(StringLogicalTypeAnnotation).

Stage 2 — Column data to chunk (ParquetColumnLocation)

Before: No visit(EnumLogicalTypeAnnotation) override existed, so the visitor returned Optional.empty() and the read failed at runtime.

After: A new override delegates to ToStringPage.create(...), the same decoder used for STRING columns.

Stage 3 — Pushdown statistics (MinMaxFromStatistics)

Before: getMinMaxForStrings only accepted StringLogicalTypeAnnotation, so ENUM columns returned false and forced a full scan on every filter.

After: The condition adds || instanceof EnumLogicalTypeAnnotation, enabling min/max pushdown for ENUM columns.

Tests

  • MinMaxFromStatisticsTest.enumLogicalStatisticsAreMaterialised — unit test verifying ENUM statistics are extracted as strings.
  • ParquetTableReadWriteTest.testReadEnumLogicalTypeAsString — end-to-end test that writes a Parquet file with a BINARY+ENUM column and reads it back, verifying the column materializes as String with correct values.

leochen4891 and others added 3 commits March 11, 2026 16:53
…deephaven#7723)

ENUM-annotated BINARY columns in parquet files produced by external
tools are currently rejected. This test captures the existing behavior
where enum statistics are not materialized as strings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…7723)

Parquet ENUM is physically identical to STRING (UTF-8 encoded BINARY).
Files produced by external tools (e.g. Spark) may use this annotation.

- ParquetSchemaReader: map EnumLogicalTypeAnnotation to String.class
- ParquetColumnLocation: add visitor delegating to ToStringPage
- MinMaxFromStatistics: accept EnumLogicalTypeAnnotation for min/max
- Update test to verify enum statistics are now materialized

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ring

Writes a Parquet file with a BINARY+ENUM column using Deephaven's own
ParquetFileWriter, then reads it back and verifies the column is
materialized as String with correct values.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 19, 2026

No docs changes detected for 84ea207

@leochen4891 leochen4891 changed the title Support reading Parquet ENUM logical type as String feat: Support reading Parquet ENUM logical type as String Mar 19, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for Parquet ENUM logical type by treating it like STRING across Deephaven’s Parquet read pipeline, enabling successful reads and statistics pushdown for externally-produced Parquet files.

Changes:

  • Map EnumLogicalTypeAnnotation to String during schema-to-Java type resolution.
  • Decode ENUM-annotated BINARY columns using the existing string page decoder.
  • Enable min/max statistics extraction (pushdown) for ENUM logical type and add unit + end-to-end tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java Maps Parquet ENUM logical type to String in schema interpretation.
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/location/ParquetColumnLocation.java Adds ENUM logical type visitor to decode values via ToStringPage.
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/location/MinMaxFromStatistics.java Allows string min/max extraction for ENUM logical type to support pushdown.
extensions/parquet/table/src/test/java/io/deephaven/parquet/table/location/MinMaxFromStatisticsTest.java Adds unit test asserting min/max materialization for ENUM-annotated stats.
extensions/parquet/table/src/test/java/io/deephaven/parquet/table/ParquetTableReadWriteTest.java Adds end-to-end test writing a BINARY+ENUM column and reading it back as String.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Extract visitStringLike() helper in ParquetSchemaReader so STRING and
  ENUM visitors share identical SpecialType resolution logic. This ensures
  any future Deephaven SpecialType metadata on ENUM columns is handled
  consistently with STRING columns.

- Wrap ParquetFileWriter in try-with-resources in the test so the file is
  always finalised even if an exception is thrown mid-write.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for Enum logical type in parquet file

2 participants