rfc: Implement an API for all Data File Formats by Kurtiscwright · Pull Request #2384 · apache/iceberg-rust

Kurtiscwright · 2026-04-29T07:31:32Z

Which issue does this PR close?

Closes RFC: File Format API for Apache Iceberg Rust #2382

What changes are included in this PR?

Adds new File Format API RFC doc.
From the Issue:
This RFC proposes a File Format API for the iceberg Rust crate that decouples Iceberg's read and write paths from any single file format. Today, iceberg-rust can only read and write Parquet data files: the format is hard-wired into ArrowReader, ParquetWriter, and every layer that touches them. The Java project shipped an analogous abstraction (FormatModel) in February 2026 via PR Core, Data: File Format API interfaces iceberg#12774, and PyIceberg has an open proposal (File Format API for PyIceberg iceberg-python#3100) for the same concept.

Are these changes tested?

Just a doc update.

xanderbailey · 2026-04-29T09:59:50Z

+
+4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and `FileWriter` layering is sound. This RFC adds a format abstraction beneath `FileWriter`, not a replacement for it.
+
+5. **Implement variant shredding or encryption.** Java exposes `engineProjection` and `engineSchema` as extension points for variant shredding and similar format-specific type mapping, and `withFileEncryptionKey` and `withAADPrefix` for Parquet encryption. Equivalent hooks are noted as future extensions in the Rust design. Implementing either requires a dedicated RFC.


Encryption work is currently in-flight so I'd love to understand how we could incorporate that in this plan.

let mut wb = registry.write_builder(format, output)?; if let Some(em) = &self.encryption_manager { let (dek, aad) = em.create_file_key().await?; wb.with_file_encryption_key(&dek); wb.with_aad_prefix(&aad); } let writer = wb.build().await?;

I need to think if something like this might make sense?

Which I think means I could loose https://github.com/apache/iceberg-rust/pull/2383/changes#diff-d7541c771a8632c6fb1644f8ec49a8e3efddae40c68197ca3657125059417454R96

Thank you for the review. In regards to integrating the File Format and Encryption interface, the first thing that jumps out to me is the following question. Does Encryption have different traits or impls based on the File Format in question?

Essentially, will you need OrcEncryption that is separate from ParquetEncryption modules. If yes then later in the doc when it lists the Format specific traits living in the formats/ folder. I think the format specific encryption trait can be called from there. Otherwise we can integrate the Encryption libraries directly into the WriterBuilder and ReaderBuilder with respect to Encrypt vs Decrypt calls.

ORC does have a similar concept to PME in parquet and the spec references this but it's not implemented in Java. In the case of Parquet, the actual encryption / decryption is handled by arrow-rs in the writer so it will differ depending on format. I would hope we could get away with just passing the key and aad down to the writer but I'm the kind of person that has to see code to work out how this might look.

Also this could be an interesting read around the spec: https://iceberg.apache.org/docs/nightly/encryption/#appendix-internals-overview

@ggershinsky described the current Java state like this:

Currently we can encrypt Parquet and Avro data files. Parquet files are encrypted via the native PME mechanism, while Avro files are encrypted via Iceberg's AES GCM Streams https://iceberg.apache.org/gcm-stream-spec/. But both kinds of encryption require the same parameters: key and aadPrefix. The best doc, I guess, is https://iceberg.apache.org/docs/nightly/encryption/#appendix-internals-overview .

CTTY

Hi @Kurtiscwright , thanks for the rfc! Just took a quick look and left some comments

My biggest concern is limiting the format API to the arrow ecosystem. It seems ok for now but it basically forces users to think about arrow when implementing their own formats.

Curious to hear thoughts from you and others as well

CTTY · 2026-04-29T22:33:31Z

+
+2. Remove hard-coded Parquet assumptions from scan and write orchestration. After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch through the format abstraction instead of constructing Parquet types directly.
+
+3. Provide a registry that maps `DataFileFormat` values to `FormatModel` implementations, so callers obtain readers and writers without naming the concrete format type.


I think this is more of an implementation details

CTTY · 2026-04-29T22:34:09Z

+
+3. Provide a registry that maps `DataFileFormat` values to `FormatModel` implementations, so callers obtain readers and writers without naming the concrete format type.
+
+4. Define a conformance test suite (TCK) that any `FormatModel` implementation must pass before it merges.


Is this really necessary for the initial implementation? we only support parquet for now

TCK is not even completed on the java side afaik

CTTY · 2026-04-29T22:37:53Z

+
+Each implementation registers one instance per `DataFileFormat` variant it supports. The `format` method returns that variant. `read_builder` and `write_builder` are the entry points for reading and writing a file. Both return trait objects so that the registry can hand them back from a `DataFileFormat`-keyed lookup.
+
+The data type is fixed to Arrow `RecordBatch`, the Iceberg schema type to `iceberg::spec::Schema`, and the physical schema type to `arrow::datatypes::SchemaRef`. These are the only types in iceberg-rust today that would fill the roles Java uses generic parameters for. Arguments for keeping them fixed, including comparisons with Java's `FormatModel<D, S>`, are in "Design rationale" below.


We don't have to hardcode it to use Arrow's RecordBatch even. We can use a generic type for in-memory representation and arrow can be the default value

Wonder if we need to make that abstraction at this stage? Java engines do have different representations that they operate on but to my knowledge arrow is pretty ubiquitous in rust engines / applications?

I would prefer to keep the data type open. My reasoning:

Hive: We built our own integration earlier, when there was no straightforward way to convert to Hive’s internal format. As a result, Hive still uses its own readers and writers and effectively has to implement each feature from scratch. They might eventually move to the API if we introduce a major reader feature they don’t want to reimplement.

Impala: similar situation. Impala also relies on its own readers and writers. They do more custom logic there, so they might not adopt the File Format API directly, but the trade‑off is that they end up lagging behind on new integrations.

Arrow dependency: for engines that don’t natively operate on Arrow, converting FileFormat → Arrow → engine format is possible but can be quite expensive performance wise. If we only expose Arrow, we may lose those engines because of the overhead.

Rust ecosystem: I’m aware of Rust‑based engines that don’t use Arrow internally. It would be great to keep the door open for them as well.

How flexible are the Rust API contracts in this regard? If we limited the scope now, how hard would it be to support additional (non‑Arrow) data types later? I understand the desire to keep things simple for now, but I suspect we’ll eventually need a path that allows more than just Arrow.

This compelling to me! Thanks for the detailed explanation!

+1 to what Peter says above, for a low-level API like file format, we want to keep the iceberg-rust open to other ecosystems and enable it as a kernel implementation

CTTY · 2026-04-29T22:40:48Z

+
+Using `DataFileFormat` as a `HashMap` key requires adding `#[derive(Hash)]` to the enum. That is a non-breaking addition.
+
+The registry is an owned value, not a global static. Tests construct their own. Applications construct one at startup and pass it to scan planners and write orchestrators. For the common case of a single registry for the lifetime of a process, `default_format_registry()` returns a `&'static FormatRegistry` initialized through `OnceLock` on first call.


Would love to see more details on scan planner's API changes. Also what about writing?

Or we can inject format registry to Catalog instances directly?

CTTY · 2026-04-29T23:12:48Z

+
+`read_builder` and `write_builder` return `Err(Error { kind: ErrorKind::FeatureUnsupported, .. })` for unregistered formats. The error message distinguishes two cases: the format is implemented but its feature flag is disabled in this build, or the format has no implementation in this crate.
+
+### Feature flags


This could be a non-goal, we only plan to support parquet for now and parquet is essential

CTTY · 2026-04-29T23:14:03Z

+crates/iceberg/src/formats/
+├── mod.rs        # FormatModel, FormatReadBuilder, FormatWriteBuilder, FormatFileWriter
+├── registry.rs   # FormatRegistry, default_format_registry
+└── parquet.rs    # ParquetFormatModel, wrapping existing ParquetWriter and ArrowReader


nit: This should be formats/parquet/mod.rs?

CTTY · 2026-04-29T23:15:40Z

+
+Java's `FormatModel<D, S>` uses `D` for Spark `InternalRow`, Flink `RowData`, Arrow `ColumnarBatch`, and other engine-native row types. iceberg-rust has one row type: Arrow `RecordBatch`. Every writer accepts it, and every reader returns it. No format on the near-term queue (ORC, Avro data-file, Vortex, Lance) produces anything other than `RecordBatch` at the Iceberg-facing boundary. No engine integration in iceberg-rust today brings an engine-native row type the way Spark and Flink do in Java.
+
+Adding a `D` parameter today means writing `<RecordBatch>` everywhere the trait is used, for no present caller benefit. If a future format cannot bridge to Arrow, the trait can gain an associated type with a default, which is a semver-compatible addition.


the trait can gain an associated type with a default

I think we can do it in the initial cut, but I have not investigated the required effort for this

CTTY · 2026-04-29T23:18:27Z

+
+Feature flags work across `cargo check`, `cargo miri`, cross-compilation, and static linking. `datafusion`, `opendal`, and `reqwest` use the same pattern.
+
+#### RecordBatch as the canonical data type


Is this the same as the first point?

pvary · 2026-04-30T09:03:02Z

+
+5. Match the Java and PyIceberg designs where they align, and diverge where Rust's single-data-type ecosystem and pre-1.0 status justify it. Divergences are called out inline.
+
+## Non-Goals


Writing V2 tables (position delete files) is also out of scope. Am I right?

pvary · 2026-04-30T09:07:35Z

+
+3. **Add Puffin support to the FormatModel.** Puffin files hold statistics and deletion vectors rather than row data. They have a different lifecycle from data files and are already handled separately in `crates/iceberg/src/puffin/`.
+
+4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and `FileWriter` layering is sound. This RFC adds a format abstraction beneath `FileWriter`, not a replacement for it.


How equality delete writers are implemented in rust?

In Java we added an extra abstraction layer on top of FileWriterBuilder so the registry-provided writer can serve both DataFileWriters and EqualityDeleteFileWriters. This works because, aside from configuration differences, both writers ultimately do the same thing: write records.

pvary · 2026-04-30T09:39:41Z

+`FormatReadBuilder` and `FormatWriteBuilder` configure a single read or write. `FormatModel` produces them, and the caller consumes them with `build`.
+
+```rust
+pub trait FormatReadBuilder: Send {


There are a few methods that seem to be missing compared to the Java API:

set / setAll – Do we want to allow pushing configuration down to the readers (e.g., prefetching, parallel column reads, or page index usage)? If so, having a set-style method to propagate such configs to the reader could be useful.

reuseContainers – Does Rust support container reuse? In Java, reusing the same objects across reads (instead of allocating new ones each time) can bring significant performance benefits.

idToConstants – In Java, the reader returns full rows, but some values are not stored in the data file itself and instead come from file metadata. IdToConstants supplies those values so they can be populated in the returned rows.

nameMapping – In Java, this is used when reading data files that were not written by Iceberg, where column IDs are missing and only column names are available.

The engineSchema might also be necessary for Variant handling, if the engine wants to control how a Variant is read (e.g., specify the shredding layout). This would avoid leaking the file’s internal structure into the reader and would also eliminate the need to reassemble shredded variants if the file stores them shredded and the engine prefers to consume them in that form.

pvary · 2026-04-30T09:54:15Z

+    fn build(self: Box<Self>) -> BoxFuture<'static, Result<ArrowRecordBatchStream>>;
+}
+
+pub trait FormatWriteBuilder: Send {


There are a few methods missing compared to the Java API:

engineSchema – The engine could specify how a Variant is currently shredded, or indicate that values are stored, for example, as int in the records. The writer could then use this information to read or materialize them differently (e.g., as long) when writing.

meta – File-level metadata is often type-specific. In the Java implementation we record some information mainly for debuggability, such as the file type or the equality field IDs for an equality-delete file.

metricsConfig – In Java, the writer is responsible for collecting file metrics, so the configuration must be passed down to ensure the correct metrics are produced.

overwrite – In Java, users can request that existing output files be overwritten.

Encryption settings – For example fileEncryptionKey and aadPrefix, which are required when encryption is enabled.

pvary · 2026-04-30T09:59:30Z

+
+#### No generic over the data type
+
+Java's `FormatModel<D, S>` uses `D` for Spark `InternalRow`, Flink `RowData`, Arrow `ColumnarBatch`, and other engine-native row types. iceberg-rust has one row type: Arrow `RecordBatch`. Every writer accepts it, and every reader returns it. No format on the near-term queue (ORC, Avro data-file, Vortex, Lance) produces anything other than `RecordBatch` at the Iceberg-facing boundary. No engine integration in iceberg-rust today brings an engine-native row type the way Spark and Flink do in Java.


I think this limitation will prevent integrations in the long run. How hard would it be to change this if we face new use-cases?

pvary · 2026-04-30T10:02:04Z

+
+Java's `S` parameter serves Spark's `StructType`, Flink's `RowType`, and the other engine-native schema types. iceberg-rust has `iceberg::spec::Schema` for the logical schema and `arrow::datatypes::SchemaRef` for the physical schema, with conversion in `arrow/schema.rs`. There is no third schema type a generic parameter would serve.
+
+Variant shredding is the concrete use case that Java's `engineProjection(S)` method addresses. `FormatReadBuilder` can later gain an `engine_projection(&mut self, schema: ArrowSchemaRef) -> &mut Self` method with a default no-op implementation. Format implementations that support shredding override it. The parameter is `ArrowSchemaRef`, not a generic `S`, because Arrow is the only engine schema in iceberg-rust.


engineProjection(S) could also be used for requesting a long when the iceberg schema defines an int

pvary · 2026-04-30T10:06:02Z

+
+#### RecordBatch as the canonical data type
+
+Arrow `RecordBatch` is the interchange format for every Rust data ecosystem iceberg-rust plausibly integrates with. DataFusion consumes it directly. Polars bridges to Arrow with zero-copy conversion for primitive columns. Lance uses its own columnar layout internally and exposes Arrow at read and write boundaries. Every integration and near-term format proposal either produces Arrow or bridges to it.


I’m skeptical about this.
Is the Rust ecosystem really that centered around Arrow, or are we just seeing survival bias? It feels possible that we mainly hear from Arrow users simply because anyone who wasn’t comfortable with Arrow has already abandoned the Iceberg Rust connector.

pvary · 2026-04-30T10:08:12Z

+
+## Conformance tests (TCK)
+
+A new `FormatModel` implementation should pass a shared conformance test suite before it merges. Java has one (`BaseFormatModelTests<T>`) but it is intentionally minimal: a single struct-of-primitives data generator, no V3 types, no schema evolution, no metadata columns. Issue [#15415](https://github.com/apache/iceberg/issues/15415) tracks the gaps.


The BaseFormatModelTests were intentionally kept minimal to limit the scope of that large PR. We’re actively adding more tests there now. It would also be useful to align on the test cases and potentially create/run cross‑implementation tests as well.

rfc: Implement an API for all Data File Formats

0577d2d

Kurtiscwright mentioned this pull request Apr 29, 2026

RFC: File Format API for Apache Iceberg Rust #2382

Open

xanderbailey reviewed Apr 29, 2026

View reviewed changes

CTTY reviewed Apr 29, 2026

View reviewed changes

pvary reviewed Apr 30, 2026

View reviewed changes


		4. Redesign the writer trait hierarchy. The existing `IcebergWriter` and `FileWriter` layering is sound. This RFC adds a format abstraction beneath `FileWriter`, not a replacement for it.

		5. Implement variant shredding or encryption. Java exposes `engineProjection` and `engineSchema` as extension points for variant shredding and similar format-specific type mapping, and `withFileEncryptionKey` and `withAADPrefix` for Parquet encryption. Equivalent hooks are noted as future extensions in the Rust design. Implementing either requires a dedicated RFC.


		2. Remove hard-coded Parquet assumptions from scan and write orchestration. After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch through the format abstraction instead of constructing Parquet types directly.

		3. Provide a registry that maps `DataFileFormat` values to `FormatModel` implementations, so callers obtain readers and writers without naming the concrete format type.


		3. Provide a registry that maps `DataFileFormat` values to `FormatModel` implementations, so callers obtain readers and writers without naming the concrete format type.

		4. Define a conformance test suite (TCK) that any `FormatModel` implementation must pass before it merges.


		Each implementation registers one instance per `DataFileFormat` variant it supports. The `format` method returns that variant. `read_builder` and `write_builder` are the entry points for reading and writing a file. Both return trait objects so that the registry can hand them back from a `DataFileFormat`-keyed lookup.

		The data type is fixed to Arrow `RecordBatch`, the Iceberg schema type to `iceberg::spec::Schema`, and the physical schema type to `arrow::datatypes::SchemaRef`. These are the only types in iceberg-rust today that would fill the roles Java uses generic parameters for. Arguments for keeping them fixed, including comparisons with Java's `FormatModel<D, S>`, are in "Design rationale" below.


		Using `DataFileFormat` as a `HashMap` key requires adding `#[derive(Hash)]` to the enum. That is a non-breaking addition.

		The registry is an owned value, not a global static. Tests construct their own. Applications construct one at startup and pass it to scan planners and write orchestrators. For the common case of a single registry for the lifetime of a process, `default_format_registry()` returns a `&'static FormatRegistry` initialized through `OnceLock` on first call.


		`read_builder` and `write_builder` return `Err(Error { kind: ErrorKind::FeatureUnsupported, .. })` for unregistered formats. The error message distinguishes two cases: the format is implemented but its feature flag is disabled in this build, or the format has no implementation in this crate.

		### Feature flags


		Java's `FormatModel<D, S>` uses `D` for Spark `InternalRow`, Flink `RowData`, Arrow `ColumnarBatch`, and other engine-native row types. iceberg-rust has one row type: Arrow `RecordBatch`. Every writer accepts it, and every reader returns it. No format on the near-term queue (ORC, Avro data-file, Vortex, Lance) produces anything other than `RecordBatch` at the Iceberg-facing boundary. No engine integration in iceberg-rust today brings an engine-native row type the way Spark and Flink do in Java.

		Adding a `D` parameter today means writing `<RecordBatch>` everywhere the trait is used, for no present caller benefit. If a future format cannot bridge to Arrow, the trait can gain an associated type with a default, which is a semver-compatible addition.


		Feature flags work across `cargo check`, `cargo miri`, cross-compilation, and static linking. `datafusion`, `opendal`, and `reqwest` use the same pattern.

		#### RecordBatch as the canonical data type


		5. Match the Java and PyIceberg designs where they align, and diverge where Rust's single-data-type ecosystem and pre-1.0 status justify it. Divergences are called out inline.

		## Non-Goals


		3. Add Puffin support to the FormatModel. Puffin files hold statistics and deletion vectors rather than row data. They have a different lifecycle from data files and are already handled separately in `crates/iceberg/src/puffin/`.

		4. Redesign the writer trait hierarchy. The existing `IcebergWriter` and `FileWriter` layering is sound. This RFC adds a format abstraction beneath `FileWriter`, not a replacement for it.


		#### No generic over the data type

		Java's `FormatModel<D, S>` uses `D` for Spark `InternalRow`, Flink `RowData`, Arrow `ColumnarBatch`, and other engine-native row types. iceberg-rust has one row type: Arrow `RecordBatch`. Every writer accepts it, and every reader returns it. No format on the near-term queue (ORC, Avro data-file, Vortex, Lance) produces anything other than `RecordBatch` at the Iceberg-facing boundary. No engine integration in iceberg-rust today brings an engine-native row type the way Spark and Flink do in Java.


		Java's `S` parameter serves Spark's `StructType`, Flink's `RowType`, and the other engine-native schema types. iceberg-rust has `iceberg::spec::Schema` for the logical schema and `arrow::datatypes::SchemaRef` for the physical schema, with conversion in `arrow/schema.rs`. There is no third schema type a generic parameter would serve.

		Variant shredding is the concrete use case that Java's `engineProjection(S)` method addresses. `FormatReadBuilder` can later gain an `engine_projection(&mut self, schema: ArrowSchemaRef) -> &mut Self` method with a default no-op implementation. Format implementations that support shredding override it. The parameter is `ArrowSchemaRef`, not a generic `S`, because Arrow is the only engine schema in iceberg-rust.


		#### RecordBatch as the canonical data type

		Arrow `RecordBatch` is the interchange format for every Rust data ecosystem iceberg-rust plausibly integrates with. DataFusion consumes it directly. Polars bridges to Arrow with zero-copy conversion for primitive columns. Lance uses its own columnar layout internally and exposes Arrow at read and write boundaries. Every integration and near-term format proposal either produces Arrow or bridges to it.


		## Conformance tests (TCK)

		A new `FormatModel` implementation should pass a shared conformance test suite before it merges. Java has one (`BaseFormatModelTests<T>`) but it is intentionally minimal: a single struct-of-primitives data generator, no V3 types, no schema evolution, no metadata columns. Issue [#15415](https://github.com/apache/iceberg/issues/15415) tracks the gaps.

Conversation

Kurtiscwright commented Apr 29, 2026

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

xanderbailey Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CTTY left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xanderbailey Apr 29, 2026 •

edited

Loading