rfc: Implement an API for all Data File Formats#2384
rfc: Implement an API for all Data File Formats#2384Kurtiscwright wants to merge 1 commit intoapache:mainfrom
Conversation
|
|
||
| 4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and `FileWriter` layering is sound. This RFC adds a format abstraction beneath `FileWriter`, not a replacement for it. | ||
|
|
||
| 5. **Implement variant shredding or encryption.** Java exposes `engineProjection` and `engineSchema` as extension points for variant shredding and similar format-specific type mapping, and `withFileEncryptionKey` and `withAADPrefix` for Parquet encryption. Equivalent hooks are noted as future extensions in the Rust design. Implementing either requires a dedicated RFC. |
There was a problem hiding this comment.
Encryption work is currently in-flight so I'd love to understand how we could incorporate that in this plan.
let mut wb = registry.write_builder(format, output)?;
if let Some(em) = &self.encryption_manager {
let (dek, aad) = em.create_file_key().await?;
wb.with_file_encryption_key(&dek);
wb.with_aad_prefix(&aad);
}
let writer = wb.build().await?;
I need to think if something like this might make sense?
There was a problem hiding this comment.
There was a problem hiding this comment.
Thank you for the review. In regards to integrating the File Format and Encryption interface, the first thing that jumps out to me is the following question. Does Encryption have different traits or impls based on the File Format in question?
Essentially, will you need OrcEncryption that is separate from ParquetEncryption modules. If yes then later in the doc when it lists the Format specific traits living in the formats/ folder. I think the format specific encryption trait can be called from there. Otherwise we can integrate the Encryption libraries directly into the WriterBuilder and ReaderBuilder with respect to Encrypt vs Decrypt calls.
There was a problem hiding this comment.
ORC does have a similar concept to PME in parquet and the spec references this but it's not implemented in Java. In the case of Parquet, the actual encryption / decryption is handled by arrow-rs in the writer so it will differ depending on format. I would hope we could get away with just passing the key and aad down to the writer but I'm the kind of person that has to see code to work out how this might look.
There was a problem hiding this comment.
Also this could be an interesting read around the spec: https://iceberg.apache.org/docs/nightly/encryption/#appendix-internals-overview
@ggershinsky described the current Java state like this:
Currently we can encrypt Parquet and Avro data files. Parquet files are encrypted via the native PME mechanism, while Avro files are encrypted via Iceberg's AES GCM Streams https://iceberg.apache.org/gcm-stream-spec/. But both kinds of encryption require the same parameters: key and aadPrefix. The best doc, I guess, is https://iceberg.apache.org/docs/nightly/encryption/#appendix-internals-overview .
CTTY
left a comment
There was a problem hiding this comment.
Hi @Kurtiscwright , thanks for the rfc! Just took a quick look and left some comments
My biggest concern is limiting the format API to the arrow ecosystem. It seems ok for now but it basically forces users to think about arrow when implementing their own formats.
Curious to hear thoughts from you and others as well
|
|
||
| 2. Remove hard-coded Parquet assumptions from scan and write orchestration. After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch through the format abstraction instead of constructing Parquet types directly. | ||
|
|
||
| 3. Provide a registry that maps `DataFileFormat` values to `FormatModel` implementations, so callers obtain readers and writers without naming the concrete format type. |
There was a problem hiding this comment.
I think this is more of an implementation details
|
|
||
| 3. Provide a registry that maps `DataFileFormat` values to `FormatModel` implementations, so callers obtain readers and writers without naming the concrete format type. | ||
|
|
||
| 4. Define a conformance test suite (TCK) that any `FormatModel` implementation must pass before it merges. |
There was a problem hiding this comment.
Is this really necessary for the initial implementation? we only support parquet for now
TCK is not even completed on the java side afaik
|
|
||
| Each implementation registers one instance per `DataFileFormat` variant it supports. The `format` method returns that variant. `read_builder` and `write_builder` are the entry points for reading and writing a file. Both return trait objects so that the registry can hand them back from a `DataFileFormat`-keyed lookup. | ||
|
|
||
| The data type is fixed to Arrow `RecordBatch`, the Iceberg schema type to `iceberg::spec::Schema`, and the physical schema type to `arrow::datatypes::SchemaRef`. These are the only types in iceberg-rust today that would fill the roles Java uses generic parameters for. Arguments for keeping them fixed, including comparisons with Java's `FormatModel<D, S>`, are in "Design rationale" below. |
There was a problem hiding this comment.
We don't have to hardcode it to use Arrow's RecordBatch even. We can use a generic type for in-memory representation and arrow can be the default value
There was a problem hiding this comment.
Wonder if we need to make that abstraction at this stage? Java engines do have different representations that they operate on but to my knowledge arrow is pretty ubiquitous in rust engines / applications?
There was a problem hiding this comment.
I would prefer to keep the data type open. My reasoning:
- Hive: We built our own integration earlier, when there was no straightforward way to convert to Hive’s internal format. As a result, Hive still uses its own readers and writers and effectively has to implement each feature from scratch. They might eventually move to the API if we introduce a major reader feature they don’t want to reimplement.
- Impala: similar situation. Impala also relies on its own readers and writers. They do more custom logic there, so they might not adopt the File Format API directly, but the trade‑off is that they end up lagging behind on new integrations.
- Arrow dependency: for engines that don’t natively operate on Arrow, converting FileFormat → Arrow → engine format is possible but can be quite expensive performance wise. If we only expose Arrow, we may lose those engines because of the overhead.
- Rust ecosystem: I’m aware of Rust‑based engines that don’t use Arrow internally. It would be great to keep the door open for them as well.
How flexible are the Rust API contracts in this regard? If we limited the scope now, how hard would it be to support additional (non‑Arrow) data types later? I understand the desire to keep things simple for now, but I suspect we’ll eventually need a path that allows more than just Arrow.
There was a problem hiding this comment.
This compelling to me! Thanks for the detailed explanation!
There was a problem hiding this comment.
+1 to what Peter says above, for a low-level API like file format, we want to keep the iceberg-rust open to other ecosystems and enable it as a kernel implementation
|
|
||
| Using `DataFileFormat` as a `HashMap` key requires adding `#[derive(Hash)]` to the enum. That is a non-breaking addition. | ||
|
|
||
| The registry is an owned value, not a global static. Tests construct their own. Applications construct one at startup and pass it to scan planners and write orchestrators. For the common case of a single registry for the lifetime of a process, `default_format_registry()` returns a `&'static FormatRegistry` initialized through `OnceLock` on first call. |
There was a problem hiding this comment.
Would love to see more details on scan planner's API changes. Also what about writing?
There was a problem hiding this comment.
Or we can inject format registry to Catalog instances directly?
|
|
||
| `read_builder` and `write_builder` return `Err(Error { kind: ErrorKind::FeatureUnsupported, .. })` for unregistered formats. The error message distinguishes two cases: the format is implemented but its feature flag is disabled in this build, or the format has no implementation in this crate. | ||
|
|
||
| ### Feature flags |
There was a problem hiding this comment.
This could be a non-goal, we only plan to support parquet for now and parquet is essential
| crates/iceberg/src/formats/ | ||
| ├── mod.rs # FormatModel, FormatReadBuilder, FormatWriteBuilder, FormatFileWriter | ||
| ├── registry.rs # FormatRegistry, default_format_registry | ||
| └── parquet.rs # ParquetFormatModel, wrapping existing ParquetWriter and ArrowReader |
There was a problem hiding this comment.
nit: This should be formats/parquet/mod.rs?
|
|
||
| Java's `FormatModel<D, S>` uses `D` for Spark `InternalRow`, Flink `RowData`, Arrow `ColumnarBatch`, and other engine-native row types. iceberg-rust has one row type: Arrow `RecordBatch`. Every writer accepts it, and every reader returns it. No format on the near-term queue (ORC, Avro data-file, Vortex, Lance) produces anything other than `RecordBatch` at the Iceberg-facing boundary. No engine integration in iceberg-rust today brings an engine-native row type the way Spark and Flink do in Java. | ||
|
|
||
| Adding a `D` parameter today means writing `<RecordBatch>` everywhere the trait is used, for no present caller benefit. If a future format cannot bridge to Arrow, the trait can gain an associated type with a default, which is a semver-compatible addition. |
There was a problem hiding this comment.
the trait can gain an associated type with a default
I think we can do it in the initial cut, but I have not investigated the required effort for this
|
|
||
| Feature flags work across `cargo check`, `cargo miri`, cross-compilation, and static linking. `datafusion`, `opendal`, and `reqwest` use the same pattern. | ||
|
|
||
| #### RecordBatch as the canonical data type |
There was a problem hiding this comment.
Is this the same as the first point?
|
|
||
| 5. Match the Java and PyIceberg designs where they align, and diverge where Rust's single-data-type ecosystem and pre-1.0 status justify it. Divergences are called out inline. | ||
|
|
||
| ## Non-Goals |
There was a problem hiding this comment.
Writing V2 tables (position delete files) is also out of scope. Am I right?
|
|
||
| 3. **Add Puffin support to the FormatModel.** Puffin files hold statistics and deletion vectors rather than row data. They have a different lifecycle from data files and are already handled separately in `crates/iceberg/src/puffin/`. | ||
|
|
||
| 4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and `FileWriter` layering is sound. This RFC adds a format abstraction beneath `FileWriter`, not a replacement for it. |
There was a problem hiding this comment.
How equality delete writers are implemented in rust?
In Java we added an extra abstraction layer on top of FileWriterBuilder so the registry-provided writer can serve both DataFileWriters and EqualityDeleteFileWriters. This works because, aside from configuration differences, both writers ultimately do the same thing: write records.
| `FormatReadBuilder` and `FormatWriteBuilder` configure a single read or write. `FormatModel` produces them, and the caller consumes them with `build`. | ||
|
|
||
| ```rust | ||
| pub trait FormatReadBuilder: Send { |
There was a problem hiding this comment.
There are a few methods that seem to be missing compared to the Java API:
- set / setAll – Do we want to allow pushing configuration down to the readers (e.g., prefetching, parallel column reads, or page index usage)? If so, having a set-style method to propagate such configs to the reader could be useful.
- reuseContainers – Does Rust support container reuse? In Java, reusing the same objects across reads (instead of allocating new ones each time) can bring significant performance benefits.
- idToConstants – In Java, the reader returns full rows, but some values are not stored in the data file itself and instead come from file metadata. IdToConstants supplies those values so they can be populated in the returned rows.
- nameMapping – In Java, this is used when reading data files that were not written by Iceberg, where column IDs are missing and only column names are available.
There was a problem hiding this comment.
The engineSchema might also be necessary for Variant handling, if the engine wants to control how a Variant is read (e.g., specify the shredding layout). This would avoid leaking the file’s internal structure into the reader and would also eliminate the need to reassemble shredded variants if the file stores them shredded and the engine prefers to consume them in that form.
| fn build(self: Box<Self>) -> BoxFuture<'static, Result<ArrowRecordBatchStream>>; | ||
| } | ||
|
|
||
| pub trait FormatWriteBuilder: Send { |
There was a problem hiding this comment.
There are a few methods missing compared to the Java API:
- engineSchema – The engine could specify how a Variant is currently shredded, or indicate that values are stored, for example, as int in the records. The writer could then use this information to read or materialize them differently (e.g., as long) when writing.
- meta – File-level metadata is often type-specific. In the Java implementation we record some information mainly for debuggability, such as the file type or the equality field IDs for an equality-delete file.
- metricsConfig – In Java, the writer is responsible for collecting file metrics, so the configuration must be passed down to ensure the correct metrics are produced.
- overwrite – In Java, users can request that existing output files be overwritten.
- Encryption settings – For example fileEncryptionKey and aadPrefix, which are required when encryption is enabled.
|
|
||
| #### No generic over the data type | ||
|
|
||
| Java's `FormatModel<D, S>` uses `D` for Spark `InternalRow`, Flink `RowData`, Arrow `ColumnarBatch`, and other engine-native row types. iceberg-rust has one row type: Arrow `RecordBatch`. Every writer accepts it, and every reader returns it. No format on the near-term queue (ORC, Avro data-file, Vortex, Lance) produces anything other than `RecordBatch` at the Iceberg-facing boundary. No engine integration in iceberg-rust today brings an engine-native row type the way Spark and Flink do in Java. |
There was a problem hiding this comment.
I think this limitation will prevent integrations in the long run. How hard would it be to change this if we face new use-cases?
|
|
||
| Java's `S` parameter serves Spark's `StructType`, Flink's `RowType`, and the other engine-native schema types. iceberg-rust has `iceberg::spec::Schema` for the logical schema and `arrow::datatypes::SchemaRef` for the physical schema, with conversion in `arrow/schema.rs`. There is no third schema type a generic parameter would serve. | ||
|
|
||
| Variant shredding is the concrete use case that Java's `engineProjection(S)` method addresses. `FormatReadBuilder` can later gain an `engine_projection(&mut self, schema: ArrowSchemaRef) -> &mut Self` method with a default no-op implementation. Format implementations that support shredding override it. The parameter is `ArrowSchemaRef`, not a generic `S`, because Arrow is the only engine schema in iceberg-rust. |
There was a problem hiding this comment.
engineProjection(S) could also be used for requesting a long when the iceberg schema defines an int
|
|
||
| #### RecordBatch as the canonical data type | ||
|
|
||
| Arrow `RecordBatch` is the interchange format for every Rust data ecosystem iceberg-rust plausibly integrates with. DataFusion consumes it directly. Polars bridges to Arrow with zero-copy conversion for primitive columns. Lance uses its own columnar layout internally and exposes Arrow at read and write boundaries. Every integration and near-term format proposal either produces Arrow or bridges to it. |
There was a problem hiding this comment.
I’m skeptical about this.
Is the Rust ecosystem really that centered around Arrow, or are we just seeing survival bias? It feels possible that we mainly hear from Arrow users simply because anyone who wasn’t comfortable with Arrow has already abandoned the Iceberg Rust connector.
|
|
||
| ## Conformance tests (TCK) | ||
|
|
||
| A new `FormatModel` implementation should pass a shared conformance test suite before it merges. Java has one (`BaseFormatModelTests<T>`) but it is intentionally minimal: a single struct-of-primitives data generator, no V3 types, no schema evolution, no metadata columns. Issue [#15415](https://github.com/apache/iceberg/issues/15415) tracks the gaps. |
There was a problem hiding this comment.
The BaseFormatModelTests were intentionally kept minimal to limit the scope of that large PR. We’re actively adding more tests there now. It would also be useful to align on the test cases and potentially create/run cross‑implementation tests as well.
Which issue does this PR close?
What changes are included in this PR?
This RFC proposes a File Format API for the iceberg Rust crate that decouples Iceberg's read and write paths from any single file format. Today, iceberg-rust can only read and write Parquet data files: the format is hard-wired into ArrowReader, ParquetWriter, and every layer that touches them. The Java project shipped an analogous abstraction (FormatModel) in February 2026 via PR Core, Data: File Format API interfaces iceberg#12774, and PyIceberg has an open proposal (File Format API for PyIceberg iceberg-python#3100) for the same concept.
Are these changes tested?