-
Notifications
You must be signed in to change notification settings - Fork 3k
Spec v3: Add deletion vectors to the table spec #11240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
format/puffin-spec.md
Outdated
|
|
||
| - `ndv`: estimate of number of distinct values, derived from the sketch. | ||
|
|
||
| #### `delete-vector-v1` blob type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes are here for reference, but are being discussed separately in #11238.
format/spec.md
Outdated
|
|
||
| To test whether a certain position is set, its most significant 4 bytes (the key) are used to find a 32-bit bitmap and the least significant 4 bytes are tested for inclusion in the bitmap. If a bitmap is not found for the key, then it is not set. | ||
|
|
||
| Delete manifests track deletion vectors individually by the containing file location, starting offset of the DV magic bytes, and total length of the deletion vector blob. Multiple deletion vectors can stored in the same file. There are no restrictions on the data files that can be referenced by deletion vectors in the same file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Delete manifests track deletion vectors individually by the containing file location, starting offset of the DV magic bytes, and total length of the deletion vector blob. Multiple deletion vectors can stored in the same file. There are no restrictions on the data files that can be referenced by deletion vectors in the same file. | |
| Delete manifests track deletion vectors individually by the containing file location, starting offset of the DV magic bytes, and total length of the deletion vector blob. Multiple deletion vectors can stored in the same file. There are no restrictions on which data files can be referenced by deletion vectors in the same puffin file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should use that instead of which because "no restrictions on which" is unclear. The phrase "on which" typically refers to the preceding noun, as in the phrase "restrictions on which atomicity depends".
I'll update to call out these are "Puffin" files.
format/spec.md
Outdated
| | _optional_ | _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. | | ||
| | | _optional_ | _optional_ | **`143 referenced_data_file`** | `string` | Fully qualified location (URI with FS scheme) of a data file that all deletes reference [4] | | ||
| | | | _optional_ | **`144 blob_offset`** | `long` | The offset in the file where the content starts [5] | | ||
| | | | _optional_ | **`145 blob_size_in_bytes`** | `long` | The length of a referenced blob stored in the file [5] | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These last 3 lines are the only changes. Otherwise the rows are the same but with the v2 requirements copied to the new v3 column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason blob_size_in_bytes needed? Isn't blob_offset sufficient? If this is an optimization to bypass reading the footer, then it seems that compression would also be required here, or explicitly disallowed for the bitmaps? If parsing the footer of puffin is expected, it seems sufficient to store the index to the blob
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, line 464 is a change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need blob_size_in_byte so that we can compare the total Puffin size to the size of the individual DV blob without opening the Puffin file. It will be useful for maintenance and to generate accurate snapshot stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think the answer to my question above is that blob's referenced here are always assumed to be uncompressed.
| * A deletion vector must be applied to a data file when all of the following are true: | ||
| - The data file's `file_path` is equal to the deletion vector's `referenced_data_file` | ||
| - The data file's data sequence number is _less than or equal to_ the deletion vector's data sequence number | ||
| - The data file's partition (both spec and partition values) is equal [4] to the deletion vector's partition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why we need this requirement, how would it be possible for the first 2 conditions be true but not this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I debated whether to keep this or not and ended up deciding that it is helpful. The second requirement should also be impossible if the first requirement is true.
I kept both of these because I want people reading the spec to understand that these can be relied on in the scan planning algorithm. If this only said that the file_path must match, then implementers may think that they need to consider all deletion vectors without first filtering by partition and data sequence number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the correct spec and partition for deletion vectors is important so that we can skip them using partition predicates, just like we skip position delete files today.
00306bc to
f013921
Compare
| Both position and equality deletes allow encoding deleted row values with a delete. This can be used to reconstruct a stream of changes to a table. | ||
| These vectors are stored using the `delete-vector-v1` blob definition from the [Puffin spec][puffin-spec]. | ||
|
|
||
| Deletion vectors support positive 64-bit positions, but are optimized for cases where most positions fit in 32 bits by using a collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a 32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using the least significant 4 bytes. For each key in the set of positions, a 32-bit Roaring bitmap is maintained to store a set of 32-bit sub-positions for that key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think this paragraph and the next one are somewhat redundant with the puffin spec (or if they do provide additional information, I think consolidating in puffin makes sense, since the operations would like change for a v2 bitmap).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the context important here? I think it makes sense to have it in both places because they describe the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it is helpful to have it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel strong, but I think all of the details on optimization and lookup probably belong as an implementation detail in puffin. As stated there isn't enough information to actually implement reading the file, and if the underlying structure changes these sections would become irrelevant. I think the high level logic, is that deletion vectors represent row IDs that should be deleted the presence of a row number in the vector indicates the row should not be read.
format/spec.md
Outdated
| | _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. | | ||
| | v1 | v2 | v3 | Field id, name | Type | Description | | ||
| | ---------- | ---------- | ---------- |-----------------------------------|------------------------------|-------------| | ||
| | | _required_ | _required_ | **`134 content`** | `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion, but for completeness did you consider adding 3: Deletion Vector? I think it potentially makes it clearer that the data contained is not tabular in nature but this might not have any real advantage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will note, that equality_ids takes the approach of requiring, content type 2 below, so it might be more consistent to have a new enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should continue to use 1 for the new position deletes files because they actually contain position deletes. The file format will be puffin, which will make it clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aokolnychyi after our threads on the puffin spec. I think the fundamental question here is actually where and how redundantly to store content type. There are three possible places:
- In the footer of puffin (this is required anyways).
- As a magic number in the content of deletion vector (currently implemented as it is inherited from Delta Lake)
- Here as an additional content type (deletion-vector-v1).
After thinking about it I think it might make some sense to store it here versus again in the content as a magic number (or maybe in addition to the content type). This is for two (mostly minor reasons):
- It can save 3 bytes
- It allows feature usage to be introspected from manifests, rather than requiring reading the deletion vector (e.g. if we ever come up with a v2 of the deletion vector, people can look at relative distribution without parsing the puffing files at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left another proposal on the comment thread of file format name thoughts @aokolnychyi ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I just left a similar thought in the design doc: https://docs.google.com/document/d/18Bqhr-vnzFfQk1S4AgRISkA_5_m5m32Nnc2Cw0zn2XM/edit?disco=AAABVb7Ww5k. Is there any complication to add a new type for clarity? Slightly agree with @emkornfield
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should continue to set this as 1: POSITION DELETES because that's what the content represents. Both file/pos deletes in Parquet/Avro/ORC and deletion vectors are applied to data files the same way during planning (and this spec update helps by introducing referenced_data_file to speed up v2 planning). I think it makes the most sense to track how the deletes are serialized in the format field.
That raises the question about what to use in the format field. I think that puffin fits with the other format names because it is the format of the entire data file. If you wanted to go inspect that file, you'd instantiate a Puffin reader.
Both of these choices fit with how v2 works today. Where this gets a bit tricky is that we are also changing the granularity. Instead of having a single entry per Puffin file, we are tracking individual deletion vectors (using the now renamed content_size_in_bytes and content_offset) to cut down on false positives.
Now that we are pointing to a slice of a Puffin file of position deletes, I think there's an argument that we could/should store the blob type somewhere: that way we have both file type and the type of the content pointed to by the offset. If anything I'd introduce a content_type field for this. But given that the type is always going to be the same and there's an open question about whether we would just detect the type from magic bytes, I would opt to simply omit it for now.
TL;DR: What we're really looking for is a content type for the blob, but we don't need one right now so let's leave our options open and omit it.
format/spec.md
Outdated
| | ---------- | ---------- | ---------- |-----------------------------------|------------------------------|-------------| | ||
| | | _required_ | _required_ | **`134 content`** | `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | | ||
| | _required_ | _required_ | _required_ | **`100 file_path`** | `string` | Full URI for the file with FS scheme | | ||
| | _required_ | _required_ | _required_ | **`101 file_format`** | `string` | String file format name, `avro`, `orc`, `parquet`, or `puffin` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we should note that puffin is only used for storing deletion vectors (this might not be obvious for first time readers, since all deletion vectors are only discussed below this point)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question here, if the intent is really to be compatible with delta lake, maybe this should just be called "blob" storage isn't tied to puffin, and users have the ability to store this in other locations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about it some more, I think I'd be more in favor of keeping content type above and calling this delete-vector-v1-blob. Implementations still must write this as a puffin file but this leaves flexibility for readers without violating the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replied to this in the thread above.
I think this field should continue to be the overall file format (puffin here) and that if anything we want a type for the blob. However, we only have one case right now so let's add that when we need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think we should add a not that puffin is only used for storing DVs (maybe a link to the DV section?)
| - The data file's `file_path` is equal to the delete file's `referenced_data_file` if it is non-null | ||
| - The data file's data sequence number is _less than or equal to_ the delete file's data sequence number | ||
| - The data file's partition (both spec and partition values) is equal [4] to the delete file's partition | ||
| - There is no deletion vector that must be applied to the data file (when added, such a vector must contain all deletes from existing position delete files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, this is how the design doc proposes it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this condition apply to eq delete application as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szehon-ho, this does not apply to equality deletes.
This is a fantastic question to consider. It would be great if we could add a rule like this one for equality deletes because writers that produce positional deletes will almost always apply equality deletes and encode them in the new DVs.
However, there's at least one big issue: concurrent commits would require re-scanning data. Imagine a writer is nearly done with a MERGE that uses DVs and has created a DV with all of the positions previously deleted by both positional and equality delete files. While the metadata update and commit is happening, a new equality delete comes in and commits first. If we were to require that the DV of the MERGE commit replaces the new equality deletes, the MERGE job would potentially need to go re-scan data files in order to find deleted row positions. That could be a very expensive operation because the equality delete could apply to an entire partition of data.
There is also a similar cost for maintaining position deletes, but in that case we have the metadata to load and union DVs quickly, as opposed to scanning potentially hundreds of files for newly deleted records at commit time. Because position deletes are much more targeted, there are fewer false positives (one goal of this update!) and the work to merge them is lower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed an interesting option to consider. If we add a way to annotate a DV with the data sequence of the scan, we can ignore all equality deletes prior to that without requiring re-scanning data during retries.
|
|
||
| Position-based delete files identify deleted rows by file and position in one or more data files, and may optionally contain the deleted row. | ||
|
|
||
| _Note: Position delete files are **deprecated** in v3. Existing position deletes must be written to delete vectors when updating the position deletes for a data file._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a technical reason to force deprecation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Deletion vectors are much better for creating a stream of changes from tables. They are also just faster in general so we want to switch over to using them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many optimizations enabled by knowing there is a DV for a data file, this was one of the points in the design doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the flip side a counter argument, is that requriing DVs means more work for writers that want to support some portions of V3 specification but not upgrade to DVs (despite them being better). It seems other places iceberg has taken the stand that optimizations are not required for writers (e.g. sequence number inheritance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optimizations are not required for writers (e.g. sequence number inheritance).
Could you elaborate? Not sure I got.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that Snapshot/Sequence number inheritance is only turned on via a configuration flag for Iceberg tables for writers. It is not required that writers "must" make snapshot id null for new files added. Similarly, I'm asking that if it really needs to be made a requirement that DVs must be written, even though they are better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, Iceberg is fairly strict about writing correct and well-maintained metadata, but we are careful to be backward-compatible. Sequence number and snapshot ID inheritance are required in v2. I think what you're referring to here is that we made snapshot ID inheritance available in v1 with a flag to enable it (and break compatibility with some readers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess what I'm referring to in this case might have been considered a bug that was addressed. But the spec has never mandated inheritance "must" be used.
I think the question still stands on whether from a spec perspective this should be mandated ("must") or ("should"). I guess I'm OK with "must" given clarification below on existing delete files are present but it still feels like we might be overly strict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The snapshot and sequence number inheritance was always required in V2 tables from the spec perspective, it just took us a bit of time to get rid of the legacy V1 logic everywhere. The flag was there only to enable that behavior in V1 tables, which could break older readers. That's why it was safe to merge the referenced PR as it simply aligned the implementation with the spec.
We know the new representation is better. I think it is OK to be strict in this case. We were very flexible with the design of equality deletes. For instance, we allowed adding any number of global equality delete files. We thought implementations would be smart and would not misuse this concept. Yet, we have seen so many cases where tables were unreadable after engines produces a ton of equality deletes without ever compacting them. I think the new layout forces implementations to provide the best possible behavior for position deletes as well as enables various optimizations like CDC generation mentioned by Ryan.
format/spec.md
Outdated
| | _optional_ | _optional_ | _optional_ | **`132 split_offsets`** | `list<133: long>` | Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending | | ||
| | | _optional_ | _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file | | ||
| | _optional_ | _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. | | ||
| | | _optional_ | _optional_ | **`143 referenced_data_file`** | `string` | Fully qualified location (URI with FS scheme) of a data file that all deletes reference [4] | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| | | _optional_ | _optional_ | **`143 referenced_data_file`** | `string` | Fully qualified location (URI with FS scheme) of a data file that all deletes reference [4] | | |
| | | _optional_ | | **`143 referenced_data_file`** | `string` | Fully qualified location (URI with FS scheme) of a data file that all deletes reference [4] | |
Do we want to support this in v2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac, yes we do want to support it in v2. This is a good optimization for file-scoped deletes that we already support. If you have a position delete file that references only one data file, we can speed up planning by using this.
format/spec.md
Outdated
| | _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. | | ||
| | v1 | v2 | v3 | Field id, name | Type | Description | | ||
| | ---------- | ---------- | ---------- |-----------------------------------|------------------------------|-------------| | ||
| | | _required_ | _required_ | **`134 content`** | `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I just left a similar thought in the design doc: https://docs.google.com/document/d/18Bqhr-vnzFfQk1S4AgRISkA_5_m5m32Nnc2Cw0zn2XM/edit?disco=AAABVb7Ww5k. Is there any complication to add a new type for clarity? Slightly agree with @emkornfield
| - The data file's `file_path` is equal to the delete file's `referenced_data_file` if it is non-null | ||
| - The data file's data sequence number is _less than or equal to_ the delete file's data sequence number | ||
| - The data file's partition (both spec and partition values) is equal [4] to the delete file's partition | ||
| - There is no deletion vector that must be applied to the data file (when added, such a vector must contain all deletes from existing position delete files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this condition apply to eq delete application as well?
format/spec.md
Outdated
|
|
||
| To test whether a certain position is set, its most significant 4 bytes (the key) are used to find a 32-bit bitmap and the least significant 4 bytes (the sub-position) are tested for inclusion in the bitmap. If a bitmap is not found for the key, then it is not set. | ||
|
|
||
| Delete manifests track deletion vectors individually by the containing file location (`file_path`), starting offset of the DV magic bytes (`blob_offset`), and total length of the deletion vector blob (`blob_size_in_bytes`). Multiple deletion vectors can be stored in the same file. There are no restrictions on the data files that can be referenced by deletion vectors in the same Puffin file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean no restrictions? I thought one DV blob refer to one data file, from the previous section? And obviously you have to make sure the data file has compatible partition spec, seq number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because each DV blob is tracked by the partition of the data file it references, you can combine DVs across partitions into Puffin files. That allows us to have a lot fewer files but maintain the benefits of having partition-scoped deletes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, i think the language can be clearer? 'no restrictions on the number of data files referenced by deleteion vectors in the same Puffin file'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While that's true, it is more specific. There is no restriction on the data files at all, including the number of data files.
format/spec.md
Outdated
|
|
||
| Delete manifests track deletion vectors individually by the containing file location (`file_path`), starting offset of the DV magic bytes (`blob_offset`), and total length of the deletion vector blob (`blob_size_in_bytes`). Multiple deletion vectors can be stored in the same file. There are no restrictions on the data files that can be referenced by deletion vectors in the same Puffin file. | ||
|
|
||
| At most one deletion vector is allowed per data file in a table. If a DV is written for a data file, it must replace all previously written position delete files so that when a DV is present, readers can safely ignore matching position delete files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At most one deletion vector is allowed per data file in a table.
I guess this is pretty critical, and answer my question in https://docs.google.com/document/d/18Bqhr-vnzFfQk1S4AgRISkA_5_m5m32Nnc2Cw0zn2XM/edit?disco=AAABWolDRwg , just had a question how is this implemented? In spark, do we cluster delete writers by data-files? Not to block the spec, just curious here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, one of the v2 optimizations that Anton built is to use file-scoped positional deletes rather than partition-scoped positional deletes. I think we've been recommending this for a while because it makes planning much faster (you can use a map keyed by referenced_data_file) and dramatically reduces false-positives, making actual runtime faster too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we explicitly state that this is for a single snapshot? Each data file may have different DVs in different snapshots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac, good catch. I've updated this to snapshot to be more clear.
e36d9b0 to
5c463b0
Compare
5c463b0 to
c2501fd
Compare
format/spec.md
Outdated
|
|
||
| Delete manifests track deletion vectors individually by the containing file location (`file_path`), starting offset of the DV magic bytes (`blob_offset`), and total length of the deletion vector blob (`blob_size_in_bytes`). Multiple deletion vectors can be stored in the same file. There are no restrictions on the data files that can be referenced by deletion vectors in the same Puffin file. | ||
|
|
||
| At most one deletion vector is allowed per data file in a table. If a DV is written for a data file, it must replace all previously written position delete files so that when a DV is present, readers can safely ignore matching position delete files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we explicitly state that this is for a single snapshot? Each data file may have different DVs in different snapshots.
| | _optional_ | _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. | | ||
| | | | _optional_ | **`142 first_row_id`** | `long` | The `_row_id` for the first row in the data file. See [First Row ID Inheritance](#first-row-id-inheritance) | | ||
| | | | _optional_ | **`142 first_row_id`** | `long` | The `_row_id` for the first row in the data file. See [First Row ID Inheritance](#first-row-id-inheritance) | | ||
| | | _optional_ | _optional_ | **`143 referenced_data_file`** | `string` | Fully qualified location (URI with FS scheme) of a data file that all deletes reference [4] | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to allow V2 writers and readers to populate the referenced data file? I generally don't mind but we will have to continue to persist the bounds anyway to not silently break existing integrations that rely on reconstructing the referenced path from the bounds. Writing the field in addition to the bounds will make things only worse and require additional work to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should allow it. We know that it is better to use referenced_data_file than to use the bounds, so it makes sense to allow it. People may not choose to use it if it is duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. The problem, however, is that we won't switch to referenced_data_file anywhere in the library. Persisting both values won't make sense as it will increase the disk footprint and memory pressure. We can't stop writing the bounds too, as it may break external connectors like Trino.
|
|
||
| class PositionDeleteFile(ContentFile): | ||
| content: Literal['position-deletes'] | ||
| content_offset: Optional[int] = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about referenced_data_file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left out referenced_data_file because it is unnecessary. Position delete files are sent back through server-side planning as a list. The file scan tasks in the response include a list of indexes into that list to associate files with the deletes that must be applied to them. Adding referenced_data_file would duplicate that information and would be confusing, so I think it is better to omit it.
Note that if we were to add fine-grained commits to the update API, we would need to be able to pass it to the service. We can add that later if/when needed.
|
Thanks @rdblue for the great writeup! Thanks to everyone who reviewed and helped shape the doc! |
|
Linking this PR to #11122 for tracking. |
Co-authored-by: Anton Okolnychyi <aokolnychyi@apache.org> Co-authored-by: emkornfield <emkornfield@gmail.com>
This is based on the changes to the Puffin spec in #11238.