Puffin: Add delete-vector-v1 blob type #11238

rdblue · 2024-09-30T19:40:38Z

This adds a blob type to the Puffin spec that can store a Roaring bitmap delete vector. This is in support of the row-level delete improvements proposed for Iceberg v3.

format/puffin-spec.md

RussellSpitzer · 2024-09-30T22:11:03Z

format/puffin-spec.md

+tested for inclusion in the bitmap. If a bitmap is not found for the key, then
+it is not set.
+
+The serialized blob starts with a 4-byte magic sequence, `D1D33964` (1681511377


Is there any significance to the sequence here? Is this for compatibility purposes? I don't have a problem with that, just wondering

Yeah it's for compatibility purposes, the Delta deletion vector format uses the same magic number https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vector-format

I also need to add a length field and a CRC checksum for compatibility. The CRC seems useful and Delta readers assume that the length of the vector + magic bytes is written. I think for compatibility, it's worth it to just add those fields.

if we implement Delta format, then we should document that fact, it's significant.
A Puffin spec reader/implementor shouldn't need to compare documents bit by bit to be able to come up to this conclusion.

format/puffin-spec.md

RussellSpitzer · 2024-09-30T22:12:30Z

format/puffin-spec.md

+The blob metadata must include the following properties:
+
+* `referenced-data-file`: location of the data file the delete vector applies to
+* `cardinality`: number of deleted rows (set positions) in the delete vector


May also make sense to store the sequence number here as well.

We don't know the sequence number when writing the file because it hasn't yet been assigned. That must come from the metadata that tracks these blobs.

I think the BlobMetadata section requires refactoring as it currently calls out both snapshot and sequence number as required fields.

The plan was to set sequence number and snapshot ID to -1 as they will be assigned during commits. We should update this PR to include this.

It's minor but it seems better to change the fact that it is required in blobmetadata then keeping a useless field (this also seems generally more consistent with how missing snapshot data is handled elsewhere)

I am +1 for exploring what it would take to make those fields optional. My opinion would depend on the amount of work needed.

RussellSpitzer

I still have some comments on grammar but I think this is a sensible way to go forwards.

format/puffin-spec.md

findepi · 2024-10-03T11:52:27Z

format/puffin-spec.md

+
+The blob metadata must include the following properties:
+
+* `referenced-data-file`: location of the data file the delete vector applies to


I wasn't quite aware that we gonna put the deletion vectors in the puffin files. So now they gonna be mandatory for readers, right?
How this compares to alternatives considered? (i believe following delta example we could have deletion vectors inline in manifest list)

an interesting discussion on the same : https://docs.google.com/document/d/18Bqhr-vnzFfQk1S4AgRISkA_5_m5m32Nnc2Cw0zn2XM/edit?disco=AAABUXU4vh4

Technically a reader just needs to be able to read the vector since the addressing is all within the delete file manifest still.

My TLDR on the reasoning from Anton was, we make less files, but we still get the ability to individually address and attribute blobs. I would check the proposal @singhpk234 linked to

thanks @singhpk234 @RussellSpitzer . So these Puffin won't be registered on the table-level, like the table stats files are, right?

@findepi, Puffin is used to make the files self-describing, but it is not necessary to read the Puffin footer in order to apply the delete vector. We are basically just using it as a container for DVs.

so the idea is that manifest will link to portions of the file (by path+offset+length)?
but the referenced-data-file stays important property.

i am concerned that, if it's self-describing but portions of information isn't used at execution, this leaves space for bugs to creep in. the self-descibing but unused portion may end up containing incorrect information, without anyone noticing (either for files produced by Iceberg itself, or by other implementors/query engines, who use Iceberg project for compatibility test purposes).

PR #11302 has a sample implementation. The magic number and CRC in the blob ensure we read a correct byte sequence. The Puffin footer will be helpful to debug potential issues or during table maintenance. It enables us to store the referenced data file next to the blob in addition to storing it in the metadata.

RussellSpitzer · 2024-10-09T21:28:32Z

format/puffin-spec.md

+for the key, then it is not set.
+
+The serialized blob contains:
+* The length of the vector and magic bytes stored as 4 bytes, little-endian


No 😞

I also found out that the CRC and this length are big endian. I'm going to look into this more and see what we can do.

I love compatibility and all, but that sounds wrong and since we are also including this in the manifests I think we should change it ...

I talked with the Delta folks and all implementations rely on on these values to check the validity of the deletion vector. Removing this would mean those checks can't be run and would also break compatibility with all older Delta readers. The length is also intended to make the format self-describing. It is made up of these blobs concatenated together.

The CRC is definitely useful and I don't think there's a reason to make minor changes that break compatibility like swapping the endianness. I'd use the same logic for storing the length. While it isn't necessary when embedding these in Puffin, it isn't such a problem that we should break compatibility with older readers, which would limit options later.

Plus if we keep compatibility, then the deletion vector files in both Delta and Iceberg are the same format with the addition of a Puffin footer and magic bytes. At least it would be an option to use to Puffin as a container format later on.

So we will be specifically storing a "delta length" which is the real length - 4 bytes? But in manifests we will keep the full blob length including the crc bytes?

Older Delta readers will never have to read Iceberg Puffin files. Newer software can be made adapted, so that core DV logic is reused even if container formats aren't the same.

So we will be specifically storing a "delta length" which is the real length - 4 bytes? But in manifests we will keep the full blob length including the crc bytes?

I think it would be the blob length - 8 bytes (4 for the CRC and 4 for the length).

I wouldn't be worried about older Delta readers at all.

compatibility with Delta readers isn't a project goal, is it?

compatibility with some old software that we have no control over, shouldn't be a project goal either?

These are good questions to consider and we will need to come to consensus as a community.

I think there is value in supporting compatibility with older Delta readers, but I acknowledge that this may be my perspective because my employer has a lot of Delta customers that we are going support now and in the future.

The main use case for maintaining compatibility with the Delta format is that it's hard to move old jobs to new code. We see a similar issue in Hive to Iceberg migrations, too, where unknown older readers prevent migration because they are hard to track down and often read files directly from the backing object store. I'd like to avoid the same problem here, where all readers need to be identified and migrated at the same time. Doing that requires temporarily producing Delta metadata, which this makes possible.

The second reason for maintaining compatibility is that we want for the formats to become more similar. My hope is that we can work across both communities and come up with a common metadata format in a future version. Maintaining compatibility in cases like this keeps our options open: if we have compatible data layers, then it's easier to build a compatible metadata layer.

Other people may not share those goals, but I think the decision comes down to how much is being compromised for compatibility. I don't think it is too much:

A 4-byte length field that Iceberg doesn't need

A 4-byte CRC that is useful to add

Using big endian for this blob (mixed endian if you consider RoaringBitmap is LE)

The main use case for maintaining compatibility with the Delta format is that it's hard to move old jobs to new code. We see a similar issue in Hive to Iceberg migrations, too, where unknown older readers prevent migration because they are hard to track down and often read files directly from the backing object store. I'd like to avoid the same problem here, where all readers need to be identified and migrated at the same time. Doing that requires temporarily producing Delta metadata, which this makes possible.

This will already be the case for an upgrade to Iceberg V3, where as currently proposed all position delete files will be migrated to DV? Is it technically harder to do a similar migration of Delta lake to a new version if users would want to interop with Iceberg?

This will already be the case for an upgrade to Iceberg V3, where as currently proposed all position delete files will be migrated to DV? Is it technically harder to do a similar migration of Delta lake to a new version if users would want to interop with Iceberg?

TL;DR: Iceberg v3 is an update that breaks forward compatibility, but we have the option to not break compatibility with Delta. I think we should keep these fields to be compatible so that it is not a similar situation.

I don't think that the upgrade to Iceberg v3 is a similar situation to a Delta migration. The v3 upgrade is a breaking change (that is, forward-compatibility breaking). This must be a forward-compatibility breaking change because we are making additions that must be supported by readers and writers to ensure correctness, including:

New types (timestamp nanos, variant, etc.)

Default values

Deletion vectors (and synchronous maintenance)

Row-level lineage

We have the option of making a migration to Iceberg a forward-incompatible change or not by maintaining this compatibility or not. If we make changes to the format that Delta uses, then this would not allow existing readers to operate.

On the other hand, if we do keep compatibility you could continue to produce Delta metadata temporarily and update readers over time. It's often a problem in migrations like this to know about those readers and be able to update them. The fear that a conversion breaks something important often prevents people from upgrading. We've seen that quite a bit with Hive to Iceberg migrations.

In addition, there's a cost to doing the similar migration on the Delta side. In order to have compatibility with Iceberg, Delta would need to introduce a new deletion vector format that drops the length field. That's a forward-incompatible change that would be needed to read files in Iceberg tables correctly. I want to try to work more across both communities, so it seems better to me to compromise and choose the compatible option.

This is interesting perspective. If indeed we can generate Delta metadata on top of these newly-written Puffin files and feed that to old Delta readers without any modifications to those old readers, there's certainly some value in it.

@rdblue this discussion is interesting and compatibility capabilities are potentially useul.
Why isn't there "Delta" word in the spec and only one (unclear) occurrence of "compatibility"? wouldn't it be useful for better understanding of a) why it's the way it is and b) what a user can do thanks to this?
Without this the spec is complete (context isn't needed for completeness), but confusing. There are some redundant fields (eg length) and a spec implementor may spend some time wondering eg whether length needs to be validated or ignored.

emkornfield · 2024-10-10T22:18:02Z

format/puffin-spec.md

+
+The serialized blob contains:
+* The length of the vector and magic bytes stored as 4 bytes, big-endian
+* A 4-byte magic sequence, `D1 D3 39 64`


why is a magic number here if this is already versioned by the blob type?

I see this was discussed in #11238 (comment)

IMO, I think trying to converge the specs by taking slightly questionable designs from Delta lake and propagating them to Iceberg seems like not a great choice. It makes more sense to me to call the existing Delta legacy, and converge onto a representation that both can share. It seems the core payload is easy enough to transfer and recompute CRC as new data is written.

If we were to do it from scratch, I personally would skip the length field and use consistent byte order for the magic number, bitmap, and checksum. The magic number is needed as a sanity check because readers will directly jump to an offset without reading the Puffin footer. Therefore, checking the magic number ensures we are reading what we expect without an extra request to the footer. Using portable serialization for 64-bit Roaring bitmaps is the only reasonable option, so that wouldn't change too. CRC helps catch bit flips if the underlying storage becomes faulty, but I wish we used little endian as for the bitmap and magic number.

It is still a question for the Iceberg community to whether not including the length and using consistent byte order is significant enough to break the compatibility with the existing Delta spec. This layout is a bit tricky understand, I will admit that.

@emkornfield, what are your thoughts?

Therefore, checking the magic number ensures we are reading what we expect without an extra request to the footer.

I'm a little skeptical on value here, given we are already labelling this content elsewhere. The internal consistency of the data structures need to be vetted anyways and probably unlikely to get valid ones if the offset is incorrect. The CRC32 provides further sanity checks if the structures do appear to be valid. Its 4 bytes, so at the end of the day there are likely bigger fish to fry.

It is still a question for the Iceberg community to whether not including the length and using consistent byte order is significant enough to break the compatibility with the existing Delta spec. This layout is a bit tricky understand, I will admit that.

My (non-binding) vote would be -0.1, simply copy and pasting the delta lake spec. Given the relatively small differences, on the surface it seems that the Delta Lake spec could be revised in a manner to converge with the Iceberg spec without replicating a technical decision. The likelihood that these actually have any impact, as long as the spec is clear enough is probably minimal. CRC version probably has the most impact (and I don't have any stats but I assume this would be small).

is the choice of little endian to match most processors' architecture to avoid byte swapping on read?

Agree that it would be nice to use consistent byte ordering (little endian). Parquet and Orc seem to use little endian to encode integers too.

I don't fully understand the compatibility part with Delta lake. It is not like Delta lake would use Puffin file directly, right? I assume some conversion is still needed. those meta fields are cheap to convert byte order if needed, as long as we keep the big bitmap vector compatible with the same portable serialization.

I believe the intent here is to be byte for byte compatible with Delta Lake deletion vectors so both can be easily read simply given a tuple of: (file URI, offset, length), so no rewrites would be necessary.

I'm a little skeptical on value here, given we are already labelling this content elsewhere. The internal consistency of the data structures need to be vetted anyways and probably unlikely to get valid ones if the offset is incorrect. The CRC32 provides further sanity checks if the structures do appear to be valid.

The CRC is computed on the bytes so it can only catch random bit flips. I do think checking the magic number is a reasonable sanity check which is fairly common in the industry. I'd be curious to hear alternatives.

The CRC is computed on the bytes so it can only catch random bit flips.

I'd flip this on its head and I'd like to understand what use-cases we expect it to sanity check that would fail? The CRC doesn't just detect bit-flips, it also detects for accidental truncation/enlargement of the range (e.g. if you leave of the last byte, it is very unlikely the CDC check would pass). If the argument is the CRC will not be checked on most reads, then the consistency of the vectors provides another sanity check (I think it would be quite hard to get random sampled bytes that would fully parse, and implementations should be careful in parsing it here).

I do think checking the magic number is a reasonable sanity check which is fairly common in the industry.

I'm not saying in general it is not useful and even if this case it might be, but magic numbers are typically a file level concept so that readers can determine if the file-type is something they can in fact read (and disambiguate between multiple types of binary formatted data). In this case this information is redundant with the information in the footer. For direct reads based on manifest data, I think a reasonable design option would be to introduce a new content type value that reflected 'deletion-vector-v1', which would allow readers to fail earlier if the type is not recognized.

As another example, theta sketches in puffin don't include a magic number (and as far as I can tell the library by itself doesn't add one for serialization).

Ultimately it is just 4 bytes, so it might not pay to argue too much about it, especially if the community agrees that the compatibility is worth it.

While I don’t think the magic number check is critical, I do believe it is beneficial. If things start to fail, we would want to have as much helpful information as possible. Having the magic number allows us to cross check the serialization format without reading the footer and may help debug issues with offsets. It will also be useful if we add more serialization formats in the future. I agree it is unlikely we will be able to successfully deserialize the rest of the content if the offset is invalid, but still. If we end up in that situation, it would mean there was an ugly bug and having more metadata will only help. Overall, it does seem like a reasonable sanity check to me, similar to magic numbers in zstd and gzip.

We once had to debug issues with bit flips while reading manifests. There was no easy way to prove we didn't corrupt the files and it was a faulty disk. The CRC check would catch those and save a ton of time.

I’d propose to keep the magic number and CRC independently of whether we decide to follow the Delta Lake blob layout. The length and byte orders are controversial. There is no merit in those beyond compatibility with Delta Lake.

While I don’t think the magic number check is critical, I do believe it is beneficial. If things start to fail, we would want to have as much helpful information as possible.

There is obviously a balance here (for instance we aren't also adding a magic footer, nor are we using error correcting codes). Like I said, probably not worth debating too much as it is 4 bytes.

I’d propose to keep the magic number and CRC independently of whether we decide to follow the Delta Lake blob layout.

I agree a digest is important, I think the main question is which one. We should probably limit that discussion to the other thread, and I think it was probably a miss not to include this at the puffin format level or at least directly on the theta sketch.

format/puffin-spec.md

aokolnychyi · 2024-10-11T04:53:55Z

I am going to share a PR with some basic implementation that follows this spec. We can use it as an example that will hopefully clarify some questions. Thanks for putting this together, @rdblue!

format/puffin-spec.md

stevenzwu · 2024-10-11T04:54:38Z

format/puffin-spec.md

+#### `delete-vector-v1` blob type
+
+A serialized delete vector (bitmap) that represents the positions of rows in a
+file that are deleted.  A set bit at position P indicates that the row at


nit: a file -> a data file

Getting a jump on allowing delete vectors for manifests .... :)

True, we may keep this generic for referencing manifests in the future.

stevenzwu · 2024-10-11T05:36:37Z

format/puffin-spec.md

+
+The serialized blob contains:
+* The length of the vector and magic bytes stored as 4 bytes, big-endian
+* A 4-byte magic sequence, `D1 D3 39 64`


is the choice of little endian to match most processors' architecture to avoid byte swapping on read?

Agree that it would be nice to use consistent byte ordering (little endian). Parquet and Orc seem to use little endian to encode integers too.

I don't fully understand the compatibility part with Delta lake. It is not like Delta lake would use Puffin file directly, right? I assume some conversion is still needed. those meta fields are cheap to convert byte order if needed, as long as we keep the big bitmap vector compatible with the same portable serialization.

RussellSpitzer

I want to remove my approval just for a sec here because of the recent updates to the proposal and I want to make sure as a community we establish some principles for what we are doing going forward. To start, I think that compatibility with other table formats is a great goal but I do want to stress that I value our ability to read other formats much higher than our ability to write other formats.

For example, our spec doesn’t allow for 96 bit timestamps and we will not write them, but we still have an implementation that allows reading them. We’ve done this a bunch of times thus far in the spec to maintain compatibility with migrating data from Hive into Iceberg. I think these are generally good moves but do occasionally cause a problem for implementers (see column mapping, partition strings in directory names, etc ...) and in general we do not require them.

In this case we are introducing a few minor things into the spec that an Iceberg reader should never use but an Iceberg writer would be required to write. Specifically the magic bytes, length and CRC associated with the blob are all essentially just requirements for the current implementation of Delta readers. We already have to store our length information in our delete manifests (which will be the true length and not just magic + blob) and we could also store version and AAD there as well (or in the footer). I don't think we really have a technical reason to add any of these to the blob except for compatibility for external readers. As noted, even if we did add these to the spec we likely wouldn’t not have added them in the same way they were implemented in the Delta spec.

I’m not sure it’s worth drawing a line in the sand over this particular issue and I’d like to talk about it a bit more as a community before we merge this. I don’t want to set a precedent of adding write requirements to the Iceberg spec that aren’t actually requirements for Iceberg. I feel like if we make this a pattern we will essentially be deferring design decisions and I don’t really feel comfortable with that.

I want to emphasize that I feel like everyone participating in this PR is acting in good faith, but I think our policy on compatibility issues is something we should hash out.

emkornfield · 2024-10-11T20:38:30Z

I’m not sure it’s worth drawing a line in the sand over this particular issue and I’d like to talk about it a bit more as a community before we merge this. I don’t want to set a precedent of adding write requirements to the Iceberg spec that aren’t actually requirements for Iceberg. I feel like if we make this a pattern we will essentially be deferring design decisions and I don’t really feel comfortable with that.

This is my main concern, I don't think the technical differences here really present blockers, they just add some warts. I also think compatibility between table formats is good goal, but I worry that due to governance differences between Iceberg and Delta, things naturally will go slower in Iceberg, so we would in most cases likely be ceding design to another project. I'm happy to take a wait and see approach on the more philosophical issue here and move forward on this (ultimately I think people doing the work should have more of a say on approach).

format/puffin-spec.md

aokolnychyi · 2024-10-13T23:42:38Z

PR #11302 contains a sample implementation of this spec.

format/puffin-spec.md

rdblue · 2024-10-15T17:00:54Z

I’m not sure it’s worth drawing a line in the sand over this particular issue and I’d like to talk about it a bit more as a community before we merge this. I don’t want to set a precedent of adding write requirements to the Iceberg spec that aren’t actually requirements for Iceberg. I feel like if we make this a pattern we will essentially be deferring design decisions and I don’t really feel comfortable with that.

This is my main concern, I don't think the technical differences here really present blockers, they just add some warts. I also think compatibility between table formats is good goal, but I worry that due to governance differences between Iceberg and Delta, things naturally will go slower in Iceberg, so we would in most cases likely be ceding design to another project. I'm happy to take a wait and see approach on the more philosophical issue here and move forward on this (ultimately I think people doing the work should have more of a say on approach).

I agree that we don't want to cede design to another project and not set a precedent. This should be an independent choice of whether we want to maintain compatibility in this case, based on weighing the benefits against the costs. This is definitely an Iceberg community decision.

To me, reducing fragmentation across formats is worth the cost of a few warts.

I think that compatibility with other table formats is a great goal but I do want to stress that I value our ability to read other formats much higher than our ability to write other formats.

This is true, but I'm not sure that we've had a case before where we know that we want to build basically the same thing. And in this case, if we want compatibility with existing code, then we would need to make sure we write the fields to keep the other readers functioning.

I also think that if the community roles were reversed in a similar situation, we would want the Delta community to consider compatibility when building a very similar feature, too.

advancedxy

Seems some instance of delete vector are not renamed to deletion vector.

Also the PR title should be updated to use deletion-vector-v1 as well.

Others LGTM since it looks like the community has reached its consensus.

format/puffin-spec.md

advancedxy · 2024-10-23T03:34:32Z

format/puffin-spec.md

+* Include `referenced-data-file`, the location of the data file the delete
+  vector applies to; must be equal to the data file's `location` in table


Suggested change

* Include `referenced-data-file`, the location of the data file the delete

vector applies to; must be equal to the data file's `location` in table

* Include `referenced-data-file`, the location of the data file the deletion

vector applies to; must be equal to the data file's `location` in table

These are synonyms and I don't think that this is confusing. Is there a concern that you have about using both terms?

advancedxy · 2024-10-23T03:34:57Z

format/puffin-spec.md


+#### `deletion-vector-v1` blob type
+
+A serialized delete vector (bitmap) that represents the positions of rows in a


Suggested change

A serialized delete vector (bitmap) that represents the positions of rows in a

A serialized deletion vector (bitmap) that represents the positions of rows in a

format/puffin-spec.md

aokolnychyi

Looks good to me except some pending renames of delete vector to deletion vector.

emkornfield · 2024-10-25T19:29:44Z

format/puffin-spec.md

+* Include `referenced-data-file`, the location of the data file the delete
+  vector applies to; must be equal to the data file's `location` in table
+  metadata
+* Include `cardinality`, the number of deleted rows (set positions) in the


We should document that this should be equal to the number of rows listed in the manifest.

This spec is independent of manifests, so I don't want to add that here. I also think that it's clear in the table spec that record_count for a delete vector is its cardinality.

format/puffin-spec.md

aokolnychyi · 2024-11-02T10:04:56Z

The vote has passed, so I merged this PR. Thanks @rdblue! Thanks everyone who reviewed!

aokolnychyi · 2024-11-04T08:53:01Z

Linking this PR to #11122 for tracking.

Puffin: Add delete-vector-v1 blob type.

d37f200

github-actions bot added the Specification Issues that may introduce spec changes. label Sep 30, 2024

RussellSpitzer reviewed Sep 30, 2024

View reviewed changes

format/puffin-spec.md Outdated Show resolved Hide resolved

RussellSpitzer reviewed Sep 30, 2024

View reviewed changes

format/puffin-spec.md Outdated Show resolved Hide resolved

rdblue added 2 commits September 30, 2024 13:56

Add blob metadata properties.

4480f5f

Incorporate Russell's suggestions.

ea6cbc8

rdblue mentioned this pull request Sep 30, 2024

Spec v3: Add deletion vectors to the table spec #11240

Merged

RussellSpitzer reviewed Sep 30, 2024

View reviewed changes

format/puffin-spec.md Outdated Show resolved Hide resolved

RussellSpitzer reviewed Sep 30, 2024

View reviewed changes

format/puffin-spec.md Outdated Show resolved Hide resolved

RussellSpitzer reviewed Sep 30, 2024

View reviewed changes

RussellSpitzer approved these changes Sep 30, 2024

View reviewed changes

nastra approved these changes Oct 1, 2024

View reviewed changes

findepi reviewed Oct 3, 2024

View reviewed changes

Incorporate review feedback.

6b1afae

RussellSpitzer reviewed Oct 9, 2024

View reviewed changes

Update to fix endianness.

58d83fc

rdblue force-pushed the spec-v3-dv-blob-type branch from 7e6bf9a to 58d83fc Compare October 10, 2024 21:07

danielcweeks approved these changes Oct 10, 2024

View reviewed changes

emkornfield reviewed Oct 10, 2024

View reviewed changes

format/puffin-spec.md Outdated Show resolved Hide resolved

emkornfield reviewed Oct 10, 2024

View reviewed changes

format/puffin-spec.md Show resolved Hide resolved

emkornfield reviewed Oct 10, 2024

View reviewed changes

format/puffin-spec.md Outdated Show resolved Hide resolved

emkornfield reviewed Oct 10, 2024

View reviewed changes

format/puffin-spec.md Outdated Show resolved Hide resolved

emkornfield reviewed Oct 10, 2024

View reviewed changes

format/puffin-spec.md Outdated Show resolved Hide resolved

emkornfield reviewed Oct 10, 2024

View reviewed changes

format/puffin-spec.md Outdated Show resolved Hide resolved

aokolnychyi reviewed Oct 11, 2024

View reviewed changes

format/puffin-spec.md Outdated Show resolved Hide resolved

aokolnychyi reviewed Oct 11, 2024

View reviewed changes

format/puffin-spec.md Outdated Show resolved Hide resolved

aokolnychyi reviewed Oct 11, 2024

View reviewed changes

format/puffin-spec.md Outdated Show resolved Hide resolved

stevenzwu reviewed Oct 11, 2024

View reviewed changes

RussellSpitzer reviewed Oct 11, 2024

View reviewed changes

aokolnychyi reviewed Oct 12, 2024

View reviewed changes

format/puffin-spec.md Outdated Show resolved Hide resolved

stephen-shelby mentioned this pull request Oct 14, 2024

[Enhancement] Read pos-delete file using datacache StarRocks/starrocks#51867

Merged

24 tasks

advancedxy reviewed Oct 15, 2024

View reviewed changes

format/puffin-spec.md Outdated Show resolved Hide resolved

rdblue added 2 commits October 22, 2024 16:38

Incorporate review comments.

1b7a24e

Rename to deletion-vector-v1.

d676615

advancedxy reviewed Oct 23, 2024

View reviewed changes

aokolnychyi reviewed Oct 24, 2024

View reviewed changes

format/puffin-spec.md Show resolved Hide resolved

aokolnychyi approved these changes Oct 25, 2024

View reviewed changes

emkornfield reviewed Oct 25, 2024

View reviewed changes

format/puffin-spec.md Show resolved Hide resolved

Update for review feedback.

3bb25b3

aokolnychyi approved these changes Nov 2, 2024

View reviewed changes

aokolnychyi merged commit b9ebc71 into apache:main Nov 2, 2024

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

Puffin: Add deletion-vector-v1 blob type (apache#11238)

0f1fb70

zhztheplayer mentioned this pull request Nov 10, 2025

feat: Add new function for Delta Lake deletion vector processing facebookincubator/velox#14721

Open


		The blob metadata must include the following properties:

		* `referenced-data-file`: location of the data file the delete vector applies to

		* Include `referenced-data-file`, the location of the data file the delete
		vector applies to; must be equal to the data file's `location` in table


		#### `deletion-vector-v1` blob type

		A serialized delete vector (bitmap) that represents the positions of rows in a

	A serialized delete vector (bitmap) that represents the positions of rows in a
	A serialized deletion vector (bitmap) that represents the positions of rows in a

Puffin: Add delete-vector-v1 blob type #11238

Puffin: Add delete-vector-v1 blob type #11238

Uh oh!

Conversation

rdblue commented Sep 30, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

aokolnychyi Oct 14, 2024 •

edited

Loading

aokolnychyi Oct 11, 2024 •

edited

Loading

emkornfield Oct 14, 2024 •

edited

Loading