Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Sep 30, 2024

This adds a blob type to the Puffin spec that can store a Roaring bitmap delete vector. This is in support of the row-level delete improvements proposed for Iceberg v3.

@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Sep 30, 2024
tested for inclusion in the bitmap. If a bitmap is not found for the key, then
it is not set.

The serialized blob starts with a 4-byte magic sequence, `D1D33964` (1681511377
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any significance to the sequence here? Is this for compatibility purposes? I don't have a problem with that, just wondering

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's for compatibility purposes, the Delta deletion vector format uses the same magic number https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vector-format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also need to add a length field and a CRC checksum for compatibility. The CRC seems useful and Delta readers assume that the length of the vector + magic bytes is written. I think for compatibility, it's worth it to just add those fields.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we implement Delta format, then we should document that fact, it's significant.
A Puffin spec reader/implementor shouldn't need to compare documents bit by bit to be able to come up to this conclusion.

The blob metadata must include the following properties:

* `referenced-data-file`: location of the data file the delete vector applies to
* `cardinality`: number of deleted rows (set positions) in the delete vector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May also make sense to store the sequence number here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know the sequence number when writing the file because it hasn't yet been assigned. That must come from the metadata that tracks these blobs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the BlobMetadata section requires refactoring as it currently calls out both snapshot and sequence number as required fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan was to set sequence number and snapshot ID to -1 as they will be assigned during commits. We should update this PR to include this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's minor but it seems better to change the fact that it is required in blobmetadata then keeping a useless field (this also seems generally more consistent with how missing snapshot data is handled elsewhere)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am +1 for exploring what it would take to make those fields optional. My opinion would depend on the amount of work needed.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have some comments on grammar but I think this is a sensible way to go forwards.


The blob metadata must include the following properties:

* `referenced-data-file`: location of the data file the delete vector applies to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't quite aware that we gonna put the deletion vectors in the puffin files. So now they gonna be mandatory for readers, right?
How this compares to alternatives considered? (i believe following delta example we could have deletion vectors inline in manifest list)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically a reader just needs to be able to read the vector since the addressing is all within the delete file manifest still.

My TLDR on the reasoning from Anton was, we make less files, but we still get the ability to individually address and attribute blobs. I would check the proposal @singhpk234 linked to

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @singhpk234 @RussellSpitzer . So these Puffin won't be registered on the table-level, like the table stats files are, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi, Puffin is used to make the files self-describing, but it is not necessary to read the Puffin footer in order to apply the delete vector. We are basically just using it as a container for DVs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the idea is that manifest will link to portions of the file (by path+offset+length)?
but the referenced-data-file stays important property.

i am concerned that, if it's self-describing but portions of information isn't used at execution, this leaves space for bugs to creep in. the self-descibing but unused portion may end up containing incorrect information, without anyone noticing (either for files produced by Iceberg itself, or by other implementors/query engines, who use Iceberg project for compatibility test purposes).

Copy link
Contributor

@aokolnychyi aokolnychyi Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #11302 has a sample implementation. The magic number and CRC in the blob ensure we read a correct byte sequence. The Puffin footer will be helpful to debug potential issues or during table maintenance. It enables us to store the referenced data file next to the blob in addition to storing it in the metadata.

for the key, then it is not set.

The serialized blob contains:
* The length of the vector and magic bytes stored as 4 bytes, little-endian
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and CRC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No 😞

I also found out that the CRC and this length are big endian. I'm going to look into this more and see what we can do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love compatibility and all, but that sounds wrong and since we are also including this in the manifests I think we should change it ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked with the Delta folks and all implementations rely on on these values to check the validity of the deletion vector. Removing this would mean those checks can't be run and would also break compatibility with all older Delta readers. The length is also intended to make the format self-describing. It is made up of these blobs concatenated together.

The CRC is definitely useful and I don't think there's a reason to make minor changes that break compatibility like swapping the endianness. I'd use the same logic for storing the length. While it isn't necessary when embedding these in Puffin, it isn't such a problem that we should break compatibility with older readers, which would limit options later.

Plus if we keep compatibility, then the deletion vector files in both Delta and Iceberg are the same format with the addition of a Puffin footer and magic bytes. At least it would be an option to use to Puffin as a container format later on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we will be specifically storing a "delta length" which is the real length - 4 bytes? But in manifests we will keep the full blob length including the crc bytes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Older Delta readers will never have to read Iceberg Puffin files. Newer software can be made adapted, so that core DV logic is reused even if container formats aren't the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we will be specifically storing a "delta length" which is the real length - 4 bytes? But in manifests we will keep the full blob length including the crc bytes?

I think it would be the blob length - 8 bytes (4 for the CRC and 4 for the length).

I wouldn't be worried about older Delta readers at all.

  • compatibility with Delta readers isn't a project goal, is it?
  • compatibility with some old software that we have no control over, shouldn't be a project goal either?

These are good questions to consider and we will need to come to consensus as a community.

I think there is value in supporting compatibility with older Delta readers, but I acknowledge that this may be my perspective because my employer has a lot of Delta customers that we are going support now and in the future.

The main use case for maintaining compatibility with the Delta format is that it's hard to move old jobs to new code. We see a similar issue in Hive to Iceberg migrations, too, where unknown older readers prevent migration because they are hard to track down and often read files directly from the backing object store. I'd like to avoid the same problem here, where all readers need to be identified and migrated at the same time. Doing that requires temporarily producing Delta metadata, which this makes possible.

The second reason for maintaining compatibility is that we want for the formats to become more similar. My hope is that we can work across both communities and come up with a common metadata format in a future version. Maintaining compatibility in cases like this keeps our options open: if we have compatible data layers, then it's easier to build a compatible metadata layer.

Other people may not share those goals, but I think the decision comes down to how much is being compromised for compatibility. I don't think it is too much:

  • A 4-byte length field that Iceberg doesn't need
  • A 4-byte CRC that is useful to add
  • Using big endian for this blob (mixed endian if you consider RoaringBitmap is LE)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main use case for maintaining compatibility with the Delta format is that it's hard to move old jobs to new code. We see a similar issue in Hive to Iceberg migrations, too, where unknown older readers prevent migration because they are hard to track down and often read files directly from the backing object store. I'd like to avoid the same problem here, where all readers need to be identified and migrated at the same time. Doing that requires temporarily producing Delta metadata, which this makes possible.

This will already be the case for an upgrade to Iceberg V3, where as currently proposed all position delete files will be migrated to DV? Is it technically harder to do a similar migration of Delta lake to a new version if users would want to interop with Iceberg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will already be the case for an upgrade to Iceberg V3, where as currently proposed all position delete files will be migrated to DV? Is it technically harder to do a similar migration of Delta lake to a new version if users would want to interop with Iceberg?

TL;DR: Iceberg v3 is an update that breaks forward compatibility, but we have the option to not break compatibility with Delta. I think we should keep these fields to be compatible so that it is not a similar situation.

I don't think that the upgrade to Iceberg v3 is a similar situation to a Delta migration. The v3 upgrade is a breaking change (that is, forward-compatibility breaking). This must be a forward-compatibility breaking change because we are making additions that must be supported by readers and writers to ensure correctness, including:

  • New types (timestamp nanos, variant, etc.)
  • Default values
  • Deletion vectors (and synchronous maintenance)
  • Row-level lineage

We have the option of making a migration to Iceberg a forward-incompatible change or not by maintaining this compatibility or not. If we make changes to the format that Delta uses, then this would not allow existing readers to operate.

On the other hand, if we do keep compatibility you could continue to produce Delta metadata temporarily and update readers over time. It's often a problem in migrations like this to know about those readers and be able to update them. The fear that a conversion breaks something important often prevents people from upgrading. We've seen that quite a bit with Hive to Iceberg migrations.

In addition, there's a cost to doing the similar migration on the Delta side. In order to have compatibility with Iceberg, Delta would need to introduce a new deletion vector format that drops the length field. That's a forward-incompatible change that would be needed to read files in Iceberg tables correctly. I want to try to work more across both communities, so it seems better to me to compromise and choose the compatible option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting perspective. If indeed we can generate Delta metadata on top of these newly-written Puffin files and feed that to old Delta readers without any modifications to those old readers, there's certainly some value in it.

@rdblue this discussion is interesting and compatibility capabilities are potentially useul.
Why isn't there "Delta" word in the spec and only one (unclear) occurrence of "compatibility"? wouldn't it be useful for better understanding of a) why it's the way it is and b) what a user can do thanks to this?
Without this the spec is complete (context isn't needed for completeness), but confusing. There are some redundant fields (eg length) and a spec implementor may spend some time wondering eg whether length needs to be validated or ignored.

@rdblue rdblue force-pushed the spec-v3-dv-blob-type branch from 7e6bf9a to 58d83fc Compare October 10, 2024 21:07

The serialized blob contains:
* The length of the vector and magic bytes stored as 4 bytes, big-endian
* A 4-byte magic sequence, `D1 D3 39 64`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is a magic number here if this is already versioned by the blob type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this was discussed in #11238 (comment)

IMO, I think trying to converge the specs by taking slightly questionable designs from Delta lake and propagating them to Iceberg seems like not a great choice. It makes more sense to me to call the existing Delta legacy, and converge onto a representation that both can share. It seems the core payload is easy enough to transfer and recompute CRC as new data is written.

Copy link
Contributor

@aokolnychyi aokolnychyi Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to do it from scratch, I personally would skip the length field and use consistent byte order for the magic number, bitmap, and checksum. The magic number is needed as a sanity check because readers will directly jump to an offset without reading the Puffin footer. Therefore, checking the magic number ensures we are reading what we expect without an extra request to the footer. Using portable serialization for 64-bit Roaring bitmaps is the only reasonable option, so that wouldn't change too. CRC helps catch bit flips if the underlying storage becomes faulty, but I wish we used little endian as for the bitmap and magic number.

It is still a question for the Iceberg community to whether not including the length and using consistent byte order is significant enough to break the compatibility with the existing Delta spec. This layout is a bit tricky understand, I will admit that.

@emkornfield, what are your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore, checking the magic number ensures we are reading what we expect without an extra request to the footer.

I'm a little skeptical on value here, given we are already labelling this content elsewhere. The internal consistency of the data structures need to be vetted anyways and probably unlikely to get valid ones if the offset is incorrect. The CRC32 provides further sanity checks if the structures do appear to be valid. Its 4 bytes, so at the end of the day there are likely bigger fish to fry.

It is still a question for the Iceberg community to whether not including the length and using consistent byte order is significant enough to break the compatibility with the existing Delta spec. This layout is a bit tricky understand, I will admit that.

My (non-binding) vote would be -0.1, simply copy and pasting the delta lake spec. Given the relatively small differences, on the surface it seems that the Delta Lake spec could be revised in a manner to converge with the Iceberg spec without replicating a technical decision. The likelihood that these actually have any impact, as long as the spec is clear enough is probably minimal. CRC version probably has the most impact (and I don't have any stats but I assume this would be small).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the choice of little endian to match most processors' architecture to avoid byte swapping on read?

Agree that it would be nice to use consistent byte ordering (little endian). Parquet and Orc seem to use little endian to encode integers too.

I don't fully understand the compatibility part with Delta lake. It is not like Delta lake would use Puffin file directly, right? I assume some conversion is still needed. those meta fields are cheap to convert byte order if needed, as long as we keep the big bitmap vector compatible with the same portable serialization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the intent here is to be byte for byte compatible with Delta Lake deletion vectors so both can be easily read simply given a tuple of: (file URI, offset, length), so no rewrites would be necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little skeptical on value here, given we are already labelling this content elsewhere. The internal consistency of the data structures need to be vetted anyways and probably unlikely to get valid ones if the offset is incorrect. The CRC32 provides further sanity checks if the structures do appear to be valid.

The CRC is computed on the bytes so it can only catch random bit flips. I do think checking the magic number is a reasonable sanity check which is fairly common in the industry. I'd be curious to hear alternatives.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CRC is computed on the bytes so it can only catch random bit flips.

I'd flip this on its head and I'd like to understand what use-cases we expect it to sanity check that would fail? The CRC doesn't just detect bit-flips, it also detects for accidental truncation/enlargement of the range (e.g. if you leave of the last byte, it is very unlikely the CDC check would pass). If the argument is the CRC will not be checked on most reads, then the consistency of the vectors provides another sanity check (I think it would be quite hard to get random sampled bytes that would fully parse, and implementations should be careful in parsing it here).

I do think checking the magic number is a reasonable sanity check which is fairly common in the industry.

I'm not saying in general it is not useful and even if this case it might be, but magic numbers are typically a file level concept so that readers can determine if the file-type is something they can in fact read (and disambiguate between multiple types of binary formatted data). In this case this information is redundant with the information in the footer. For direct reads based on manifest data, I think a reasonable design option would be to introduce a new content type value that reflected 'deletion-vector-v1', which would allow readers to fail earlier if the type is not recognized.

As another example, theta sketches in puffin don't include a magic number (and as far as I can tell the library by itself doesn't add one for serialization).

Ultimately it is just 4 bytes, so it might not pay to argue too much about it, especially if the community agrees that the compatibility is worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I don’t think the magic number check is critical, I do believe it is beneficial. If things start to fail, we would want to have as much helpful information as possible. Having the magic number allows us to cross check the serialization format without reading the footer and may help debug issues with offsets. It will also be useful if we add more serialization formats in the future. I agree it is unlikely we will be able to successfully deserialize the rest of the content if the offset is invalid, but still. If we end up in that situation, it would mean there was an ugly bug and having more metadata will only help. Overall, it does seem like a reasonable sanity check to me, similar to magic numbers in zstd and gzip.

We once had to debug issues with bit flips while reading manifests. There was no easy way to prove we didn't corrupt the files and it was a faulty disk. The CRC check would catch those and save a ton of time.

I’d propose to keep the magic number and CRC independently of whether we decide to follow the Delta Lake blob layout. The length and byte orders are controversial. There is no merit in those beyond compatibility with Delta Lake.

Copy link
Contributor

@emkornfield emkornfield Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I don’t think the magic number check is critical, I do believe it is beneficial. If things start to fail, we would want to have as much helpful information as possible.

There is obviously a balance here (for instance we aren't also adding a magic footer, nor are we using error correcting codes). Like I said, probably not worth debating too much as it is 4 bytes.

I’d propose to keep the magic number and CRC independently of whether we decide to follow the Delta Lake blob layout.

I agree a digest is important, I think the main question is which one. We should probably limit that discussion to the other thread, and I think it was probably a miss not to include this at the puffin format level or at least directly on the theta sketch.

@aokolnychyi
Copy link
Contributor

I am going to share a PR with some basic implementation that follows this spec. We can use it as an example that will hopefully clarify some questions. Thanks for putting this together, @rdblue!

#### `delete-vector-v1` blob type

A serialized delete vector (bitmap) that represents the positions of rows in a
file that are deleted. A set bit at position P indicates that the row at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a file -> a data file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting a jump on allowing delete vectors for manifests .... :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, we may keep this generic for referencing manifests in the future.


The serialized blob contains:
* The length of the vector and magic bytes stored as 4 bytes, big-endian
* A 4-byte magic sequence, `D1 D3 39 64`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the choice of little endian to match most processors' architecture to avoid byte swapping on read?

Agree that it would be nice to use consistent byte ordering (little endian). Parquet and Orc seem to use little endian to encode integers too.

I don't fully understand the compatibility part with Delta lake. It is not like Delta lake would use Puffin file directly, right? I assume some conversion is still needed. those meta fields are cheap to convert byte order if needed, as long as we keep the big bitmap vector compatible with the same portable serialization.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to remove my approval just for a sec here because of the recent updates to the proposal and I want to make sure as a community we establish some principles for what we are doing going forward. To start, I think that compatibility with other table formats is a great goal but I do want to stress that I value our ability to read other formats much higher than our ability to write other formats.

For example, our spec doesn’t allow for 96 bit timestamps and we will not write them, but we still have an implementation that allows reading them. We’ve done this a bunch of times thus far in the spec to maintain compatibility with migrating data from Hive into Iceberg. I think these are generally good moves but do occasionally cause a problem for implementers (see column mapping, partition strings in directory names, etc ...) and in general we do not require them.

In this case we are introducing a few minor things into the spec that an Iceberg reader should never use but an Iceberg writer would be required to write. Specifically the magic bytes, length and CRC associated with the blob are all essentially just requirements for the current implementation of Delta readers. We already have to store our length information in our delete manifests (which will be the true length and not just magic + blob) and we could also store version and AAD there as well (or in the footer). I don't think we really have a technical reason to add any of these to the blob except for compatibility for external readers. As noted, even if we did add these to the spec we likely wouldn’t not have added them in the same way they were implemented in the Delta spec.

I’m not sure it’s worth drawing a line in the sand over this particular issue and I’d like to talk about it a bit more as a community before we merge this. I don’t want to set a precedent of adding write requirements to the Iceberg spec that aren’t actually requirements for Iceberg. I feel like if we make this a pattern we will essentially be deferring design decisions and I don’t really feel comfortable with that.

I want to emphasize that I feel like everyone participating in this PR is acting in good faith, but I think our policy on compatibility issues is something we should hash out.

@emkornfield
Copy link
Contributor

I’m not sure it’s worth drawing a line in the sand over this particular issue and I’d like to talk about it a bit more as a community before we merge this. I don’t want to set a precedent of adding write requirements to the Iceberg spec that aren’t actually requirements for Iceberg. I feel like if we make this a pattern we will essentially be deferring design decisions and I don’t really feel comfortable with that.

This is my main concern, I don't think the technical differences here really present blockers, they just add some warts. I also think compatibility between table formats is good goal, but I worry that due to governance differences between Iceberg and Delta, things naturally will go slower in Iceberg, so we would in most cases likely be ceding design to another project. I'm happy to take a wait and see approach on the more philosophical issue here and move forward on this (ultimately I think people doing the work should have more of a say on approach).

@aokolnychyi
Copy link
Contributor

PR #11302 contains a sample implementation of this spec.

@rdblue
Copy link
Contributor Author

rdblue commented Oct 15, 2024

I’m not sure it’s worth drawing a line in the sand over this particular issue and I’d like to talk about it a bit more as a community before we merge this. I don’t want to set a precedent of adding write requirements to the Iceberg spec that aren’t actually requirements for Iceberg. I feel like if we make this a pattern we will essentially be deferring design decisions and I don’t really feel comfortable with that.

This is my main concern, I don't think the technical differences here really present blockers, they just add some warts. I also think compatibility between table formats is good goal, but I worry that due to governance differences between Iceberg and Delta, things naturally will go slower in Iceberg, so we would in most cases likely be ceding design to another project. I'm happy to take a wait and see approach on the more philosophical issue here and move forward on this (ultimately I think people doing the work should have more of a say on approach).

I agree that we don't want to cede design to another project and not set a precedent. This should be an independent choice of whether we want to maintain compatibility in this case, based on weighing the benefits against the costs. This is definitely an Iceberg community decision.

To me, reducing fragmentation across formats is worth the cost of a few warts.

I think that compatibility with other table formats is a great goal but I do want to stress that I value our ability to read other formats much higher than our ability to write other formats.

This is true, but I'm not sure that we've had a case before where we know that we want to build basically the same thing. And in this case, if we want compatibility with existing code, then we would need to make sure we write the fields to keep the other readers functioning.

I also think that if the community roles were reversed in a similar situation, we would want the Delta community to consider compatibility when building a very similar feature, too.

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems some instance of delete vector are not renamed to deletion vector.

Also the PR title should be updated to use deletion-vector-v1 as well.

Others LGTM since it looks like the community has reached its consensus.

Comment on lines +166 to +167
* Include `referenced-data-file`, the location of the data file the delete
vector applies to; must be equal to the data file's `location` in table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Include `referenced-data-file`, the location of the data file the delete
vector applies to; must be equal to the data file's `location` in table
* Include `referenced-data-file`, the location of the data file the deletion
vector applies to; must be equal to the data file's `location` in table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are synonyms and I don't think that this is confusing. Is there a concern that you have about using both terms?


#### `deletion-vector-v1` blob type

A serialized delete vector (bitmap) that represents the positions of rows in a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A serialized delete vector (bitmap) that represents the positions of rows in a
A serialized deletion vector (bitmap) that represents the positions of rows in a

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me except some pending renames of delete vector to deletion vector.

* Include `referenced-data-file`, the location of the data file the delete
vector applies to; must be equal to the data file's `location` in table
metadata
* Include `cardinality`, the number of deleted rows (set positions) in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document that this should be equal to the number of rows listed in the manifest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spec is independent of manifests, so I don't want to add that here. I also think that it's clear in the table spec that record_count for a delete vector is its cardinality.

@aokolnychyi aokolnychyi merged commit b9ebc71 into apache:main Nov 2, 2024
@aokolnychyi
Copy link
Contributor

The vote has passed, so I merged this PR. Thanks @rdblue! Thanks everyone who reviewed!

@aokolnychyi
Copy link
Contributor

Linking this PR to #11122 for tracking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Specification Issues that may introduce spec changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.