-
Notifications
You must be signed in to change notification settings - Fork 3k
Docs: RewritePositionDeleteFiles procedure #7589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs: RewritePositionDeleteFiles procedure #7589
Conversation
docs/spark-procedures.md
Outdated
| * Minor Compaction: Compact small position delete files into larger ones. This reduces size of metadata stored in manifest files and overhead of opening small delete files. | ||
| * Remove Dangling Deletes: Filter out position delete records that refer to data files that are no longer live. After rewrite_data_files, position delete records pointing to the rewritten data files are not immediately marked for removeal and remain tracked by the table's live snapshot metadata. This is known as the 'dangling delete' problem, and is because a single position delete file can apply to more than one data file, and not all applicable data files are removed during rewrite. | ||
|
|
||
| Iceberg can rewrite position delete files in parallel using Spark with the `rewritePositionDeletes` action. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may not need this line since this is a doc for procedure instead of action. Or we can mention the procedure like this:
This procedure rewrites position delete files in parallel.
Also if we want to emphasis "parallel", can we provide a bit more details about how it works, like parallel at partition level or file level?
Or if we want to mention the rewritePositionDeletes action, we may clarify that this procedure is based upon the action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll remove this. I copied it from the rewrite data files procedure.
flyrain
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @szehon-ho for working on this. LGTM overall. Left some minor comments and questions.
docs/spark-procedures.md
Outdated
| |--------------------------------|------|-----------------------------------------------------------------------------| | ||
| | `rewritten_delete_files_count` | int | Number of delete files which were removed by this command | | ||
| | `rewritten_bytes_count` | long | Count of bytes across delete files which were removed by this command | | ||
| | `rewritten_delete_files_count` | int | Number of delete files which were added by this command | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra space in "added by"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
docs/spark-procedures.md
Outdated
| See the [`SizeBasedFileRewriter` Javadoc](../../../javadoc/{{% icebergVersion %}}/org/apache/iceberg/actions/SizeBasedFileRewriter.html#field.summary), | ||
| for list of all the supported options for this action. | ||
|
|
||
| All rewritten position delete files are filtered to remove dangling deletes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: maybe just me, feel like a bit hard at a glance. Would this be a bit easier to grab?
All the position delete files that are rewritten go through a filtering process, ensuring the removal of any dangling deletes.
or
Dangling deletes are all filtered out during the rewriting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the second one, changed
docs/spark-procedures.md
Outdated
|
|
||
| #### Examples | ||
|
|
||
| Rewrite position delete files in table `db.sample` with default options. This rewrites position delete files that conform to the default count-per-partition and size thresholds, and rewrite them using the default target sizes. Dangling deletes are removed from rewritten delete files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is count-per-partition a table property or a default value of the procedure?
The default target size is the table property write.delete.target-file-size-bytes, correct? Should we mention the property name here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea the option name is target-file-size-bytes. I added it to description. We should probably document these options properly instead of a softlink to the javadoc, but I left it as it is in line with rewrite_data_files. We can do this in a later pr.
docs/spark-procedures.md
Outdated
| CALL catalog_name.system.rewrite_position_delete_files('db.sample') | ||
| ``` | ||
|
|
||
| Rewrite all position delete files in table `db.sample`. Dangling deletes are removed from rewritten delete files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the different between the first example and the second example? The first example doesn't rewrite all pos delete files, I assume. Is there a default value for min-input-files to prevent rewriting on some partitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. As mentioned in other comment, should probably add docs on all the options at some point, instead of relying on soft link to javadoc.
1d1f2d5 to
0e156ed
Compare
|
Thanks @flyrain for detailed review! We can improve these docs in later prs |
Follow-up for #7572 (still not merged)