-
Notifications
You must be signed in to change notification settings - Fork 13
[WIP] Export merge tree partition to object storage #939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
There is one thing I am not doing yet, but I should: somehow manage fails to schedule an export task |
This needs to be addressed asap |
…. very hackish for now, need to improve part selection and blocking
src/Storages/StorageMergeTree.cpp
Outdated
| { | ||
| if (areBackgroundMovesNeeded()) | ||
| background_moves_assignee.start(); | ||
| // if (areBackgroundMovesNeeded()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to think about this
|
One idea is to hold references to the data parts as soon as the request comes in instead of locking the parts for merges/mutations. This way we allow parts to be mutated or merged, that's not a problem as long as they remain on disk. Holding references will give us that guarantee. We just need to make sure we grab those references upon re-start as well. |
…disk. first attempt
|
List of pending things from the top of my head:
|
| @@ -0,0 +1,61 @@ | |||
| #pragma once | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a workaround / refactor needed for two things:
- Override storage engine filenames (it was used when we wanted to preserve part names)
- Compute filenames separately (as opposed to computing it inside the sink) so we are able to build a commit file
|
Parking it right now in favor of a simpler version |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Implement exporting partitions from merge tree tables to object storage in a different format (e.g, parquet). The files are converted to the destination format in-memory.
Syntax:
ALTER TABLE merge_tree_table EXPORT PARTITION ID 'ABC' TO TABLE 's3_hive_table'.Related settings:
allow_experimental_export_merge_tree_partition.<table_root>/pkey1=pvalue1/.../pkeyn=pvaluen/<snowflakeid>.parquet). Most likely in the future we'll not be using snowflakeids for the filenames.commit_<partition_id>_<transaction_id>. It shall contain the list of files that were uploaded in that transaction.disk moves. I still need to decide if I'll create yet another queue or re-use one of the existing ones.anyDisk.system.exportsandsystem.part_logDocumentation entry for user-facing changes
...
Exclude tests: