Skip to content

Conversation

@arthurpassos
Copy link
Collaborator

@arthurpassos arthurpassos commented Jul 28, 2025

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Implement exporting partitions from merge tree tables to object storage in a different format (e.g, parquet). The files are converted to the destination format in-memory.

Syntax: ALTER TABLE merge_tree_table EXPORT PARTITION ID 'ABC' TO TABLE 's3_hive_table'.

Related settings: allow_experimental_export_merge_tree_partition.

  1. The destination file names and paths, for now, are decided on the destination engine (I am only testing and thinking about S3 with hive, so <table_root>/pkey1=pvalue1/.../pkeyn=pvaluen/<snowflakeid>.parquet). Most likely in the future we'll not be using snowflakeids for the filenames.
  2. A commit file should be uploaded at the end of the execution to signal the completion of the transaction, the filename is: commit_<partition_id>_<transaction_id>. It shall contain the list of files that were uploaded in that transaction.
  3. A partition can not be exported twice. The limitation comes from the fact upon re-export we don't have a reliable way of telling which parts should be exported (we can't duplicate data). Parts might have been merged with un-exported parts and etc.
  4. The parts selected for an export are not locked at all. We just keep references so they are not deleted from disk, it is totally ok to mutate or merge them meanwhile.
  5. Exports should be able to recover from hard failures/disasters (hard re-start or crash). This is controlled using export manifests that are written on disk.
  6. Exports should be able to recover from soft failures (i.e, failed to export a given part but did not crash)
  7. Upon re-start, exports are scheduled based on when they were created.
  8. For now, exports are being scheduled in the same list of disk moves. I still need to decide if I'll create yet another queue or re-use one of the existing ones.
  9. Export manifests are being written on anyDisk.
  10. There is some half-baked observability on system.exports and system.part_log

Documentation entry for user-facing changes

...

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

@github-actions
Copy link

github-actions bot commented Jul 28, 2025

Workflow [PR], commit [bad3bc0]

@svb-alt svb-alt added enhancement New feature or request tiered storage Antalya Roadmap: Tiered Storage labels Jul 30, 2025
@svb-alt svb-alt linked an issue Aug 8, 2025 that may be closed by this pull request
@arthurpassos
Copy link
Collaborator Author

There is one thing I am not doing yet, but I should: somehow manage fails to schedule an export task

@arthurpassos
Copy link
Collaborator Author

There is one thing I am not doing yet, but I should: somehow manage fails to schedule an export task

This needs to be addressed asap

{
if (areBackgroundMovesNeeded())
background_moves_assignee.start();
// if (areBackgroundMovesNeeded())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to think about this

@arthurpassos
Copy link
Collaborator Author

One idea is to hold references to the data parts as soon as the request comes in instead of locking the parts for merges/mutations.

This way we allow parts to be mutated or merged, that's not a problem as long as they remain on disk. Holding references will give us that guarantee.

We just need to make sure we grab those references upon re-start as well.

@arthurpassos
Copy link
Collaborator Author

arthurpassos commented Sep 7, 2025

List of pending things from the top of my head:

  1. Documentation
  2. No need to capture exceptions in StorageObjectStorageMergeTreePartImporterSink anymore. It is ok for a pipeline to throw an exception, we'll catch it in the task
  3. Make MergeTreeExportManifest a bit safer by checking json fields existence before extracting. Consider checksums.
  4. Add fsync support for MergeTreeExportManifest
  5. Make max_retries configurable
  6. Persist attempt count
  7. Exports throttler
  8. Disable parallel formatting
  9. Cancel mechanism?
  10. Correctly set up apply_deleted_mask, read_with_direct_io and prefetch upon reading from merge tree parts
  11. Validations around https://github.com/Altinity/ClickHouse/pull/939/files#diff-a3d77682f605bf66aacbda72a660aaa789ddc37f064494e9bbe9dc934d59282eR581
  12. Determine the state of parts we are interested in.
  13. Fix commit filepaths with extra '/'
  14. Tests
  15. QA

@@ -0,0 +1,61 @@
#pragma once
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a workaround / refactor needed for two things:

  1. Override storage engine filenames (it was used when we wanted to preserve part names)
  2. Compute filenames separately (as opposed to computing it inside the sink) so we are able to build a commit file

@arthurpassos
Copy link
Collaborator Author

Parking it right now in favor of a simpler version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya antalya-25.8 enhancement New feature or request tiered storage Antalya Roadmap: Tiered Storage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ALTER TABLE EXPORT to external table

4 participants