Iceberg Metadata Cache for faster query planning by abhinav-yadav-git · Pull Request #13338 · trinodb/trino

abhinav-yadav-git · 2022-07-25T13:23:47Z

Description

Similar to HiveMetaStore cache, we need a in memory cache for Iceberg connector as well so that metadata operations (getTableProperties, getTableStats etc) can be served from cache reducing query planning time. In iceberg connector it is needed even more as due to metadata files being persisted on remote file system each iceberg table scan triggers remote reads for metadata json/manifest files and then these files are inferred by the iceberg layer which is a costly operation.

Current implementation is similar to the hive one but it is strongly consistent as it utilises the fact that in case of iceberg metadata json is immutable, with each write we would get a new metadata json, hence with each read for getTableHandle it checks if the metadata location of the cached table matches with the current metadata json location, if there is a mismatch we invalidate and flush all the individual caches(like statsCache, propertiesCache etc) making it possible to practically cache metadata for longer durations without comprising on correctness.

With this patch, this cache is only available for iceberg tables with Glue catalog and not hive as in order to get it working with hive catalog in iceberg we would need some refactoring to bypass CachingHiveMetaStore layer to avoid redundant caches in code as hive has its own metastore cache.

Improvements- Have tested this patch on TPCDS suite in our lower and envs and with the current change we are seeing 90% reduction in planning time if there is no change in the underlining table metadata.

Is this change a fix, improvement, new feature, refactoring, or other?
New Feature

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)
Trino Iceberg Connector

How would you describe this change to a non-technical end user or system administrator?
Iceberg Metadata Cache for caching metadata operations.

Related issues, pull requests, and links

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

cla-bot · 2022-07-25T13:23:49Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: abhinav-yadav-git.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email email@example.com
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot · 2022-07-25T13:29:58Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: abhinav-yadav-git.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email email@example.com
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

electrum · 2022-07-25T19:15:26Z

We actually don't allow usage of CachingHiveMetastore in Iceberg, so we don't need to worry about redundant caching.

findepi · 2022-07-27T15:21:19Z

can you please make sure you filled the CLA? For more information, see https://github.com/trinodb/cla.

abhinav-yadav-git · 2022-08-01T05:59:46Z

@findepi Have signed the cla, and have refreshed the PR.

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadataCache.java

alexjo2144 · 2022-08-01T19:46:40Z

Between Trino and the Iceberg library there are a few conversations and approaches floating around for reducing planning time.

#11708
apache/iceberg#4518

There's also some caching in the Iceberg library that I don't think we leverage very well right now: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseSnapshot.java#L54

Maybe we can consolidate the conversation somewhere.

Improvements- Have tested this patch on TPCDS suite in our lower and envs and with the current change we are seeing 90% reduction in planning time if there is no change in the underlining table metadata.

Also, which version did you test against? @findepi merged this change into 391 I think that was a non-trivial improvement.

abhinav-yadav-git · 2022-08-02T05:42:30Z

@alexjo2144 The iceberg PR which has been shared by you is something which we are looking at as well, caching manifest content should help in reducing planning as well as scheduling time especially in cases of complex queries where a table is present in more than one sub queries or nested queries causing repetitive manifest scanning.
Irrespective of the optimisations at the iceberg layer, caching the metadata operations at trino layer should reduce most of the planning time, assuming most of the latency in planning comes from costly table scans. Also I agree we can create a parent Issue may be for all improvements in planning and link other issues for clarity.

Also I tested my patch on trino 374. 379 and 388 versions and saw similar improvements. We are not on 391 yet, the stats cache PR will also have good improvement but I think it is more of caching within the same query transaction and not across transactions.

…fied

abhinav-yadav-git · 2022-09-05T07:25:46Z

@findepi @electrum @erichwang @alexjo2144 Hi folks, can you please review the above PR.

ElapsedSoul · 2022-11-15T03:13:33Z

mark for record

mosabua · 2024-01-12T22:42:33Z

👋 @abhinav-yadav-git - this PR has become inactive. We hope you are still interested in working on it. Please let us know, and we can try to get reviewers to help with that.

We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks.

github-actions · 2024-09-05T17:03:22Z

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

raunaqmorarka · 2024-09-05T17:06:19Z

This should be tackled by #22739

abhinav-yadav-git force-pushed the Abhinav/IcebergMetadataCache branch from 23d72f7 to 810aee1 Compare July 25, 2022 13:29

github-actions bot added the tests:hive label Jul 25, 2022

raunaqmorarka requested a review from findepi July 25, 2022 17:26

abhinav-yadav-git force-pushed the Abhinav/IcebergMetadataCache branch from 810aee1 to 2619685 Compare August 1, 2022 05:57

cla-bot bot added the cla-signed label Aug 1, 2022

ankushChatterjee reviewed Aug 1, 2022

View reviewed changes

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadataCache.java Outdated Show resolved Hide resolved

findepi requested review from alexjo2144, electrum and erichwang August 1, 2022 16:09

Iceberg Metadata Cache for faster query planning

355e17a

abhinav-yadav-git force-pushed the Abhinav/IcebergMetadataCache branch from 2619685 to 355e17a Compare August 4, 2022 15:22

Disabling caching on listing operations when listing ttl is not speci…

9424c84

…fied

kokosing force-pushed the master branch from 3f05134 to 58d6356 Compare March 14, 2023 11:34

findepi removed the tests:hive label Apr 18, 2024

github-actions bot added the stale label Sep 5, 2024

raunaqmorarka closed this Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iceberg Metadata Cache for faster query planning#13338

Iceberg Metadata Cache for faster query planning#13338
abhinav-yadav-git wants to merge 2 commits intotrinodb:masterfrom
abhinav-yadav-git:Abhinav/IcebergMetadataCache

abhinav-yadav-git commented Jul 25, 2022 •

edited

Loading

Uh oh!

cla-bot bot commented Jul 25, 2022

Uh oh!

cla-bot bot commented Jul 25, 2022

Uh oh!

electrum commented Jul 25, 2022

Uh oh!

findepi commented Jul 27, 2022

Uh oh!

abhinav-yadav-git commented Aug 1, 2022

Uh oh!

Uh oh!

alexjo2144 commented Aug 1, 2022 •

edited

Loading

Uh oh!

abhinav-yadav-git commented Aug 2, 2022

Uh oh!

abhinav-yadav-git commented Sep 5, 2022 •

edited

Loading

Uh oh!

ElapsedSoul commented Nov 15, 2022

Uh oh!

mosabua commented Jan 12, 2024

Uh oh!

github-actions bot commented Sep 5, 2024

Uh oh!

raunaqmorarka commented Sep 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

8 participants

Conversation

abhinav-yadav-git commented Jul 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues, pull requests, and links

Documentation

Release notes

Uh oh!

cla-bot bot commented Jul 25, 2022

Uh oh!

cla-bot bot commented Jul 25, 2022

Uh oh!

electrum commented Jul 25, 2022

Uh oh!

findepi commented Jul 27, 2022

Uh oh!

abhinav-yadav-git commented Aug 1, 2022

Uh oh!

Uh oh!

alexjo2144 commented Aug 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhinav-yadav-git commented Aug 2, 2022

Uh oh!

abhinav-yadav-git commented Sep 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ElapsedSoul commented Nov 15, 2022

Uh oh!

mosabua commented Jan 12, 2024

Uh oh!

github-actions bot commented Sep 5, 2024

Uh oh!

raunaqmorarka commented Sep 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

8 participants

abhinav-yadav-git commented Jul 25, 2022 •

edited

Loading

alexjo2144 commented Aug 1, 2022 •

edited

Loading

abhinav-yadav-git commented Sep 5, 2022 •

edited

Loading