Conversation
this one i didn't apply: #12196 (comment)
review isn't required here. |
Before the change, the planner cached table stats within `IterativeOptimizer` run (as part of `Memo`). After the change, there is another cache that spans the whole optimization process.
The overloads were introduced only temporarily.
cbe6f26 to
1b94f09
Compare
per my comment in the original PR: dc8228a#r911207504 |
|
|
| Session session, | ||
| TypeProvider types); | ||
| TypeProvider types, | ||
| TableStatsProvider tableStatsProvider); |
There was a problem hiding this comment.
Why does TableStatsProvider needs to be 1st class citizen in StatsCalculator? It seems it should be local to TableScanStatsRule
There was a problem hiding this comment.
TableScanStatsRule is effectively a singleton.
CachingTableStatsProvider lifecycle is "during plan optimization", so it's query-scoped.
we need to pass TableStatsProvider instance created in the optimizer back to the TableScanStatsRule rule
other option would be to have stats rules query-scoped and create new StatsCalculator for each query, but this is IMO generally not a good idea.
There was a problem hiding this comment.
other option would be to have stats rules query-scoped and create new StatsCalculator for each query, but this is IMO generally not a good idea.
why? Maybe we can just preserve cached stats across planner/optimizer rules?
The root cause here is really that we have a lot of iterative optimizer instances. If we just had one, then there would be no caching problem.
Anyways, I'm more in favor of preserving cached stats (maybe with weak references) across optimizer executions rather than adding artificial components to a clean interface. This is similar issue as with caching expression optimizer/analyzer results.
There was a problem hiding this comment.
why? Maybe we can just preserve cached stats across planner/optimizer rules?
that's what the PR is doing
The root cause here is really that we have a lot of iterative optimizer instances. If we just had one, then there would be no caching problem.
i agree. Memo would do all of the work for now.
We can remove this new concept once we get to that stage.
Anyways, I'm more in favor of preserving cached stats (maybe with weak references) across optimizer executions rather than adding artificial components to a clean interface. This is similar issue as with caching expression optimizer/analyzer results.
I am not following. @sopel39 what are you suggesting?
There was a problem hiding this comment.
I am not following. @sopel39 what are you suggesting?
I suggest introducing some kind of context to doCalculate rather than adding another explicit parameter. I think @gaurav8297 has similar proposal for expression optimizer/analyzer results.
There was a problem hiding this comment.
Yes, I've implemented it as part of this PR: #12016
Basically like we have io.trino.sql.planner.iterative.Rule.Context for iterative rules, we could have the same thing for StatsRule
There was a problem hiding this comment.
I suggest introducing some kind of context to
doCalculaterather than adding another explicit parameter.
We still need to pass a parameter.
I agree we could introduce a more generic context as well.
I don't see a need for this here yet, can be a follow up
|
|
||
| @Override | ||
| protected Optional<PlanNodeStatsEstimate> doCalculate(TableScanNode node, StatsProvider sourceStats, Lookup lookup, Session session, TypeProvider types) | ||
| protected Optional<PlanNodeStatsEstimate> doCalculate(TableScanNode node, StatsProvider sourceStats, Lookup lookup, Session session, TypeProvider types, TableStatsProvider tableStatsProvider) |
There was a problem hiding this comment.
Why StatsProvider cannot handle caching? It does already, right? Something doesn't add up here.
There was a problem hiding this comment.
that was my initial idea as well. see #12196 (comment) .
| private final Metadata metadata; | ||
| private final Session session; | ||
|
|
||
| private final Map<TableHandle, TableStatistics> cache = new WeakHashMap<>(); |
There was a problem hiding this comment.
nit: why do you need WeakHashMap if CTSP is created per query. Do you expect that TableHandle objects will be GCed often while query is running. It is possible for sure but I am not sure if often in practice.
There was a problem hiding this comment.
During optimizations we can create large number of TableHandle objects (eg subsequent pushdowns) and old TableHandle objects may not be reachable in the plan. I don't think "large number" would be a common situation (there isn't an unlimited number of pushdown opportunities for a given query), but i still don't think we should be using strong references here.
losipiuk
left a comment
There was a problem hiding this comment.
LGTM. Why not merge commits together?
the commit were separated per my request in #12196 |
#12196 with most comments applied
already reviewed at #12196
cc @lxynov @alexjo2144 @sopel39 @losipiuk