fix: avoid overflow in Scan::derive_stats for full-range UInt64 stats#19591
fix: avoid overflow in Scan::derive_stats for full-range UInt64 stats#19591sundy-li wants to merge 6 commits intodatabendlabs:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 35498aaba3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Blocking issue: |
|
Skipping histogram synthesis here avoids the immediate overflow and the unsafe histogram path, but filtered scans on the same full-range integer stats are still misestimated. Once |
|
Blocking issue in src/query/sql/src/planner/optimizer/ir/stats/selectivity.rs:399-405: once src/query/sql/src/planner/plans/scan.rs:323-326 skips histogram synthesis for large integer bounds, filtered scans still fall back to compute_ndv_comparison(), which converts min, max, and the literal to f64. For UInt64(0..=u64::MAX), u64::MAX - 2 and u64::MAX collapse to the same f64, so a > u64::MAX - 2 returns zero selectivity even though u64::MAX matches. I reproduced this on commit ea36693 with cargo test -p databend-common-sql test_comparison_with_full_range_uint64_keeps_positive_selectivity -- --nocapture, which fails with: expected positive selectivity, got estimated_rows=0. Please make the non-histogram integer comparison path precision-safe too, or disable NDV/range selectivity when bounds exceed exact f64 precision. |
|
Blocking issue: skipping synthesized histograms for out-of-range integers is only a partial fix here. After The new tests only cover the no-predicate planner path and the histogram gate, so this remaining path stays untested. Please either keep the NDV comparison math in integer space for unsafe integer bounds or bail out of the NDV-based comparison when the bounds/literal are outside exact |
sundy-li
left a comment
There was a problem hiding this comment.
Blocking: skipping synthesized histograms in scan.rs avoids the overflow, but filtered scans on the same full-range UInt64 stats still fall back to SelectivityEstimator::compute_ndv_comparison() in src/query/sql/src/planner/optimizer/ir/stats/selectivity.rs, which converts min, max, and the literal through Datum::as_double(). For UInt64(0..=u64::MAX), float(u64::MAX - 2) == float(u64::MAX), so a predicate like d > u64::MAX - 2 is still estimated as zero rows even though u64::MAX matches. Please make that fallback integer path precision-safe too, or bail out when the bounds/literal are outside exact f64 integer precision.
|
Blocking issue in
Concrete example: for three distinct values Please tighten the guard to exclude that boundary (or move the inclusive-range math out of float space) and add a regression around the edge value. |
|
Blocking: skipping synthesized histograms in I verified this on the current PR head with a focused local reproduction over This needs either a precision-safe integer comparison path in |
|
Blocking issues:
|
|
Blocking: the new histogram gate in I reproduced this on the current PR head Command used: Failure: This still leaves incorrect selectivity for the filtered full-range |
|
Blocking issues from review:
|
sundy-li
left a comment
There was a problem hiding this comment.
Two blocking issues remain on the current head ea366936fc:
-
Scan::derive_stats()now skips histogram synthesis for wide integer bounds, but filtered scans on those same stats still fall back toSelectivityEstimator::compute_ndv_comparison(), which convertsmin,max, and the predicate literal throughDatum::as_double(). ForUInt64(0..=u64::MAX), tail values collapse to the samef64, so a predicate likea > u64::MAX - 2is still estimated as zero rows even thoughu64::MAXcan match. I reproduced this locally against the current PR head with a focused unit test: the estimate was0rows instead of a positive value. -
has_exact_integer_histogram_bounds()is still off by one at the IEEE-754 boundary. It allows1 << 53(and the signed+/-boundary), but the histogram path still models inclusive ranges in float space. For the three-value range[2^53 - 2, 2^53], the synthesized histogram makesa < 2^53 - 1estimate2rows instead of1. I reproduced that locally on the current head as well.
Please make the non-histogram integer comparison path precision-safe too (or bail out when the bounds/literal exceed exact f64 precision), and tighten the histogram guard to exclude the 2^53 boundary or move the inclusive-range math out of float space.
|
Review result: fail. I verified two blocking issues on the current head
The inline comments on |
|
Pushed Addressed the two remaining blockers from review:
Added focused regressions for both cases and re-ran the local SQL crate verification listed in the branch summary. |
|
Blocking issue: the new signed exactness guard is still incomplete. Ranges like |
|
Blocking issue: The new exact-range guard is too strict at the That turns exact ranges into Please make the boundary check inclusive ( |
|
Pushed Addressed the remaining exact-boundary review feedback:
Local |
I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/
Summary
Scan::derive_statsso full-rangeUInt64stats do not overflow during NDV reductioni64::MAXsaturation edge caseCollectStatisticsOptimizerwithmin = 0,max = u64::MAX, andndv = 2Tests
Verification:
cargo fmt --all --checkcargo test -p databend-common-sql --test it planner::fixture::tests::cargo clippy -p databend-common-sql --test it -- -D warningsType of change
Fixes #19555
This change is