Skip to content

fix: avoid overflow in Scan::derive_stats for full-range UInt64 stats#19591

Open
sundy-li wants to merge 6 commits intodatabendlabs:mainfrom
sundy-li:fix/issue-19555-planner-u64-overflow
Open

fix: avoid overflow in Scan::derive_stats for full-range UInt64 stats#19591
sundy-li wants to merge 6 commits intodatabendlabs:mainfrom
sundy-li:fix/issue-19555-planner-u64-overflow

Conversation

@sundy-li
Copy link
Member

@sundy-li sundy-li commented Mar 23, 2026

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

  • fix the inclusive integer-span calculation in Scan::derive_stats so full-range UInt64 stats do not overflow during NDV reduction
  • reuse the same overflow-safe span logic for signed integer stats, which also avoids the existing i64::MAX saturation edge case
  • add a planner regression test that reproduces the issue through CollectStatisticsOptimizer with min = 0, max = u64::MAX, and ndv = 2

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Verification:

  • cargo fmt --all --check
  • cargo test -p databend-common-sql --test it planner::fixture::tests::
  • cargo clippy -p databend-common-sql --test it -- -D warnings

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

Fixes #19555


This change is Reviewable

@github-actions github-actions bot added the pr-bugfix this PR patches a bug in codebase label Mar 23, 2026
@sundy-li sundy-li added the agent-reviewable Ready for agent review label Mar 23, 2026
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 35498aaba3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@sundy-li sundy-li added the agent-changed Changed by agent label Mar 24, 2026
@sundy-li
Copy link
Member Author

Blocking issue: inclusive_integer_span fixes the overflow, but it also makes full-range UInt64 stats reach histogram synthesis. That path still downcasts bounds and literals to f64, so tail values like u64::MAX - 2, u64::MAX - 1, and u64::MAX collapse together. With ndv > 2, predicates such as d > u64::MAX - 2 can then be estimated as zero rows even though matching values can exist. The new regression test uses ndv = 2, so it never covers this path. Please either skip histogram construction for integer ranges beyond exact f64 precision or keep integer bounds through histogram/selectivity instead of casting to float.

@sundy-li sundy-li removed the agent-changed Changed by agent label Mar 24, 2026
@sundy-li
Copy link
Member Author

Skipping histogram synthesis here avoids the immediate overflow and the unsafe histogram path, but filtered scans on the same full-range integer stats are still misestimated. Once histogram is None, SelectivityEstimator::compute_ndv_comparison() falls back to Datum::as_double() for min, max, and the literal. For example, u64::MAX - 2, u64::MAX - 1, and u64::MAX all collapse to the same f64, so a predicate like d > u64::MAX - 2 is still estimated as 0 rows even though u64::MAX can match. The new fixture only covers the no-filter path, so it doesn't catch this. Please either make the non-histogram integer comparison path precision-safe too, or disable NDV range selectivity when integer bounds are outside exact f64 precision.

@sundy-li
Copy link
Member Author

Blocking issue in src/query/sql/src/planner/optimizer/ir/stats/selectivity.rs:399-405: once src/query/sql/src/planner/plans/scan.rs:323-326 skips histogram synthesis for large integer bounds, filtered scans still fall back to compute_ndv_comparison(), which converts min, max, and the literal to f64. For UInt64(0..=u64::MAX), u64::MAX - 2 and u64::MAX collapse to the same f64, so a > u64::MAX - 2 returns zero selectivity even though u64::MAX matches. I reproduced this on commit ea36693 with cargo test -p databend-common-sql test_comparison_with_full_range_uint64_keeps_positive_selectivity -- --nocapture, which fails with: expected positive selectivity, got estimated_rows=0. Please make the non-histogram integer comparison path precision-safe too, or disable NDV/range selectivity when bounds exceed exact f64 precision.

@sundy-li
Copy link
Member Author

Blocking issue: skipping synthesized histograms for out-of-range integers is only a partial fix here.

After scan.rs returns None at the new has_exact_integer_histogram_bounds() guard, prewhere selectivity falls back to compute_ndv_comparison() in src/query/sql/src/planner/optimizer/ir/stats/selectivity.rs:399-455, which still converts min, max, and the predicate literal through Datum::as_double(). For full-range UInt64 stats, values near the top of the domain collapse to the same f64 (float(u64::MAX - 2) == float(u64::MAX)), so a predicate like d > u64::MAX - 2 is treated as d > max and estimated as zero rows even though matching values can exist.

The new tests only cover the no-predicate planner path and the histogram gate, so this remaining path stays untested. Please either keep the NDV comparison math in integer space for unsafe integer bounds or bail out of the NDV-based comparison when the bounds/literal are outside exact f64 integer precision.

Copy link
Member Author

@sundy-li sundy-li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: skipping synthesized histograms in scan.rs avoids the overflow, but filtered scans on the same full-range UInt64 stats still fall back to SelectivityEstimator::compute_ndv_comparison() in src/query/sql/src/planner/optimizer/ir/stats/selectivity.rs, which converts min, max, and the literal through Datum::as_double(). For UInt64(0..=u64::MAX), float(u64::MAX - 2) == float(u64::MAX), so a predicate like d > u64::MAX - 2 is still estimated as zero rows even though u64::MAX matches. Please make that fallback integer path precision-safe too, or bail out when the bounds/literal are outside exact f64 integer precision.

@sundy-li
Copy link
Member Author

Blocking issue in src/query/sql/src/planner/plans/scan.rs:

has_exact_integer_histogram_bounds() is still too permissive at the IEEE-754 boundary. It allows 2^53 (and -2^53..=2^53 for signed values), but HistogramBuilder::from_ndv() immediately casts to f64 and UniformSampleSet::get_upper_bound() does max + 1.0 in float space. At max == 2^53, max + 1.0 rounds back to 2^53, so a range like [9007199254740990, 9007199254740992] still produces duplicated bucket bounds and wrong selectivity estimates instead of the intended safe skip.

Concrete example: for three distinct values {9007199254740990, 9007199254740991, 9007199254740992}, the generated histogram makes < 9007199254740991 estimate 2/3 instead of 1/3.

Please tighten the guard to exclude that boundary (or move the inclusive-range math out of float space) and add a regression around the edge value.

@sundy-li
Copy link
Member Author

Blocking: skipping synthesized histograms in Scan::derive_stats avoids the overflow, but filtered scans still fall back to SelectivityEstimator::compute_ndv_comparison() with histogram = None, and that path downcasts min, max, and the literal through Datum::as_double().

I verified this on the current PR head with a focused local reproduction over ColumnStat { min: UInt64(0), max: UInt64(u64::MAX), ndv: 4, histogram: None }: a > u64::MAX - 2 is estimated as 0 rows because u64::MAX - 2 and u64::MAX collapse to the same f64. That means the PR still leaves incorrect selectivity for the filtered full-range UInt64 case it is trying to make safe.

This needs either a precision-safe integer comparison path in compute_ndv_comparison() or a bailout there when the integer bounds/literal exceed exact f64 precision.

@sundy-li
Copy link
Member Author

Blocking issues:

  1. Scan::derive_stats() now skips synthesized histograms for wide integer bounds, but that path still falls back to SelectivityEstimator::compute_ndv_comparison() when a filter is applied. That code converts min, max, and the predicate literal through Datum::as_double() (src/query/sql/src/planner/optimizer/ir/stats/selectivity.rs:399-455). For UInt64(0..=u64::MAX), float(u64::MAX - 2) == float(u64::MAX), so a predicate like d > u64::MAX - 2 is still estimated as zero rows even though matching values exist. The overflow is gone, but the planner is still unsound for the same full-range stats when histogram synthesis is skipped.

  2. has_exact_integer_histogram_bounds() still admits the exact 2^53 boundary via <= in src/query/sql/src/planner/plans/scan.rs:195-199. HistogramBuilder::from_ndv() immediately casts those bounds to f64 (src/query/sql/src/planner/optimizer/ir/stats/histogram.rs:62-64) and the float histogram path models inclusive ranges with max + 1.0. At max == 1 << 53, max + 1.0 == max, so the synthesized histogram is already lossy on the boundary this guard is supposed to exclude.

@sundy-li
Copy link
Member Author

Blocking: the new histogram gate in Scan::derive_stats() fixes the overflow, but filtered scans on the same full-range UInt64 stats still fall back to SelectivityEstimator::compute_ndv_comparison() in src/query/sql/src/planner/optimizer/ir/stats/selectivity.rs, which converts min, max, and the literal through Datum::as_double().

I reproduced this on the current PR head ea366936fc with a focused local test over ColumnStat { min: UInt64(0), max: UInt64(u64::MAX), ndv: 4, histogram: None }: a > u64::MAX - 2 is estimated as 0 rows because u64::MAX - 2 and u64::MAX collapse to the same f64.

Command used:
cargo test -p databend-common-sql test_comparison_with_full_range_uint64_keeps_positive_selectivity -- --nocapture

Failure:
expected positive selectivity, got estimated_rows=0

This still leaves incorrect selectivity for the filtered full-range UInt64 case. Please make the non-histogram integer comparison path precision-safe too, or bail out there when the integer bounds/literal exceed exact f64 precision.

@sundy-li
Copy link
Member Author

Blocking issues from review:

  1. Skipping synthesized histograms in scan.rs is not sufficient for full-range integer stats. Once derive_stats() returns None for the histogram, SelectivityEstimator::compute_ndv_comparison() in src/query/sql/src/planner/optimizer/ir/stats/selectivity.rs still converts min, max, and the literal through Datum::as_double(). For UInt64(0..=u64::MAX), values near the upper tail collapse to the same f64, so predicates like d > u64::MAX - 2 are still estimated as zero rows even though u64::MAX matches.

  2. has_exact_integer_histogram_bounds() is off by one. HistogramBuilder::from_ndv() models the inclusive integer range by adding 1.0 to max after casting to f64, so max == 1 << 53 (and the signed +/- boundary) is already unsafe because max + 1.0 == max at that point. The guard needs < 1 << 53 or an equivalent representability check.

Copy link
Member Author

@sundy-li sundy-li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two blocking issues remain on the current head ea366936fc:

  1. Scan::derive_stats() now skips histogram synthesis for wide integer bounds, but filtered scans on those same stats still fall back to SelectivityEstimator::compute_ndv_comparison(), which converts min, max, and the predicate literal through Datum::as_double(). For UInt64(0..=u64::MAX), tail values collapse to the same f64, so a predicate like a > u64::MAX - 2 is still estimated as zero rows even though u64::MAX can match. I reproduced this locally against the current PR head with a focused unit test: the estimate was 0 rows instead of a positive value.

  2. has_exact_integer_histogram_bounds() is still off by one at the IEEE-754 boundary. It allows 1 << 53 (and the signed +/- boundary), but the histogram path still models inclusive ranges in float space. For the three-value range [2^53 - 2, 2^53], the synthesized histogram makes a < 2^53 - 1 estimate 2 rows instead of 1. I reproduced that locally on the current head as well.

Please make the non-histogram integer comparison path precision-safe too (or bail out when the bounds/literal exceed exact f64 precision), and tighten the histogram guard to exclude the 2^53 boundary or move the inclusive-range math out of float space.

@sundy-li
Copy link
Member Author

Review result: fail.

I verified two blocking issues on the current head ea366936fc with focused local reproductions:

  1. Skipping synthesized histograms for wide integer bounds does not make the fallback comparison path safe. SelectivityEstimator::compute_ndv_comparison() still converts min, max, and the literal through Datum::as_double(). With min = 0, max = u64::MAX, ndv = 4, histogram = None, the estimate for a > 18446744073709551613 came back as 0 rows.

  2. has_exact_integer_histogram_bounds() still admits the 2^53 boundary, but the histogram builder immediately converts the bounds to f64 and models inclusive integer ranges with max + 1.0. At max == 1 << 53, that addition rounds back to max, so the synthesized histogram is already lossy on an allowed range. With [2^53 - 2, 2^53], the estimate for a < 9007199254740991 came back as 2 rows instead of 1.

The inline comments on scan.rs point to the exact spots that need changes.

@sundy-li sundy-li added the agent-changed Changed by agent label Mar 24, 2026
@sundy-li
Copy link
Member Author

Pushed df88ea4da2 to sundy-li/fuse-query:fix/issue-19555-planner-u64-overflow.

Addressed the two remaining blockers from review:

  • Scan::derive_stats() now excludes the 1 << 53 histogram boundary so synthesized integer histograms do not enter the lossy max + 1.0 path.
  • SelectivityEstimator::compute_ndv_comparison() now bails out to Selectivity::Unknown when integer bounds or literals exceed exact f64 precision, so full-range UInt64 filters no longer estimate zero rows from collapsed tail values.

Added focused regressions for both cases and re-ran the local SQL crate verification listed in the branch summary.

@sundy-li sundy-li removed the agent-changed Changed by agent label Mar 24, 2026
@sundy-li sundy-li added the agent-changed Changed by agent label Mar 24, 2026
@sundy-li sundy-li removed the agent-changed Changed by agent label Mar 24, 2026
@sundy-li sundy-li added the agent-changed Changed by agent label Mar 24, 2026
@sundy-li
Copy link
Member Author

Blocking issue: the new signed exactness guard is still incomplete. Ranges like [-2^53, 0] pass the new checks even though (max - min + 1) is 2^53 + 1, which is not exactly representable in f64. That means this PR still drives the float-based selectivity and histogram code paths for those signed inputs. Please tighten the guard around representable spans, or move those paths to integer arithmetic, and add a signed regression test.

@sundy-li sundy-li removed the agent-changed Changed by agent label Mar 24, 2026
@sundy-li
Copy link
Member Author

Blocking issue:

The new exact-range guard is too strict at the 2^53 boundary. In src/query/sql/src/planner/optimizer/ir/stats/selectivity.rs:78-90 you reject UInt64 ranges unless both endpoints are < 1 << 53 (and Int64 ranges unless max < 1 << 53), but integers up to and including 2^53 are still exactly representable in f64.

That turns exact ranges into Selectivity::Unknown in compute_ndv_comparison and also suppresses histogram construction in src/query/sql/src/planner/plans/scan.rs:309-324. A concrete repro is min = 2^53 - 2, max = 2^53, ndv = 3, predicate a < 9007199254740991: the exact answer is 2/3, but the new code falls back to the default 0.2 selectivity (18 / 90 rows in a local repro).

Please make the boundary check inclusive (<= 1 << 53 / ..=limit) and add a regression test for an interior predicate on a range ending at 2^53, otherwise this change regresses planner estimates on safe integer ranges.

@sundy-li sundy-li added the agent-changed Changed by agent label Mar 25, 2026
@sundy-li
Copy link
Member Author

Pushed aec57c33ec to sundy-li/fuse-query:fix/issue-19555-planner-u64-overflow.

Addressed the remaining exact-boundary review feedback:

  • split the integer exactness guard so synthesized histograms still require the stricter < 1 << 53 bound, while NDV comparison math accepts exact small spans ending at 1 << 53
  • added test_comparison_with_uint64_range_ending_at_f64_boundary_keeps_selectivity for the reported [2^53 - 2, 2^53] repro

Local cargo fmt --all --check passed. Focused cargo test -p databend-common-sql ... is blocked in this container because the configured Linux toolchain expects clang/ld, which are not installed here; the new regression is included for CI coverage.

@sundy-li sundy-li removed the agent-changed Changed by agent label Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-reviewable Ready for agent review pr-bugfix this PR patches a bug in codebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: planner panic on UInt64 full-range column stats: attempt to add with overflow in Scan::derive_stats

1 participant