Skip to content

[NA] [BE] Fix high cost metrics calculation#5965

Merged
thiagohora merged 7 commits intomainfrom
thiaghora/NA-fix-project-metrics-spans-filter
Apr 1, 2026
Merged

[NA] [BE] Fix high cost metrics calculation#5965
thiagohora merged 7 commits intomainfrom
thiaghora/NA-fix-project-metrics-spans-filter

Conversation

@thiagohora
Copy link
Copy Markdown
Contributor

@thiagohora thiagohora commented Mar 30, 2026

Details

Root cause

Span subqueries in GET_COST, GET_COST_WITH_BREAKDOWN, GET_TOKEN_USAGE, and GET_TOKEN_USAGE_WITH_BREAKDOWN were filtering spans by span.id (the 5th column in the spans ORDER BY). ClickHouse's primary key index cannot prune granules on non-leading columns, resulting in near full-table scans (~25,429 granules read per query). This also caused correctness issues: spans whose IDs fall outside the trace time window were silently excluded, missing ~12.8% of spans (209K out of 1.8M in production).

Changes

ProjectMetricsDAO.java

  • Span subquery scoping fix (GET_COST, GET_COST_WITH_BREAKDOWN, GET_TOKEN_USAGE, GET_TOKEN_USAGE_WITH_BREAKDOWN): Added AND trace_id IN (SELECT id FROM traces_filtered) to each span subquery. This allows ClickHouse to use trace_id (3rd column in ORDER BY) for granule pruning, reducing granules read from 25,429 → 4,959 (~5×) and query latency by ~2× in production.

  • Materialized duration column: Replaced inline if(end_time IS NOT NULL ... dateDiff('microsecond', ...) / 1000.0 ...) AS duration expressions with the existing duration MATERIALIZED column in TRACE_FILTERED_PREFIX, SPAN_FILTERED_PREFIX, and GET_AVERAGE_DURATION. The MATERIALIZED column stores the pre-computed value and avoids recomputing it at query time across millions of rows.

  • Remove FINAL from feedback score reads: Removed FINAL from feedback_scores and authored_feedback_scores in TRACE_FILTERED_PREFIX, SPAN_FILTERED_PREFIX, and THREAD_FILTERED_PREFIX. Deduplication is already handled downstream by the ROW_NUMBER() window function, so applying FINAL here forced redundant merge-time deduplication.

  • THREAD_FILTERED_PREFIX scoping: Moved traces_final after trace_threads_final and scoped it with AND thread_id IN (SELECT thread_id FROM trace_threads_final). Previously traces_final loaded all traces in the project with a non-empty thread_id, ignoring the time window filter entirely.

Migration 000076

  • Adds a minmax skip index on authored_feedback_scores.created_at to enable efficient time-bounded range filtering on that table.

Production benchmark (same workspace/project, 7-day window, 522K traces / 1.9M spans)

Query Before After Speedup
GET_COST 3.28s 1.70s 1.9×
GET_COST_WITH_BREAKDOWN 3.14s 2.59s 1.2×
GET_TOKEN_USAGE 2.28s 1.15s 2.0×
GET_SPAN_DURATION 2.20s 1.21s 1.8×
GET_TRACE_COUNT 0.61s 0.68s ~same
GET_SPAN_COUNT 2.43s 2.50s ~same
GET_THREAD_COUNT 0.28s 0.28s ~same
GET_AVERAGE_DURATION 0.34s 0.33s ~same

EXPLAIN indexes confirmed 25,429 → 4,959 granules read on the spans table for the cost/token queries.

Change checklist

  • User facing
  • Documentation update

Issues

  • NA

Testing

  • Verified query correctness on production: old span.id filter returned 1,629,299 rows; new trace_id filter returns 1,838,544 rows (209K previously missing spans recovered).
  • EXPLAIN indexes confirmed granule pruning improvement (~5×).
  • Full latency benchmark run on production ClickHouse (5 runs each, same workspace/project).

Documentation

N/A

Span subqueries in GET_COST, GET_COST_WITH_BREAKDOWN, GET_TOKEN_USAGE,
and GET_TOKEN_USAGE_WITH_BREAKDOWN were not scoping spans to the traces
returned by traces_filtered. Adding AND trace_id IN (SELECT id FROM
traces_filtered) ensures spans are only aggregated for traces that pass
all applied filters (time range, name, metadata, feedback scores, etc.).

Benchmarked on production (1.9M spans, 7-day window):
- Granules read: 25,429 → 4,959 (5x reduction)
- GET_TOKEN_USAGE latency: ~2.0s → ~0.6s median (3.5x faster)
- GET_COST latency: ~1.6s → ~0.9s median (1.7x faster)
@thiagohora thiagohora requested a review from a team as a code owner March 30, 2026 14:24
@github-actions github-actions bot added java Pull requests that update Java code Backend labels Mar 30, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

Backend Tests - Integration Group 6

273 tests   273 ✅  2m 32s ⏱️
 25 suites    0 💤
 25 files      0 ❌

Results for commit 0437a93.

♻️ This comment has been updated with latest results.

…ed_at index on authored_feedback_scores

- Add AND trace_id IN (SELECT id FROM traces_filtered) to span subqueries in
  GET_COST, GET_COST_WITH_BREAKDOWN, GET_TOKEN_USAGE, GET_TOKEN_USAGE_WITH_BREAKDOWN.
  Previously filtering by span.id (5th ORDER BY column) caused full-table scans;
  the fix reduces granules read from 25,429 to 4,959 (~5x) and query latency by ~2x.
- Replace inline dateDiff duration expressions with the MATERIALIZED duration column
  in TRACE_FILTERED_PREFIX, SPAN_FILTERED_PREFIX, and GET_AVERAGE_DURATION.
- Remove FINAL from feedback_scores and authored_feedback_scores reads in
  TRACE_FILTERED_PREFIX, SPAN_FILTERED_PREFIX, and THREAD_FILTERED_PREFIX,
  replacing deduplication with ROW_NUMBER() window function which is already applied.
- Scope traces_final in THREAD_FILTERED_PREFIX to only traces whose thread_id is
  in the selected time window (was previously loading all threads in the project).
- Add minmax skip index on authored_feedback_scores.created_at (migration 000076).
@thiagohora thiagohora changed the title [NA] [BE] fix: filter project metrics span subqueries by trace_id [NA] [BE] Fix high cost metrics calculation Mar 30, 2026
Copy link
Copy Markdown
Contributor

@ldaugusto ldaugusto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things to confirm:

Copy link
Copy Markdown
Member

@andrescrz andrescrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Copy Markdown
Contributor

@ldaugusto ldaugusto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are going by default with use_skip_indexes_if_final=1, its good to go

@thiagohora thiagohora merged commit 574e867 into main Apr 1, 2026
78 checks passed
@thiagohora thiagohora deleted the thiaghora/NA-fix-project-metrics-spans-filter branch April 1, 2026 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backend java Pull requests that update Java code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants