[NA] [BE] Fix high cost metrics calculation by thiagohora · Pull Request #5965 · comet-ml/opik

thiagohora · 2026-03-30T14:24:48Z

Details

Root cause

Span subqueries in GET_COST, GET_COST_WITH_BREAKDOWN, GET_TOKEN_USAGE, and GET_TOKEN_USAGE_WITH_BREAKDOWN were filtering spans by span.id (the 5th column in the spans ORDER BY). ClickHouse's primary key index cannot prune granules on non-leading columns, resulting in near full-table scans (~25,429 granules read per query). This also caused correctness issues: spans whose IDs fall outside the trace time window were silently excluded, missing ~12.8% of spans (209K out of 1.8M in production).

Changes

ProjectMetricsDAO.java

Span subquery scoping fix (GET_COST, GET_COST_WITH_BREAKDOWN, GET_TOKEN_USAGE, GET_TOKEN_USAGE_WITH_BREAKDOWN): Added AND trace_id IN (SELECT id FROM traces_filtered) to each span subquery. This allows ClickHouse to use trace_id (3rd column in ORDER BY) for granule pruning, reducing granules read from 25,429 → 4,959 (~5×) and query latency by ~2× in production.
Materialized duration column: Replaced inline if(end_time IS NOT NULL ... dateDiff('microsecond', ...) / 1000.0 ...) AS duration expressions with the existing duration MATERIALIZED column in TRACE_FILTERED_PREFIX, SPAN_FILTERED_PREFIX, and GET_AVERAGE_DURATION. The MATERIALIZED column stores the pre-computed value and avoids recomputing it at query time across millions of rows.
Remove FINAL from feedback score reads: Removed FINAL from feedback_scores and authored_feedback_scores in TRACE_FILTERED_PREFIX, SPAN_FILTERED_PREFIX, and THREAD_FILTERED_PREFIX. Deduplication is already handled downstream by the ROW_NUMBER() window function, so applying FINAL here forced redundant merge-time deduplication.
THREAD_FILTERED_PREFIX scoping: Moved traces_final after trace_threads_final and scoped it with AND thread_id IN (SELECT thread_id FROM trace_threads_final). Previously traces_final loaded all traces in the project with a non-empty thread_id, ignoring the time window filter entirely.

Migration 000076

Adds a minmax skip index on authored_feedback_scores.created_at to enable efficient time-bounded range filtering on that table.

Production benchmark (same workspace/project, 7-day window, 522K traces / 1.9M spans)

Query	Before	After	Speedup
GET_COST	3.28s	1.70s	1.9×
GET_COST_WITH_BREAKDOWN	3.14s	2.59s	1.2×
GET_TOKEN_USAGE	2.28s	1.15s	2.0×
GET_SPAN_DURATION	2.20s	1.21s	1.8×
GET_TRACE_COUNT	0.61s	0.68s	~same
GET_SPAN_COUNT	2.43s	2.50s	~same
GET_THREAD_COUNT	0.28s	0.28s	~same
GET_AVERAGE_DURATION	0.34s	0.33s	~same

EXPLAIN indexes confirmed 25,429 → 4,959 granules read on the spans table for the cost/token queries.

Change checklist

User facing
Documentation update

Issues

NA

Testing

Verified query correctness on production: old span.id filter returned 1,629,299 rows; new trace_id filter returns 1,838,544 rows (209K previously missing spans recovered).
EXPLAIN indexes confirmed granule pruning improvement (~5×).
Full latency benchmark run on production ClickHouse (5 runs each, same workspace/project).

Documentation

N/A

Span subqueries in GET_COST, GET_COST_WITH_BREAKDOWN, GET_TOKEN_USAGE, and GET_TOKEN_USAGE_WITH_BREAKDOWN were not scoping spans to the traces returned by traces_filtered. Adding AND trace_id IN (SELECT id FROM traces_filtered) ensures spans are only aggregated for traces that pass all applied filters (time range, name, metadata, feedback scores, etc.). Benchmarked on production (1.9M spans, 7-day window): - Granules read: 25,429 → 4,959 (5x reduction) - GET_TOKEN_USAGE latency: ~2.0s → ~0.6s median (3.5x faster) - GET_COST latency: ~1.6s → ~0.9s median (1.7x faster)

github-actions · 2026-03-30T14:28:30Z

Backend Tests - Integration Group 6

273 tests 273 ✅ 2m 32s ⏱️
25 suites 0 💤
25 files 0 ❌

Results for commit 0437a93.

♻️ This comment has been updated with latest results.

apps/opik-backend/src/main/java/com/comet/opik/domain/ProjectMetricsDAO.java

…ed_at index on authored_feedback_scores - Add AND trace_id IN (SELECT id FROM traces_filtered) to span subqueries in GET_COST, GET_COST_WITH_BREAKDOWN, GET_TOKEN_USAGE, GET_TOKEN_USAGE_WITH_BREAKDOWN. Previously filtering by span.id (5th ORDER BY column) caused full-table scans; the fix reduces granules read from 25,429 to 4,959 (~5x) and query latency by ~2x. - Replace inline dateDiff duration expressions with the MATERIALIZED duration column in TRACE_FILTERED_PREFIX, SPAN_FILTERED_PREFIX, and GET_AVERAGE_DURATION. - Remove FINAL from feedback_scores and authored_feedback_scores reads in TRACE_FILTERED_PREFIX, SPAN_FILTERED_PREFIX, and THREAD_FILTERED_PREFIX, replacing deduplication with ROW_NUMBER() window function which is already applied. - Scope traces_final in THREAD_FILTERED_PREFIX to only traces whose thread_id is in the selected time window (was previously loading all threads in the project). - Add minmax skip index on authored_feedback_scores.created_at (migration 000076).

.../db-app-analytics/migrations/000078_add_minmax_index_authored_feedback_scores_created_at.sql

…eated_at.sql to 000078_add_minmax_index_authored_feedback_scores_created_at.sql

ldaugusto

Two things to confirm:

apps/opik-backend/src/main/java/com/comet/opik/domain/ProjectMetricsDAO.java

.../db-app-analytics/migrations/000078_add_minmax_index_authored_feedback_scores_created_at.sql

andrescrz

LGTM.

ldaugusto

As we are going by default with use_skip_indexes_if_final=1, its good to go

thiagohora requested a review from a team as a code owner March 30, 2026 14:24

github-actions bot assigned thiagohora Mar 30, 2026

github-actions bot added java Pull requests that update Java code Backend labels Mar 30, 2026

baz-reviewer bot reviewed Mar 30, 2026

View reviewed changes

apps/opik-backend/src/main/java/com/comet/opik/domain/ProjectMetricsDAO.java Show resolved Hide resolved

thiagohora changed the title ~~[NA] [BE] fix: filter project metrics span subqueries by trace_id~~ [NA] [BE] Fix high cost metrics calculation Mar 30, 2026

baz-reviewer bot reviewed Mar 30, 2026

View reviewed changes

.../db-app-analytics/migrations/000078_add_minmax_index_authored_feedback_scores_created_at.sql Show resolved Hide resolved

baz-reviewer bot approved these changes Mar 30, 2026

View reviewed changes

thiagohora added 5 commits March 31, 2026 09:06

Merge branch 'main' into thiaghora/NA-fix-project-metrics-spans-filter

3009b35

Merge branch 'main' into thiaghora/NA-fix-project-metrics-spans-filter

829d7f1

Merge branch 'main' into thiaghora/NA-fix-project-metrics-spans-filter

0437a93

Merge branch 'main' into thiaghora/NA-fix-project-metrics-spans-filter

0711a37

Update and rename 000076_add_minmax_index_authored_feedback_scores_cr…

e2ee3f0

…eated_at.sql to 000078_add_minmax_index_authored_feedback_scores_created_at.sql

ldaugusto reviewed Apr 1, 2026

View reviewed changes

apps/opik-backend/src/main/java/com/comet/opik/domain/ProjectMetricsDAO.java Show resolved Hide resolved

.../db-app-analytics/migrations/000078_add_minmax_index_authored_feedback_scores_created_at.sql Show resolved Hide resolved

andrescrz approved these changes Apr 1, 2026

View reviewed changes

ldaugusto approved these changes Apr 1, 2026

View reviewed changes

thiagohora merged commit 574e867 into main Apr 1, 2026
78 checks passed

thiagohora deleted the thiaghora/NA-fix-project-metrics-spans-filter branch April 1, 2026 14:20

CometActions mentioned this pull request Apr 1, 2026

[NA] [SDK] [DOCS] Update automatically OpenAPI spec and Fern code #6036

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NA] [BE] Fix high cost metrics calculation#5965

[NA] [BE] Fix high cost metrics calculation#5965
thiagohora merged 7 commits intomainfrom
thiaghora/NA-fix-project-metrics-spans-filter

thiagohora commented Mar 30, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ldaugusto left a comment

Uh oh!

Uh oh!

Uh oh!

andrescrz left a comment

Uh oh!

ldaugusto left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

thiagohora commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Root cause

Changes

Production benchmark (same workspace/project, 7-day window, 522K traces / 1.9M spans)

Change checklist

Issues

Testing

Documentation

Uh oh!

github-actions bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backend Tests - Integration Group 6

Uh oh!

Uh oh!

Uh oh!

ldaugusto left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

andrescrz left a comment

Choose a reason for hiding this comment

Uh oh!

ldaugusto left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thiagohora commented Mar 30, 2026 •

edited

Loading

github-actions bot commented Mar 30, 2026 •

edited

Loading