duration -> chain_duration by alexkuzmik · Pull Request #11 · comet-ml/opik

alexkuzmik · 2023-05-30T17:10:40Z

No description provided.

@nonnull

- Use per-workspace cursors in deleteSmallBatch via deleteForRetentionBounded instead of collapsing to min(cursor) across all workspaces (#1) - Add @nonnull on executeCatchUpCycle(now) parameter (#3) - Log when catch-up is disabled (#3) - Run all three tiers independently per cycle via Flux.concat instead of switchIfEmpty chain to prevent medium/large starvation (#4) - Return null cursor when velocity=0, marking catch-up done immediately (#5) - Preserve Instant directly instead of UUID round-trip in deleteOneChunk (#7) - Hoist computeSlidingWindowStart out of per-rule loop (#8) - Centralize extractInstant/compareUUID into RetentionUtils (#9) - Remove unnecessary @UseStringTemplateEngine from catch-up queries (#10) - Add explicit IS NOT NULL guard on catch_up_velocity queries (#11) - Drop unused cnt column from scout query (#14) - Fix Javadoc: 'oldest span ID' → 'oldest span time' in SpanDAO

@nonnull

* [OPIK-4891] [BE] Catch-up job for apply-to-past retention rules Progressive historical data deletion for rules with applyToPast=true. Estimates workspace span velocity at rule creation to triage into small/medium/large tiers with appropriate chunk sizes. Schema: - Add catch_up_velocity, catch_up_cursor, catch_up_done columns - Add idx_catch_up_pending composite index for catch-up queries Velocity estimation: - ClickHouse query: uniq(id) / weeks_active for spans below cutoff - Handles TOO_MANY_ROWS (code 158) by defaulting to 1M/week - Handles empty tables gracefully Catch-up tiers (configurable thresholds): - Small (<10K/week): batch up to 200, one-shot delete entire range - Medium (10K-100K/week): 10 most outdated, 7-day chunks each - Large (>100K/week): 1 most outdated, 2-day chunks Execution: - Runs after regular sliding-window pass in RetentionPolicyJob - Priority: small first (quick wins), then medium, then large - Cursor advances oldest→newest, marks done when reaching sliding window Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix large workspace chunk size: 2 days to 1 day Large workspaces (>100K spans/week) process one day per catch-up cycle, so each execution handles a manageable amount of data. * Address PR review: fix type cast, null safety, error handling - Fix Float64→Long ClassCastException: wrap velocity query with toUInt64() - Fix null cursor NPE in deleteSmallBatch: filter nulls before min() - Fix catch-up marking done on delete failure: remove onErrorResume, propagate errors so cursor/done only advances on success - Make markDone/updateCursor non-blocking: wrap in Mono.fromRunnable on boundedElastic to avoid blocking Reactor threads * Add config comments for catch-up settings * Return oldest span time from velocity estimation, add scouting - Velocity query now returns both spans_per_week and oldest_span_time - Cursor starts at the actual oldest data, not service start date - For huge workspaces (TOO_MANY_ROWS), scout month by month on traces table to find first day with data, avoiding months of no-op deletes - If a monthly scout also hits row limit, use that month start as cursor * Replace SQL string concatenation with @BindList in markCatchUpDoneBatch Avoids fragile raw SQL construction pattern. Uses JDBI's @BindList for parameterized IN clause, consistent with other DAOs in the codebase. * Address review: rename vars for clarity, hide internal fields, add safety comment - Rename upperBound/lowerBound to cutoffId/fromId in deleteSmallBatch for consistency with deleteOneChunk and DAO signatures - Hide catchUpVelocity and catchUpCursor from API response (internal); only catchUpDone remains public as user-facing progress indicator - Add comment explaining NULL cursor safety in catch-up DAO queries * Guard cursor >= upperBound, isolate catch-up errors, expose cursor in API - Skip delete and mark done if cursor already past sliding window boundary - Wrap catch-up cycle in onErrorResume so failures don't kill regular retention - Re-expose catchUpCursor in API (useful for users to see cleanup progress); catchUpVelocity remains hidden (internal implementation detail) * Revert scouting to simple blocking loop, improve schema comments - Revert scoutFirstDataCursor from Flux back to blocking while-loop. Rule creation is a rare admin op; reactive complexity not justified. - Improve catch_up_cursor and catch_up_done column comments to document cursor semantics (data before cursor has been deleted). * Add unit tests for TOO_MANY_ROWS velocity estimation fallback - RetentionRuleServiceVelocityTest: 6 tests covering the code 158 exception path with mocked SpanDAO/TraceDAO. Tests scouting month-by-month, dense month fallback, service start date fallback, and non-158 exception rethrow. - Remove large workspace integration test (max_rows_to_read profile setting also blocks normal inserts/deletes, making it impossible to trigger TOO_MANY_ROWS only on the estimation query) - Keep small workspace catch-up integration test and applyToPast=false test in RetentionPolicyServiceTest - Make estimateVelocity/scoutFirstDataCursor package-visible for testing * Mark catch-up done when scouting finds no historical data When the velocity estimation hits TOO_MANY_ROWS and scouting scans every month without finding data, return velocity=0 with null cursor so the rule is created with catchUpDone=true. Prevents hundreds of empty 1-day chunk DELETE cycles. * Bump migration to 000061, simplify index, split rollback - Rename migration from 000060 to 000061 (main advanced past 000060) - Simplify index to (catch_up_done, catch_up_velocity) since catch_up_done=false already implies enabled=true and apply_to_past=true - Split rollback into individual DROP COLUMN statements * Address review comments from thiagohora and baz - Use per-workspace cursors in deleteSmallBatch via deleteForRetentionBounded instead of collapsing to min(cursor) across all workspaces (#1) - Add @nonnull on executeCatchUpCycle(now) parameter (#3) - Log when catch-up is disabled (#3) - Run all three tiers independently per cycle via Flux.concat instead of switchIfEmpty chain to prevent medium/large starvation (#4) - Return null cursor when velocity=0, marking catch-up done immediately (#5) - Preserve Instant directly instead of UUID round-trip in deleteOneChunk (#7) - Hoist computeSlidingWindowStart out of per-rule loop (#8) - Centralize extractInstant/compareUUID into RetentionUtils (#9) - Remove unnecessary @UseStringTemplateEngine from catch-up queries (#10) - Add explicit IS NOT NULL guard on catch_up_velocity queries (#11) - Drop unused cnt column from scout query (#14) - Fix Javadoc: 'oldest span ID' → 'oldest span time' in SpanDAO * Remove scripts/.gitignore, lower disabled log to DEBUG - Remove unnecessary .gitignore in scripts/ (test CSVs are local only) - Lower catch-up disabled log from INFO to DEBUG to avoid 48 noisy log lines per day when catch-up is off --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

duration -> chain_duration

5ca53ec

alexkuzmik self-assigned this May 30, 2023

alexkuzmik merged commit d3a6cea into main May 30, 2023

andrescrz deleted the update-asset-format branch September 2, 2024 11:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duration -> chain_duration#11

duration -> chain_duration#11
alexkuzmik merged 1 commit intomainfrom
update-asset-format

alexkuzmik commented May 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alexkuzmik commented May 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant