Merged
Conversation
ldaugusto
added a commit
that referenced
this pull request
Mar 26, 2026
- Use per-workspace cursors in deleteSmallBatch via deleteForRetentionBounded instead of collapsing to min(cursor) across all workspaces (#1) - Add @nonnull on executeCatchUpCycle(now) parameter (#3) - Log when catch-up is disabled (#3) - Run all three tiers independently per cycle via Flux.concat instead of switchIfEmpty chain to prevent medium/large starvation (#4) - Return null cursor when velocity=0, marking catch-up done immediately (#5) - Preserve Instant directly instead of UUID round-trip in deleteOneChunk (#7) - Hoist computeSlidingWindowStart out of per-rule loop (#8) - Centralize extractInstant/compareUUID into RetentionUtils (#9) - Remove unnecessary @UseStringTemplateEngine from catch-up queries (#10) - Add explicit IS NOT NULL guard on catch_up_velocity queries (#11) - Drop unused cnt column from scout query (#14) - Fix Javadoc: 'oldest span ID' → 'oldest span time' in SpanDAO
ldaugusto
added a commit
that referenced
this pull request
Mar 26, 2026
* [OPIK-4891] [BE] Catch-up job for apply-to-past retention rules Progressive historical data deletion for rules with applyToPast=true. Estimates workspace span velocity at rule creation to triage into small/medium/large tiers with appropriate chunk sizes. Schema: - Add catch_up_velocity, catch_up_cursor, catch_up_done columns - Add idx_catch_up_pending composite index for catch-up queries Velocity estimation: - ClickHouse query: uniq(id) / weeks_active for spans below cutoff - Handles TOO_MANY_ROWS (code 158) by defaulting to 1M/week - Handles empty tables gracefully Catch-up tiers (configurable thresholds): - Small (<10K/week): batch up to 200, one-shot delete entire range - Medium (10K-100K/week): 10 most outdated, 7-day chunks each - Large (>100K/week): 1 most outdated, 2-day chunks Execution: - Runs after regular sliding-window pass in RetentionPolicyJob - Priority: small first (quick wins), then medium, then large - Cursor advances oldest→newest, marks done when reaching sliding window Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix large workspace chunk size: 2 days to 1 day Large workspaces (>100K spans/week) process one day per catch-up cycle, so each execution handles a manageable amount of data. * Address PR review: fix type cast, null safety, error handling - Fix Float64→Long ClassCastException: wrap velocity query with toUInt64() - Fix null cursor NPE in deleteSmallBatch: filter nulls before min() - Fix catch-up marking done on delete failure: remove onErrorResume, propagate errors so cursor/done only advances on success - Make markDone/updateCursor non-blocking: wrap in Mono.fromRunnable on boundedElastic to avoid blocking Reactor threads * Add config comments for catch-up settings * Return oldest span time from velocity estimation, add scouting - Velocity query now returns both spans_per_week and oldest_span_time - Cursor starts at the actual oldest data, not service start date - For huge workspaces (TOO_MANY_ROWS), scout month by month on traces table to find first day with data, avoiding months of no-op deletes - If a monthly scout also hits row limit, use that month start as cursor * Replace SQL string concatenation with @BindList in markCatchUpDoneBatch Avoids fragile raw SQL construction pattern. Uses JDBI's @BindList for parameterized IN clause, consistent with other DAOs in the codebase. * Address review: rename vars for clarity, hide internal fields, add safety comment - Rename upperBound/lowerBound to cutoffId/fromId in deleteSmallBatch for consistency with deleteOneChunk and DAO signatures - Hide catchUpVelocity and catchUpCursor from API response (internal); only catchUpDone remains public as user-facing progress indicator - Add comment explaining NULL cursor safety in catch-up DAO queries * Guard cursor >= upperBound, isolate catch-up errors, expose cursor in API - Skip delete and mark done if cursor already past sliding window boundary - Wrap catch-up cycle in onErrorResume so failures don't kill regular retention - Re-expose catchUpCursor in API (useful for users to see cleanup progress); catchUpVelocity remains hidden (internal implementation detail) * Revert scouting to simple blocking loop, improve schema comments - Revert scoutFirstDataCursor from Flux back to blocking while-loop. Rule creation is a rare admin op; reactive complexity not justified. - Improve catch_up_cursor and catch_up_done column comments to document cursor semantics (data before cursor has been deleted). * Add unit tests for TOO_MANY_ROWS velocity estimation fallback - RetentionRuleServiceVelocityTest: 6 tests covering the code 158 exception path with mocked SpanDAO/TraceDAO. Tests scouting month-by-month, dense month fallback, service start date fallback, and non-158 exception rethrow. - Remove large workspace integration test (max_rows_to_read profile setting also blocks normal inserts/deletes, making it impossible to trigger TOO_MANY_ROWS only on the estimation query) - Keep small workspace catch-up integration test and applyToPast=false test in RetentionPolicyServiceTest - Make estimateVelocity/scoutFirstDataCursor package-visible for testing * Mark catch-up done when scouting finds no historical data When the velocity estimation hits TOO_MANY_ROWS and scouting scans every month without finding data, return velocity=0 with null cursor so the rule is created with catchUpDone=true. Prevents hundreds of empty 1-day chunk DELETE cycles. * Bump migration to 000061, simplify index, split rollback - Rename migration from 000060 to 000061 (main advanced past 000060) - Simplify index to (catch_up_done, catch_up_velocity) since catch_up_done=false already implies enabled=true and apply_to_past=true - Split rollback into individual DROP COLUMN statements * Address review comments from thiagohora and baz - Use per-workspace cursors in deleteSmallBatch via deleteForRetentionBounded instead of collapsing to min(cursor) across all workspaces (#1) - Add @nonnull on executeCatchUpCycle(now) parameter (#3) - Log when catch-up is disabled (#3) - Run all three tiers independently per cycle via Flux.concat instead of switchIfEmpty chain to prevent medium/large starvation (#4) - Return null cursor when velocity=0, marking catch-up done immediately (#5) - Preserve Instant directly instead of UUID round-trip in deleteOneChunk (#7) - Hoist computeSlidingWindowStart out of per-rule loop (#8) - Centralize extractInstant/compareUUID into RetentionUtils (#9) - Remove unnecessary @UseStringTemplateEngine from catch-up queries (#10) - Add explicit IS NOT NULL guard on catch_up_velocity queries (#11) - Drop unused cnt column from scout query (#14) - Fix Javadoc: 'oldest span ID' → 'oldest span time' in SpanDAO * Remove scripts/.gitignore, lower disabled log to DEBUG - Remove unnecessary .gitignore in scripts/ (test CSVs are local only) - Lower catch-up disabled log from INFO to DEBUG to avoid 48 noisy log lines per day when catch-up is off --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.