Skip to content

duration -> chain_duration#11

Merged
alexkuzmik merged 1 commit intomainfrom
update-asset-format
May 30, 2023
Merged

duration -> chain_duration#11
alexkuzmik merged 1 commit intomainfrom
update-asset-format

Conversation

@alexkuzmik
Copy link
Copy Markdown
Collaborator

No description provided.

@alexkuzmik alexkuzmik self-assigned this May 30, 2023
@alexkuzmik alexkuzmik merged commit d3a6cea into main May 30, 2023
@andrescrz andrescrz deleted the update-asset-format branch September 2, 2024 11:35
ldaugusto added a commit that referenced this pull request Mar 26, 2026
- Use per-workspace cursors in deleteSmallBatch via deleteForRetentionBounded
  instead of collapsing to min(cursor) across all workspaces (#1)
- Add @nonnull on executeCatchUpCycle(now) parameter (#3)
- Log when catch-up is disabled (#3)
- Run all three tiers independently per cycle via Flux.concat instead of
  switchIfEmpty chain to prevent medium/large starvation (#4)
- Return null cursor when velocity=0, marking catch-up done immediately (#5)
- Preserve Instant directly instead of UUID round-trip in deleteOneChunk (#7)
- Hoist computeSlidingWindowStart out of per-rule loop (#8)
- Centralize extractInstant/compareUUID into RetentionUtils (#9)
- Remove unnecessary @UseStringTemplateEngine from catch-up queries (#10)
- Add explicit IS NOT NULL guard on catch_up_velocity queries (#11)
- Drop unused cnt column from scout query (#14)
- Fix Javadoc: 'oldest span ID' → 'oldest span time' in SpanDAO
ldaugusto added a commit that referenced this pull request Mar 26, 2026
* [OPIK-4891] [BE] Catch-up job for apply-to-past retention rules

Progressive historical data deletion for rules with applyToPast=true.
Estimates workspace span velocity at rule creation to triage into
small/medium/large tiers with appropriate chunk sizes.

Schema:
- Add catch_up_velocity, catch_up_cursor, catch_up_done columns
- Add idx_catch_up_pending composite index for catch-up queries

Velocity estimation:
- ClickHouse query: uniq(id) / weeks_active for spans below cutoff
- Handles TOO_MANY_ROWS (code 158) by defaulting to 1M/week
- Handles empty tables gracefully

Catch-up tiers (configurable thresholds):
- Small (<10K/week): batch up to 200, one-shot delete entire range
- Medium (10K-100K/week): 10 most outdated, 7-day chunks each
- Large (>100K/week): 1 most outdated, 2-day chunks

Execution:
- Runs after regular sliding-window pass in RetentionPolicyJob
- Priority: small first (quick wins), then medium, then large
- Cursor advances oldest→newest, marks done when reaching sliding window

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix large workspace chunk size: 2 days to 1 day

Large workspaces (>100K spans/week) process one day per catch-up
cycle, so each execution handles a manageable amount of data.

* Address PR review: fix type cast, null safety, error handling

- Fix Float64→Long ClassCastException: wrap velocity query with toUInt64()
- Fix null cursor NPE in deleteSmallBatch: filter nulls before min()
- Fix catch-up marking done on delete failure: remove onErrorResume,
  propagate errors so cursor/done only advances on success
- Make markDone/updateCursor non-blocking: wrap in Mono.fromRunnable
  on boundedElastic to avoid blocking Reactor threads

* Add config comments for catch-up settings

* Return oldest span time from velocity estimation, add scouting

- Velocity query now returns both spans_per_week and oldest_span_time
- Cursor starts at the actual oldest data, not service start date
- For huge workspaces (TOO_MANY_ROWS), scout month by month on traces
  table to find first day with data, avoiding months of no-op deletes
- If a monthly scout also hits row limit, use that month start as cursor

* Replace SQL string concatenation with @BindList in markCatchUpDoneBatch

Avoids fragile raw SQL construction pattern. Uses JDBI's @BindList
for parameterized IN clause, consistent with other DAOs in the codebase.

* Address review: rename vars for clarity, hide internal fields, add safety comment

- Rename upperBound/lowerBound to cutoffId/fromId in deleteSmallBatch
  for consistency with deleteOneChunk and DAO signatures
- Hide catchUpVelocity and catchUpCursor from API response (internal);
  only catchUpDone remains public as user-facing progress indicator
- Add comment explaining NULL cursor safety in catch-up DAO queries

* Guard cursor >= upperBound, isolate catch-up errors, expose cursor in API

- Skip delete and mark done if cursor already past sliding window boundary
- Wrap catch-up cycle in onErrorResume so failures don't kill regular retention
- Re-expose catchUpCursor in API (useful for users to see cleanup progress);
  catchUpVelocity remains hidden (internal implementation detail)

* Revert scouting to simple blocking loop, improve schema comments

- Revert scoutFirstDataCursor from Flux back to blocking while-loop.
  Rule creation is a rare admin op; reactive complexity not justified.
- Improve catch_up_cursor and catch_up_done column comments to
  document cursor semantics (data before cursor has been deleted).

* Add unit tests for TOO_MANY_ROWS velocity estimation fallback

- RetentionRuleServiceVelocityTest: 6 tests covering the code 158
  exception path with mocked SpanDAO/TraceDAO. Tests scouting
  month-by-month, dense month fallback, service start date fallback,
  and non-158 exception rethrow.
- Remove large workspace integration test (max_rows_to_read profile
  setting also blocks normal inserts/deletes, making it impossible
  to trigger TOO_MANY_ROWS only on the estimation query)
- Keep small workspace catch-up integration test and applyToPast=false
  test in RetentionPolicyServiceTest
- Make estimateVelocity/scoutFirstDataCursor package-visible for testing

* Mark catch-up done when scouting finds no historical data

When the velocity estimation hits TOO_MANY_ROWS and scouting scans
every month without finding data, return velocity=0 with null cursor
so the rule is created with catchUpDone=true. Prevents hundreds of
empty 1-day chunk DELETE cycles.

* Bump migration to 000061, simplify index, split rollback

- Rename migration from 000060 to 000061 (main advanced past 000060)
- Simplify index to (catch_up_done, catch_up_velocity) since
  catch_up_done=false already implies enabled=true and apply_to_past=true
- Split rollback into individual DROP COLUMN statements

* Address review comments from thiagohora and baz

- Use per-workspace cursors in deleteSmallBatch via deleteForRetentionBounded
  instead of collapsing to min(cursor) across all workspaces (#1)
- Add @nonnull on executeCatchUpCycle(now) parameter (#3)
- Log when catch-up is disabled (#3)
- Run all three tiers independently per cycle via Flux.concat instead of
  switchIfEmpty chain to prevent medium/large starvation (#4)
- Return null cursor when velocity=0, marking catch-up done immediately (#5)
- Preserve Instant directly instead of UUID round-trip in deleteOneChunk (#7)
- Hoist computeSlidingWindowStart out of per-rule loop (#8)
- Centralize extractInstant/compareUUID into RetentionUtils (#9)
- Remove unnecessary @UseStringTemplateEngine from catch-up queries (#10)
- Add explicit IS NOT NULL guard on catch_up_velocity queries (#11)
- Drop unused cnt column from scout query (#14)
- Fix Javadoc: 'oldest span ID' → 'oldest span time' in SpanDAO

* Remove scripts/.gitignore, lower disabled log to DEBUG

- Remove unnecessary .gitignore in scripts/ (test CSVs are local only)
- Lower catch-up disabled log from INFO to DEBUG to avoid 48 noisy
  log lines per day when catch-up is off

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant