Skip to content

[CM-7071] llm sdk allow users to logs their chain execution#7

Merged
alexkuzmik merged 54 commits intofeature/chainsfrom
CM-7071-llm-sdk-allow-users-to-logs-their-chain-execution
Jun 1, 2023
Merged

[CM-7071] llm sdk allow users to logs their chain execution#7
alexkuzmik merged 54 commits intofeature/chainsfrom
CM-7071-llm-sdk-allow-users-to-logs-their-chain-execution

Conversation

@alexkuzmik
Copy link
Copy Markdown
Collaborator

No description provided.

@alexkuzmik alexkuzmik self-assigned this May 24, 2023
thiagohora added a commit that referenced this pull request Nov 12, 2025
…trics (#3969)

* [NA] [BE] Upgrade MySQL container from Testcontainers

* Fix imports order

* [OPIK-2856] [BE] Implement UUIDv7 time-based filtering for traces

- Add InstantToUUIDMapper to convert Instant timestamps to UUIDv7 bounds
- Add InstantParamConverter to parse ISO-8601 and epoch millisecond time parameters
- Update TracesResource to accept from_time and to_time parameters on /traces and /traces/stats endpoints
- Update TraceDAO to apply UUID-based time filtering using BETWEEN clause on id column
- Update TraceSearchCriteria to include uuidFromTime and uuidToTime fields
- Add comprehensive integration tests for time filtering with boundary conditions
- All tests passing: 10/10 time filtering tests + validation tests

* [OPIK-2856] Address PR review comments: improve exception handling and type safety

- Fix InstantParamConverter to catch specific DateTimeParseException instead of generic Exception
- Add debug logging when falling back to epoch milliseconds parsing
- Refactor anonymous ParamConverter class to named InstantConverter inner class for clarity
- Suppress unchecked cast with @SuppressWarnings annotation
- Fix MySQLContainerUtils return type to use MySQLContainer<?> for type safety

* [OPIK-2856] Fix InstantToUUIDMapperTest to match implementation

- Update tests to reflect that toUpperBound uses next millisecond (+1ms) for inclusive BETWEEN queries
- Remove outdated assertions expecting same timestamp in both bounds
- Verify upper bound is lexicographically greater than lower bound
- All 13 tests now passing

* Remove setup duplicated code

* Revision 2: Address PR review comments - LOW priority fixes

- #7: Add INSTANCE singleton pattern for InstantConverter
- #8: Use StringUtils.isEmpty() for null-safe empty check
- Note: #2 and #3 already addressed in previous commit

* Revision 3: Use IdGenerator.getTimeOrderedEpoch() for UUID bounds

- Simplified InstantToUUIDMapper to use IdGenerator.getTimeOrderedEpoch() instead of convertOtelIdToUUIDv7
- Per UUIDv7 RFC, sub-millisecond 12 bits are optional with millisecond granularity
- Start/end interval semantics with ±1ms ensures correct BETWEEN query results
- This approach has been battle-tested for months without issues per reviewer recommendation
- Converted InstantToUUIDMapper to @singleton service for proper DI integration
- Updated TracesResource to inject InstantToUUIDMapper dependency
- Updated tests to properly mock IdGenerator dependency

* Fix tests

* [OPIK-2856] [BE] Split Get spans Tests

* Fix format

* Revision 2: Extract duplicated span creation logic into createSpanWithTimestamp helper method

* [OPIK-2856] [BE] Implement UUIDv7 time-based filtering for spans

* Revision 3: Extract workspace setup duplication into setupTestWorkspace helper method and fix transformTestParams call

* [OPIK-2856] Refactor TracesResourceTest to use TraceResourceClient instead of direct URL_TEMPLATE calls

- Replace all direct client.target(URL_TEMPLATE) calls with TraceResourceClient methods
- Add callFeedbackScoresWithCookie method to TraceResourceClient for session token authentication
- Add callRetrieveThreadResponseWithCookie method to TraceResourceClient for session token authentication
- Fix feedback batch endpoint by using callFeedbackScores and callFeedbackScoresWithCookie
- Add null checks for query parameters to prevent NPE errors
- Fix API key vs session token usage in authentication tests
- Rename get__whenApiKeyIsPresent__thenReturnTraceThread to get__whenSessionTokenIsPresent__thenReturnTraceThread in SessionTokenCookie class
- Add mockGetWorkspaceIdByName() calls for proper workspace mocking
- Preserve original test assertions and behavior
- All tests properly refactored to use resource client methods instead of direct HTTP calls

* [OPIK-2856] Remove duplicate methods from TraceResourceClient

- Remove callGetTraces() - duplicate of callGetTracesWithQueryParams()
- Remove callSearchTraces() - duplicate of callSearchTracesStream()
- Reduced code duplication and maintenance burden

* Fix tests

* Revision 2: Address Copilot review comments - remove redundant wrapper method and add clarifying comment

* Revision 3: Extract duplicated path splitting logic into helper method addPathSegments()

* Revision 4: Make getWebTarget() private and add callGetTraceThreadsWithSorting() public method

* Revision 7: Move addPathSegments() and addQueryParameters() helper methods to BaseCommentResourceClient

* [OPIK-2856] [BE] Add UUIDv7 time-based filtering for trace threads

* Revision 2: Address GitHub Copilot PR review comments

- Extract conditional UUID generation into generateThreadModelId() method for better readability
- Rename minTraceTimestamp to earliestTraceTimestamp for clarity
- Add explanatory comment about UUIDv7 lexicographic ordering in compareTo()

* Fix

* Revision 3: Add UUID time filter to SELECT_TRACES_STATS query

* Revision 4: Fix generateUUIDForTimestamp to manually construct UUIDv7

* [OPIK-2856] [BE] Implement UUIDv7-based time filtering for project metrics

- Add uuidFromTime and uuidToTime fields to ProjectMetricRequest
- Update ProjectMetricsService to enrich requests with UUID bounds using InstantToUUIDMapper
- Refactor ProjectMetricsDAO SQL queries to use UUID-based filtering (id BETWEEN uuid_from_time AND uuid_to_time)
- Extract timestamps from UUIDs using UUIDv7ToDateTime for bucketing and WITH FILL clauses
- Update TraceService to generate UUIDs based on trace startTime when ID is not provided
- Fix ProjectMetricsResourceTest to generate UUIDs with correct timestamps using TimeBasedEpochGenerator
- Remove explicit openTraceThread calls in tests to allow traces to create thread metadata with correct timestamps

All 206 ProjectMetricsResourceTest tests now passing (1 skipped).

* [OPIK-2856] Fix flaky MultiValueFeedbackScoresE2ETest by ensuring UUID bounds are min/max for timestamp

* [OPIK-2856] Update InstantToUUIDMapper tests to match new min/max UUID implementation

* [OPIK-2856] Address Copilot PR review comments: clarify 62-bit constant and update validateProject comment

* [OPIK-2856] [BE] Extract UUID utility for test reuse

- Create UUIDTestUtils with generateUUIDForTimestamp method
- Replace local implementations in ProjectMetricsResourceTest
- Replace local implementations in FindSpansResourceTest
- Remove duplicate method definitions and unused imports
- Centralize UUID generation logic for time-based testing

Tests verified: ✅ ProjectMetricsResourceTest (206 tests passed, 1 skipped)

* Revert id changes
thiagohora added a commit that referenced this pull request Nov 12, 2025
…bs (#3977)

* [NA] [BE] Upgrade MySQL container from Testcontainers

* Fix imports order

* [OPIK-2856] [BE] Implement UUIDv7 time-based filtering for traces

- Add InstantToUUIDMapper to convert Instant timestamps to UUIDv7 bounds
- Add InstantParamConverter to parse ISO-8601 and epoch millisecond time parameters
- Update TracesResource to accept from_time and to_time parameters on /traces and /traces/stats endpoints
- Update TraceDAO to apply UUID-based time filtering using BETWEEN clause on id column
- Update TraceSearchCriteria to include uuidFromTime and uuidToTime fields
- Add comprehensive integration tests for time filtering with boundary conditions
- All tests passing: 10/10 time filtering tests + validation tests

* [OPIK-2856] Address PR review comments: improve exception handling and type safety

- Fix InstantParamConverter to catch specific DateTimeParseException instead of generic Exception
- Add debug logging when falling back to epoch milliseconds parsing
- Refactor anonymous ParamConverter class to named InstantConverter inner class for clarity
- Suppress unchecked cast with @SuppressWarnings annotation
- Fix MySQLContainerUtils return type to use MySQLContainer<?> for type safety

* [OPIK-2856] Fix InstantToUUIDMapperTest to match implementation

- Update tests to reflect that toUpperBound uses next millisecond (+1ms) for inclusive BETWEEN queries
- Remove outdated assertions expecting same timestamp in both bounds
- Verify upper bound is lexicographically greater than lower bound
- All 13 tests now passing

* Remove setup duplicated code

* Revision 2: Address PR review comments - LOW priority fixes

- #7: Add INSTANCE singleton pattern for InstantConverter
- #8: Use StringUtils.isEmpty() for null-safe empty check
- Note: #2 and #3 already addressed in previous commit

* Revision 3: Use IdGenerator.getTimeOrderedEpoch() for UUID bounds

- Simplified InstantToUUIDMapper to use IdGenerator.getTimeOrderedEpoch() instead of convertOtelIdToUUIDv7
- Per UUIDv7 RFC, sub-millisecond 12 bits are optional with millisecond granularity
- Start/end interval semantics with ±1ms ensures correct BETWEEN query results
- This approach has been battle-tested for months without issues per reviewer recommendation
- Converted InstantToUUIDMapper to @singleton service for proper DI integration
- Updated TracesResource to inject InstantToUUIDMapper dependency
- Updated tests to properly mock IdGenerator dependency

* Fix tests

* [OPIK-2856] [BE] Split Get spans Tests

* Fix format

* Revision 2: Extract duplicated span creation logic into createSpanWithTimestamp helper method

* [OPIK-2856] [BE] Implement UUIDv7 time-based filtering for spans

* Revision 3: Extract workspace setup duplication into setupTestWorkspace helper method and fix transformTestParams call

* [OPIK-2856] Refactor TracesResourceTest to use TraceResourceClient instead of direct URL_TEMPLATE calls

- Replace all direct client.target(URL_TEMPLATE) calls with TraceResourceClient methods
- Add callFeedbackScoresWithCookie method to TraceResourceClient for session token authentication
- Add callRetrieveThreadResponseWithCookie method to TraceResourceClient for session token authentication
- Fix feedback batch endpoint by using callFeedbackScores and callFeedbackScoresWithCookie
- Add null checks for query parameters to prevent NPE errors
- Fix API key vs session token usage in authentication tests
- Rename get__whenApiKeyIsPresent__thenReturnTraceThread to get__whenSessionTokenIsPresent__thenReturnTraceThread in SessionTokenCookie class
- Add mockGetWorkspaceIdByName() calls for proper workspace mocking
- Preserve original test assertions and behavior
- All tests properly refactored to use resource client methods instead of direct HTTP calls

* [OPIK-2856] Remove duplicate methods from TraceResourceClient

- Remove callGetTraces() - duplicate of callGetTracesWithQueryParams()
- Remove callSearchTraces() - duplicate of callSearchTracesStream()
- Reduced code duplication and maintenance burden

* Fix tests

* Revision 2: Address Copilot review comments - remove redundant wrapper method and add clarifying comment

* Revision 3: Extract duplicated path splitting logic into helper method addPathSegments()

* Revision 4: Make getWebTarget() private and add callGetTraceThreadsWithSorting() public method

* Revision 7: Move addPathSegments() and addQueryParameters() helper methods to BaseCommentResourceClient

* [OPIK-2856] [BE] Add UUIDv7 time-based filtering for trace threads

* Revision 2: Address GitHub Copilot PR review comments

- Extract conditional UUID generation into generateThreadModelId() method for better readability
- Rename minTraceTimestamp to earliestTraceTimestamp for clarity
- Add explanatory comment about UUIDv7 lexicographic ordering in compareTo()

* Fix

* Revision 3: Add UUID time filter to SELECT_TRACES_STATS query

* Revision 4: Fix generateUUIDForTimestamp to manually construct UUIDv7

* [OPIK-2856] [BE] Implement UUIDv7-based time filtering for project metrics

- Add uuidFromTime and uuidToTime fields to ProjectMetricRequest
- Update ProjectMetricsService to enrich requests with UUID bounds using InstantToUUIDMapper
- Refactor ProjectMetricsDAO SQL queries to use UUID-based filtering (id BETWEEN uuid_from_time AND uuid_to_time)
- Extract timestamps from UUIDs using UUIDv7ToDateTime for bucketing and WITH FILL clauses
- Update TraceService to generate UUIDs based on trace startTime when ID is not provided
- Fix ProjectMetricsResourceTest to generate UUIDs with correct timestamps using TimeBasedEpochGenerator
- Remove explicit openTraceThread calls in tests to allow traces to create thread metadata with correct timestamps

All 206 ProjectMetricsResourceTest tests now passing (1 skipped).

* [OPIK-2856] Fix flaky MultiValueFeedbackScoresE2ETest by ensuring UUID bounds are min/max for timestamp

* [OPIK-2856] Update InstantToUUIDMapper tests to match new min/max UUID implementation

* [OPIK-2856] Address Copilot PR review comments: clarify 62-bit constant and update validateProject comment

* [OPIK-2856] [FE] Add datetime picker to traces, spans, and threads tabs

* Revision 2: Synchronize date range across all tabs using shared 'range' key

* Revision 3: Add refetchOnMount to ensure data refreshes when switching tabs

* Revision 4: Fix TypeScript error - change refetchOnMount from 'stale' to 'always'

* Fix date range

* Update SpanService.java

* Update TraceService.java

* Update ProjectMetricsResourceTest.java
awkoy pushed a commit that referenced this pull request Nov 12, 2025
)

* [NA] [BE] Upgrade MySQL container from Testcontainers

* Fix imports order

* [OPIK-2856] [BE] Implement UUIDv7 time-based filtering for traces

- Add InstantToUUIDMapper to convert Instant timestamps to UUIDv7 bounds
- Add InstantParamConverter to parse ISO-8601 and epoch millisecond time parameters
- Update TracesResource to accept from_time and to_time parameters on /traces and /traces/stats endpoints
- Update TraceDAO to apply UUID-based time filtering using BETWEEN clause on id column
- Update TraceSearchCriteria to include uuidFromTime and uuidToTime fields
- Add comprehensive integration tests for time filtering with boundary conditions
- All tests passing: 10/10 time filtering tests + validation tests

* [OPIK-2856] Address PR review comments: improve exception handling and type safety

- Fix InstantParamConverter to catch specific DateTimeParseException instead of generic Exception
- Add debug logging when falling back to epoch milliseconds parsing
- Refactor anonymous ParamConverter class to named InstantConverter inner class for clarity
- Suppress unchecked cast with @SuppressWarnings annotation
- Fix MySQLContainerUtils return type to use MySQLContainer<?> for type safety

* [OPIK-2856] Fix InstantToUUIDMapperTest to match implementation

- Update tests to reflect that toUpperBound uses next millisecond (+1ms) for inclusive BETWEEN queries
- Remove outdated assertions expecting same timestamp in both bounds
- Verify upper bound is lexicographically greater than lower bound
- All 13 tests now passing

* Remove setup duplicated code

* Revision 2: Address PR review comments - LOW priority fixes

- #7: Add INSTANCE singleton pattern for InstantConverter
- #8: Use StringUtils.isEmpty() for null-safe empty check
- Note: #2 and #3 already addressed in previous commit

* Revision 3: Use IdGenerator.getTimeOrderedEpoch() for UUID bounds

- Simplified InstantToUUIDMapper to use IdGenerator.getTimeOrderedEpoch() instead of convertOtelIdToUUIDv7
- Per UUIDv7 RFC, sub-millisecond 12 bits are optional with millisecond granularity
- Start/end interval semantics with ±1ms ensures correct BETWEEN query results
- This approach has been battle-tested for months without issues per reviewer recommendation
- Converted InstantToUUIDMapper to @singleton service for proper DI integration
- Updated TracesResource to inject InstantToUUIDMapper dependency
- Updated tests to properly mock IdGenerator dependency

* Fix tests
awkoy pushed a commit that referenced this pull request Nov 12, 2025
* [NA] [BE] Upgrade MySQL container from Testcontainers

* Fix imports order

* [OPIK-2856] [BE] Implement UUIDv7 time-based filtering for traces

- Add InstantToUUIDMapper to convert Instant timestamps to UUIDv7 bounds
- Add InstantParamConverter to parse ISO-8601 and epoch millisecond time parameters
- Update TracesResource to accept from_time and to_time parameters on /traces and /traces/stats endpoints
- Update TraceDAO to apply UUID-based time filtering using BETWEEN clause on id column
- Update TraceSearchCriteria to include uuidFromTime and uuidToTime fields
- Add comprehensive integration tests for time filtering with boundary conditions
- All tests passing: 10/10 time filtering tests + validation tests

* [OPIK-2856] Address PR review comments: improve exception handling and type safety

- Fix InstantParamConverter to catch specific DateTimeParseException instead of generic Exception
- Add debug logging when falling back to epoch milliseconds parsing
- Refactor anonymous ParamConverter class to named InstantConverter inner class for clarity
- Suppress unchecked cast with @SuppressWarnings annotation
- Fix MySQLContainerUtils return type to use MySQLContainer<?> for type safety

* [OPIK-2856] Fix InstantToUUIDMapperTest to match implementation

- Update tests to reflect that toUpperBound uses next millisecond (+1ms) for inclusive BETWEEN queries
- Remove outdated assertions expecting same timestamp in both bounds
- Verify upper bound is lexicographically greater than lower bound
- All 13 tests now passing

* Remove setup duplicated code

* Revision 2: Address PR review comments - LOW priority fixes

- #7: Add INSTANCE singleton pattern for InstantConverter
- #8: Use StringUtils.isEmpty() for null-safe empty check
- Note: #2 and #3 already addressed in previous commit

* Revision 3: Use IdGenerator.getTimeOrderedEpoch() for UUID bounds

- Simplified InstantToUUIDMapper to use IdGenerator.getTimeOrderedEpoch() instead of convertOtelIdToUUIDv7
- Per UUIDv7 RFC, sub-millisecond 12 bits are optional with millisecond granularity
- Start/end interval semantics with ±1ms ensures correct BETWEEN query results
- This approach has been battle-tested for months without issues per reviewer recommendation
- Converted InstantToUUIDMapper to @singleton service for proper DI integration
- Updated TracesResource to inject InstantToUUIDMapper dependency
- Updated tests to properly mock IdGenerator dependency

* Fix tests

* [OPIK-2856] [BE] Split Get spans Tests

* Fix format
awkoy pushed a commit that referenced this pull request Nov 12, 2025
* [NA] [BE] Upgrade MySQL container from Testcontainers

* Fix imports order

* [OPIK-2856] [BE] Implement UUIDv7 time-based filtering for traces

- Add InstantToUUIDMapper to convert Instant timestamps to UUIDv7 bounds
- Add InstantParamConverter to parse ISO-8601 and epoch millisecond time parameters
- Update TracesResource to accept from_time and to_time parameters on /traces and /traces/stats endpoints
- Update TraceDAO to apply UUID-based time filtering using BETWEEN clause on id column
- Update TraceSearchCriteria to include uuidFromTime and uuidToTime fields
- Add comprehensive integration tests for time filtering with boundary conditions
- All tests passing: 10/10 time filtering tests + validation tests

* [OPIK-2856] Address PR review comments: improve exception handling and type safety

- Fix InstantParamConverter to catch specific DateTimeParseException instead of generic Exception
- Add debug logging when falling back to epoch milliseconds parsing
- Refactor anonymous ParamConverter class to named InstantConverter inner class for clarity
- Suppress unchecked cast with @SuppressWarnings annotation
- Fix MySQLContainerUtils return type to use MySQLContainer<?> for type safety

* [OPIK-2856] Fix InstantToUUIDMapperTest to match implementation

- Update tests to reflect that toUpperBound uses next millisecond (+1ms) for inclusive BETWEEN queries
- Remove outdated assertions expecting same timestamp in both bounds
- Verify upper bound is lexicographically greater than lower bound
- All 13 tests now passing

* Remove setup duplicated code

* Revision 2: Address PR review comments - LOW priority fixes

- #7: Add INSTANCE singleton pattern for InstantConverter
- #8: Use StringUtils.isEmpty() for null-safe empty check
- Note: #2 and #3 already addressed in previous commit

* Revision 3: Use IdGenerator.getTimeOrderedEpoch() for UUID bounds

- Simplified InstantToUUIDMapper to use IdGenerator.getTimeOrderedEpoch() instead of convertOtelIdToUUIDv7
- Per UUIDv7 RFC, sub-millisecond 12 bits are optional with millisecond granularity
- Start/end interval semantics with ±1ms ensures correct BETWEEN query results
- This approach has been battle-tested for months without issues per reviewer recommendation
- Converted InstantToUUIDMapper to @singleton service for proper DI integration
- Updated TracesResource to inject InstantToUUIDMapper dependency
- Updated tests to properly mock IdGenerator dependency

* Fix tests

* [OPIK-2856] [BE] Split Get spans Tests

* Fix format

* Revision 2: Extract duplicated span creation logic into createSpanWithTimestamp helper method

* [OPIK-2856] [BE] Implement UUIDv7 time-based filtering for spans

* Revision 3: Extract workspace setup duplication into setupTestWorkspace helper method and fix transformTestParams call
awkoy pushed a commit that referenced this pull request Nov 12, 2025
…d of direct URL_TEMPLATE calls (#3947)

* [NA] [BE] Upgrade MySQL container from Testcontainers

* Fix imports order

* [OPIK-2856] [BE] Implement UUIDv7 time-based filtering for traces

- Add InstantToUUIDMapper to convert Instant timestamps to UUIDv7 bounds
- Add InstantParamConverter to parse ISO-8601 and epoch millisecond time parameters
- Update TracesResource to accept from_time and to_time parameters on /traces and /traces/stats endpoints
- Update TraceDAO to apply UUID-based time filtering using BETWEEN clause on id column
- Update TraceSearchCriteria to include uuidFromTime and uuidToTime fields
- Add comprehensive integration tests for time filtering with boundary conditions
- All tests passing: 10/10 time filtering tests + validation tests

* [OPIK-2856] Address PR review comments: improve exception handling and type safety

- Fix InstantParamConverter to catch specific DateTimeParseException instead of generic Exception
- Add debug logging when falling back to epoch milliseconds parsing
- Refactor anonymous ParamConverter class to named InstantConverter inner class for clarity
- Suppress unchecked cast with @SuppressWarnings annotation
- Fix MySQLContainerUtils return type to use MySQLContainer<?> for type safety

* [OPIK-2856] Fix InstantToUUIDMapperTest to match implementation

- Update tests to reflect that toUpperBound uses next millisecond (+1ms) for inclusive BETWEEN queries
- Remove outdated assertions expecting same timestamp in both bounds
- Verify upper bound is lexicographically greater than lower bound
- All 13 tests now passing

* Remove setup duplicated code

* Revision 2: Address PR review comments - LOW priority fixes

- #7: Add INSTANCE singleton pattern for InstantConverter
- #8: Use StringUtils.isEmpty() for null-safe empty check
- Note: #2 and #3 already addressed in previous commit

* Revision 3: Use IdGenerator.getTimeOrderedEpoch() for UUID bounds

- Simplified InstantToUUIDMapper to use IdGenerator.getTimeOrderedEpoch() instead of convertOtelIdToUUIDv7
- Per UUIDv7 RFC, sub-millisecond 12 bits are optional with millisecond granularity
- Start/end interval semantics with ±1ms ensures correct BETWEEN query results
- This approach has been battle-tested for months without issues per reviewer recommendation
- Converted InstantToUUIDMapper to @singleton service for proper DI integration
- Updated TracesResource to inject InstantToUUIDMapper dependency
- Updated tests to properly mock IdGenerator dependency

* Fix tests

* [OPIK-2856] [BE] Split Get spans Tests

* Fix format

* Revision 2: Extract duplicated span creation logic into createSpanWithTimestamp helper method

* [OPIK-2856] [BE] Implement UUIDv7 time-based filtering for spans

* Revision 3: Extract workspace setup duplication into setupTestWorkspace helper method and fix transformTestParams call

* [OPIK-2856] Refactor TracesResourceTest to use TraceResourceClient instead of direct URL_TEMPLATE calls

- Replace all direct client.target(URL_TEMPLATE) calls with TraceResourceClient methods
- Add callFeedbackScoresWithCookie method to TraceResourceClient for session token authentication
- Add callRetrieveThreadResponseWithCookie method to TraceResourceClient for session token authentication
- Fix feedback batch endpoint by using callFeedbackScores and callFeedbackScoresWithCookie
- Add null checks for query parameters to prevent NPE errors
- Fix API key vs session token usage in authentication tests
- Rename get__whenApiKeyIsPresent__thenReturnTraceThread to get__whenSessionTokenIsPresent__thenReturnTraceThread in SessionTokenCookie class
- Add mockGetWorkspaceIdByName() calls for proper workspace mocking
- Preserve original test assertions and behavior
- All tests properly refactored to use resource client methods instead of direct HTTP calls

* [OPIK-2856] Remove duplicate methods from TraceResourceClient

- Remove callGetTraces() - duplicate of callGetTracesWithQueryParams()
- Remove callSearchTraces() - duplicate of callSearchTracesStream()
- Reduced code duplication and maintenance burden

* Fix tests

* Revision 2: Address Copilot review comments - remove redundant wrapper method and add clarifying comment

* Revision 3: Extract duplicated path splitting logic into helper method addPathSegments()

* Revision 4: Make getWebTarget() private and add callGetTraceThreadsWithSorting() public method

* Revision 7: Move addPathSegments() and addQueryParameters() helper methods to BaseCommentResourceClient
awkoy pushed a commit that referenced this pull request Nov 12, 2025
…3953)

* [NA] [BE] Upgrade MySQL container from Testcontainers

* Fix imports order

* [OPIK-2856] [BE] Implement UUIDv7 time-based filtering for traces

- Add InstantToUUIDMapper to convert Instant timestamps to UUIDv7 bounds
- Add InstantParamConverter to parse ISO-8601 and epoch millisecond time parameters
- Update TracesResource to accept from_time and to_time parameters on /traces and /traces/stats endpoints
- Update TraceDAO to apply UUID-based time filtering using BETWEEN clause on id column
- Update TraceSearchCriteria to include uuidFromTime and uuidToTime fields
- Add comprehensive integration tests for time filtering with boundary conditions
- All tests passing: 10/10 time filtering tests + validation tests

* [OPIK-2856] Address PR review comments: improve exception handling and type safety

- Fix InstantParamConverter to catch specific DateTimeParseException instead of generic Exception
- Add debug logging when falling back to epoch milliseconds parsing
- Refactor anonymous ParamConverter class to named InstantConverter inner class for clarity
- Suppress unchecked cast with @SuppressWarnings annotation
- Fix MySQLContainerUtils return type to use MySQLContainer<?> for type safety

* [OPIK-2856] Fix InstantToUUIDMapperTest to match implementation

- Update tests to reflect that toUpperBound uses next millisecond (+1ms) for inclusive BETWEEN queries
- Remove outdated assertions expecting same timestamp in both bounds
- Verify upper bound is lexicographically greater than lower bound
- All 13 tests now passing

* Remove setup duplicated code

* Revision 2: Address PR review comments - LOW priority fixes

- #7: Add INSTANCE singleton pattern for InstantConverter
- #8: Use StringUtils.isEmpty() for null-safe empty check
- Note: #2 and #3 already addressed in previous commit

* Revision 3: Use IdGenerator.getTimeOrderedEpoch() for UUID bounds

- Simplified InstantToUUIDMapper to use IdGenerator.getTimeOrderedEpoch() instead of convertOtelIdToUUIDv7
- Per UUIDv7 RFC, sub-millisecond 12 bits are optional with millisecond granularity
- Start/end interval semantics with ±1ms ensures correct BETWEEN query results
- This approach has been battle-tested for months without issues per reviewer recommendation
- Converted InstantToUUIDMapper to @singleton service for proper DI integration
- Updated TracesResource to inject InstantToUUIDMapper dependency
- Updated tests to properly mock IdGenerator dependency

* Fix tests

* [OPIK-2856] [BE] Split Get spans Tests

* Fix format

* Revision 2: Extract duplicated span creation logic into createSpanWithTimestamp helper method

* [OPIK-2856] [BE] Implement UUIDv7 time-based filtering for spans

* Revision 3: Extract workspace setup duplication into setupTestWorkspace helper method and fix transformTestParams call

* [OPIK-2856] Refactor TracesResourceTest to use TraceResourceClient instead of direct URL_TEMPLATE calls

- Replace all direct client.target(URL_TEMPLATE) calls with TraceResourceClient methods
- Add callFeedbackScoresWithCookie method to TraceResourceClient for session token authentication
- Add callRetrieveThreadResponseWithCookie method to TraceResourceClient for session token authentication
- Fix feedback batch endpoint by using callFeedbackScores and callFeedbackScoresWithCookie
- Add null checks for query parameters to prevent NPE errors
- Fix API key vs session token usage in authentication tests
- Rename get__whenApiKeyIsPresent__thenReturnTraceThread to get__whenSessionTokenIsPresent__thenReturnTraceThread in SessionTokenCookie class
- Add mockGetWorkspaceIdByName() calls for proper workspace mocking
- Preserve original test assertions and behavior
- All tests properly refactored to use resource client methods instead of direct HTTP calls

* [OPIK-2856] Remove duplicate methods from TraceResourceClient

- Remove callGetTraces() - duplicate of callGetTracesWithQueryParams()
- Remove callSearchTraces() - duplicate of callSearchTracesStream()
- Reduced code duplication and maintenance burden

* Fix tests

* Revision 2: Address Copilot review comments - remove redundant wrapper method and add clarifying comment

* Revision 3: Extract duplicated path splitting logic into helper method addPathSegments()

* Revision 4: Make getWebTarget() private and add callGetTraceThreadsWithSorting() public method

* Revision 7: Move addPathSegments() and addQueryParameters() helper methods to BaseCommentResourceClient

* [OPIK-2856] [BE] Add UUIDv7 time-based filtering for trace threads

* Revision 2: Address GitHub Copilot PR review comments

- Extract conditional UUID generation into generateThreadModelId() method for better readability
- Rename minTraceTimestamp to earliestTraceTimestamp for clarity
- Add explanatory comment about UUIDv7 lexicographic ordering in compareTo()

* Fix

* Revision 3: Add UUID time filter to SELECT_TRACES_STATS query

* Revision 4: Fix generateUUIDForTimestamp to manually construct UUIDv7
thiagohora added a commit that referenced this pull request Nov 12, 2025
* [NA] [BE] Upgrade MySQL container from Testcontainers

* Fix imports order

* [OPIK-2856] [BE] Implement UUIDv7 time-based filtering for traces

- Add InstantToUUIDMapper to convert Instant timestamps to UUIDv7 bounds
- Add InstantParamConverter to parse ISO-8601 and epoch millisecond time parameters
- Update TracesResource to accept from_time and to_time parameters on /traces and /traces/stats endpoints
- Update TraceDAO to apply UUID-based time filtering using BETWEEN clause on id column
- Update TraceSearchCriteria to include uuidFromTime and uuidToTime fields
- Add comprehensive integration tests for time filtering with boundary conditions
- All tests passing: 10/10 time filtering tests + validation tests

* [OPIK-2856] Address PR review comments: improve exception handling and type safety

- Fix InstantParamConverter to catch specific DateTimeParseException instead of generic Exception
- Add debug logging when falling back to epoch milliseconds parsing
- Refactor anonymous ParamConverter class to named InstantConverter inner class for clarity
- Suppress unchecked cast with @SuppressWarnings annotation
- Fix MySQLContainerUtils return type to use MySQLContainer<?> for type safety

* [OPIK-2856] Fix InstantToUUIDMapperTest to match implementation

- Update tests to reflect that toUpperBound uses next millisecond (+1ms) for inclusive BETWEEN queries
- Remove outdated assertions expecting same timestamp in both bounds
- Verify upper bound is lexicographically greater than lower bound
- All 13 tests now passing

* Remove setup duplicated code

* Revision 2: Address PR review comments - LOW priority fixes

- #7: Add INSTANCE singleton pattern for InstantConverter
- #8: Use StringUtils.isEmpty() for null-safe empty check
- Note: #2 and #3 already addressed in previous commit

* Revision 3: Use IdGenerator.getTimeOrderedEpoch() for UUID bounds

- Simplified InstantToUUIDMapper to use IdGenerator.getTimeOrderedEpoch() instead of convertOtelIdToUUIDv7
- Per UUIDv7 RFC, sub-millisecond 12 bits are optional with millisecond granularity
- Start/end interval semantics with ±1ms ensures correct BETWEEN query results
- This approach has been battle-tested for months without issues per reviewer recommendation
- Converted InstantToUUIDMapper to @singleton service for proper DI integration
- Updated TracesResource to inject InstantToUUIDMapper dependency
- Updated tests to properly mock IdGenerator dependency

* Fix tests

* [OPIK-2856] [BE] Split Get spans Tests

* Fix format

* Revision 2: Extract duplicated span creation logic into createSpanWithTimestamp helper method

* [OPIK-2856] [BE] Implement UUIDv7 time-based filtering for spans

* Revision 3: Extract workspace setup duplication into setupTestWorkspace helper method and fix transformTestParams call

* [OPIK-2856] Refactor TracesResourceTest to use TraceResourceClient instead of direct URL_TEMPLATE calls

- Replace all direct client.target(URL_TEMPLATE) calls with TraceResourceClient methods
- Add callFeedbackScoresWithCookie method to TraceResourceClient for session token authentication
- Add callRetrieveThreadResponseWithCookie method to TraceResourceClient for session token authentication
- Fix feedback batch endpoint by using callFeedbackScores and callFeedbackScoresWithCookie
- Add null checks for query parameters to prevent NPE errors
- Fix API key vs session token usage in authentication tests
- Rename get__whenApiKeyIsPresent__thenReturnTraceThread to get__whenSessionTokenIsPresent__thenReturnTraceThread in SessionTokenCookie class
- Add mockGetWorkspaceIdByName() calls for proper workspace mocking
- Preserve original test assertions and behavior
- All tests properly refactored to use resource client methods instead of direct HTTP calls

* [OPIK-2856] Remove duplicate methods from TraceResourceClient

- Remove callGetTraces() - duplicate of callGetTracesWithQueryParams()
- Remove callSearchTraces() - duplicate of callSearchTracesStream()
- Reduced code duplication and maintenance burden

* Fix tests

* Revision 2: Address Copilot review comments - remove redundant wrapper method and add clarifying comment

* Revision 3: Extract duplicated path splitting logic into helper method addPathSegments()

* Revision 4: Make getWebTarget() private and add callGetTraceThreadsWithSorting() public method

* Revision 7: Move addPathSegments() and addQueryParameters() helper methods to BaseCommentResourceClient

* [OPIK-2856] [BE] Add UUIDv7 time-based filtering for trace threads

* Revision 2: Address GitHub Copilot PR review comments

- Extract conditional UUID generation into generateThreadModelId() method for better readability
- Rename minTraceTimestamp to earliestTraceTimestamp for clarity
- Add explanatory comment about UUIDv7 lexicographic ordering in compareTo()

* Fix

* Revision 3: Add UUID time filter to SELECT_TRACES_STATS query

* Revision 4: Fix generateUUIDForTimestamp to manually construct UUIDv7

* [OPIK-2856] [BE] Implement UUIDv7-based time filtering for project metrics

- Add uuidFromTime and uuidToTime fields to ProjectMetricRequest
- Update ProjectMetricsService to enrich requests with UUID bounds using InstantToUUIDMapper
- Refactor ProjectMetricsDAO SQL queries to use UUID-based filtering (id BETWEEN uuid_from_time AND uuid_to_time)
- Extract timestamps from UUIDs using UUIDv7ToDateTime for bucketing and WITH FILL clauses
- Update TraceService to generate UUIDs based on trace startTime when ID is not provided
- Fix ProjectMetricsResourceTest to generate UUIDs with correct timestamps using TimeBasedEpochGenerator
- Remove explicit openTraceThread calls in tests to allow traces to create thread metadata with correct timestamps

All 206 ProjectMetricsResourceTest tests now passing (1 skipped).

* [OPIK-2856] Fix flaky MultiValueFeedbackScoresE2ETest by ensuring UUID bounds are min/max for timestamp

* [OPIK-2856] Update InstantToUUIDMapper tests to match new min/max UUID implementation

* [OPIK-2856] Address Copilot PR review comments: clarify 62-bit constant and update validateProject comment

* [OPIK-2856] [BE] Extract UUID utility for test reuse

- Create UUIDTestUtils with generateUUIDForTimestamp method
- Replace local implementations in ProjectMetricsResourceTest
- Replace local implementations in FindSpansResourceTest
- Remove duplicate method definitions and unused imports
- Centralize UUID generation logic for time-based testing

Tests verified: ✅ ProjectMetricsResourceTest (206 tests passed, 1 skipped)

* Revert id changes

* [OPIK-2856] [BE] Use batch calls to reduce test duration
JetoPistola added a commit that referenced this pull request Dec 1, 2025
Python SDK improvements:
- Import module instead of name for ExperimentScore (comment #2)
- Allow ExperimentScoreFunction to return single or List of ScoreResults (comment #3)
- Move experiment score verification to verify_experiment utility (comment #4)

Backend code quality:
- Simplify TypeReference diamond operator in ExperimentScore.java (comment #5)
- Remove overloaded constructor in FeedbackScoreNames.ScoreName (comment #6)
- Reuse ScoreName instead of ScoreNameWithType in DAO (comment #7)
- Add TODO for full primary key in ORDER BY (comment #8)
- Revert flakiness fix in TemplateUtilsTest.java (comment #9)
JetoPistola added a commit that referenced this pull request Dec 2, 2025
…nctions (#3989)

* Hide experiment_scores columns in the single experiment table

* Add SDK support for experiment_scores

* Add SDK support for experiment_scores

* Add BE functionality

* Typescript autogenerated code

* Documentation and FE update

* Address PR comments

* Address PR comments

* Fix PR comments

* Address PR comments

* Fix merge conflicts

* Fix tests

* Fix failing tests

* Fix failing tests

* Fix UI colors and column names

* Refactor: Extract common score averaging logic to eliminate duplication

* Harmonize experiment scores sorting to use map access from CTE

- Add experiment_scores_agg LEFT JOIN to non-grouped queries
- Simplify SortingQueryBuilder to use coalesce(map[key]) instead of complex JSON extraction
- Remove special case handling for experiment_scores in null direction logic
- Addresses PR review comments about query harmonization

* Remove early return for empty test results in experiment scores

- Allow experiment score functions to handle empty test results
- Some functions may want to return baseline/default scores with no data
- Addresses PR review comment about preventing score function execution

* Add E2E test for experiment scores functionality

- Test verifies experiment scoring functions work end-to-end
- Validates experiment scores appear in evaluation result
- Validates experiment scores are retrievable via SDK API
- Uses compute_max_score function to test score aggregation
- Addresses PR review comment about E2E test coverage

* Enhance experiment score computation to handle empty test results gracefully

- Update condition to return empty list if either scoring functions or test results are absent
- Ensures robustness in score computation logic

* Add Python SDK E2E test for experiment scores

- Tests experiment_scoring_functions parameter in evaluate()
- Verifies experiment scores are computed and returned in result
- Validates scores are persisted to backend API
- Tests aggregate metrics (max, min, avg) computation
- Addresses PR review comment about SDK test coverage

* Revert "Add E2E test for experiment scores functionality"

This reverts commit 50f9f8d.

* Apply DRY principle to score type mapping in ExperimentFeedbackScoresTab

- Extract addScoresToMap helper function to avoid duplication
- Works for both feedback_scores and experiment_scores
- Reduces code duplication and improves maintainability
- Fix parameter ordering (required before optional)

* [FE] Apply DRY principle to feedback/experiment scores handling

- useExperimentsTableConfig: Extract getScoreByName helper, eliminate duplicate accessorFn logic
- useCompareExperimentsChartsData: Extract createScoresMap helper for both score types
- CompareExperimentsDetails: Extract markScores helper to avoid duplicate map calls
- ExperimentsPage: Extract createScoresMap and getScoreNames helpers
- EvaluationSection: Use shared transformExperimentScores utility
- experimentScoreUtils: Refactor with formatScores helper to eliminate duplication

All changes maintain type safety and pass linting/typecheck

* Revision 7: Add missing experiment_scores_agg CTE to FIND query

* Revision 8: Fix experiment_scores sorting to use correct CTE alias 'es'

* Revision 9: Address all 9 PR review comments

Python SDK improvements:
- Import module instead of name for ExperimentScore (comment #2)
- Allow ExperimentScoreFunction to return single or List of ScoreResults (comment #3)
- Move experiment score verification to verify_experiment utility (comment #4)

Backend code quality:
- Simplify TypeReference diamond operator in ExperimentScore.java (comment #5)
- Remove overloaded constructor in FeedbackScoreNames.ScoreName (comment #6)
- Reuse ScoreName instead of ScoreNameWithType in DAO (comment #7)
- Add TODO for full primary key in ORDER BY (comment #8)
- Revert flakiness fix in TemplateUtilsTest.java (comment #9)

* Update return type of get_experiment_data method to use rest_api_types for consistency

* Revision 10: Add full primary key to ORDER BY clause

* Refactor test for standard deviation calculation in experiment scoring functions

Replaced hardcoded expected standard deviation value with a dynamic calculation using the statistics.stdev function for improved accuracy and maintainability.

* Add experiment_scores column to experiments table in migration 000048

This migration introduces a new column, experiment_scores, to the experiments table to store precomputed metrics. The column is added with a default value of an empty string. A rollback statement is also included to drop the column if necessary.

* Update import statement for Prompt in evaluator.py to reflect new module structure

* Refactor whitespace in verifiers.py for improved readability

This commit removes unnecessary blank lines in the verify_experiment and _verify_experiment_scores functions, enhancing the overall clarity of the code without altering functionality.

* Enhance type hinting in dataset and experiment modules

This commit adds future annotations to the dataset REST operations and introduces TYPE_CHECKING for conditional imports in the experiment module, improving type hinting and code clarity without affecting functionality.

* Update documentation to replace `experiment_scores` with `experiment_scoring_functions` for consistency across evaluation methods

* Refactor score type handling in experiment feedback components

This commit replaces string literals for score types with constants, enhancing type safety and code clarity across various components, including ExperimentFeedbackScoresTab, ExperimentItemsTab, and related utility functions. The changes ensure consistent usage of SCORE_TYPE_FEEDBACK and SCORE_TYPE_EXPERIMENT throughout the codebase.

* Refactor column mapping for sorting functionality

This commit consolidates the logic for converting underscore-prefixed column IDs to dot notation into a single array of sortable prefixes. The `mapComplexColumn` function is updated to iterate over this array, improving code clarity and maintainability while ensuring consistent handling of various column types.

* Implement ExperimentScoreListCell and refactor score handling in data tables

This commit introduces the new ExperimentScoreListCell component for displaying experiment scores and updates the relevant data tables to utilize this component. Additionally, it refactors the handling of score types across various components, replacing string literals with constants for improved type safety and consistency. The changes affect the ExperimentsPage, ProjectsPage, and other related components, ensuring a unified approach to score type management.

* Refactor FeedbackScoresChartsWrapper and FeedbackScoreHoverCard for consistency

This commit updates the FeedbackScoresChartsWrapper component to rename the `isAggregationScores` prop to `areAggregatedScores` for improved clarity. Additionally, it modifies the subtitle text in the FeedbackScoreHoverCard component to use "Aggregated experiment scores" and "Average feedback scores" for consistency in terminology across the application.

* Add experiment scores tab to CompareExperimentsPage and update score handling

This commit introduces a new tab for displaying experiment scores in the CompareExperimentsPage. It updates the ExperimentFeedbackScoresTab component to handle both feedback and experiment scores based on the selected tab. The score retrieval logic is modified to filter scores according to their type, enhancing clarity and usability in the comparison of experiments.

* run fern generate

* Refactor score handling in various components to unify feedback and experiment score logic. Removed experiment score references and updated feedback score components to handle aggregated scores. Adjusted column definitions and metadata across multiple pages for consistency.

* Add migration to include experiment_scores column in experiments table

---------

Co-authored-by: Daniel Dimenshtein <danield@comet.com>
Co-authored-by: Ido Berkovich <ido@comet.com>
Co-authored-by: Boris Feld <boris@comet.com>
Co-authored-by: YarivHashaiComet <yarivh@comet.com>
YarivHashaiComet added a commit that referenced this pull request Jan 20, 2026
- Fix #1: Use trace provider when model not found (provider fallback)
- Fix #3: Add role mapping for external roles (tool, function, human, etc.)
- Fix #4: Support 'type' property for LangChain/LangGraph messages
- Fix #6: Add empty array check in canOpenInPlayground
- Fix #7: Check span input before using, fallback to trace input
- Fix #9/#10: Handle { messages: [] } case properly
thiagohora added a commit that referenced this pull request Feb 26, 2026
- #7: Remove "Used for testing and verification" from getExperimentFromAggregates javadoc
- #8: Replace recursive flatMap with Mono.expand() in populateExperimentItemsInBatches
- #10: Remove unrelated subscribeOn addition from DatasetItemService.createVersionFromDelta
thiagohora added a commit that referenced this pull request Mar 2, 2026
…5338)

* [OPIK-4380] [BE] Add experiment aggregates for denormalized metrics

- Add experiment_aggregates and experiment_item_aggregates tables
- Implement ExperimentAggregatesDAO with population and query methods
- Add ExperimentAggregatesService for aggregation management
- Refactor DTOs into organized model classes:
  - ExperimentAggregatesModel: aggregation results
  - ExperimentEntityData: entity models
  - ExperimentSourceData: raw source data
  - ExperimentAggregatesUtils: utilities
- Add FEEDBACK_SCORES_AGGREGATED filter strategy for map-based filtering
- Add comprehensive integration tests (10/10 passing)
- Configure batch size and parallelism settings

* [OPIK-4380] [BE] Add MySQL deadlock retry mechanism for concurrent dataset operations

Problem:
- MySQL deadlock on dataset_version_tags composite PRIMARY KEY (workspace_id, dataset_id, tag)
- Occurred during parallel dataset creation in same workspace
- Multiple threads inserting "latest" tag for different datasets caused lock contention
- Experiments with parallel execution were failing with MySQLTransactionRollbackException

Solution:
- Add handleOnDeadLocks() method in RetryUtils with:
  - 5 retry attempts with exponential backoff (250ms to 2s)
  - 0.5 jitter to reduce thundering herd effect
  - Recursive isDatabaseDeadlock() detection for MySQLTransactionRollbackException
- Apply retry logic in DatasetItemService.setDatasetItemVersion()
- Enables concurrent dataset creation for same workspace

Impact:
- Supports parallel experiment execution with proper deadlock handling
- Test success rate improved from 0/10 to 10/10 in ExperimentAggregatesIntegrationTest

* Fix visibility

* [OPIK-4380] [BE] Address PR review comments for experiment aggregates

Fixed 11 automated review comments from baz-reviewer:

CRITICAL fixes:
- Prevent NPE on null span aggregations by adding coalesce() in SQL
- Handle multi-project experiments with LIMIT 1 in GET_PROJECT_ID
- Handle zero-item experiments with empty aggregation helpers
- Bind feedback_scores_percentiles map instead of empty CAST

HIGH priority fixes:
- Use toDecimal128(12) instead of toDecimal64(9) for cost percentiles
- Add null-safe tags handling with Optional.ofNullable()
- Include exception objects in retry logging for stack traces

MEDIUM priority fixes:
- Add missing log_comment to SELECT_EXPERIMENT_BY_ID query
- Add missing log_comment to GET_PROJECT_ID query

LOW priority fixes:
- Remove duplicate "id" binding in bindItemsParameters
- Enhance batchSize config documentation with details

All 11 integration tests passing.

* [OPIK-4380] [BE] Fix table definition

* [OPIK-4380] [BE] Address PR comments and consolidate DatasetItemService methods

- Fix tags NPE in ExperimentAggregatesDAO with defaultIfNull
- Remove unnecessary FINAL clause from GET_EXPERIMENT_DATA query
- Fix test naming in ExperimentAggregatesIntegrationTest
- Consolidate 7 duplicate createVersionFromDelta methods into single canonical implementation
- Remove debug logger from config-test.yml

* [OPIK-4380] [BE] Fix missing log_comment and centralize search criteria binding

- Fix SELECT_EXPERIMENT_BY_ID to properly render log_comment metadata
  - Use getSTWithLogComment pattern in getExperimentFromAggregates
  - Ensures ClickHouse query logging populates workspace/experiment IDs

- Centralize ExperimentSearchCriteria binding logic
  - Create ExperimentSearchCriteriaBinder utility class
  - Parameterize filter strategies to support both DAO variants
  - Eliminate 29-line duplication between ExperimentDAO and ExperimentAggregatesDAO
  - Single source of truth prevents DAOs from getting out of sync

* [OPIK-4380] [BE] Fix createVersionFromDelta consolidation after rebase

- Update canonical method signature to include new parameters:
  - List<EvaluatorItem> evaluators
  - ExecutionPolicy executionPolicy
  - boolean clearExecutionPolicy

- Update all 5 caller sites to pass new parameters:
  - Use changes.evaluators(), changes.executionPolicy(), changes.clearExecutionPolicy() when available
  - Pass null/false for auto-generated versions that inherit from base

- Add imports for EvaluatorItem and ExecutionPolicy

Fixes compilation errors introduced by rebase with upstream changes to DatasetVersionService

* [OPIK-4380] [BE] Address PR review comments - fix type mismatch, extract constants, remove DAO logging

- Fixed BigDecimal[] to Double[] conversion for experiment_scores (matches ClickHouse Float64)
- Extracted FilterStrategy lists to static final constants to avoid repeated allocations
- Added @nonnull validation to populateExperimentAggregate parameter
- Removed DAO layer logging, keeping service-level logging only

* [OPIK-4380] [BE] Address PR review comments - fix TYPE_REFERENCE visibility, redundant IN subquery, hardcoded context keys, Instant.now in loop, and inline defaultIfNull

* Revision 4: Address remaining JetoPistola review comments (#7, #8, #10)

- #7: Remove "Used for testing and verification" from getExperimentFromAggregates javadoc
- #8: Replace recursive flatMap with Mono.expand() in populateExperimentItemsInBatches
- #10: Remove unrelated subscribeOn addition from DatasetItemService.createVersionFromDelta

* Revision 3: Add switchIfEmpty fallback for deleted traces in populateExperimentAggregate

* Fix tests

* Revision 6: Move countTotal log from DAO to service layer

Operational logs belong in the service layer, not the DAO.

* Revision 7: Apply Spotless formatting

* Revision 8: Make populateAggregations(UUID, int) private

Removes the uncapped public batch size entry point. All callers now go
through the public no-arg overload which reads batchSize safely from config.

* [OPIK-4380] [BE] Add evaluation_method support to experiment_aggregates pipeline

- Add ClickHouse migration (000062) to add evaluation_method column to experiment_aggregates table
- Add evaluationMethod field to ExperimentData record
- Update GET_EXPERIMENT_DATA query to read evaluation_method from experiments
- Update INSERT_EXPERIMENT_AGGREGATE to write evaluation_method to experiment_aggregates
- Update SELECT_EXPERIMENT_BY_ID to read evaluation_method from experiment_aggregates
- Fix Experiment record constructor call: insert EvaluationMethod at correct position (10)

* [OPIK-4380] [BE] Extract shared helper for experiment aggregation queries

Reduce copy-paste in getTraceAggregations, getSpanAggregations, and
getFeedbackScoreAggregations by extracting queryExperimentAggregation,
which centralises the context-aware execution, workspace/experiment/project
parameter binding, and singleOrEmpty pattern shared by all three methods.

* [OPIK-4380] [BE] Enforce non-null contract on countTotal criteria parameter

Add @nonnull to ExperimentSearchCriteria in the interface and implementation
so that a null argument fails fast with an explicit NullPointerException at
the DAO boundary instead of crashing deep inside buildCountTemplate.

* [OPIK-4380] [BE] Fix countTotal ignoring target project IDs in normal path

target_project_ids was only applied inside the project_deleted LEFT JOIN
subquery; the main WHERE had no project restriction, so counts were
workspace-wide. Reuse has_target_projects in the main WHERE so
project_id IN :target_project_ids always takes effect. Also replace
manual null/empty checks with CollectionUtils.isNotEmpty.

* [OPIK-4380] [BE] Apply Spotless formatting
thiagohora added a commit that referenced this pull request Mar 6, 2026
* [OPIK-4380] [BE] Add experiment aggregates for denormalized metrics

- Add experiment_aggregates and experiment_item_aggregates tables
- Implement ExperimentAggregatesDAO with population and query methods
- Add ExperimentAggregatesService for aggregation management
- Refactor DTOs into organized model classes:
  - ExperimentAggregatesModel: aggregation results
  - ExperimentEntityData: entity models
  - ExperimentSourceData: raw source data
  - ExperimentAggregatesUtils: utilities
- Add FEEDBACK_SCORES_AGGREGATED filter strategy for map-based filtering
- Add comprehensive integration tests (10/10 passing)
- Configure batch size and parallelism settings

* [OPIK-4380] [BE] Add MySQL deadlock retry mechanism for concurrent dataset operations

Problem:
- MySQL deadlock on dataset_version_tags composite PRIMARY KEY (workspace_id, dataset_id, tag)
- Occurred during parallel dataset creation in same workspace
- Multiple threads inserting "latest" tag for different datasets caused lock contention
- Experiments with parallel execution were failing with MySQLTransactionRollbackException

Solution:
- Add handleOnDeadLocks() method in RetryUtils with:
  - 5 retry attempts with exponential backoff (250ms to 2s)
  - 0.5 jitter to reduce thundering herd effect
  - Recursive isDatabaseDeadlock() detection for MySQLTransactionRollbackException
- Apply retry logic in DatasetItemService.setDatasetItemVersion()
- Enables concurrent dataset creation for same workspace

Impact:
- Supports parallel experiment execution with proper deadlock handling
- Test success rate improved from 0/10 to 10/10 in ExperimentAggregatesIntegrationTest

* Fix visibility

* [OPIK-4380] [BE] Address PR review comments for experiment aggregates

Fixed 11 automated review comments from baz-reviewer:

CRITICAL fixes:
- Prevent NPE on null span aggregations by adding coalesce() in SQL
- Handle multi-project experiments with LIMIT 1 in GET_PROJECT_ID
- Handle zero-item experiments with empty aggregation helpers
- Bind feedback_scores_percentiles map instead of empty CAST

HIGH priority fixes:
- Use toDecimal128(12) instead of toDecimal64(9) for cost percentiles
- Add null-safe tags handling with Optional.ofNullable()
- Include exception objects in retry logging for stack traces

MEDIUM priority fixes:
- Add missing log_comment to SELECT_EXPERIMENT_BY_ID query
- Add missing log_comment to GET_PROJECT_ID query

LOW priority fixes:
- Remove duplicate "id" binding in bindItemsParameters
- Enhance batchSize config documentation with details

All 11 integration tests passing.

* [OPIK-4382] [BE] Refactor experiment aggregates with import cleanup and Optional patterns

- Add missing imports for IntStream, ProjectStats, and other dependencies
- Replace fully-qualified class names with proper imports across DAO and Service classes
- Fix IS_NOT_EMPTY filter handling for FEEDBACK_SCORES_AGGREGATED strategies
- Refactor null checks to use Optional in mapping methods:
  - mapFeedbackScoreAggregations, mapExperimentFromAggregates
  - mapFeedbackScoreData, mapExperimentGroupAggregationItem
  - Batch insert preparation with Optional chains
- Improve code readability and maintainability with functional patterns

* [OPIK-4380] [BE] Fix table definition

* [OPIK-4380] [BE] Address PR comments and consolidate DatasetItemService methods

- Fix tags NPE in ExperimentAggregatesDAO with defaultIfNull
- Remove unnecessary FINAL clause from GET_EXPERIMENT_DATA query
- Fix test naming in ExperimentAggregatesIntegrationTest
- Consolidate 7 duplicate createVersionFromDelta methods into single canonical implementation
- Remove debug logger from config-test.yml

* [OPIK-4380] [BE] Fix missing log_comment and centralize search criteria binding

- Fix SELECT_EXPERIMENT_BY_ID to properly render log_comment metadata
  - Use getSTWithLogComment pattern in getExperimentFromAggregates
  - Ensures ClickHouse query logging populates workspace/experiment IDs

- Centralize ExperimentSearchCriteria binding logic
  - Create ExperimentSearchCriteriaBinder utility class
  - Parameterize filter strategies to support both DAO variants
  - Eliminate 29-line duplication between ExperimentDAO and ExperimentAggregatesDAO
  - Single source of truth prevents DAOs from getting out of sync

* [OPIK-4380] [BE] Fix createVersionFromDelta consolidation after rebase

- Update canonical method signature to include new parameters:
  - List<EvaluatorItem> evaluators
  - ExecutionPolicy executionPolicy
  - boolean clearExecutionPolicy

- Update all 5 caller sites to pass new parameters:
  - Use changes.evaluators(), changes.executionPolicy(), changes.clearExecutionPolicy() when available
  - Pass null/false for auto-generated versions that inherit from base

- Add imports for EvaluatorItem and ExecutionPolicy

Fixes compilation errors introduced by rebase with upstream changes to DatasetVersionService

* [OPIK-4380] [BE] Address PR review comments - fix type mismatch, extract constants, remove DAO logging

- Fixed BigDecimal[] to Double[] conversion for experiment_scores (matches ClickHouse Float64)
- Extracted FilterStrategy lists to static final constants to avoid repeated allocations
- Added @nonnull validation to populateExperimentAggregate parameter
- Removed DAO layer logging, keeping service-level logging only

* [OPIK-4380] [BE] Extract shared helper for experiment data pagination

Extract streamWithExperimentPagination() helper method to eliminate
duplication in getTracesData(), getSpansData(), and getFeedbackScoresData().

All three methods followed identical pattern:
- asyncTemplate.stream with connection
- getSTWithLogComment with cursor flag
- Bind workspace_id, experiment_id, project_id, limit
- Optional cursor binding
- Result mapping

Benefits:
- Single source of truth for pagination binding logic
- Prevents divergence when tweaking cursor/limit bindings
- Reduces code from ~20 lines to ~10 lines per method
- Type-safe generic implementation

Note: CTE redundancy (3x experiment_items scan) is intentional to avoid
passing large trace ID lists as parameters, which would cause performance
issues with 10K+ traces.

* [OPIK-4380] [BE] Address PR review comments - fix TYPE_REFERENCE visibility, redundant IN subquery, hardcoded context keys, Instant.now in loop, and inline defaultIfNull

* Revision 4: Address remaining JetoPistola review comments (#7, #8, #10)

- #7: Remove "Used for testing and verification" from getExperimentFromAggregates javadoc
- #8: Replace recursive flatMap with Mono.expand() in populateExperimentItemsInBatches
- #10: Remove unrelated subscribeOn addition from DatasetItemService.createVersionFromDelta

* Revision 3: Add switchIfEmpty fallback for deleted traces in populateExperimentAggregate

* Fix tests

* Revision 6: Move countTotal log from DAO to service layer

Operational logs belong in the service layer, not the DAO.

* Revision 7: Apply Spotless formatting

* Revision 8: Make populateAggregations(UUID, int) private

Removes the uncapped public batch size entry point. All callers now go
through the public no-arg overload which reads batchSize safely from config.

* [OPIK-4380] [BE] Add evaluation_method support to experiment_aggregates pipeline

- Add ClickHouse migration (000062) to add evaluation_method column to experiment_aggregates table
- Add evaluationMethod field to ExperimentData record
- Update GET_EXPERIMENT_DATA query to read evaluation_method from experiments
- Update INSERT_EXPERIMENT_AGGREGATE to write evaluation_method to experiment_aggregates
- Update SELECT_EXPERIMENT_BY_ID to read evaluation_method from experiment_aggregates
- Fix Experiment record constructor call: insert EvaluationMethod at correct position (10)

* [OPIK-4380] [BE] Extract shared helper for experiment aggregation queries

Reduce copy-paste in getTraceAggregations, getSpanAggregations, and
getFeedbackScoreAggregations by extracting queryExperimentAggregation,
which centralises the context-aware execution, workspace/experiment/project
parameter binding, and singleOrEmpty pattern shared by all three methods.

* [OPIK-4380] [BE] Enforce non-null contract on countTotal criteria parameter

Add @nonnull to ExperimentSearchCriteria in the interface and implementation
so that a null argument fails fast with an explicit NullPointerException at
the DAO boundary instead of crashing deep inside buildCountTemplate.

* [OPIK-4380] [BE] Fix countTotal ignoring target project IDs in normal path

target_project_ids was only applied inside the project_deleted LEFT JOIN
subquery; the main WHERE had no project restriction, so counts were
workspace-wide. Reuse has_target_projects in the main WHERE so
project_id IN :target_project_ids always takes effect. Also replace
manual null/empty checks with CollectionUtils.isNotEmpty.

* [OPIK-4380] [BE] Apply Spotless formatting

* [OPIK-4382] [BE] Address PR review comments on experiment aggregates

- Fix :versionId → :version_id parameter naming in SQL templates and bindings
- Fix last_updated_at binding to use item.lastUpdatedAt() instead of Instant.now()
- Fix FEEDBACK_SCORES_AGGREGATED_IS_EMPTY filter: embed generated SQL into templates
  instead of hard-coded static condition, and add missing bind calls
- Fix RetryUtils log duplication (remove getMessage() + pass exception directly)
- Add batchSize = 1000 default in ExperimentAggregatesConfig
- Extract resolveVersionIdForCriteria helper to deduplicate version-id resolution
- Add null/blank/ClickHouse placeholder guards in extractUuidsFromGroupValues
- Extract loadEntityMap helper to deduplicate getEnrichInfoHolder enrichment logic

* Revision 3: Address PR comments E, F, G, H

- Fix E: Extract shared template/bind helpers in ExperimentAggregatesDAO
- Fix F: Bind experiment_ids as UUID[] instead of String[]
- Fix G+H: Extract getEnrichInfoHolder logic into ExperimentGroupEnricher,
  eliminating duplication between ExperimentService and ExperimentAggregatesService
  without introducing a direct dependency between them

* Revision 4: Fix ExperimentServiceTest to include ExperimentGroupEnricher mock

* [OPIK-4382] [BE] Extract shared Row→ExperimentGroup mappers into ExperimentGroupMappers

Pull the duplicated Row→ExperimentGroupItem and Row→ExperimentGroupAggregationItem
conversion logic from ExperimentDAO and ExperimentAggregatesDAO into a shared
ExperimentGroupMappers utility class. Both DAOs now delegate to the same
toExperimentGroupItem / toExperimentGroupAggregationItem helpers, eliminating
the need to mirror DTO mapping changes in two places.

* [OPIK-4382] [BE] Deduplicate bindGroupCriteria into ExperimentGroupMappers

Moves the shared group-criteria binding logic out of ExperimentDAO and
ExperimentAggregatesDAO into ExperimentGroupMappers.bindGroupCriteria(),
following the same pattern as ExperimentSearchCriteriaBinder.  Adding or
fixing a criteria binding now only requires a change in one place.

* [OPIK-4382] [BE] Extract streamGroupQuery helper and fix null percentiles

- Deduplicate findGroups/findGroupsAggregations into a single private
  streamGroupQuery(queryTemplate, criteria, rowMapper) that differs only
  by the query constant and BiFunction row mapper.

- Fix convertToBigDecimal to return null for null/unsupported inputs so
  absent p50/p90/p99 entries in getDuration propagate as null rather than
  BigDecimal.ZERO, preserving the semantic-null that lets callers apply
  COALESCE/fallback logic correctly.

* [OPIK-4382] [BE] Consolidate cost/duration helpers into ExperimentGroupMappers

Promote getCostValue and getDuration to public static in
ExperimentGroupMappers and delete the private copies in ExperimentDAO.
ExperimentDAO.mapToDto now delegates to the shared helpers, so any
future change to cost filtering, duration percentile extraction, or the
BigDecimal conversion only needs to be made in one place.

Side-effect: ExperimentDAO.mapToDto also picks up the null-percentile
fix (convertToBigDecimal returns null for absent/unsupported inputs)
that was previously applied only to ExperimentGroupMappers.

* [OPIK-4382] [BE] Fix pagination count and add criteria filter tests

- Remove count() OVER () window function from paged query (returned
  page-scoped count instead of full result-set count)
- Replace with dedicated count query + short-circuit: skip items query
  when count == 0, use DatasetItemPage.empty() for that case
- Extract DatasetItemResultMapper.buildItemFromRow as public static
  helper reused by ExperimentAggregatesDAO
- Add parameterized integration tests for ExperimentGroupCriteria
  filters (name, types, projectId, combined, empty-result) covering
  both findGroups and findGroupsAggregations aggregate paths

* [OPIK-4380] [BE] Extract shared filter helpers into FilterQueryBuilder

Add FilterStrategyParam record, applyFiltersToTemplate and bindFilters
static helpers to FilterQueryBuilder, then replace duplicated per-strategy
loops in DatasetItemVersionDAO and ExperimentAggregatesDAO with single
delegating calls backed by per-DAO strategy constants.

* [OPIK-4382] [BE] Consolidate filter helpers in getExperimentItemsStatsFromAggregates

Add EXPERIMENT_ITEMS_STATS_FILTER_STRATEGY_PARAMS and
EXPERIMENT_ITEMS_STATS_BIND_STRATEGIES constants and replace the
per-strategy toAnalyticsDbFilters/bind blocks in
getExperimentItemsStatsFromAggregates with single delegating calls to
FilterQueryBuilder.applyFiltersToTemplate and bindFilters.

* Revision 9: Extract shared helpers to eliminate duplication across DAOs

- Create DatasetItemSearchCriteriaMapper: centralizes filters + search flag
  wiring for DatasetItemSearchCriteria, shared by DatasetItemVersionDAO and
  ExperimentAggregatesDAO
- Add ExperimentGroupMappers.applyGroupCriteriaToTemplate: centralizes
  ExperimentGroupCriteria → ST template wiring, now shared by ExperimentDAO
  and ExperimentAggregatesDAO
- Update DatasetItemVersionDAO.addFiltersToTemplate and bindSearchAndFilters
  to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentAggregatesDAO.applyDatasetItemFiltersToTemplate and
  bindDatasetItemSearchParams to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentDAO.newGroupTemplate and ExperimentAggregatesDAO.newGroupTemplate
  to delegate to ExperimentGroupMappers.applyGroupCriteriaToTemplate

* [OPIK-4382] [BE] Address PR review comments

- Fix convertToBigDecimal to return BigDecimal.ZERO for unknown Number types
- Use import instead of inline FQN for java.time.Instant in ExperimentEntityData
- Change EMPTY_ARRAY_STR visibility from public to private in ExperimentAggregatesDAO
- Extract resolveLatestVersionId() to deduplicate version-id resolution in ExperimentAggregatesService
- Extract shared test setup (AggregatesTestContext + setupAggregatesTestData) in ExperimentAggregatesIntegrationTest

* [OPIK-4382] [BE] Address PR review: workspace scoping and thread-safe test collections

- Wrap countTotal() in Mono.deferContextual for consistent workspace context handling
- Use Collections.synchronizedList for shared lists mutated in parallel forEach

* [OPIK-4382] [BE] Replace div.* with explicit column list in dataset_item_versions subqueries

Avoids fetching unnecessary heavy columns (e.g. data, metadata) from
dataset_item_versions when they are not needed by the outer query.
thiagohora added a commit that referenced this pull request Mar 9, 2026
…regates recomputation (#5371)

* [OPIK-4380] [BE] Add experiment aggregates for denormalized metrics

- Add experiment_aggregates and experiment_item_aggregates tables
- Implement ExperimentAggregatesDAO with population and query methods
- Add ExperimentAggregatesService for aggregation management
- Refactor DTOs into organized model classes:
  - ExperimentAggregatesModel: aggregation results
  - ExperimentEntityData: entity models
  - ExperimentSourceData: raw source data
  - ExperimentAggregatesUtils: utilities
- Add FEEDBACK_SCORES_AGGREGATED filter strategy for map-based filtering
- Add comprehensive integration tests (10/10 passing)
- Configure batch size and parallelism settings

* [OPIK-4380] [BE] Add MySQL deadlock retry mechanism for concurrent dataset operations

Problem:
- MySQL deadlock on dataset_version_tags composite PRIMARY KEY (workspace_id, dataset_id, tag)
- Occurred during parallel dataset creation in same workspace
- Multiple threads inserting "latest" tag for different datasets caused lock contention
- Experiments with parallel execution were failing with MySQLTransactionRollbackException

Solution:
- Add handleOnDeadLocks() method in RetryUtils with:
  - 5 retry attempts with exponential backoff (250ms to 2s)
  - 0.5 jitter to reduce thundering herd effect
  - Recursive isDatabaseDeadlock() detection for MySQLTransactionRollbackException
- Apply retry logic in DatasetItemService.setDatasetItemVersion()
- Enables concurrent dataset creation for same workspace

Impact:
- Supports parallel experiment execution with proper deadlock handling
- Test success rate improved from 0/10 to 10/10 in ExperimentAggregatesIntegrationTest

* Fix visibility

* [OPIK-4380] [BE] Address PR review comments for experiment aggregates

Fixed 11 automated review comments from baz-reviewer:

CRITICAL fixes:
- Prevent NPE on null span aggregations by adding coalesce() in SQL
- Handle multi-project experiments with LIMIT 1 in GET_PROJECT_ID
- Handle zero-item experiments with empty aggregation helpers
- Bind feedback_scores_percentiles map instead of empty CAST

HIGH priority fixes:
- Use toDecimal128(12) instead of toDecimal64(9) for cost percentiles
- Add null-safe tags handling with Optional.ofNullable()
- Include exception objects in retry logging for stack traces

MEDIUM priority fixes:
- Add missing log_comment to SELECT_EXPERIMENT_BY_ID query
- Add missing log_comment to GET_PROJECT_ID query

LOW priority fixes:
- Remove duplicate "id" binding in bindItemsParameters
- Enhance batchSize config documentation with details

All 11 integration tests passing.

* [OPIK-4382] [BE] Refactor experiment aggregates with import cleanup and Optional patterns

- Add missing imports for IntStream, ProjectStats, and other dependencies
- Replace fully-qualified class names with proper imports across DAO and Service classes
- Fix IS_NOT_EMPTY filter handling for FEEDBACK_SCORES_AGGREGATED strategies
- Refactor null checks to use Optional in mapping methods:
  - mapFeedbackScoreAggregations, mapExperimentFromAggregates
  - mapFeedbackScoreData, mapExperimentGroupAggregationItem
  - Batch insert preparation with Optional chains
- Improve code readability and maintainability with functional patterns

* [OPIK-4380] [BE] Fix table definition

* [OPIK-4380] [BE] Address PR comments and consolidate DatasetItemService methods

- Fix tags NPE in ExperimentAggregatesDAO with defaultIfNull
- Remove unnecessary FINAL clause from GET_EXPERIMENT_DATA query
- Fix test naming in ExperimentAggregatesIntegrationTest
- Consolidate 7 duplicate createVersionFromDelta methods into single canonical implementation
- Remove debug logger from config-test.yml

* [OPIK-4380] [BE] Fix missing log_comment and centralize search criteria binding

- Fix SELECT_EXPERIMENT_BY_ID to properly render log_comment metadata
  - Use getSTWithLogComment pattern in getExperimentFromAggregates
  - Ensures ClickHouse query logging populates workspace/experiment IDs

- Centralize ExperimentSearchCriteria binding logic
  - Create ExperimentSearchCriteriaBinder utility class
  - Parameterize filter strategies to support both DAO variants
  - Eliminate 29-line duplication between ExperimentDAO and ExperimentAggregatesDAO
  - Single source of truth prevents DAOs from getting out of sync

* [OPIK-4380] [BE] Fix createVersionFromDelta consolidation after rebase

- Update canonical method signature to include new parameters:
  - List<EvaluatorItem> evaluators
  - ExecutionPolicy executionPolicy
  - boolean clearExecutionPolicy

- Update all 5 caller sites to pass new parameters:
  - Use changes.evaluators(), changes.executionPolicy(), changes.clearExecutionPolicy() when available
  - Pass null/false for auto-generated versions that inherit from base

- Add imports for EvaluatorItem and ExecutionPolicy

Fixes compilation errors introduced by rebase with upstream changes to DatasetVersionService

* [OPIK-4380] [BE] Address PR review comments - fix type mismatch, extract constants, remove DAO logging

- Fixed BigDecimal[] to Double[] conversion for experiment_scores (matches ClickHouse Float64)
- Extracted FilterStrategy lists to static final constants to avoid repeated allocations
- Added @nonnull validation to populateExperimentAggregate parameter
- Removed DAO layer logging, keeping service-level logging only

* [OPIK-4380] [BE] Extract shared helper for experiment data pagination

Extract streamWithExperimentPagination() helper method to eliminate
duplication in getTracesData(), getSpansData(), and getFeedbackScoresData().

All three methods followed identical pattern:
- asyncTemplate.stream with connection
- getSTWithLogComment with cursor flag
- Bind workspace_id, experiment_id, project_id, limit
- Optional cursor binding
- Result mapping

Benefits:
- Single source of truth for pagination binding logic
- Prevents divergence when tweaking cursor/limit bindings
- Reduces code from ~20 lines to ~10 lines per method
- Type-safe generic implementation

Note: CTE redundancy (3x experiment_items scan) is intentional to avoid
passing large trace ID lists as parameters, which would cause performance
issues with 10K+ traces.

* [OPIK-4383] [BE] Add Redis stream subscriber for debounced experiment aggregates recomputation

- Add ExperimentDenormalizationConfig implementing StreamConfiguration with debounce, job lock, and per-experiment aggregation lock settings
- Add ExperimentAggregationMessage as stream message record
- Add ExperimentAggregatesSubscriber consuming from the denormalization stream; acquires a workspace-scoped distributed lock per experiment before calling populateAggregations()
- Add experimentDenormalizationEnabled feature flag to ServiceTogglesConfig and FeatureFlags
- Wire ExperimentDenormalizationConfig into OpikConfiguration
- Update config.yml and config-test.yml with full experimentDenormalization block (enabled for tests)
- Add ExperimentAggregatesSubscriberTest covering lifecycle gating and processEvent success/error paths

* Revision 2: Address PR comments - add config defaults, remove toggle, rename tests

- ExperimentDenormalizationConfig: add sensible defaults to all fields so
  Dropwizard validation doesn't fail when the config block is absent from
  old deployments (config.isEnabled()=false still gates the subscriber)
- Remove experimentDenormalizationEnabled service toggle from
  ServiceTogglesConfig, FeatureFlags, config.yml and config-test.yml -
  the infrastructure gate (config.isEnabled()) is the single control point
- Rename lifecycle test methods to camelCase per project conventions:
  startSkipsStartupWhenDisabled / stopSkipsShutdownWhenDisabled

* Revision 3: Add @max(500) to consumerBatchSize and @NotNull to jobLockWaitTime

* [OPIK-4380] [BE] Address PR review comments - fix TYPE_REFERENCE visibility, redundant IN subquery, hardcoded context keys, Instant.now in loop, and inline defaultIfNull

* Revision 4: Address remaining JetoPistola review comments (#7, #8, #10)

- #7: Remove "Used for testing and verification" from getExperimentFromAggregates javadoc
- #8: Replace recursive flatMap with Mono.expand() in populateExperimentItemsInBatches
- #10: Remove unrelated subscribeOn addition from DatasetItemService.createVersionFromDelta

* Revision 3: Add switchIfEmpty fallback for deleted traces in populateExperimentAggregate

* Fix tests

* Revision 6: Move countTotal log from DAO to service layer

Operational logs belong in the service layer, not the DAO.

* Revision 7: Apply Spotless formatting

* Revision 8: Make populateAggregations(UUID, int) private

Removes the uncapped public batch size entry point. All callers now go
through the public no-arg overload which reads batchSize safely from config.

* [OPIK-4380] [BE] Add evaluation_method support to experiment_aggregates pipeline

- Add ClickHouse migration (000062) to add evaluation_method column to experiment_aggregates table
- Add evaluationMethod field to ExperimentData record
- Update GET_EXPERIMENT_DATA query to read evaluation_method from experiments
- Update INSERT_EXPERIMENT_AGGREGATE to write evaluation_method to experiment_aggregates
- Update SELECT_EXPERIMENT_BY_ID to read evaluation_method from experiment_aggregates
- Fix Experiment record constructor call: insert EvaluationMethod at correct position (10)

* [OPIK-4380] [BE] Extract shared helper for experiment aggregation queries

Reduce copy-paste in getTraceAggregations, getSpanAggregations, and
getFeedbackScoreAggregations by extracting queryExperimentAggregation,
which centralises the context-aware execution, workspace/experiment/project
parameter binding, and singleOrEmpty pattern shared by all three methods.

* [OPIK-4380] [BE] Enforce non-null contract on countTotal criteria parameter

Add @nonnull to ExperimentSearchCriteria in the interface and implementation
so that a null argument fails fast with an explicit NullPointerException at
the DAO boundary instead of crashing deep inside buildCountTemplate.

* [OPIK-4380] [BE] Fix countTotal ignoring target project IDs in normal path

target_project_ids was only applied inside the project_deleted LEFT JOIN
subquery; the main WHERE had no project restriction, so counts were
workspace-wide. Reuse has_target_projects in the main WHERE so
project_id IN :target_project_ids always takes effect. Also replace
manual null/empty checks with CollectionUtils.isNotEmpty.

* [OPIK-4380] [BE] Apply Spotless formatting

* [OPIK-4382] [BE] Address PR review comments on experiment aggregates

- Fix :versionId → :version_id parameter naming in SQL templates and bindings
- Fix last_updated_at binding to use item.lastUpdatedAt() instead of Instant.now()
- Fix FEEDBACK_SCORES_AGGREGATED_IS_EMPTY filter: embed generated SQL into templates
  instead of hard-coded static condition, and add missing bind calls
- Fix RetryUtils log duplication (remove getMessage() + pass exception directly)
- Add batchSize = 1000 default in ExperimentAggregatesConfig
- Extract resolveVersionIdForCriteria helper to deduplicate version-id resolution
- Add null/blank/ClickHouse placeholder guards in extractUuidsFromGroupValues
- Extract loadEntityMap helper to deduplicate getEnrichInfoHolder enrichment logic

* Revision 3: Address PR comments E, F, G, H

- Fix E: Extract shared template/bind helpers in ExperimentAggregatesDAO
- Fix F: Bind experiment_ids as UUID[] instead of String[]
- Fix G+H: Extract getEnrichInfoHolder logic into ExperimentGroupEnricher,
  eliminating duplication between ExperimentService and ExperimentAggregatesService
  without introducing a direct dependency between them

* Revision 4: Fix ExperimentServiceTest to include ExperimentGroupEnricher mock

* [OPIK-4382] [BE] Extract shared Row→ExperimentGroup mappers into ExperimentGroupMappers

Pull the duplicated Row→ExperimentGroupItem and Row→ExperimentGroupAggregationItem
conversion logic from ExperimentDAO and ExperimentAggregatesDAO into a shared
ExperimentGroupMappers utility class. Both DAOs now delegate to the same
toExperimentGroupItem / toExperimentGroupAggregationItem helpers, eliminating
the need to mirror DTO mapping changes in two places.

* [OPIK-4382] [BE] Deduplicate bindGroupCriteria into ExperimentGroupMappers

Moves the shared group-criteria binding logic out of ExperimentDAO and
ExperimentAggregatesDAO into ExperimentGroupMappers.bindGroupCriteria(),
following the same pattern as ExperimentSearchCriteriaBinder.  Adding or
fixing a criteria binding now only requires a change in one place.

* [OPIK-4382] [BE] Extract streamGroupQuery helper and fix null percentiles

- Deduplicate findGroups/findGroupsAggregations into a single private
  streamGroupQuery(queryTemplate, criteria, rowMapper) that differs only
  by the query constant and BiFunction row mapper.

- Fix convertToBigDecimal to return null for null/unsupported inputs so
  absent p50/p90/p99 entries in getDuration propagate as null rather than
  BigDecimal.ZERO, preserving the semantic-null that lets callers apply
  COALESCE/fallback logic correctly.

* [OPIK-4382] [BE] Consolidate cost/duration helpers into ExperimentGroupMappers

Promote getCostValue and getDuration to public static in
ExperimentGroupMappers and delete the private copies in ExperimentDAO.
ExperimentDAO.mapToDto now delegates to the shared helpers, so any
future change to cost filtering, duration percentile extraction, or the
BigDecimal conversion only needs to be made in one place.

Side-effect: ExperimentDAO.mapToDto also picks up the null-percentile
fix (convertToBigDecimal returns null for absent/unsupported inputs)
that was previously applied only to ExperimentGroupMappers.

* [OPIK-4382] [BE] Fix pagination count and add criteria filter tests

- Remove count() OVER () window function from paged query (returned
  page-scoped count instead of full result-set count)
- Replace with dedicated count query + short-circuit: skip items query
  when count == 0, use DatasetItemPage.empty() for that case
- Extract DatasetItemResultMapper.buildItemFromRow as public static
  helper reused by ExperimentAggregatesDAO
- Add parameterized integration tests for ExperimentGroupCriteria
  filters (name, types, projectId, combined, empty-result) covering
  both findGroups and findGroupsAggregations aggregate paths

* [OPIK-4380] [BE] Extract shared filter helpers into FilterQueryBuilder

Add FilterStrategyParam record, applyFiltersToTemplate and bindFilters
static helpers to FilterQueryBuilder, then replace duplicated per-strategy
loops in DatasetItemVersionDAO and ExperimentAggregatesDAO with single
delegating calls backed by per-DAO strategy constants.

* [OPIK-4382] [BE] Consolidate filter helpers in getExperimentItemsStatsFromAggregates

Add EXPERIMENT_ITEMS_STATS_FILTER_STRATEGY_PARAMS and
EXPERIMENT_ITEMS_STATS_BIND_STRATEGIES constants and replace the
per-strategy toAnalyticsDbFilters/bind blocks in
getExperimentItemsStatsFromAggregates with single delegating calls to
FilterQueryBuilder.applyFiltersToTemplate and bindFilters.

* Revision 9: Extract shared helpers to eliminate duplication across DAOs

- Create DatasetItemSearchCriteriaMapper: centralizes filters + search flag
  wiring for DatasetItemSearchCriteria, shared by DatasetItemVersionDAO and
  ExperimentAggregatesDAO
- Add ExperimentGroupMappers.applyGroupCriteriaToTemplate: centralizes
  ExperimentGroupCriteria → ST template wiring, now shared by ExperimentDAO
  and ExperimentAggregatesDAO
- Update DatasetItemVersionDAO.addFiltersToTemplate and bindSearchAndFilters
  to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentAggregatesDAO.applyDatasetItemFiltersToTemplate and
  bindDatasetItemSearchParams to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentDAO.newGroupTemplate and ExperimentAggregatesDAO.newGroupTemplate
  to delegate to ExperimentGroupMappers.applyGroupCriteriaToTemplate

* [OPIK-4383] [BE] Fix project_deleted filter and comments_dedup scope in ExperimentAggregatesDAO

- Fix project_deleted filter: use zero UUID sentinel instead of empty string
  for FixedString(36) column comparison in FIND_GROUPS and FIND_GROUPS_AGGREGATIONS
- Fix comments_dedup CTE: scope trace_id subquery by dataset_id to avoid
  scanning the entire workspace's comments table
- Add missing streamMaxLen and streamTrimLimit fields to
  ExperimentDenormalizationConfig (implements StreamConfiguration interface)

* [OPIK-4383] [BE] Address PR review comments: extract ZERO_UUID constant and fix config comment

- Promote zero UUID sentinel to shared constant in ExperimentGroupMappers
- Use parameterized :zero_uuid binding in SQL templates instead of hardcoded string
- Fix config.yml comment from "Default: 120s" to "Default: 1m"

* [OPIK-4383] [BE] Add streamMaxLen and streamTrimLimit to experimentDenormalization config
itamargolan pushed a commit that referenced this pull request Mar 9, 2026
…regates recomputation (#5371)

* [OPIK-4380] [BE] Add experiment aggregates for denormalized metrics

- Add experiment_aggregates and experiment_item_aggregates tables
- Implement ExperimentAggregatesDAO with population and query methods
- Add ExperimentAggregatesService for aggregation management
- Refactor DTOs into organized model classes:
  - ExperimentAggregatesModel: aggregation results
  - ExperimentEntityData: entity models
  - ExperimentSourceData: raw source data
  - ExperimentAggregatesUtils: utilities
- Add FEEDBACK_SCORES_AGGREGATED filter strategy for map-based filtering
- Add comprehensive integration tests (10/10 passing)
- Configure batch size and parallelism settings

* [OPIK-4380] [BE] Add MySQL deadlock retry mechanism for concurrent dataset operations

Problem:
- MySQL deadlock on dataset_version_tags composite PRIMARY KEY (workspace_id, dataset_id, tag)
- Occurred during parallel dataset creation in same workspace
- Multiple threads inserting "latest" tag for different datasets caused lock contention
- Experiments with parallel execution were failing with MySQLTransactionRollbackException

Solution:
- Add handleOnDeadLocks() method in RetryUtils with:
  - 5 retry attempts with exponential backoff (250ms to 2s)
  - 0.5 jitter to reduce thundering herd effect
  - Recursive isDatabaseDeadlock() detection for MySQLTransactionRollbackException
- Apply retry logic in DatasetItemService.setDatasetItemVersion()
- Enables concurrent dataset creation for same workspace

Impact:
- Supports parallel experiment execution with proper deadlock handling
- Test success rate improved from 0/10 to 10/10 in ExperimentAggregatesIntegrationTest

* Fix visibility

* [OPIK-4380] [BE] Address PR review comments for experiment aggregates

Fixed 11 automated review comments from baz-reviewer:

CRITICAL fixes:
- Prevent NPE on null span aggregations by adding coalesce() in SQL
- Handle multi-project experiments with LIMIT 1 in GET_PROJECT_ID
- Handle zero-item experiments with empty aggregation helpers
- Bind feedback_scores_percentiles map instead of empty CAST

HIGH priority fixes:
- Use toDecimal128(12) instead of toDecimal64(9) for cost percentiles
- Add null-safe tags handling with Optional.ofNullable()
- Include exception objects in retry logging for stack traces

MEDIUM priority fixes:
- Add missing log_comment to SELECT_EXPERIMENT_BY_ID query
- Add missing log_comment to GET_PROJECT_ID query

LOW priority fixes:
- Remove duplicate "id" binding in bindItemsParameters
- Enhance batchSize config documentation with details

All 11 integration tests passing.

* [OPIK-4382] [BE] Refactor experiment aggregates with import cleanup and Optional patterns

- Add missing imports for IntStream, ProjectStats, and other dependencies
- Replace fully-qualified class names with proper imports across DAO and Service classes
- Fix IS_NOT_EMPTY filter handling for FEEDBACK_SCORES_AGGREGATED strategies
- Refactor null checks to use Optional in mapping methods:
  - mapFeedbackScoreAggregations, mapExperimentFromAggregates
  - mapFeedbackScoreData, mapExperimentGroupAggregationItem
  - Batch insert preparation with Optional chains
- Improve code readability and maintainability with functional patterns

* [OPIK-4380] [BE] Fix table definition

* [OPIK-4380] [BE] Address PR comments and consolidate DatasetItemService methods

- Fix tags NPE in ExperimentAggregatesDAO with defaultIfNull
- Remove unnecessary FINAL clause from GET_EXPERIMENT_DATA query
- Fix test naming in ExperimentAggregatesIntegrationTest
- Consolidate 7 duplicate createVersionFromDelta methods into single canonical implementation
- Remove debug logger from config-test.yml

* [OPIK-4380] [BE] Fix missing log_comment and centralize search criteria binding

- Fix SELECT_EXPERIMENT_BY_ID to properly render log_comment metadata
  - Use getSTWithLogComment pattern in getExperimentFromAggregates
  - Ensures ClickHouse query logging populates workspace/experiment IDs

- Centralize ExperimentSearchCriteria binding logic
  - Create ExperimentSearchCriteriaBinder utility class
  - Parameterize filter strategies to support both DAO variants
  - Eliminate 29-line duplication between ExperimentDAO and ExperimentAggregatesDAO
  - Single source of truth prevents DAOs from getting out of sync

* [OPIK-4380] [BE] Fix createVersionFromDelta consolidation after rebase

- Update canonical method signature to include new parameters:
  - List<EvaluatorItem> evaluators
  - ExecutionPolicy executionPolicy
  - boolean clearExecutionPolicy

- Update all 5 caller sites to pass new parameters:
  - Use changes.evaluators(), changes.executionPolicy(), changes.clearExecutionPolicy() when available
  - Pass null/false for auto-generated versions that inherit from base

- Add imports for EvaluatorItem and ExecutionPolicy

Fixes compilation errors introduced by rebase with upstream changes to DatasetVersionService

* [OPIK-4380] [BE] Address PR review comments - fix type mismatch, extract constants, remove DAO logging

- Fixed BigDecimal[] to Double[] conversion for experiment_scores (matches ClickHouse Float64)
- Extracted FilterStrategy lists to static final constants to avoid repeated allocations
- Added @nonnull validation to populateExperimentAggregate parameter
- Removed DAO layer logging, keeping service-level logging only

* [OPIK-4380] [BE] Extract shared helper for experiment data pagination

Extract streamWithExperimentPagination() helper method to eliminate
duplication in getTracesData(), getSpansData(), and getFeedbackScoresData().

All three methods followed identical pattern:
- asyncTemplate.stream with connection
- getSTWithLogComment with cursor flag
- Bind workspace_id, experiment_id, project_id, limit
- Optional cursor binding
- Result mapping

Benefits:
- Single source of truth for pagination binding logic
- Prevents divergence when tweaking cursor/limit bindings
- Reduces code from ~20 lines to ~10 lines per method
- Type-safe generic implementation

Note: CTE redundancy (3x experiment_items scan) is intentional to avoid
passing large trace ID lists as parameters, which would cause performance
issues with 10K+ traces.

* [OPIK-4383] [BE] Add Redis stream subscriber for debounced experiment aggregates recomputation

- Add ExperimentDenormalizationConfig implementing StreamConfiguration with debounce, job lock, and per-experiment aggregation lock settings
- Add ExperimentAggregationMessage as stream message record
- Add ExperimentAggregatesSubscriber consuming from the denormalization stream; acquires a workspace-scoped distributed lock per experiment before calling populateAggregations()
- Add experimentDenormalizationEnabled feature flag to ServiceTogglesConfig and FeatureFlags
- Wire ExperimentDenormalizationConfig into OpikConfiguration
- Update config.yml and config-test.yml with full experimentDenormalization block (enabled for tests)
- Add ExperimentAggregatesSubscriberTest covering lifecycle gating and processEvent success/error paths

* Revision 2: Address PR comments - add config defaults, remove toggle, rename tests

- ExperimentDenormalizationConfig: add sensible defaults to all fields so
  Dropwizard validation doesn't fail when the config block is absent from
  old deployments (config.isEnabled()=false still gates the subscriber)
- Remove experimentDenormalizationEnabled service toggle from
  ServiceTogglesConfig, FeatureFlags, config.yml and config-test.yml -
  the infrastructure gate (config.isEnabled()) is the single control point
- Rename lifecycle test methods to camelCase per project conventions:
  startSkipsStartupWhenDisabled / stopSkipsShutdownWhenDisabled

* Revision 3: Add @max(500) to consumerBatchSize and @NotNull to jobLockWaitTime

* [OPIK-4380] [BE] Address PR review comments - fix TYPE_REFERENCE visibility, redundant IN subquery, hardcoded context keys, Instant.now in loop, and inline defaultIfNull

* Revision 4: Address remaining JetoPistola review comments (#7, #8, #10)

- #7: Remove "Used for testing and verification" from getExperimentFromAggregates javadoc
- #8: Replace recursive flatMap with Mono.expand() in populateExperimentItemsInBatches
- #10: Remove unrelated subscribeOn addition from DatasetItemService.createVersionFromDelta

* Revision 3: Add switchIfEmpty fallback for deleted traces in populateExperimentAggregate

* Fix tests

* Revision 6: Move countTotal log from DAO to service layer

Operational logs belong in the service layer, not the DAO.

* Revision 7: Apply Spotless formatting

* Revision 8: Make populateAggregations(UUID, int) private

Removes the uncapped public batch size entry point. All callers now go
through the public no-arg overload which reads batchSize safely from config.

* [OPIK-4380] [BE] Add evaluation_method support to experiment_aggregates pipeline

- Add ClickHouse migration (000062) to add evaluation_method column to experiment_aggregates table
- Add evaluationMethod field to ExperimentData record
- Update GET_EXPERIMENT_DATA query to read evaluation_method from experiments
- Update INSERT_EXPERIMENT_AGGREGATE to write evaluation_method to experiment_aggregates
- Update SELECT_EXPERIMENT_BY_ID to read evaluation_method from experiment_aggregates
- Fix Experiment record constructor call: insert EvaluationMethod at correct position (10)

* [OPIK-4380] [BE] Extract shared helper for experiment aggregation queries

Reduce copy-paste in getTraceAggregations, getSpanAggregations, and
getFeedbackScoreAggregations by extracting queryExperimentAggregation,
which centralises the context-aware execution, workspace/experiment/project
parameter binding, and singleOrEmpty pattern shared by all three methods.

* [OPIK-4380] [BE] Enforce non-null contract on countTotal criteria parameter

Add @nonnull to ExperimentSearchCriteria in the interface and implementation
so that a null argument fails fast with an explicit NullPointerException at
the DAO boundary instead of crashing deep inside buildCountTemplate.

* [OPIK-4380] [BE] Fix countTotal ignoring target project IDs in normal path

target_project_ids was only applied inside the project_deleted LEFT JOIN
subquery; the main WHERE had no project restriction, so counts were
workspace-wide. Reuse has_target_projects in the main WHERE so
project_id IN :target_project_ids always takes effect. Also replace
manual null/empty checks with CollectionUtils.isNotEmpty.

* [OPIK-4380] [BE] Apply Spotless formatting

* [OPIK-4382] [BE] Address PR review comments on experiment aggregates

- Fix :versionId → :version_id parameter naming in SQL templates and bindings
- Fix last_updated_at binding to use item.lastUpdatedAt() instead of Instant.now()
- Fix FEEDBACK_SCORES_AGGREGATED_IS_EMPTY filter: embed generated SQL into templates
  instead of hard-coded static condition, and add missing bind calls
- Fix RetryUtils log duplication (remove getMessage() + pass exception directly)
- Add batchSize = 1000 default in ExperimentAggregatesConfig
- Extract resolveVersionIdForCriteria helper to deduplicate version-id resolution
- Add null/blank/ClickHouse placeholder guards in extractUuidsFromGroupValues
- Extract loadEntityMap helper to deduplicate getEnrichInfoHolder enrichment logic

* Revision 3: Address PR comments E, F, G, H

- Fix E: Extract shared template/bind helpers in ExperimentAggregatesDAO
- Fix F: Bind experiment_ids as UUID[] instead of String[]
- Fix G+H: Extract getEnrichInfoHolder logic into ExperimentGroupEnricher,
  eliminating duplication between ExperimentService and ExperimentAggregatesService
  without introducing a direct dependency between them

* Revision 4: Fix ExperimentServiceTest to include ExperimentGroupEnricher mock

* [OPIK-4382] [BE] Extract shared Row→ExperimentGroup mappers into ExperimentGroupMappers

Pull the duplicated Row→ExperimentGroupItem and Row→ExperimentGroupAggregationItem
conversion logic from ExperimentDAO and ExperimentAggregatesDAO into a shared
ExperimentGroupMappers utility class. Both DAOs now delegate to the same
toExperimentGroupItem / toExperimentGroupAggregationItem helpers, eliminating
the need to mirror DTO mapping changes in two places.

* [OPIK-4382] [BE] Deduplicate bindGroupCriteria into ExperimentGroupMappers

Moves the shared group-criteria binding logic out of ExperimentDAO and
ExperimentAggregatesDAO into ExperimentGroupMappers.bindGroupCriteria(),
following the same pattern as ExperimentSearchCriteriaBinder.  Adding or
fixing a criteria binding now only requires a change in one place.

* [OPIK-4382] [BE] Extract streamGroupQuery helper and fix null percentiles

- Deduplicate findGroups/findGroupsAggregations into a single private
  streamGroupQuery(queryTemplate, criteria, rowMapper) that differs only
  by the query constant and BiFunction row mapper.

- Fix convertToBigDecimal to return null for null/unsupported inputs so
  absent p50/p90/p99 entries in getDuration propagate as null rather than
  BigDecimal.ZERO, preserving the semantic-null that lets callers apply
  COALESCE/fallback logic correctly.

* [OPIK-4382] [BE] Consolidate cost/duration helpers into ExperimentGroupMappers

Promote getCostValue and getDuration to public static in
ExperimentGroupMappers and delete the private copies in ExperimentDAO.
ExperimentDAO.mapToDto now delegates to the shared helpers, so any
future change to cost filtering, duration percentile extraction, or the
BigDecimal conversion only needs to be made in one place.

Side-effect: ExperimentDAO.mapToDto also picks up the null-percentile
fix (convertToBigDecimal returns null for absent/unsupported inputs)
that was previously applied only to ExperimentGroupMappers.

* [OPIK-4382] [BE] Fix pagination count and add criteria filter tests

- Remove count() OVER () window function from paged query (returned
  page-scoped count instead of full result-set count)
- Replace with dedicated count query + short-circuit: skip items query
  when count == 0, use DatasetItemPage.empty() for that case
- Extract DatasetItemResultMapper.buildItemFromRow as public static
  helper reused by ExperimentAggregatesDAO
- Add parameterized integration tests for ExperimentGroupCriteria
  filters (name, types, projectId, combined, empty-result) covering
  both findGroups and findGroupsAggregations aggregate paths

* [OPIK-4380] [BE] Extract shared filter helpers into FilterQueryBuilder

Add FilterStrategyParam record, applyFiltersToTemplate and bindFilters
static helpers to FilterQueryBuilder, then replace duplicated per-strategy
loops in DatasetItemVersionDAO and ExperimentAggregatesDAO with single
delegating calls backed by per-DAO strategy constants.

* [OPIK-4382] [BE] Consolidate filter helpers in getExperimentItemsStatsFromAggregates

Add EXPERIMENT_ITEMS_STATS_FILTER_STRATEGY_PARAMS and
EXPERIMENT_ITEMS_STATS_BIND_STRATEGIES constants and replace the
per-strategy toAnalyticsDbFilters/bind blocks in
getExperimentItemsStatsFromAggregates with single delegating calls to
FilterQueryBuilder.applyFiltersToTemplate and bindFilters.

* Revision 9: Extract shared helpers to eliminate duplication across DAOs

- Create DatasetItemSearchCriteriaMapper: centralizes filters + search flag
  wiring for DatasetItemSearchCriteria, shared by DatasetItemVersionDAO and
  ExperimentAggregatesDAO
- Add ExperimentGroupMappers.applyGroupCriteriaToTemplate: centralizes
  ExperimentGroupCriteria → ST template wiring, now shared by ExperimentDAO
  and ExperimentAggregatesDAO
- Update DatasetItemVersionDAO.addFiltersToTemplate and bindSearchAndFilters
  to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentAggregatesDAO.applyDatasetItemFiltersToTemplate and
  bindDatasetItemSearchParams to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentDAO.newGroupTemplate and ExperimentAggregatesDAO.newGroupTemplate
  to delegate to ExperimentGroupMappers.applyGroupCriteriaToTemplate

* [OPIK-4383] [BE] Fix project_deleted filter and comments_dedup scope in ExperimentAggregatesDAO

- Fix project_deleted filter: use zero UUID sentinel instead of empty string
  for FixedString(36) column comparison in FIND_GROUPS and FIND_GROUPS_AGGREGATIONS
- Fix comments_dedup CTE: scope trace_id subquery by dataset_id to avoid
  scanning the entire workspace's comments table
- Add missing streamMaxLen and streamTrimLimit fields to
  ExperimentDenormalizationConfig (implements StreamConfiguration interface)

* [OPIK-4383] [BE] Address PR review comments: extract ZERO_UUID constant and fix config comment

- Promote zero UUID sentinel to shared constant in ExperimentGroupMappers
- Use parameterized :zero_uuid binding in SQL templates instead of hardcoded string
- Fix config.yml comment from "Default: 120s" to "Default: 1m"

* [OPIK-4383] [BE] Add streamMaxLen and streamTrimLimit to experimentDenormalization config
thiagohora added a commit that referenced this pull request Mar 11, 2026
…blisher (#5510)

* [OPIK-4380] [BE] Add experiment aggregates for denormalized metrics

- Add experiment_aggregates and experiment_item_aggregates tables
- Implement ExperimentAggregatesDAO with population and query methods
- Add ExperimentAggregatesService for aggregation management
- Refactor DTOs into organized model classes:
  - ExperimentAggregatesModel: aggregation results
  - ExperimentEntityData: entity models
  - ExperimentSourceData: raw source data
  - ExperimentAggregatesUtils: utilities
- Add FEEDBACK_SCORES_AGGREGATED filter strategy for map-based filtering
- Add comprehensive integration tests (10/10 passing)
- Configure batch size and parallelism settings

* [OPIK-4380] [BE] Add MySQL deadlock retry mechanism for concurrent dataset operations

Problem:
- MySQL deadlock on dataset_version_tags composite PRIMARY KEY (workspace_id, dataset_id, tag)
- Occurred during parallel dataset creation in same workspace
- Multiple threads inserting "latest" tag for different datasets caused lock contention
- Experiments with parallel execution were failing with MySQLTransactionRollbackException

Solution:
- Add handleOnDeadLocks() method in RetryUtils with:
  - 5 retry attempts with exponential backoff (250ms to 2s)
  - 0.5 jitter to reduce thundering herd effect
  - Recursive isDatabaseDeadlock() detection for MySQLTransactionRollbackException
- Apply retry logic in DatasetItemService.setDatasetItemVersion()
- Enables concurrent dataset creation for same workspace

Impact:
- Supports parallel experiment execution with proper deadlock handling
- Test success rate improved from 0/10 to 10/10 in ExperimentAggregatesIntegrationTest

* Fix visibility

* [OPIK-4380] [BE] Address PR review comments for experiment aggregates

Fixed 11 automated review comments from baz-reviewer:

CRITICAL fixes:
- Prevent NPE on null span aggregations by adding coalesce() in SQL
- Handle multi-project experiments with LIMIT 1 in GET_PROJECT_ID
- Handle zero-item experiments with empty aggregation helpers
- Bind feedback_scores_percentiles map instead of empty CAST

HIGH priority fixes:
- Use toDecimal128(12) instead of toDecimal64(9) for cost percentiles
- Add null-safe tags handling with Optional.ofNullable()
- Include exception objects in retry logging for stack traces

MEDIUM priority fixes:
- Add missing log_comment to SELECT_EXPERIMENT_BY_ID query
- Add missing log_comment to GET_PROJECT_ID query

LOW priority fixes:
- Remove duplicate "id" binding in bindItemsParameters
- Enhance batchSize config documentation with details

All 11 integration tests passing.

* [OPIK-4382] [BE] Refactor experiment aggregates with import cleanup and Optional patterns

- Add missing imports for IntStream, ProjectStats, and other dependencies
- Replace fully-qualified class names with proper imports across DAO and Service classes
- Fix IS_NOT_EMPTY filter handling for FEEDBACK_SCORES_AGGREGATED strategies
- Refactor null checks to use Optional in mapping methods:
  - mapFeedbackScoreAggregations, mapExperimentFromAggregates
  - mapFeedbackScoreData, mapExperimentGroupAggregationItem
  - Batch insert preparation with Optional chains
- Improve code readability and maintainability with functional patterns

* [OPIK-4380] [BE] Fix table definition

* [OPIK-4380] [BE] Address PR comments and consolidate DatasetItemService methods

- Fix tags NPE in ExperimentAggregatesDAO with defaultIfNull
- Remove unnecessary FINAL clause from GET_EXPERIMENT_DATA query
- Fix test naming in ExperimentAggregatesIntegrationTest
- Consolidate 7 duplicate createVersionFromDelta methods into single canonical implementation
- Remove debug logger from config-test.yml

* [OPIK-4380] [BE] Fix missing log_comment and centralize search criteria binding

- Fix SELECT_EXPERIMENT_BY_ID to properly render log_comment metadata
  - Use getSTWithLogComment pattern in getExperimentFromAggregates
  - Ensures ClickHouse query logging populates workspace/experiment IDs

- Centralize ExperimentSearchCriteria binding logic
  - Create ExperimentSearchCriteriaBinder utility class
  - Parameterize filter strategies to support both DAO variants
  - Eliminate 29-line duplication between ExperimentDAO and ExperimentAggregatesDAO
  - Single source of truth prevents DAOs from getting out of sync

* [OPIK-4380] [BE] Fix createVersionFromDelta consolidation after rebase

- Update canonical method signature to include new parameters:
  - List<EvaluatorItem> evaluators
  - ExecutionPolicy executionPolicy
  - boolean clearExecutionPolicy

- Update all 5 caller sites to pass new parameters:
  - Use changes.evaluators(), changes.executionPolicy(), changes.clearExecutionPolicy() when available
  - Pass null/false for auto-generated versions that inherit from base

- Add imports for EvaluatorItem and ExecutionPolicy

Fixes compilation errors introduced by rebase with upstream changes to DatasetVersionService

* [OPIK-4380] [BE] Address PR review comments - fix type mismatch, extract constants, remove DAO logging

- Fixed BigDecimal[] to Double[] conversion for experiment_scores (matches ClickHouse Float64)
- Extracted FilterStrategy lists to static final constants to avoid repeated allocations
- Added @nonnull validation to populateExperimentAggregate parameter
- Removed DAO layer logging, keeping service-level logging only

* [OPIK-4380] [BE] Extract shared helper for experiment data pagination

Extract streamWithExperimentPagination() helper method to eliminate
duplication in getTracesData(), getSpansData(), and getFeedbackScoresData().

All three methods followed identical pattern:
- asyncTemplate.stream with connection
- getSTWithLogComment with cursor flag
- Bind workspace_id, experiment_id, project_id, limit
- Optional cursor binding
- Result mapping

Benefits:
- Single source of truth for pagination binding logic
- Prevents divergence when tweaking cursor/limit bindings
- Reduces code from ~20 lines to ~10 lines per method
- Type-safe generic implementation

Note: CTE redundancy (3x experiment_items scan) is intentional to avoid
passing large trace ID lists as parameters, which would cause performance
issues with 10K+ traces.

* [OPIK-4383] [BE] Add Redis stream subscriber for debounced experiment aggregates recomputation

- Add ExperimentDenormalizationConfig implementing StreamConfiguration with debounce, job lock, and per-experiment aggregation lock settings
- Add ExperimentAggregationMessage as stream message record
- Add ExperimentAggregatesSubscriber consuming from the denormalization stream; acquires a workspace-scoped distributed lock per experiment before calling populateAggregations()
- Add experimentDenormalizationEnabled feature flag to ServiceTogglesConfig and FeatureFlags
- Wire ExperimentDenormalizationConfig into OpikConfiguration
- Update config.yml and config-test.yml with full experimentDenormalization block (enabled for tests)
- Add ExperimentAggregatesSubscriberTest covering lifecycle gating and processEvent success/error paths

* Revision 2: Address PR comments - add config defaults, remove toggle, rename tests

- ExperimentDenormalizationConfig: add sensible defaults to all fields so
  Dropwizard validation doesn't fail when the config block is absent from
  old deployments (config.isEnabled()=false still gates the subscriber)
- Remove experimentDenormalizationEnabled service toggle from
  ServiceTogglesConfig, FeatureFlags, config.yml and config-test.yml -
  the infrastructure gate (config.isEnabled()) is the single control point
- Rename lifecycle test methods to camelCase per project conventions:
  startSkipsStartupWhenDisabled / stopSkipsShutdownWhenDisabled

* Revision 3: Add @max(500) to consumerBatchSize and @NotNull to jobLockWaitTime

* [OPIK-4380] [BE] Address PR review comments - fix TYPE_REFERENCE visibility, redundant IN subquery, hardcoded context keys, Instant.now in loop, and inline defaultIfNull

* Revision 4: Address remaining JetoPistola review comments (#7, #8, #10)

- #7: Remove "Used for testing and verification" from getExperimentFromAggregates javadoc
- #8: Replace recursive flatMap with Mono.expand() in populateExperimentItemsInBatches
- #10: Remove unrelated subscribeOn addition from DatasetItemService.createVersionFromDelta

* Revision 3: Add switchIfEmpty fallback for deleted traces in populateExperimentAggregate

* Fix tests

* Revision 6: Move countTotal log from DAO to service layer

Operational logs belong in the service layer, not the DAO.

* Revision 7: Apply Spotless formatting

* Revision 8: Make populateAggregations(UUID, int) private

Removes the uncapped public batch size entry point. All callers now go
through the public no-arg overload which reads batchSize safely from config.

* [OPIK-4380] [BE] Add evaluation_method support to experiment_aggregates pipeline

- Add ClickHouse migration (000062) to add evaluation_method column to experiment_aggregates table
- Add evaluationMethod field to ExperimentData record
- Update GET_EXPERIMENT_DATA query to read evaluation_method from experiments
- Update INSERT_EXPERIMENT_AGGREGATE to write evaluation_method to experiment_aggregates
- Update SELECT_EXPERIMENT_BY_ID to read evaluation_method from experiment_aggregates
- Fix Experiment record constructor call: insert EvaluationMethod at correct position (10)

* [OPIK-4380] [BE] Extract shared helper for experiment aggregation queries

Reduce copy-paste in getTraceAggregations, getSpanAggregations, and
getFeedbackScoreAggregations by extracting queryExperimentAggregation,
which centralises the context-aware execution, workspace/experiment/project
parameter binding, and singleOrEmpty pattern shared by all three methods.

* [OPIK-4380] [BE] Enforce non-null contract on countTotal criteria parameter

Add @nonnull to ExperimentSearchCriteria in the interface and implementation
so that a null argument fails fast with an explicit NullPointerException at
the DAO boundary instead of crashing deep inside buildCountTemplate.

* [OPIK-4380] [BE] Fix countTotal ignoring target project IDs in normal path

target_project_ids was only applied inside the project_deleted LEFT JOIN
subquery; the main WHERE had no project restriction, so counts were
workspace-wide. Reuse has_target_projects in the main WHERE so
project_id IN :target_project_ids always takes effect. Also replace
manual null/empty checks with CollectionUtils.isNotEmpty.

* [OPIK-4380] [BE] Apply Spotless formatting

* [OPIK-4382] [BE] Address PR review comments on experiment aggregates

- Fix :versionId → :version_id parameter naming in SQL templates and bindings
- Fix last_updated_at binding to use item.lastUpdatedAt() instead of Instant.now()
- Fix FEEDBACK_SCORES_AGGREGATED_IS_EMPTY filter: embed generated SQL into templates
  instead of hard-coded static condition, and add missing bind calls
- Fix RetryUtils log duplication (remove getMessage() + pass exception directly)
- Add batchSize = 1000 default in ExperimentAggregatesConfig
- Extract resolveVersionIdForCriteria helper to deduplicate version-id resolution
- Add null/blank/ClickHouse placeholder guards in extractUuidsFromGroupValues
- Extract loadEntityMap helper to deduplicate getEnrichInfoHolder enrichment logic

* Revision 3: Address PR comments E, F, G, H

- Fix E: Extract shared template/bind helpers in ExperimentAggregatesDAO
- Fix F: Bind experiment_ids as UUID[] instead of String[]
- Fix G+H: Extract getEnrichInfoHolder logic into ExperimentGroupEnricher,
  eliminating duplication between ExperimentService and ExperimentAggregatesService
  without introducing a direct dependency between them

* Revision 4: Fix ExperimentServiceTest to include ExperimentGroupEnricher mock

* [OPIK-4382] [BE] Extract shared Row→ExperimentGroup mappers into ExperimentGroupMappers

Pull the duplicated Row→ExperimentGroupItem and Row→ExperimentGroupAggregationItem
conversion logic from ExperimentDAO and ExperimentAggregatesDAO into a shared
ExperimentGroupMappers utility class. Both DAOs now delegate to the same
toExperimentGroupItem / toExperimentGroupAggregationItem helpers, eliminating
the need to mirror DTO mapping changes in two places.

* [OPIK-4382] [BE] Deduplicate bindGroupCriteria into ExperimentGroupMappers

Moves the shared group-criteria binding logic out of ExperimentDAO and
ExperimentAggregatesDAO into ExperimentGroupMappers.bindGroupCriteria(),
following the same pattern as ExperimentSearchCriteriaBinder.  Adding or
fixing a criteria binding now only requires a change in one place.

* [OPIK-4382] [BE] Extract streamGroupQuery helper and fix null percentiles

- Deduplicate findGroups/findGroupsAggregations into a single private
  streamGroupQuery(queryTemplate, criteria, rowMapper) that differs only
  by the query constant and BiFunction row mapper.

- Fix convertToBigDecimal to return null for null/unsupported inputs so
  absent p50/p90/p99 entries in getDuration propagate as null rather than
  BigDecimal.ZERO, preserving the semantic-null that lets callers apply
  COALESCE/fallback logic correctly.

* [OPIK-4382] [BE] Consolidate cost/duration helpers into ExperimentGroupMappers

Promote getCostValue and getDuration to public static in
ExperimentGroupMappers and delete the private copies in ExperimentDAO.
ExperimentDAO.mapToDto now delegates to the shared helpers, so any
future change to cost filtering, duration percentile extraction, or the
BigDecimal conversion only needs to be made in one place.

Side-effect: ExperimentDAO.mapToDto also picks up the null-percentile
fix (convertToBigDecimal returns null for absent/unsupported inputs)
that was previously applied only to ExperimentGroupMappers.

* [OPIK-4382] [BE] Fix pagination count and add criteria filter tests

- Remove count() OVER () window function from paged query (returned
  page-scoped count instead of full result-set count)
- Replace with dedicated count query + short-circuit: skip items query
  when count == 0, use DatasetItemPage.empty() for that case
- Extract DatasetItemResultMapper.buildItemFromRow as public static
  helper reused by ExperimentAggregatesDAO
- Add parameterized integration tests for ExperimentGroupCriteria
  filters (name, types, projectId, combined, empty-result) covering
  both findGroups and findGroupsAggregations aggregate paths

* [OPIK-4380] [BE] Extract shared filter helpers into FilterQueryBuilder

Add FilterStrategyParam record, applyFiltersToTemplate and bindFilters
static helpers to FilterQueryBuilder, then replace duplicated per-strategy
loops in DatasetItemVersionDAO and ExperimentAggregatesDAO with single
delegating calls backed by per-DAO strategy constants.

* [OPIK-4382] [BE] Consolidate filter helpers in getExperimentItemsStatsFromAggregates

Add EXPERIMENT_ITEMS_STATS_FILTER_STRATEGY_PARAMS and
EXPERIMENT_ITEMS_STATS_BIND_STRATEGIES constants and replace the
per-strategy toAnalyticsDbFilters/bind blocks in
getExperimentItemsStatsFromAggregates with single delegating calls to
FilterQueryBuilder.applyFiltersToTemplate and bindFilters.

* Revision 9: Extract shared helpers to eliminate duplication across DAOs

- Create DatasetItemSearchCriteriaMapper: centralizes filters + search flag
  wiring for DatasetItemSearchCriteria, shared by DatasetItemVersionDAO and
  ExperimentAggregatesDAO
- Add ExperimentGroupMappers.applyGroupCriteriaToTemplate: centralizes
  ExperimentGroupCriteria → ST template wiring, now shared by ExperimentDAO
  and ExperimentAggregatesDAO
- Update DatasetItemVersionDAO.addFiltersToTemplate and bindSearchAndFilters
  to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentAggregatesDAO.applyDatasetItemFiltersToTemplate and
  bindDatasetItemSearchParams to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentDAO.newGroupTemplate and ExperimentAggregatesDAO.newGroupTemplate
  to delegate to ExperimentGroupMappers.applyGroupCriteriaToTemplate

* [OPIK-4383] [BE] Add experiment aggregate event listener and no-op publisher

* Revision 2: Fix missing import for ExperimentAggregationPublisher

* Fix tests setup

* [OPIK-4383] [BE] Address PR review: move DAO logs to service layer

* [OPIK-4383] [BE] Address PR review: extract shared DAO helper and fix log placement

* [OPIK-4383] [BE] Short-circuit deleteByTraceIds when no spans found

Skip delete, cascading operations, and SpansDeleted event when
getSpanIdsForTraces returns an empty set, preserving the original
no-op behaviour and avoiding the Preconditions.checkArgument failure
in SpanDAO.deleteByIds.

* [OPIK-4383] [BE] Fix cascade deletion failures after trace delete

Two bugs prevented spans and attachments from being deleted when a trace
was deleted via the event-driven cascade:

1. FeedbackScoreService.deleteByTraceIds/deleteBySpanIds had @nonnull on
   projectId which threw NPE when TracesDeleted.projectId() was null.
   EventInterceptor swallowed the NPE, stopping the entire cascade chain.
   Fix: remove @nonnull since the DAO already handles null safely via
   Optional.ofNullable(projectId).

2. SpanDAO.DELETE_BY_IDS had the wrong column (trace_id) and parameter
   name (span_ids) — the ClickHouse R2DBC driver could not resolve :span_ids
   as a named parameter in the DELETE statement. Fixed by using id IN :ids
   to match the working pattern in TraceDAO.DELETE_BY_ID.

* [OPIK-4383] [BE] Remove DAO-level log.info from ExperimentAggregatesDAO methods

Move operational logging responsibility to the service layer, consistent
with earlier fixes for ExperimentItemDAO and SpanDAO in this PR.

* Remove accidentally committed doc files

These files were introduced during merge resolution but should
not be part of the branch.

* [OPIK-4383] [BE] refactor: extract triggerAggregation helper to centralize guard+publish flow

* [OPIK-4383] [BE] fix: restore TagOperations.tagUpdateFragment in SpanDAO BULK_UPDATE

Restores proper tag handling in SpanDAO.BULK_UPDATE query that was
regressed to a simple arrayConcat. Now uses TagOperations.tagUpdateFragment()
which provides arrayDistinct(), tag limit enforcement (max 50), and
tags_to_add/tags_to_remove support. Also adds the required
short_circuit_function_evaluation SETTINGS for throwIf evaluation.

* [OPIK-4383] [BE] Rename segment from delete_by_trace_id to delete_by_span_ids

Fix stale segment name that was not updated when deleteByTraceIds was
refactored into deleteByIds (which now deletes by span IDs).

* [OPIK-4383] [BE] Add Preconditions guard for experimentIds in event classes

Enforce non-empty experimentIds in ExperimentItemsCreated and
ExperimentItemsDeleted constructors to make the contract explicit.

* [OPIK-4383] [BE] Remove unreachable empty-experimentIds tests

Producers already guarantee non-empty experimentIds before constructing
ExperimentItemsCreated/Deleted events, so these tests were exercising
an impossible scenario that now correctly fails the Preconditions guard.

* [OPIK-4383] [BE] Fix @NotNull to @nonnull in FeedbackScoreService

* [OPIK-4383] [BE] Remove redundant traceId null checks in SpanService

SpanUpdate.traceId is @NotNull (Jakarta validation), so the null guard
is dead code — the event always fires after validation.

* [OPIK-4383] [BE] Remove orphaned deleteAllThreadScores from FeedbackScoreService

The DAO method was removed during merge from main but the service
interface and implementation were not cleaned up, causing a compilation error.
thiagohora added a commit that referenced this pull request Mar 12, 2026
…alizationJob and tests (#5511)

* [OPIK-4380] [BE] Add experiment aggregates for denormalized metrics

- Add experiment_aggregates and experiment_item_aggregates tables
- Implement ExperimentAggregatesDAO with population and query methods
- Add ExperimentAggregatesService for aggregation management
- Refactor DTOs into organized model classes:
  - ExperimentAggregatesModel: aggregation results
  - ExperimentEntityData: entity models
  - ExperimentSourceData: raw source data
  - ExperimentAggregatesUtils: utilities
- Add FEEDBACK_SCORES_AGGREGATED filter strategy for map-based filtering
- Add comprehensive integration tests (10/10 passing)
- Configure batch size and parallelism settings

* [OPIK-4380] [BE] Add MySQL deadlock retry mechanism for concurrent dataset operations

Problem:
- MySQL deadlock on dataset_version_tags composite PRIMARY KEY (workspace_id, dataset_id, tag)
- Occurred during parallel dataset creation in same workspace
- Multiple threads inserting "latest" tag for different datasets caused lock contention
- Experiments with parallel execution were failing with MySQLTransactionRollbackException

Solution:
- Add handleOnDeadLocks() method in RetryUtils with:
  - 5 retry attempts with exponential backoff (250ms to 2s)
  - 0.5 jitter to reduce thundering herd effect
  - Recursive isDatabaseDeadlock() detection for MySQLTransactionRollbackException
- Apply retry logic in DatasetItemService.setDatasetItemVersion()
- Enables concurrent dataset creation for same workspace

Impact:
- Supports parallel experiment execution with proper deadlock handling
- Test success rate improved from 0/10 to 10/10 in ExperimentAggregatesIntegrationTest

* Fix visibility

* [OPIK-4380] [BE] Address PR review comments for experiment aggregates

Fixed 11 automated review comments from baz-reviewer:

CRITICAL fixes:
- Prevent NPE on null span aggregations by adding coalesce() in SQL
- Handle multi-project experiments with LIMIT 1 in GET_PROJECT_ID
- Handle zero-item experiments with empty aggregation helpers
- Bind feedback_scores_percentiles map instead of empty CAST

HIGH priority fixes:
- Use toDecimal128(12) instead of toDecimal64(9) for cost percentiles
- Add null-safe tags handling with Optional.ofNullable()
- Include exception objects in retry logging for stack traces

MEDIUM priority fixes:
- Add missing log_comment to SELECT_EXPERIMENT_BY_ID query
- Add missing log_comment to GET_PROJECT_ID query

LOW priority fixes:
- Remove duplicate "id" binding in bindItemsParameters
- Enhance batchSize config documentation with details

All 11 integration tests passing.

* [OPIK-4382] [BE] Refactor experiment aggregates with import cleanup and Optional patterns

- Add missing imports for IntStream, ProjectStats, and other dependencies
- Replace fully-qualified class names with proper imports across DAO and Service classes
- Fix IS_NOT_EMPTY filter handling for FEEDBACK_SCORES_AGGREGATED strategies
- Refactor null checks to use Optional in mapping methods:
  - mapFeedbackScoreAggregations, mapExperimentFromAggregates
  - mapFeedbackScoreData, mapExperimentGroupAggregationItem
  - Batch insert preparation with Optional chains
- Improve code readability and maintainability with functional patterns

* [OPIK-4380] [BE] Fix table definition

* [OPIK-4380] [BE] Address PR comments and consolidate DatasetItemService methods

- Fix tags NPE in ExperimentAggregatesDAO with defaultIfNull
- Remove unnecessary FINAL clause from GET_EXPERIMENT_DATA query
- Fix test naming in ExperimentAggregatesIntegrationTest
- Consolidate 7 duplicate createVersionFromDelta methods into single canonical implementation
- Remove debug logger from config-test.yml

* [OPIK-4380] [BE] Fix missing log_comment and centralize search criteria binding

- Fix SELECT_EXPERIMENT_BY_ID to properly render log_comment metadata
  - Use getSTWithLogComment pattern in getExperimentFromAggregates
  - Ensures ClickHouse query logging populates workspace/experiment IDs

- Centralize ExperimentSearchCriteria binding logic
  - Create ExperimentSearchCriteriaBinder utility class
  - Parameterize filter strategies to support both DAO variants
  - Eliminate 29-line duplication between ExperimentDAO and ExperimentAggregatesDAO
  - Single source of truth prevents DAOs from getting out of sync

* [OPIK-4380] [BE] Fix createVersionFromDelta consolidation after rebase

- Update canonical method signature to include new parameters:
  - List<EvaluatorItem> evaluators
  - ExecutionPolicy executionPolicy
  - boolean clearExecutionPolicy

- Update all 5 caller sites to pass new parameters:
  - Use changes.evaluators(), changes.executionPolicy(), changes.clearExecutionPolicy() when available
  - Pass null/false for auto-generated versions that inherit from base

- Add imports for EvaluatorItem and ExecutionPolicy

Fixes compilation errors introduced by rebase with upstream changes to DatasetVersionService

* [OPIK-4380] [BE] Address PR review comments - fix type mismatch, extract constants, remove DAO logging

- Fixed BigDecimal[] to Double[] conversion for experiment_scores (matches ClickHouse Float64)
- Extracted FilterStrategy lists to static final constants to avoid repeated allocations
- Added @nonnull validation to populateExperimentAggregate parameter
- Removed DAO layer logging, keeping service-level logging only

* [OPIK-4380] [BE] Extract shared helper for experiment data pagination

Extract streamWithExperimentPagination() helper method to eliminate
duplication in getTracesData(), getSpansData(), and getFeedbackScoresData().

All three methods followed identical pattern:
- asyncTemplate.stream with connection
- getSTWithLogComment with cursor flag
- Bind workspace_id, experiment_id, project_id, limit
- Optional cursor binding
- Result mapping

Benefits:
- Single source of truth for pagination binding logic
- Prevents divergence when tweaking cursor/limit bindings
- Reduces code from ~20 lines to ~10 lines per method
- Type-safe generic implementation

Note: CTE redundancy (3x experiment_items scan) is intentional to avoid
passing large trace ID lists as parameters, which would cause performance
issues with 10K+ traces.

* [OPIK-4383] [BE] Add Redis stream subscriber for debounced experiment aggregates recomputation

- Add ExperimentDenormalizationConfig implementing StreamConfiguration with debounce, job lock, and per-experiment aggregation lock settings
- Add ExperimentAggregationMessage as stream message record
- Add ExperimentAggregatesSubscriber consuming from the denormalization stream; acquires a workspace-scoped distributed lock per experiment before calling populateAggregations()
- Add experimentDenormalizationEnabled feature flag to ServiceTogglesConfig and FeatureFlags
- Wire ExperimentDenormalizationConfig into OpikConfiguration
- Update config.yml and config-test.yml with full experimentDenormalization block (enabled for tests)
- Add ExperimentAggregatesSubscriberTest covering lifecycle gating and processEvent success/error paths

* Revision 2: Address PR comments - add config defaults, remove toggle, rename tests

- ExperimentDenormalizationConfig: add sensible defaults to all fields so
  Dropwizard validation doesn't fail when the config block is absent from
  old deployments (config.isEnabled()=false still gates the subscriber)
- Remove experimentDenormalizationEnabled service toggle from
  ServiceTogglesConfig, FeatureFlags, config.yml and config-test.yml -
  the infrastructure gate (config.isEnabled()) is the single control point
- Rename lifecycle test methods to camelCase per project conventions:
  startSkipsStartupWhenDisabled / stopSkipsShutdownWhenDisabled

* Revision 3: Add @max(500) to consumerBatchSize and @NotNull to jobLockWaitTime

* [OPIK-4380] [BE] Address PR review comments - fix TYPE_REFERENCE visibility, redundant IN subquery, hardcoded context keys, Instant.now in loop, and inline defaultIfNull

* Revision 4: Address remaining JetoPistola review comments (#7, #8, #10)

- #7: Remove "Used for testing and verification" from getExperimentFromAggregates javadoc
- #8: Replace recursive flatMap with Mono.expand() in populateExperimentItemsInBatches
- #10: Remove unrelated subscribeOn addition from DatasetItemService.createVersionFromDelta

* Revision 3: Add switchIfEmpty fallback for deleted traces in populateExperimentAggregate

* Fix tests

* Revision 6: Move countTotal log from DAO to service layer

Operational logs belong in the service layer, not the DAO.

* Revision 7: Apply Spotless formatting

* Revision 8: Make populateAggregations(UUID, int) private

Removes the uncapped public batch size entry point. All callers now go
through the public no-arg overload which reads batchSize safely from config.

* [OPIK-4380] [BE] Add evaluation_method support to experiment_aggregates pipeline

- Add ClickHouse migration (000062) to add evaluation_method column to experiment_aggregates table
- Add evaluationMethod field to ExperimentData record
- Update GET_EXPERIMENT_DATA query to read evaluation_method from experiments
- Update INSERT_EXPERIMENT_AGGREGATE to write evaluation_method to experiment_aggregates
- Update SELECT_EXPERIMENT_BY_ID to read evaluation_method from experiment_aggregates
- Fix Experiment record constructor call: insert EvaluationMethod at correct position (10)

* [OPIK-4380] [BE] Extract shared helper for experiment aggregation queries

Reduce copy-paste in getTraceAggregations, getSpanAggregations, and
getFeedbackScoreAggregations by extracting queryExperimentAggregation,
which centralises the context-aware execution, workspace/experiment/project
parameter binding, and singleOrEmpty pattern shared by all three methods.

* [OPIK-4380] [BE] Enforce non-null contract on countTotal criteria parameter

Add @nonnull to ExperimentSearchCriteria in the interface and implementation
so that a null argument fails fast with an explicit NullPointerException at
the DAO boundary instead of crashing deep inside buildCountTemplate.

* [OPIK-4380] [BE] Fix countTotal ignoring target project IDs in normal path

target_project_ids was only applied inside the project_deleted LEFT JOIN
subquery; the main WHERE had no project restriction, so counts were
workspace-wide. Reuse has_target_projects in the main WHERE so
project_id IN :target_project_ids always takes effect. Also replace
manual null/empty checks with CollectionUtils.isNotEmpty.

* [OPIK-4380] [BE] Apply Spotless formatting

* [OPIK-4382] [BE] Address PR review comments on experiment aggregates

- Fix :versionId → :version_id parameter naming in SQL templates and bindings
- Fix last_updated_at binding to use item.lastUpdatedAt() instead of Instant.now()
- Fix FEEDBACK_SCORES_AGGREGATED_IS_EMPTY filter: embed generated SQL into templates
  instead of hard-coded static condition, and add missing bind calls
- Fix RetryUtils log duplication (remove getMessage() + pass exception directly)
- Add batchSize = 1000 default in ExperimentAggregatesConfig
- Extract resolveVersionIdForCriteria helper to deduplicate version-id resolution
- Add null/blank/ClickHouse placeholder guards in extractUuidsFromGroupValues
- Extract loadEntityMap helper to deduplicate getEnrichInfoHolder enrichment logic

* Revision 3: Address PR comments E, F, G, H

- Fix E: Extract shared template/bind helpers in ExperimentAggregatesDAO
- Fix F: Bind experiment_ids as UUID[] instead of String[]
- Fix G+H: Extract getEnrichInfoHolder logic into ExperimentGroupEnricher,
  eliminating duplication between ExperimentService and ExperimentAggregatesService
  without introducing a direct dependency between them

* Revision 4: Fix ExperimentServiceTest to include ExperimentGroupEnricher mock

* [OPIK-4382] [BE] Extract shared Row→ExperimentGroup mappers into ExperimentGroupMappers

Pull the duplicated Row→ExperimentGroupItem and Row→ExperimentGroupAggregationItem
conversion logic from ExperimentDAO and ExperimentAggregatesDAO into a shared
ExperimentGroupMappers utility class. Both DAOs now delegate to the same
toExperimentGroupItem / toExperimentGroupAggregationItem helpers, eliminating
the need to mirror DTO mapping changes in two places.

* [OPIK-4382] [BE] Deduplicate bindGroupCriteria into ExperimentGroupMappers

Moves the shared group-criteria binding logic out of ExperimentDAO and
ExperimentAggregatesDAO into ExperimentGroupMappers.bindGroupCriteria(),
following the same pattern as ExperimentSearchCriteriaBinder.  Adding or
fixing a criteria binding now only requires a change in one place.

* [OPIK-4382] [BE] Extract streamGroupQuery helper and fix null percentiles

- Deduplicate findGroups/findGroupsAggregations into a single private
  streamGroupQuery(queryTemplate, criteria, rowMapper) that differs only
  by the query constant and BiFunction row mapper.

- Fix convertToBigDecimal to return null for null/unsupported inputs so
  absent p50/p90/p99 entries in getDuration propagate as null rather than
  BigDecimal.ZERO, preserving the semantic-null that lets callers apply
  COALESCE/fallback logic correctly.

* [OPIK-4382] [BE] Consolidate cost/duration helpers into ExperimentGroupMappers

Promote getCostValue and getDuration to public static in
ExperimentGroupMappers and delete the private copies in ExperimentDAO.
ExperimentDAO.mapToDto now delegates to the shared helpers, so any
future change to cost filtering, duration percentile extraction, or the
BigDecimal conversion only needs to be made in one place.

Side-effect: ExperimentDAO.mapToDto also picks up the null-percentile
fix (convertToBigDecimal returns null for absent/unsupported inputs)
that was previously applied only to ExperimentGroupMappers.

* [OPIK-4382] [BE] Fix pagination count and add criteria filter tests

- Remove count() OVER () window function from paged query (returned
  page-scoped count instead of full result-set count)
- Replace with dedicated count query + short-circuit: skip items query
  when count == 0, use DatasetItemPage.empty() for that case
- Extract DatasetItemResultMapper.buildItemFromRow as public static
  helper reused by ExperimentAggregatesDAO
- Add parameterized integration tests for ExperimentGroupCriteria
  filters (name, types, projectId, combined, empty-result) covering
  both findGroups and findGroupsAggregations aggregate paths

* [OPIK-4380] [BE] Extract shared filter helpers into FilterQueryBuilder

Add FilterStrategyParam record, applyFiltersToTemplate and bindFilters
static helpers to FilterQueryBuilder, then replace duplicated per-strategy
loops in DatasetItemVersionDAO and ExperimentAggregatesDAO with single
delegating calls backed by per-DAO strategy constants.

* [OPIK-4382] [BE] Consolidate filter helpers in getExperimentItemsStatsFromAggregates

Add EXPERIMENT_ITEMS_STATS_FILTER_STRATEGY_PARAMS and
EXPERIMENT_ITEMS_STATS_BIND_STRATEGIES constants and replace the
per-strategy toAnalyticsDbFilters/bind blocks in
getExperimentItemsStatsFromAggregates with single delegating calls to
FilterQueryBuilder.applyFiltersToTemplate and bindFilters.

* Revision 9: Extract shared helpers to eliminate duplication across DAOs

- Create DatasetItemSearchCriteriaMapper: centralizes filters + search flag
  wiring for DatasetItemSearchCriteria, shared by DatasetItemVersionDAO and
  ExperimentAggregatesDAO
- Add ExperimentGroupMappers.applyGroupCriteriaToTemplate: centralizes
  ExperimentGroupCriteria → ST template wiring, now shared by ExperimentDAO
  and ExperimentAggregatesDAO
- Update DatasetItemVersionDAO.addFiltersToTemplate and bindSearchAndFilters
  to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentAggregatesDAO.applyDatasetItemFiltersToTemplate and
  bindDatasetItemSearchParams to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentDAO.newGroupTemplate and ExperimentAggregatesDAO.newGroupTemplate
  to delegate to ExperimentGroupMappers.applyGroupCriteriaToTemplate

* [OPIK-4383] [BE] Add experiment aggregate event listener and no-op publisher

* Revision 2: Fix missing import for ExperimentAggregationPublisher

* [OPIK-4383] [BE] Add ExperimentAggregationPublisher, ExperimentDenormalizationJob and tests

- ExperimentAggregationPublisher: debounces experiment aggregation triggers
  by writing compound workspaceId:experimentId members to a Redis ZSET scored
  by expiry timestamp (now + debounceDelay), plus a hash storing the userName
  with TTL=2×debounceDelay to handle stale entries.
- ExperimentDenormalizationJob: @every("5s") job that reads ZSET members with
  score <= now, publishes ExperimentAggregationMessage to the Redis stream,
  then cleans up the ZSET entry and hash bucket. Handles stale entries
  (expired hash) by removing the orphaned ZSET member without publishing.
- Fix processExperiment reactive chain: avoided double index.remove by
  returning Mono<Boolean> from flatMap branches so switchIfEmpty is only
  triggered when the bucket is truly empty.
- ExperimentAggregationPublisherTest: integration tests with real Redis
  container verifying ZSET membership, score, userName storage, TTL,
  workspace isolation, and debounce deduplication.
- ExperimentDenormalizationJobTest: unit tests with Mockito covering disabled
  config, lock not acquired, empty ZSET, happy path, stale entry, and batch.

* Fix tests setup

* [OPIK-4383] [BE] Address PR review: move DAO logs to service layer

* [OPIK-4383] [BE] Address PR review: extract shared DAO helper and fix log placement

* [OPIK-4383] [BE] Short-circuit deleteByTraceIds when no spans found

Skip delete, cascading operations, and SpansDeleted event when
getSpanIdsForTraces returns an empty set, preserving the original
no-op behaviour and avoiding the Preconditions.checkArgument failure
in SpanDAO.deleteByIds.

* [OPIK-4383] [BE] Fix cascade deletion failures after trace delete

Two bugs prevented spans and attachments from being deleted when a trace
was deleted via the event-driven cascade:

1. FeedbackScoreService.deleteByTraceIds/deleteBySpanIds had @nonnull on
   projectId which threw NPE when TracesDeleted.projectId() was null.
   EventInterceptor swallowed the NPE, stopping the entire cascade chain.
   Fix: remove @nonnull since the DAO already handles null safely via
   Optional.ofNullable(projectId).

2. SpanDAO.DELETE_BY_IDS had the wrong column (trace_id) and parameter
   name (span_ids) — the ClickHouse R2DBC driver could not resolve :span_ids
   as a named parameter in the DELETE statement. Fixed by using id IN :ids
   to match the working pattern in TraceDAO.DELETE_BY_ID.

* [OPIK-4383] [BE] Address PR review comments on ExperimentDenormalizationJob

- Centralize Redis constants (EXPERIMENT_KEY_PREFIX, USER_NAME_FIELD,
  MEMBER_SEPARATOR) in ExperimentDenormalizationConfig
- Change ExperimentAggregationPublisher.publish() to return Mono<Void>
  instead of void, so errors propagate to callers
- Make job interval configurable via jobs map in config.yml
- Fix onErrorContinue logging: remove getMessage() duplication
- Demote per-experiment logs from INFO to DEBUG
- Add ZSET pagination using expand() to avoid materializing entire range
- Update tests for all changes

* Fix @every job interval config key casing and add jobs section to test config

The dropwizard-jobs framework uses WordUtils.uncapitalize(class.getSimpleName())
to look up the interval in the jobs map, so the key must be
'experimentDenormalizationJob' (lowercase first letter). Also adds the missing
jobs section and jobBatchSize to config-test.yml.

* Replace @every annotation with programmatic Quartz scheduling

Remove @every from ExperimentDenormalizationJob and schedule it
programmatically in OpikGuiceyLifecycleEventListener, following the
same pattern as TraceThreadsClosingJob. Add jobInterval config field
to ExperimentDenormalizationConfig. Remove the jobs YAML section that
caused deserialization errors with JobConfiguration's immutable map.

* Add experiment context to error log and extract publishIfNotEmpty helper

- Include experimentId and workspaceId in onExperimentUpdated error log
- Extract publishIfNotEmpty helper to deduplicate filter+publish logic
  across triggerByExperimentIds, triggerByTraceIds, triggerBySpanIds

* Fix NPE in ExperimentAggregateEventListenerTest mock setup

Stub publisher.publish() to return Mono.empty() in setUp so
.subscribe() calls in production code don't NPE on null.

* [OPIK-4383] [BE] Remove DAO-level log.info from ExperimentAggregatesDAO methods

Move operational logging responsibility to the service layer, consistent
with earlier fixes for ExperimentItemDAO and SpanDAO in this PR.

* Remove accidentally committed doc files

These files were introduced during merge resolution but should
not be part of the branch.

* [OPIK-4383] [BE] refactor: extract triggerAggregation helper to centralize guard+publish flow

* [OPIK-4383] [BE] fix: restore TagOperations.tagUpdateFragment in SpanDAO BULK_UPDATE

Restores proper tag handling in SpanDAO.BULK_UPDATE query that was
regressed to a simple arrayConcat. Now uses TagOperations.tagUpdateFragment()
which provides arrayDistinct(), tag limit enforcement (max 50), and
tags_to_add/tags_to_remove support. Also adds the required
short_circuit_function_evaluation SETTINGS for throwIf evaluation.

* Adding InterruptableJob

* [OPIK-4383] [BE] Address PR review: expand safety valve, env var prefix

- Add batchSize-capped iteration counter to expand() to prevent
  infinite loops when ZSET entries fail to be removed
- Rename EXPERIMENT_DENORM_JOB_INTERVAL to OPIK_EXPERIMENT_DENORM_JOB_INTERVAL
  to follow the OPIK_ prefix convention
thiagohora added a commit that referenced this pull request Mar 13, 2026
…ndpoints (#5577)

* [OPIK-4380] [BE] Add experiment aggregates for denormalized metrics

- Add experiment_aggregates and experiment_item_aggregates tables
- Implement ExperimentAggregatesDAO with population and query methods
- Add ExperimentAggregatesService for aggregation management
- Refactor DTOs into organized model classes:
  - ExperimentAggregatesModel: aggregation results
  - ExperimentEntityData: entity models
  - ExperimentSourceData: raw source data
  - ExperimentAggregatesUtils: utilities
- Add FEEDBACK_SCORES_AGGREGATED filter strategy for map-based filtering
- Add comprehensive integration tests (10/10 passing)
- Configure batch size and parallelism settings

* [OPIK-4380] [BE] Add MySQL deadlock retry mechanism for concurrent dataset operations

Problem:
- MySQL deadlock on dataset_version_tags composite PRIMARY KEY (workspace_id, dataset_id, tag)
- Occurred during parallel dataset creation in same workspace
- Multiple threads inserting "latest" tag for different datasets caused lock contention
- Experiments with parallel execution were failing with MySQLTransactionRollbackException

Solution:
- Add handleOnDeadLocks() method in RetryUtils with:
  - 5 retry attempts with exponential backoff (250ms to 2s)
  - 0.5 jitter to reduce thundering herd effect
  - Recursive isDatabaseDeadlock() detection for MySQLTransactionRollbackException
- Apply retry logic in DatasetItemService.setDatasetItemVersion()
- Enables concurrent dataset creation for same workspace

Impact:
- Supports parallel experiment execution with proper deadlock handling
- Test success rate improved from 0/10 to 10/10 in ExperimentAggregatesIntegrationTest

* Fix visibility

* [OPIK-4380] [BE] Address PR review comments for experiment aggregates

Fixed 11 automated review comments from baz-reviewer:

CRITICAL fixes:
- Prevent NPE on null span aggregations by adding coalesce() in SQL
- Handle multi-project experiments with LIMIT 1 in GET_PROJECT_ID
- Handle zero-item experiments with empty aggregation helpers
- Bind feedback_scores_percentiles map instead of empty CAST

HIGH priority fixes:
- Use toDecimal128(12) instead of toDecimal64(9) for cost percentiles
- Add null-safe tags handling with Optional.ofNullable()
- Include exception objects in retry logging for stack traces

MEDIUM priority fixes:
- Add missing log_comment to SELECT_EXPERIMENT_BY_ID query
- Add missing log_comment to GET_PROJECT_ID query

LOW priority fixes:
- Remove duplicate "id" binding in bindItemsParameters
- Enhance batchSize config documentation with details

All 11 integration tests passing.

* [OPIK-4382] [BE] Refactor experiment aggregates with import cleanup and Optional patterns

- Add missing imports for IntStream, ProjectStats, and other dependencies
- Replace fully-qualified class names with proper imports across DAO and Service classes
- Fix IS_NOT_EMPTY filter handling for FEEDBACK_SCORES_AGGREGATED strategies
- Refactor null checks to use Optional in mapping methods:
  - mapFeedbackScoreAggregations, mapExperimentFromAggregates
  - mapFeedbackScoreData, mapExperimentGroupAggregationItem
  - Batch insert preparation with Optional chains
- Improve code readability and maintainability with functional patterns

* [OPIK-4380] [BE] Fix table definition

* [OPIK-4380] [BE] Address PR comments and consolidate DatasetItemService methods

- Fix tags NPE in ExperimentAggregatesDAO with defaultIfNull
- Remove unnecessary FINAL clause from GET_EXPERIMENT_DATA query
- Fix test naming in ExperimentAggregatesIntegrationTest
- Consolidate 7 duplicate createVersionFromDelta methods into single canonical implementation
- Remove debug logger from config-test.yml

* [OPIK-4380] [BE] Fix missing log_comment and centralize search criteria binding

- Fix SELECT_EXPERIMENT_BY_ID to properly render log_comment metadata
  - Use getSTWithLogComment pattern in getExperimentFromAggregates
  - Ensures ClickHouse query logging populates workspace/experiment IDs

- Centralize ExperimentSearchCriteria binding logic
  - Create ExperimentSearchCriteriaBinder utility class
  - Parameterize filter strategies to support both DAO variants
  - Eliminate 29-line duplication between ExperimentDAO and ExperimentAggregatesDAO
  - Single source of truth prevents DAOs from getting out of sync

* [OPIK-4380] [BE] Fix createVersionFromDelta consolidation after rebase

- Update canonical method signature to include new parameters:
  - List<EvaluatorItem> evaluators
  - ExecutionPolicy executionPolicy
  - boolean clearExecutionPolicy

- Update all 5 caller sites to pass new parameters:
  - Use changes.evaluators(), changes.executionPolicy(), changes.clearExecutionPolicy() when available
  - Pass null/false for auto-generated versions that inherit from base

- Add imports for EvaluatorItem and ExecutionPolicy

Fixes compilation errors introduced by rebase with upstream changes to DatasetVersionService

* [OPIK-4380] [BE] Address PR review comments - fix type mismatch, extract constants, remove DAO logging

- Fixed BigDecimal[] to Double[] conversion for experiment_scores (matches ClickHouse Float64)
- Extracted FilterStrategy lists to static final constants to avoid repeated allocations
- Added @nonnull validation to populateExperimentAggregate parameter
- Removed DAO layer logging, keeping service-level logging only

* [OPIK-4380] [BE] Extract shared helper for experiment data pagination

Extract streamWithExperimentPagination() helper method to eliminate
duplication in getTracesData(), getSpansData(), and getFeedbackScoresData().

All three methods followed identical pattern:
- asyncTemplate.stream with connection
- getSTWithLogComment with cursor flag
- Bind workspace_id, experiment_id, project_id, limit
- Optional cursor binding
- Result mapping

Benefits:
- Single source of truth for pagination binding logic
- Prevents divergence when tweaking cursor/limit bindings
- Reduces code from ~20 lines to ~10 lines per method
- Type-safe generic implementation

Note: CTE redundancy (3x experiment_items scan) is intentional to avoid
passing large trace ID lists as parameters, which would cause performance
issues with 10K+ traces.

* [OPIK-4383] [BE] Add Redis stream subscriber for debounced experiment aggregates recomputation

- Add ExperimentDenormalizationConfig implementing StreamConfiguration with debounce, job lock, and per-experiment aggregation lock settings
- Add ExperimentAggregationMessage as stream message record
- Add ExperimentAggregatesSubscriber consuming from the denormalization stream; acquires a workspace-scoped distributed lock per experiment before calling populateAggregations()
- Add experimentDenormalizationEnabled feature flag to ServiceTogglesConfig and FeatureFlags
- Wire ExperimentDenormalizationConfig into OpikConfiguration
- Update config.yml and config-test.yml with full experimentDenormalization block (enabled for tests)
- Add ExperimentAggregatesSubscriberTest covering lifecycle gating and processEvent success/error paths

* Revision 2: Address PR comments - add config defaults, remove toggle, rename tests

- ExperimentDenormalizationConfig: add sensible defaults to all fields so
  Dropwizard validation doesn't fail when the config block is absent from
  old deployments (config.isEnabled()=false still gates the subscriber)
- Remove experimentDenormalizationEnabled service toggle from
  ServiceTogglesConfig, FeatureFlags, config.yml and config-test.yml -
  the infrastructure gate (config.isEnabled()) is the single control point
- Rename lifecycle test methods to camelCase per project conventions:
  startSkipsStartupWhenDisabled / stopSkipsShutdownWhenDisabled

* Revision 3: Add @max(500) to consumerBatchSize and @NotNull to jobLockWaitTime

* [OPIK-4380] [BE] Address PR review comments - fix TYPE_REFERENCE visibility, redundant IN subquery, hardcoded context keys, Instant.now in loop, and inline defaultIfNull

* Revision 4: Address remaining JetoPistola review comments (#7, #8, #10)

- #7: Remove "Used for testing and verification" from getExperimentFromAggregates javadoc
- #8: Replace recursive flatMap with Mono.expand() in populateExperimentItemsInBatches
- #10: Remove unrelated subscribeOn addition from DatasetItemService.createVersionFromDelta

* Revision 3: Add switchIfEmpty fallback for deleted traces in populateExperimentAggregate

* Fix tests

* Revision 6: Move countTotal log from DAO to service layer

Operational logs belong in the service layer, not the DAO.

* Revision 7: Apply Spotless formatting

* Revision 8: Make populateAggregations(UUID, int) private

Removes the uncapped public batch size entry point. All callers now go
through the public no-arg overload which reads batchSize safely from config.

* [OPIK-4380] [BE] Add evaluation_method support to experiment_aggregates pipeline

- Add ClickHouse migration (000062) to add evaluation_method column to experiment_aggregates table
- Add evaluationMethod field to ExperimentData record
- Update GET_EXPERIMENT_DATA query to read evaluation_method from experiments
- Update INSERT_EXPERIMENT_AGGREGATE to write evaluation_method to experiment_aggregates
- Update SELECT_EXPERIMENT_BY_ID to read evaluation_method from experiment_aggregates
- Fix Experiment record constructor call: insert EvaluationMethod at correct position (10)

* [OPIK-4380] [BE] Extract shared helper for experiment aggregation queries

Reduce copy-paste in getTraceAggregations, getSpanAggregations, and
getFeedbackScoreAggregations by extracting queryExperimentAggregation,
which centralises the context-aware execution, workspace/experiment/project
parameter binding, and singleOrEmpty pattern shared by all three methods.

* [OPIK-4380] [BE] Enforce non-null contract on countTotal criteria parameter

Add @nonnull to ExperimentSearchCriteria in the interface and implementation
so that a null argument fails fast with an explicit NullPointerException at
the DAO boundary instead of crashing deep inside buildCountTemplate.

* [OPIK-4380] [BE] Fix countTotal ignoring target project IDs in normal path

target_project_ids was only applied inside the project_deleted LEFT JOIN
subquery; the main WHERE had no project restriction, so counts were
workspace-wide. Reuse has_target_projects in the main WHERE so
project_id IN :target_project_ids always takes effect. Also replace
manual null/empty checks with CollectionUtils.isNotEmpty.

* [OPIK-4380] [BE] Apply Spotless formatting

* [OPIK-4382] [BE] Address PR review comments on experiment aggregates

- Fix :versionId → :version_id parameter naming in SQL templates and bindings
- Fix last_updated_at binding to use item.lastUpdatedAt() instead of Instant.now()
- Fix FEEDBACK_SCORES_AGGREGATED_IS_EMPTY filter: embed generated SQL into templates
  instead of hard-coded static condition, and add missing bind calls
- Fix RetryUtils log duplication (remove getMessage() + pass exception directly)
- Add batchSize = 1000 default in ExperimentAggregatesConfig
- Extract resolveVersionIdForCriteria helper to deduplicate version-id resolution
- Add null/blank/ClickHouse placeholder guards in extractUuidsFromGroupValues
- Extract loadEntityMap helper to deduplicate getEnrichInfoHolder enrichment logic

* Revision 3: Address PR comments E, F, G, H

- Fix E: Extract shared template/bind helpers in ExperimentAggregatesDAO
- Fix F: Bind experiment_ids as UUID[] instead of String[]
- Fix G+H: Extract getEnrichInfoHolder logic into ExperimentGroupEnricher,
  eliminating duplication between ExperimentService and ExperimentAggregatesService
  without introducing a direct dependency between them

* Revision 4: Fix ExperimentServiceTest to include ExperimentGroupEnricher mock

* [OPIK-4382] [BE] Extract shared Row→ExperimentGroup mappers into ExperimentGroupMappers

Pull the duplicated Row→ExperimentGroupItem and Row→ExperimentGroupAggregationItem
conversion logic from ExperimentDAO and ExperimentAggregatesDAO into a shared
ExperimentGroupMappers utility class. Both DAOs now delegate to the same
toExperimentGroupItem / toExperimentGroupAggregationItem helpers, eliminating
the need to mirror DTO mapping changes in two places.

* [OPIK-4382] [BE] Deduplicate bindGroupCriteria into ExperimentGroupMappers

Moves the shared group-criteria binding logic out of ExperimentDAO and
ExperimentAggregatesDAO into ExperimentGroupMappers.bindGroupCriteria(),
following the same pattern as ExperimentSearchCriteriaBinder.  Adding or
fixing a criteria binding now only requires a change in one place.

* [OPIK-4382] [BE] Extract streamGroupQuery helper and fix null percentiles

- Deduplicate findGroups/findGroupsAggregations into a single private
  streamGroupQuery(queryTemplate, criteria, rowMapper) that differs only
  by the query constant and BiFunction row mapper.

- Fix convertToBigDecimal to return null for null/unsupported inputs so
  absent p50/p90/p99 entries in getDuration propagate as null rather than
  BigDecimal.ZERO, preserving the semantic-null that lets callers apply
  COALESCE/fallback logic correctly.

* [OPIK-4382] [BE] Consolidate cost/duration helpers into ExperimentGroupMappers

Promote getCostValue and getDuration to public static in
ExperimentGroupMappers and delete the private copies in ExperimentDAO.
ExperimentDAO.mapToDto now delegates to the shared helpers, so any
future change to cost filtering, duration percentile extraction, or the
BigDecimal conversion only needs to be made in one place.

Side-effect: ExperimentDAO.mapToDto also picks up the null-percentile
fix (convertToBigDecimal returns null for absent/unsupported inputs)
that was previously applied only to ExperimentGroupMappers.

* [OPIK-4382] [BE] Fix pagination count and add criteria filter tests

- Remove count() OVER () window function from paged query (returned
  page-scoped count instead of full result-set count)
- Replace with dedicated count query + short-circuit: skip items query
  when count == 0, use DatasetItemPage.empty() for that case
- Extract DatasetItemResultMapper.buildItemFromRow as public static
  helper reused by ExperimentAggregatesDAO
- Add parameterized integration tests for ExperimentGroupCriteria
  filters (name, types, projectId, combined, empty-result) covering
  both findGroups and findGroupsAggregations aggregate paths

* [OPIK-4380] [BE] Extract shared filter helpers into FilterQueryBuilder

Add FilterStrategyParam record, applyFiltersToTemplate and bindFilters
static helpers to FilterQueryBuilder, then replace duplicated per-strategy
loops in DatasetItemVersionDAO and ExperimentAggregatesDAO with single
delegating calls backed by per-DAO strategy constants.

* [OPIK-4382] [BE] Consolidate filter helpers in getExperimentItemsStatsFromAggregates

Add EXPERIMENT_ITEMS_STATS_FILTER_STRATEGY_PARAMS and
EXPERIMENT_ITEMS_STATS_BIND_STRATEGIES constants and replace the
per-strategy toAnalyticsDbFilters/bind blocks in
getExperimentItemsStatsFromAggregates with single delegating calls to
FilterQueryBuilder.applyFiltersToTemplate and bindFilters.

* Revision 9: Extract shared helpers to eliminate duplication across DAOs

- Create DatasetItemSearchCriteriaMapper: centralizes filters + search flag
  wiring for DatasetItemSearchCriteria, shared by DatasetItemVersionDAO and
  ExperimentAggregatesDAO
- Add ExperimentGroupMappers.applyGroupCriteriaToTemplate: centralizes
  ExperimentGroupCriteria → ST template wiring, now shared by ExperimentDAO
  and ExperimentAggregatesDAO
- Update DatasetItemVersionDAO.addFiltersToTemplate and bindSearchAndFilters
  to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentAggregatesDAO.applyDatasetItemFiltersToTemplate and
  bindDatasetItemSearchParams to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentDAO.newGroupTemplate and ExperimentAggregatesDAO.newGroupTemplate
  to delegate to ExperimentGroupMappers.applyGroupCriteriaToTemplate

* [OPIK-4383] [BE] Add experiment aggregate event listener and no-op publisher

* Revision 2: Fix missing import for ExperimentAggregationPublisher

* [OPIK-4383] [BE] Add ExperimentAggregationPublisher, ExperimentDenormalizationJob and tests

- ExperimentAggregationPublisher: debounces experiment aggregation triggers
  by writing compound workspaceId:experimentId members to a Redis ZSET scored
  by expiry timestamp (now + debounceDelay), plus a hash storing the userName
  with TTL=2×debounceDelay to handle stale entries.
- ExperimentDenormalizationJob: @every("5s") job that reads ZSET members with
  score <= now, publishes ExperimentAggregationMessage to the Redis stream,
  then cleans up the ZSET entry and hash bucket. Handles stale entries
  (expired hash) by removing the orphaned ZSET member without publishing.
- Fix processExperiment reactive chain: avoided double index.remove by
  returning Mono<Boolean> from flatMap branches so switchIfEmpty is only
  triggered when the bucket is truly empty.
- ExperimentAggregationPublisherTest: integration tests with real Redis
  container verifying ZSET membership, score, userName storage, TTL,
  workspace isolation, and debounce deduplication.
- ExperimentDenormalizationJobTest: unit tests with Mockito covering disabled
  config, lock not acquired, empty ZSET, happy path, stale entry, and batch.

* Fix tests setup

* [OPIK-4383] [BE] Address PR review: move DAO logs to service layer

* [OPIK-4383] [BE] Address PR review: extract shared DAO helper and fix log placement

* [OPIK-4383] [BE] Short-circuit deleteByTraceIds when no spans found

Skip delete, cascading operations, and SpansDeleted event when
getSpanIdsForTraces returns an empty set, preserving the original
no-op behaviour and avoiding the Preconditions.checkArgument failure
in SpanDAO.deleteByIds.

* [OPIK-4383] [BE] Fix cascade deletion failures after trace delete

Two bugs prevented spans and attachments from being deleted when a trace
was deleted via the event-driven cascade:

1. FeedbackScoreService.deleteByTraceIds/deleteBySpanIds had @nonnull on
   projectId which threw NPE when TracesDeleted.projectId() was null.
   EventInterceptor swallowed the NPE, stopping the entire cascade chain.
   Fix: remove @nonnull since the DAO already handles null safely via
   Optional.ofNullable(projectId).

2. SpanDAO.DELETE_BY_IDS had the wrong column (trace_id) and parameter
   name (span_ids) — the ClickHouse R2DBC driver could not resolve :span_ids
   as a named parameter in the DELETE statement. Fixed by using id IN :ids
   to match the working pattern in TraceDAO.DELETE_BY_ID.

* [OPIK-4383] [BE] Address PR review comments on ExperimentDenormalizationJob

- Centralize Redis constants (EXPERIMENT_KEY_PREFIX, USER_NAME_FIELD,
  MEMBER_SEPARATOR) in ExperimentDenormalizationConfig
- Change ExperimentAggregationPublisher.publish() to return Mono<Void>
  instead of void, so errors propagate to callers
- Make job interval configurable via jobs map in config.yml
- Fix onErrorContinue logging: remove getMessage() duplication
- Demote per-experiment logs from INFO to DEBUG
- Add ZSET pagination using expand() to avoid materializing entire range
- Update tests for all changes

* Fix @every job interval config key casing and add jobs section to test config

The dropwizard-jobs framework uses WordUtils.uncapitalize(class.getSimpleName())
to look up the interval in the jobs map, so the key must be
'experimentDenormalizationJob' (lowercase first letter). Also adds the missing
jobs section and jobBatchSize to config-test.yml.

* Replace @every annotation with programmatic Quartz scheduling

Remove @every from ExperimentDenormalizationJob and schedule it
programmatically in OpikGuiceyLifecycleEventListener, following the
same pattern as TraceThreadsClosingJob. Add jobInterval config field
to ExperimentDenormalizationConfig. Remove the jobs YAML section that
caused deserialization errors with JobConfiguration's immutable map.

* Add experiment context to error log and extract publishIfNotEmpty helper

- Include experimentId and workspaceId in onExperimentUpdated error log
- Extract publishIfNotEmpty helper to deduplicate filter+publish logic
  across triggerByExperimentIds, triggerByTraceIds, triggerBySpanIds

* Fix NPE in ExperimentAggregateEventListenerTest mock setup

Stub publisher.publish() to return Mono.empty() in setUp so
.subscribe() calls in production code don't NPE on null.

* [OPIK-4385] [BE] Use pre-computed aggregation tables for experiment endpoints

Apply UNION ALL hybrid pattern to ExperimentDAO (FIND, FIND_GROUPS,
FIND_GROUPS_AGGREGATIONS) and ExperimentItemDAO (STREAM,
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT,
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS,
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_STATS) so that
experiments present in experiment_aggregates / experiment_item_aggregates
use pre-computed values, while others fall back to live JOIN computation.

Add ExperimentAggregatesIntegrationTest covering all 7 affected queries
with parameterized filter, pagination, and consistency scenarios.

* [OPIK-4384] [BE] Fix missing zero_uuid binding and experiment_scores sort alias

- Bind zero_uuid parameter in getById, getByIds, and get(ExperimentStreamRequest)
  methods that use the FIND query; the UNION ALL refactor introduced an
  experiments_from_aggregates CTE that requires this parameter but only the
  main find() method was binding it, causing 500 errors on those paths
- Fix SortingQueryBuilder to reference the outer column alias experiment_scores_agg
  instead of es.experiment_scores; the ORDER BY sits outside the UNION ALL so the
  inner es alias is out of scope, while experiment_scores_agg is the consistent
  output alias exposed by both branches

* [OPIK-4384] [BE] Fix null row injection from LEFT JOIN miss in feedback_scores and comments aggregation

Pre-aggregate feedback_scores_final and comments_final into subqueries
(GROUP BY entity_id) before LEFT JOIN in DatasetItemVersionDAO.STREAM.
When a LEFT JOIN has no match against a pre-aggregated subquery the
joined columns are NULL, so any(NULL) returns NULL instead of a
default-valued row with epoch timestamps that caused Instant.parse()
failures.

Also adds a regression test covering the no-scores path in
ExperimentAggregatesIntegrationTest.

* [OPIK-4383] [BE] Remove DAO-level log.info from ExperimentAggregatesDAO methods

Move operational logging responsibility to the service layer, consistent
with earlier fixes for ExperimentItemDAO and SpanDAO in this PR.

* Remove accidentally committed doc files

These files were introduced during merge resolution but should
not be part of the branch.

* [OPIK-4383] [BE] refactor: extract triggerAggregation helper to centralize guard+publish flow

* [OPIK-4383] [BE] fix: restore TagOperations.tagUpdateFragment in SpanDAO BULK_UPDATE

Restores proper tag handling in SpanDAO.BULK_UPDATE query that was
regressed to a simple arrayConcat. Now uses TagOperations.tagUpdateFragment()
which provides arrayDistinct(), tag limit enforcement (max 50), and
tags_to_add/tags_to_remove support. Also adds the required
short_circuit_function_evaluation SETTINGS for throwIf evaluation.

* Adding InterruptableJob

* [OPIK-4383] [BE] Address PR review: expand safety valve, env var prefix

- Add batchSize-capped iteration counter to expand() to prevent
  infinite loops when ZSET entries fail to be removed
- Rename EXPERIMENT_DENORM_JOB_INTERVAL to OPIK_EXPERIMENT_DENORM_JOB_INTERVAL
  to follow the OPIK_ prefix convention

* [OPIK-4384] [BE] Add branch optimization and CTE split to experiment queries

Use pre-computed experiment_aggregates table to optimize query execution:
- Add has_aggregated/has_raw flags to skip unnecessary UNION ALL branches in FIND/FIND_COUNT
- Add getAggregationBranchCounts pre-query to determine which branches are needed
- Apply CTE split pattern to FIND_GROUPS and FIND_GROUPS_AGGREGATIONS
- Update getById to leverage branch optimization via single-ID branch count query
- Add <if(id)> filter to SELECT_AGGREGATED_EXPERIMENT_IDS for getById support

* [OPIK-4384] [BE] Add missing 7-arg overload for getDatasetItemsWithExperimentItems

Fix test compilation error from merge: the remote branch added callers
with (UUID, List, null, null, List<SortingField>, String, String) signature
which needs a bridge overload to the 9-arg method.

* [OPIK-4384] [BE] Add conditional LIMIT push-up, missing CTE, and fix test precision

- Add conditional LIMIT push-up in STREAM query: push LIMIT into CTE
  when only one branch (raw or aggregated) is active for performance
- Add missing experiment_item_aggr_trace_scope CTE for aggregated branch
- Add AggregatedExperimentCounts record for experiment-level branching
- Fix MultiValueFeedbackScoresE2ETest precision assertion: use isEqualTo
  instead of isEqualByComparingTo to respect custom BigDecimal comparator

* [OPIK-4384] [BE] Push OFFSET into top_dataset_items CTE and fix BigDecimal comparator in DatasetsResourceTest

* [OPIK-4384] [BE] Add pass rate aggregation to experiment aggregates

Add pass_rate, passed_count, and total_count columns to experiment_aggregates
table and compute them during aggregation. Update ExperimentDAO queries to
select these columns from both raw and aggregated paths, returning NULL for
non-evaluation-suite experiments.

* Fix get by id

* Fix mapping

* [OPIK-4384] [BE] Use pre-aggregated comments from aggregate tables with ISO 8601 date formatting

Update retrieval queries in ExperimentDAO, DatasetItemVersionDAO, and ExperimentAggregatesDAO
to read comments_array_agg as JSON String from aggregate tables instead of live-querying the
comments table. Ensure UNION ALL type compatibility by wrapping raw paths with toJSONString()
and formatting dates as ISO 8601 for proper Jackson deserialization.

* [OPIK-4384] [BE] Use parameterized binding for dynamic sort keys and add deterministic tiebreaker

- Replace literal string interpolation in getTopSortExpression with
  parameterized bind variables (sf.bindKey()) to prevent SQL injection
- Remove fieldMapping filter from bindDynamicKeys so all dynamic keys
  are bound, including those used in the top_sorting SELECT expression
- Add deterministic tiebreaker (id DESC / dataset_item_id DESC) to both
  the push-top-limit CTE and the main ORDER BY for consistent pagination
- Fix experiment_items deduplication: use FINAL where DISTINCT was used
  and vice versa for consistency across query branches

* [OPIK-4384] [BE] Add mixed-state aggregation test for UNION ALL hybrid

Test creates 3 experiments, aggregates only 1, and queries all 3 to
exercise the UNION ALL hybrid path where has_aggregated and has_raw
are both true simultaneously.

* [OPIK-4384] [BE] Add isNotEmpty assertions to parameterized filter tests

Ensure filter scenarios actually match data by asserting content()
is not empty before and after aggregation in all parameterized filter
tests (find, findGroups, findGroupsAggregations).

* [OPIK-4384] [BE] refactor: extract assertion helpers to remove duplication in ExperimentAggregatesIntegrationTest

* [OPIK-4384] [BE] refactor: rename parseFlexibleInstant to parseInstant in FeedbackScoreMapper

* [OPIK-4384] [BE] Make LIMIT unconditional in FIND query

The LIMIT clause was gated on filter/sort flags, so plain paged requests
(only limit/offset) at the outer query level would not emit LIMIT.
Simplify to always emit LIMIT when the limit parameter is provided.

* [OPIK-4384] [BE] Fix comment ordering assertion in tests

ClickHouse groupUniqArray does not guarantee ordering, so comment
assertions must use ignoringCollectionOrder to avoid flaky failures.

* [OPIK-4384] [BE] Add branch conditionals to FIND_GROUPS/FIND_GROUPS_AGGREGATIONS and revert unconditional LIMIT

- Wrap SELECT branches in FIND_GROUPS and FIND_GROUPS_AGGREGATIONS with
  <if(has_aggregated)>/<if(has_raw)> conditionals to skip unnecessary
  branches when all experiments are aggregated or all are raw
- Add no-args getAggregationBranchCounts() overload for workspace-only
  pre-query (used by group/aggregation queries that lack experiment IDs)
- Update executeQueryWithTargetProjects to run both pre-queries in
  parallel via Mono.zip
- Revert commit 215a3f9 (unconditional LIMIT) which caused double
  LIMIT/OFFSET bug: CTE-level LIMIT + outer LIMIT made page 2+ return
  0 results. The complex conditional is correct — outer LIMIT is only
  needed when post-CTE processing may alter the result set.

* [OPIK-4384] [BE] Add branch conditionals to SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT

Wrap the UNION ALL in the count query with <if(has_aggregated)>/<if(has_raw)>
conditionals to skip unnecessary branches. Pass branch flags through
getCountWithExperimentFilters from the existing pre-query results.

* [OPIK-4384] [BE] Fix ClickHouse column resolution in COUNT query

Alias dataset_item_id as di_id in the COUNT subquery branches
to avoid column name ambiguity when ClickHouse 25.3's query
analyzer resolves COUNT(DISTINCT dataset_item_id) through a
LEFT JOIN with dataset_items_resolved which also has that column.

* [OPIK-4384] [BE] Use pre-computed comments in STREAM query and fix UNION ALL type mismatch

Aggregated branch now reads comments_array_agg directly from experiment_item_aggregates
instead of doing an expensive JOIN to the comments table. Raw branch converts comments
to JSON String via toJSONString(CAST(...)) so both branches output compatible types.

* [OPIK-4384] [BE] Fix target_project_ids bind error in FIND_GROUPS aggregated branch

* [OPIK-4384] [BE] Unify aggregation branch counts with shared ExperimentAggregationSql

Extract SELECT_AGGREGATED_EXPERIMENT_IDS SQL and AggregatedExperimentCounts
into shared ExperimentAggregationSql utility class. Introduce
AggregationBranchCountsCriteria DTO to unify getAggregationBranchCounts
overloads across ExperimentDAO, ExperimentItemDAO, and DatasetItemVersionDAO.

* [OPIK-4384] [BE] Move getAggregationBranchCounts to ExperimentAggregatesDAO

Consolidate aggregation branch counting logic into ExperimentAggregatesDAO
instead of a separate utility class. Extract DTOs into their own files
in the experiments.aggregations package.

* [OPIK-4384] [BE] Deduplicate experiment_aggregates subquery with SELECT DISTINCT

Prevent inflated counts from ReplacingMergeTree pre-merge duplicates
in the aggregation branch counting query.
thiagohora added a commit that referenced this pull request Mar 16, 2026
…ment by ID (#5579)

* [OPIK-4380] [BE] Add experiment aggregates for denormalized metrics

- Add experiment_aggregates and experiment_item_aggregates tables
- Implement ExperimentAggregatesDAO with population and query methods
- Add ExperimentAggregatesService for aggregation management
- Refactor DTOs into organized model classes:
  - ExperimentAggregatesModel: aggregation results
  - ExperimentEntityData: entity models
  - ExperimentSourceData: raw source data
  - ExperimentAggregatesUtils: utilities
- Add FEEDBACK_SCORES_AGGREGATED filter strategy for map-based filtering
- Add comprehensive integration tests (10/10 passing)
- Configure batch size and parallelism settings

* [OPIK-4380] [BE] Add MySQL deadlock retry mechanism for concurrent dataset operations

Problem:
- MySQL deadlock on dataset_version_tags composite PRIMARY KEY (workspace_id, dataset_id, tag)
- Occurred during parallel dataset creation in same workspace
- Multiple threads inserting "latest" tag for different datasets caused lock contention
- Experiments with parallel execution were failing with MySQLTransactionRollbackException

Solution:
- Add handleOnDeadLocks() method in RetryUtils with:
  - 5 retry attempts with exponential backoff (250ms to 2s)
  - 0.5 jitter to reduce thundering herd effect
  - Recursive isDatabaseDeadlock() detection for MySQLTransactionRollbackException
- Apply retry logic in DatasetItemService.setDatasetItemVersion()
- Enables concurrent dataset creation for same workspace

Impact:
- Supports parallel experiment execution with proper deadlock handling
- Test success rate improved from 0/10 to 10/10 in ExperimentAggregatesIntegrationTest

* Fix visibility

* [OPIK-4380] [BE] Address PR review comments for experiment aggregates

Fixed 11 automated review comments from baz-reviewer:

CRITICAL fixes:
- Prevent NPE on null span aggregations by adding coalesce() in SQL
- Handle multi-project experiments with LIMIT 1 in GET_PROJECT_ID
- Handle zero-item experiments with empty aggregation helpers
- Bind feedback_scores_percentiles map instead of empty CAST

HIGH priority fixes:
- Use toDecimal128(12) instead of toDecimal64(9) for cost percentiles
- Add null-safe tags handling with Optional.ofNullable()
- Include exception objects in retry logging for stack traces

MEDIUM priority fixes:
- Add missing log_comment to SELECT_EXPERIMENT_BY_ID query
- Add missing log_comment to GET_PROJECT_ID query

LOW priority fixes:
- Remove duplicate "id" binding in bindItemsParameters
- Enhance batchSize config documentation with details

All 11 integration tests passing.

* [OPIK-4382] [BE] Refactor experiment aggregates with import cleanup and Optional patterns

- Add missing imports for IntStream, ProjectStats, and other dependencies
- Replace fully-qualified class names with proper imports across DAO and Service classes
- Fix IS_NOT_EMPTY filter handling for FEEDBACK_SCORES_AGGREGATED strategies
- Refactor null checks to use Optional in mapping methods:
  - mapFeedbackScoreAggregations, mapExperimentFromAggregates
  - mapFeedbackScoreData, mapExperimentGroupAggregationItem
  - Batch insert preparation with Optional chains
- Improve code readability and maintainability with functional patterns

* [OPIK-4380] [BE] Fix table definition

* [OPIK-4380] [BE] Address PR comments and consolidate DatasetItemService methods

- Fix tags NPE in ExperimentAggregatesDAO with defaultIfNull
- Remove unnecessary FINAL clause from GET_EXPERIMENT_DATA query
- Fix test naming in ExperimentAggregatesIntegrationTest
- Consolidate 7 duplicate createVersionFromDelta methods into single canonical implementation
- Remove debug logger from config-test.yml

* [OPIK-4380] [BE] Fix missing log_comment and centralize search criteria binding

- Fix SELECT_EXPERIMENT_BY_ID to properly render log_comment metadata
  - Use getSTWithLogComment pattern in getExperimentFromAggregates
  - Ensures ClickHouse query logging populates workspace/experiment IDs

- Centralize ExperimentSearchCriteria binding logic
  - Create ExperimentSearchCriteriaBinder utility class
  - Parameterize filter strategies to support both DAO variants
  - Eliminate 29-line duplication between ExperimentDAO and ExperimentAggregatesDAO
  - Single source of truth prevents DAOs from getting out of sync

* [OPIK-4380] [BE] Fix createVersionFromDelta consolidation after rebase

- Update canonical method signature to include new parameters:
  - List<EvaluatorItem> evaluators
  - ExecutionPolicy executionPolicy
  - boolean clearExecutionPolicy

- Update all 5 caller sites to pass new parameters:
  - Use changes.evaluators(), changes.executionPolicy(), changes.clearExecutionPolicy() when available
  - Pass null/false for auto-generated versions that inherit from base

- Add imports for EvaluatorItem and ExecutionPolicy

Fixes compilation errors introduced by rebase with upstream changes to DatasetVersionService

* [OPIK-4380] [BE] Address PR review comments - fix type mismatch, extract constants, remove DAO logging

- Fixed BigDecimal[] to Double[] conversion for experiment_scores (matches ClickHouse Float64)
- Extracted FilterStrategy lists to static final constants to avoid repeated allocations
- Added @nonnull validation to populateExperimentAggregate parameter
- Removed DAO layer logging, keeping service-level logging only

* [OPIK-4380] [BE] Extract shared helper for experiment data pagination

Extract streamWithExperimentPagination() helper method to eliminate
duplication in getTracesData(), getSpansData(), and getFeedbackScoresData().

All three methods followed identical pattern:
- asyncTemplate.stream with connection
- getSTWithLogComment with cursor flag
- Bind workspace_id, experiment_id, project_id, limit
- Optional cursor binding
- Result mapping

Benefits:
- Single source of truth for pagination binding logic
- Prevents divergence when tweaking cursor/limit bindings
- Reduces code from ~20 lines to ~10 lines per method
- Type-safe generic implementation

Note: CTE redundancy (3x experiment_items scan) is intentional to avoid
passing large trace ID lists as parameters, which would cause performance
issues with 10K+ traces.

* [OPIK-4383] [BE] Add Redis stream subscriber for debounced experiment aggregates recomputation

- Add ExperimentDenormalizationConfig implementing StreamConfiguration with debounce, job lock, and per-experiment aggregation lock settings
- Add ExperimentAggregationMessage as stream message record
- Add ExperimentAggregatesSubscriber consuming from the denormalization stream; acquires a workspace-scoped distributed lock per experiment before calling populateAggregations()
- Add experimentDenormalizationEnabled feature flag to ServiceTogglesConfig and FeatureFlags
- Wire ExperimentDenormalizationConfig into OpikConfiguration
- Update config.yml and config-test.yml with full experimentDenormalization block (enabled for tests)
- Add ExperimentAggregatesSubscriberTest covering lifecycle gating and processEvent success/error paths

* Revision 2: Address PR comments - add config defaults, remove toggle, rename tests

- ExperimentDenormalizationConfig: add sensible defaults to all fields so
  Dropwizard validation doesn't fail when the config block is absent from
  old deployments (config.isEnabled()=false still gates the subscriber)
- Remove experimentDenormalizationEnabled service toggle from
  ServiceTogglesConfig, FeatureFlags, config.yml and config-test.yml -
  the infrastructure gate (config.isEnabled()) is the single control point
- Rename lifecycle test methods to camelCase per project conventions:
  startSkipsStartupWhenDisabled / stopSkipsShutdownWhenDisabled

* Revision 3: Add @max(500) to consumerBatchSize and @NotNull to jobLockWaitTime

* [OPIK-4380] [BE] Address PR review comments - fix TYPE_REFERENCE visibility, redundant IN subquery, hardcoded context keys, Instant.now in loop, and inline defaultIfNull

* Revision 4: Address remaining JetoPistola review comments (#7, #8, #10)

- #7: Remove "Used for testing and verification" from getExperimentFromAggregates javadoc
- #8: Replace recursive flatMap with Mono.expand() in populateExperimentItemsInBatches
- #10: Remove unrelated subscribeOn addition from DatasetItemService.createVersionFromDelta

* Revision 3: Add switchIfEmpty fallback for deleted traces in populateExperimentAggregate

* Fix tests

* Revision 6: Move countTotal log from DAO to service layer

Operational logs belong in the service layer, not the DAO.

* Revision 7: Apply Spotless formatting

* Revision 8: Make populateAggregations(UUID, int) private

Removes the uncapped public batch size entry point. All callers now go
through the public no-arg overload which reads batchSize safely from config.

* [OPIK-4380] [BE] Add evaluation_method support to experiment_aggregates pipeline

- Add ClickHouse migration (000062) to add evaluation_method column to experiment_aggregates table
- Add evaluationMethod field to ExperimentData record
- Update GET_EXPERIMENT_DATA query to read evaluation_method from experiments
- Update INSERT_EXPERIMENT_AGGREGATE to write evaluation_method to experiment_aggregates
- Update SELECT_EXPERIMENT_BY_ID to read evaluation_method from experiment_aggregates
- Fix Experiment record constructor call: insert EvaluationMethod at correct position (10)

* [OPIK-4380] [BE] Extract shared helper for experiment aggregation queries

Reduce copy-paste in getTraceAggregations, getSpanAggregations, and
getFeedbackScoreAggregations by extracting queryExperimentAggregation,
which centralises the context-aware execution, workspace/experiment/project
parameter binding, and singleOrEmpty pattern shared by all three methods.

* [OPIK-4380] [BE] Enforce non-null contract on countTotal criteria parameter

Add @nonnull to ExperimentSearchCriteria in the interface and implementation
so that a null argument fails fast with an explicit NullPointerException at
the DAO boundary instead of crashing deep inside buildCountTemplate.

* [OPIK-4380] [BE] Fix countTotal ignoring target project IDs in normal path

target_project_ids was only applied inside the project_deleted LEFT JOIN
subquery; the main WHERE had no project restriction, so counts were
workspace-wide. Reuse has_target_projects in the main WHERE so
project_id IN :target_project_ids always takes effect. Also replace
manual null/empty checks with CollectionUtils.isNotEmpty.

* [OPIK-4380] [BE] Apply Spotless formatting

* [OPIK-4382] [BE] Address PR review comments on experiment aggregates

- Fix :versionId → :version_id parameter naming in SQL templates and bindings
- Fix last_updated_at binding to use item.lastUpdatedAt() instead of Instant.now()
- Fix FEEDBACK_SCORES_AGGREGATED_IS_EMPTY filter: embed generated SQL into templates
  instead of hard-coded static condition, and add missing bind calls
- Fix RetryUtils log duplication (remove getMessage() + pass exception directly)
- Add batchSize = 1000 default in ExperimentAggregatesConfig
- Extract resolveVersionIdForCriteria helper to deduplicate version-id resolution
- Add null/blank/ClickHouse placeholder guards in extractUuidsFromGroupValues
- Extract loadEntityMap helper to deduplicate getEnrichInfoHolder enrichment logic

* Revision 3: Address PR comments E, F, G, H

- Fix E: Extract shared template/bind helpers in ExperimentAggregatesDAO
- Fix F: Bind experiment_ids as UUID[] instead of String[]
- Fix G+H: Extract getEnrichInfoHolder logic into ExperimentGroupEnricher,
  eliminating duplication between ExperimentService and ExperimentAggregatesService
  without introducing a direct dependency between them

* Revision 4: Fix ExperimentServiceTest to include ExperimentGroupEnricher mock

* [OPIK-4382] [BE] Extract shared Row→ExperimentGroup mappers into ExperimentGroupMappers

Pull the duplicated Row→ExperimentGroupItem and Row→ExperimentGroupAggregationItem
conversion logic from ExperimentDAO and ExperimentAggregatesDAO into a shared
ExperimentGroupMappers utility class. Both DAOs now delegate to the same
toExperimentGroupItem / toExperimentGroupAggregationItem helpers, eliminating
the need to mirror DTO mapping changes in two places.

* [OPIK-4382] [BE] Deduplicate bindGroupCriteria into ExperimentGroupMappers

Moves the shared group-criteria binding logic out of ExperimentDAO and
ExperimentAggregatesDAO into ExperimentGroupMappers.bindGroupCriteria(),
following the same pattern as ExperimentSearchCriteriaBinder.  Adding or
fixing a criteria binding now only requires a change in one place.

* [OPIK-4382] [BE] Extract streamGroupQuery helper and fix null percentiles

- Deduplicate findGroups/findGroupsAggregations into a single private
  streamGroupQuery(queryTemplate, criteria, rowMapper) that differs only
  by the query constant and BiFunction row mapper.

- Fix convertToBigDecimal to return null for null/unsupported inputs so
  absent p50/p90/p99 entries in getDuration propagate as null rather than
  BigDecimal.ZERO, preserving the semantic-null that lets callers apply
  COALESCE/fallback logic correctly.

* [OPIK-4382] [BE] Consolidate cost/duration helpers into ExperimentGroupMappers

Promote getCostValue and getDuration to public static in
ExperimentGroupMappers and delete the private copies in ExperimentDAO.
ExperimentDAO.mapToDto now delegates to the shared helpers, so any
future change to cost filtering, duration percentile extraction, or the
BigDecimal conversion only needs to be made in one place.

Side-effect: ExperimentDAO.mapToDto also picks up the null-percentile
fix (convertToBigDecimal returns null for absent/unsupported inputs)
that was previously applied only to ExperimentGroupMappers.

* [OPIK-4382] [BE] Fix pagination count and add criteria filter tests

- Remove count() OVER () window function from paged query (returned
  page-scoped count instead of full result-set count)
- Replace with dedicated count query + short-circuit: skip items query
  when count == 0, use DatasetItemPage.empty() for that case
- Extract DatasetItemResultMapper.buildItemFromRow as public static
  helper reused by ExperimentAggregatesDAO
- Add parameterized integration tests for ExperimentGroupCriteria
  filters (name, types, projectId, combined, empty-result) covering
  both findGroups and findGroupsAggregations aggregate paths

* [OPIK-4380] [BE] Extract shared filter helpers into FilterQueryBuilder

Add FilterStrategyParam record, applyFiltersToTemplate and bindFilters
static helpers to FilterQueryBuilder, then replace duplicated per-strategy
loops in DatasetItemVersionDAO and ExperimentAggregatesDAO with single
delegating calls backed by per-DAO strategy constants.

* [OPIK-4382] [BE] Consolidate filter helpers in getExperimentItemsStatsFromAggregates

Add EXPERIMENT_ITEMS_STATS_FILTER_STRATEGY_PARAMS and
EXPERIMENT_ITEMS_STATS_BIND_STRATEGIES constants and replace the
per-strategy toAnalyticsDbFilters/bind blocks in
getExperimentItemsStatsFromAggregates with single delegating calls to
FilterQueryBuilder.applyFiltersToTemplate and bindFilters.

* Revision 9: Extract shared helpers to eliminate duplication across DAOs

- Create DatasetItemSearchCriteriaMapper: centralizes filters + search flag
  wiring for DatasetItemSearchCriteria, shared by DatasetItemVersionDAO and
  ExperimentAggregatesDAO
- Add ExperimentGroupMappers.applyGroupCriteriaToTemplate: centralizes
  ExperimentGroupCriteria → ST template wiring, now shared by ExperimentDAO
  and ExperimentAggregatesDAO
- Update DatasetItemVersionDAO.addFiltersToTemplate and bindSearchAndFilters
  to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentAggregatesDAO.applyDatasetItemFiltersToTemplate and
  bindDatasetItemSearchParams to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentDAO.newGroupTemplate and ExperimentAggregatesDAO.newGroupTemplate
  to delegate to ExperimentGroupMappers.applyGroupCriteriaToTemplate

* [OPIK-4383] [BE] Add experiment aggregate event listener and no-op publisher

* Revision 2: Fix missing import for ExperimentAggregationPublisher

* [OPIK-4383] [BE] Add ExperimentAggregationPublisher, ExperimentDenormalizationJob and tests

- ExperimentAggregationPublisher: debounces experiment aggregation triggers
  by writing compound workspaceId:experimentId members to a Redis ZSET scored
  by expiry timestamp (now + debounceDelay), plus a hash storing the userName
  with TTL=2×debounceDelay to handle stale entries.
- ExperimentDenormalizationJob: @every("5s") job that reads ZSET members with
  score <= now, publishes ExperimentAggregationMessage to the Redis stream,
  then cleans up the ZSET entry and hash bucket. Handles stale entries
  (expired hash) by removing the orphaned ZSET member without publishing.
- Fix processExperiment reactive chain: avoided double index.remove by
  returning Mono<Boolean> from flatMap branches so switchIfEmpty is only
  triggered when the bucket is truly empty.
- ExperimentAggregationPublisherTest: integration tests with real Redis
  container verifying ZSET membership, score, userName storage, TTL,
  workspace isolation, and debounce deduplication.
- ExperimentDenormalizationJobTest: unit tests with Mockito covering disabled
  config, lock not acquired, empty ZSET, happy path, stale entry, and batch.

* Fix tests setup

* [OPIK-4383] [BE] Address PR review: move DAO logs to service layer

* [OPIK-4383] [BE] Address PR review: extract shared DAO helper and fix log placement

* [OPIK-4383] [BE] Short-circuit deleteByTraceIds when no spans found

Skip delete, cascading operations, and SpansDeleted event when
getSpanIdsForTraces returns an empty set, preserving the original
no-op behaviour and avoiding the Preconditions.checkArgument failure
in SpanDAO.deleteByIds.

* [OPIK-4383] [BE] Fix cascade deletion failures after trace delete

Two bugs prevented spans and attachments from being deleted when a trace
was deleted via the event-driven cascade:

1. FeedbackScoreService.deleteByTraceIds/deleteBySpanIds had @nonnull on
   projectId which threw NPE when TracesDeleted.projectId() was null.
   EventInterceptor swallowed the NPE, stopping the entire cascade chain.
   Fix: remove @nonnull since the DAO already handles null safely via
   Optional.ofNullable(projectId).

2. SpanDAO.DELETE_BY_IDS had the wrong column (trace_id) and parameter
   name (span_ids) — the ClickHouse R2DBC driver could not resolve :span_ids
   as a named parameter in the DELETE statement. Fixed by using id IN :ids
   to match the working pattern in TraceDAO.DELETE_BY_ID.

* [OPIK-4383] [BE] Address PR review comments on ExperimentDenormalizationJob

- Centralize Redis constants (EXPERIMENT_KEY_PREFIX, USER_NAME_FIELD,
  MEMBER_SEPARATOR) in ExperimentDenormalizationConfig
- Change ExperimentAggregationPublisher.publish() to return Mono<Void>
  instead of void, so errors propagate to callers
- Make job interval configurable via jobs map in config.yml
- Fix onErrorContinue logging: remove getMessage() duplication
- Demote per-experiment logs from INFO to DEBUG
- Add ZSET pagination using expand() to avoid materializing entire range
- Update tests for all changes

* Fix @every job interval config key casing and add jobs section to test config

The dropwizard-jobs framework uses WordUtils.uncapitalize(class.getSimpleName())
to look up the interval in the jobs map, so the key must be
'experimentDenormalizationJob' (lowercase first letter). Also adds the missing
jobs section and jobBatchSize to config-test.yml.

* Replace @every annotation with programmatic Quartz scheduling

Remove @every from ExperimentDenormalizationJob and schedule it
programmatically in OpikGuiceyLifecycleEventListener, following the
same pattern as TraceThreadsClosingJob. Add jobInterval config field
to ExperimentDenormalizationConfig. Remove the jobs YAML section that
caused deserialization errors with JobConfiguration's immutable map.

* Add experiment context to error log and extract publishIfNotEmpty helper

- Include experimentId and workspaceId in onExperimentUpdated error log
- Extract publishIfNotEmpty helper to deduplicate filter+publish logic
  across triggerByExperimentIds, triggerByTraceIds, triggerBySpanIds

* Fix NPE in ExperimentAggregateEventListenerTest mock setup

Stub publisher.publish() to return Mono.empty() in setUp so
.subscribe() calls in production code don't NPE on null.

* [OPIK-4385] [BE] Use pre-computed aggregation tables for experiment endpoints

Apply UNION ALL hybrid pattern to ExperimentDAO (FIND, FIND_GROUPS,
FIND_GROUPS_AGGREGATIONS) and ExperimentItemDAO (STREAM,
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT,
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS,
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_STATS) so that
experiments present in experiment_aggregates / experiment_item_aggregates
use pre-computed values, while others fall back to live JOIN computation.

Add ExperimentAggregatesIntegrationTest covering all 7 affected queries
with parameterized filter, pagination, and consistency scenarios.

* [OPIK-4386] [BE] Trigger lazy aggregation via publisher on GET experiment by ID

When fetching an experiment by ID, if the experiment is in COMPLETED or
CANCELLED state and is not yet present in the experiment_aggregates table,
enqueue it for aggregation using ExperimentAggregationPublisher instead of
computing aggregations synchronously. The check and publish are performed
off the critical path via doOnEach, so the caller receives the experiment
immediately without waiting for the side effect to complete.

* [OPIK-4384] [BE] Fix missing zero_uuid binding and experiment_scores sort alias

- Bind zero_uuid parameter in getById, getByIds, and get(ExperimentStreamRequest)
  methods that use the FIND query; the UNION ALL refactor introduced an
  experiments_from_aggregates CTE that requires this parameter but only the
  main find() method was binding it, causing 500 errors on those paths
- Fix SortingQueryBuilder to reference the outer column alias experiment_scores_agg
  instead of es.experiment_scores; the ORDER BY sits outside the UNION ALL so the
  inner es alias is out of scope, while experiment_scores_agg is the consistent
  output alias exposed by both branches

* [OPIK-4384] [BE] Fix null row injection from LEFT JOIN miss in feedback_scores and comments aggregation

Pre-aggregate feedback_scores_final and comments_final into subqueries
(GROUP BY entity_id) before LEFT JOIN in DatasetItemVersionDAO.STREAM.
When a LEFT JOIN has no match against a pre-aggregated subquery the
joined columns are NULL, so any(NULL) returns NULL instead of a
default-valued row with epoch timestamps that caused Instant.parse()
failures.

Also adds a regression test covering the no-scores path in
ExperimentAggregatesIntegrationTest.

* [OPIK-4383] [BE] Remove DAO-level log.info from ExperimentAggregatesDAO methods

Move operational logging responsibility to the service layer, consistent
with earlier fixes for ExperimentItemDAO and SpanDAO in this PR.

* Remove accidentally committed doc files

These files were introduced during merge resolution but should
not be part of the branch.

* [OPIK-4383] [BE] refactor: extract triggerAggregation helper to centralize guard+publish flow

* [OPIK-4386] [BE] fix: demote lazy aggregation check log to DEBUG

* [OPIK-4383] [BE] fix: restore TagOperations.tagUpdateFragment in SpanDAO BULK_UPDATE

Restores proper tag handling in SpanDAO.BULK_UPDATE query that was
regressed to a simple arrayConcat. Now uses TagOperations.tagUpdateFragment()
which provides arrayDistinct(), tag limit enforcement (max 50), and
tags_to_add/tags_to_remove support. Also adds the required
short_circuit_function_evaluation SETTINGS for throwIf evaluation.

* Adding InterruptableJob

* [OPIK-4383] [BE] Address PR review: expand safety valve, env var prefix

- Add batchSize-capped iteration counter to expand() to prevent
  infinite loops when ZSET entries fail to be removed
- Rename EXPERIMENT_DENORM_JOB_INTERVAL to OPIK_EXPERIMENT_DENORM_JOB_INTERVAL
  to follow the OPIK_ prefix convention

* [OPIK-4384] [BE] Add branch optimization and CTE split to experiment queries

Use pre-computed experiment_aggregates table to optimize query execution:
- Add has_aggregated/has_raw flags to skip unnecessary UNION ALL branches in FIND/FIND_COUNT
- Add getAggregationBranchCounts pre-query to determine which branches are needed
- Apply CTE split pattern to FIND_GROUPS and FIND_GROUPS_AGGREGATIONS
- Update getById to leverage branch optimization via single-ID branch count query
- Add <if(id)> filter to SELECT_AGGREGATED_EXPERIMENT_IDS for getById support

* [OPIK-4384] [BE] Add missing 7-arg overload for getDatasetItemsWithExperimentItems

Fix test compilation error from merge: the remote branch added callers
with (UUID, List, null, null, List<SortingField>, String, String) signature
which needs a bridge overload to the 9-arg method.

* [OPIK-4384] [BE] Add conditional LIMIT push-up, missing CTE, and fix test precision

- Add conditional LIMIT push-up in STREAM query: push LIMIT into CTE
  when only one branch (raw or aggregated) is active for performance
- Add missing experiment_item_aggr_trace_scope CTE for aggregated branch
- Add AggregatedExperimentCounts record for experiment-level branching
- Fix MultiValueFeedbackScoresE2ETest precision assertion: use isEqualTo
  instead of isEqualByComparingTo to respect custom BigDecimal comparator

* [OPIK-4384] [BE] Push OFFSET into top_dataset_items CTE and fix BigDecimal comparator in DatasetsResourceTest

* [OPIK-4384] [BE] Add pass rate aggregation to experiment aggregates

Add pass_rate, passed_count, and total_count columns to experiment_aggregates
table and compute them during aggregation. Update ExperimentDAO queries to
select these columns from both raw and aggregated paths, returning NULL for
non-evaluation-suite experiments.

* Fix format

* Fix get by id

* Fix mapping

* Fix mapping

* [OPIK-4384] [BE] Use pre-aggregated comments from aggregate tables with ISO 8601 date formatting

Update retrieval queries in ExperimentDAO, DatasetItemVersionDAO, and ExperimentAggregatesDAO
to read comments_array_agg as JSON String from aggregate tables instead of live-querying the
comments table. Ensure UNION ALL type compatibility by wrapping raw paths with toJSONString()
and formatting dates as ISO 8601 for proper Jackson deserialization.

* [OPIK-4386] [BE] Increase debounceDelay in test config to prevent race condition

The denormalization job was processing finished experiments during test
execution with incomplete ClickHouse data, causing stale aggregated
values to be returned instead of fresh raw computations.

* [OPIK-4384] [BE] Use parameterized binding for dynamic sort keys and add deterministic tiebreaker

- Replace literal string interpolation in getTopSortExpression with
  parameterized bind variables (sf.bindKey()) to prevent SQL injection
- Remove fieldMapping filter from bindDynamicKeys so all dynamic keys
  are bound, including those used in the top_sorting SELECT expression
- Add deterministic tiebreaker (id DESC / dataset_item_id DESC) to both
  the push-top-limit CTE and the main ORDER BY for consistent pagination
- Fix experiment_items deduplication: use FINAL where DISTINCT was used
  and vice versa for consistency across query branches

* [OPIK-4384] [BE] Add mixed-state aggregation test for UNION ALL hybrid

Test creates 3 experiments, aggregates only 1, and queries all 3 to
exercise the UNION ALL hybrid path where has_aggregated and has_raw
are both true simultaneously.

* [OPIK-4384] [BE] Add isNotEmpty assertions to parameterized filter tests

Ensure filter scenarios actually match data by asserting content()
is not empty before and after aggregation in all parameterized filter
tests (find, findGroups, findGroupsAggregations).

* [OPIK-4384] [BE] refactor: extract assertion helpers to remove duplication in ExperimentAggregatesIntegrationTest

* [OPIK-4384] [BE] refactor: rename parseFlexibleInstant to parseInstant in FeedbackScoreMapper

* [OPIK-4384] [BE] Make LIMIT unconditional in FIND query

The LIMIT clause was gated on filter/sort flags, so plain paged requests
(only limit/offset) at the outer query level would not emit LIMIT.
Simplify to always emit LIMIT when the limit parameter is provided.

* [OPIK-4384] [BE] Fix comment ordering assertion in tests

ClickHouse groupUniqArray does not guarantee ordering, so comment
assertions must use ignoringCollectionOrder to avoid flaky failures.

* [OPIK-4384] [BE] Add branch conditionals to FIND_GROUPS/FIND_GROUPS_AGGREGATIONS and revert unconditional LIMIT

- Wrap SELECT branches in FIND_GROUPS and FIND_GROUPS_AGGREGATIONS with
  <if(has_aggregated)>/<if(has_raw)> conditionals to skip unnecessary
  branches when all experiments are aggregated or all are raw
- Add no-args getAggregationBranchCounts() overload for workspace-only
  pre-query (used by group/aggregation queries that lack experiment IDs)
- Update executeQueryWithTargetProjects to run both pre-queries in
  parallel via Mono.zip
- Revert commit 215a3f9 (unconditional LIMIT) which caused double
  LIMIT/OFFSET bug: CTE-level LIMIT + outer LIMIT made page 2+ return
  0 results. The complex conditional is correct — outer LIMIT is only
  needed when post-CTE processing may alter the result set.

* [OPIK-4384] [BE] Add branch conditionals to SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT

Wrap the UNION ALL in the count query with <if(has_aggregated)>/<if(has_raw)>
conditionals to skip unnecessary branches. Pass branch flags through
getCountWithExperimentFilters from the existing pre-query results.

* [OPIK-4384] [BE] Fix ClickHouse column resolution in COUNT query

Alias dataset_item_id as di_id in the COUNT subquery branches
to avoid column name ambiguity when ClickHouse 25.3's query
analyzer resolves COUNT(DISTINCT dataset_item_id) through a
LEFT JOIN with dataset_items_resolved which also has that column.

* [OPIK-4384] [BE] Use pre-computed comments in STREAM query and fix UNION ALL type mismatch

Aggregated branch now reads comments_array_agg directly from experiment_item_aggregates
instead of doing an expensive JOIN to the comments table. Raw branch converts comments
to JSON String via toJSONString(CAST(...)) so both branches output compatible types.

* [OPIK-4384] [BE] Fix target_project_ids bind error in FIND_GROUPS aggregated branch

* Update config-test.yml

* Update ExperimentService.java

* Remove old unused query

* [OPIK-4386] [BE] Address PR review comments: demote log to debug, add try-catch for context safety, add getById lazy aggregation tests

* [OPIK-4386] [BE] Add workspaceId to lazy aggregation log messages
JetoPistola pushed a commit that referenced this pull request Mar 16, 2026
…ment by ID (#5579)

* [OPIK-4380] [BE] Add experiment aggregates for denormalized metrics

- Add experiment_aggregates and experiment_item_aggregates tables
- Implement ExperimentAggregatesDAO with population and query methods
- Add ExperimentAggregatesService for aggregation management
- Refactor DTOs into organized model classes:
  - ExperimentAggregatesModel: aggregation results
  - ExperimentEntityData: entity models
  - ExperimentSourceData: raw source data
  - ExperimentAggregatesUtils: utilities
- Add FEEDBACK_SCORES_AGGREGATED filter strategy for map-based filtering
- Add comprehensive integration tests (10/10 passing)
- Configure batch size and parallelism settings

* [OPIK-4380] [BE] Add MySQL deadlock retry mechanism for concurrent dataset operations

Problem:
- MySQL deadlock on dataset_version_tags composite PRIMARY KEY (workspace_id, dataset_id, tag)
- Occurred during parallel dataset creation in same workspace
- Multiple threads inserting "latest" tag for different datasets caused lock contention
- Experiments with parallel execution were failing with MySQLTransactionRollbackException

Solution:
- Add handleOnDeadLocks() method in RetryUtils with:
  - 5 retry attempts with exponential backoff (250ms to 2s)
  - 0.5 jitter to reduce thundering herd effect
  - Recursive isDatabaseDeadlock() detection for MySQLTransactionRollbackException
- Apply retry logic in DatasetItemService.setDatasetItemVersion()
- Enables concurrent dataset creation for same workspace

Impact:
- Supports parallel experiment execution with proper deadlock handling
- Test success rate improved from 0/10 to 10/10 in ExperimentAggregatesIntegrationTest

* Fix visibility

* [OPIK-4380] [BE] Address PR review comments for experiment aggregates

Fixed 11 automated review comments from baz-reviewer:

CRITICAL fixes:
- Prevent NPE on null span aggregations by adding coalesce() in SQL
- Handle multi-project experiments with LIMIT 1 in GET_PROJECT_ID
- Handle zero-item experiments with empty aggregation helpers
- Bind feedback_scores_percentiles map instead of empty CAST

HIGH priority fixes:
- Use toDecimal128(12) instead of toDecimal64(9) for cost percentiles
- Add null-safe tags handling with Optional.ofNullable()
- Include exception objects in retry logging for stack traces

MEDIUM priority fixes:
- Add missing log_comment to SELECT_EXPERIMENT_BY_ID query
- Add missing log_comment to GET_PROJECT_ID query

LOW priority fixes:
- Remove duplicate "id" binding in bindItemsParameters
- Enhance batchSize config documentation with details

All 11 integration tests passing.

* [OPIK-4382] [BE] Refactor experiment aggregates with import cleanup and Optional patterns

- Add missing imports for IntStream, ProjectStats, and other dependencies
- Replace fully-qualified class names with proper imports across DAO and Service classes
- Fix IS_NOT_EMPTY filter handling for FEEDBACK_SCORES_AGGREGATED strategies
- Refactor null checks to use Optional in mapping methods:
  - mapFeedbackScoreAggregations, mapExperimentFromAggregates
  - mapFeedbackScoreData, mapExperimentGroupAggregationItem
  - Batch insert preparation with Optional chains
- Improve code readability and maintainability with functional patterns

* [OPIK-4380] [BE] Fix table definition

* [OPIK-4380] [BE] Address PR comments and consolidate DatasetItemService methods

- Fix tags NPE in ExperimentAggregatesDAO with defaultIfNull
- Remove unnecessary FINAL clause from GET_EXPERIMENT_DATA query
- Fix test naming in ExperimentAggregatesIntegrationTest
- Consolidate 7 duplicate createVersionFromDelta methods into single canonical implementation
- Remove debug logger from config-test.yml

* [OPIK-4380] [BE] Fix missing log_comment and centralize search criteria binding

- Fix SELECT_EXPERIMENT_BY_ID to properly render log_comment metadata
  - Use getSTWithLogComment pattern in getExperimentFromAggregates
  - Ensures ClickHouse query logging populates workspace/experiment IDs

- Centralize ExperimentSearchCriteria binding logic
  - Create ExperimentSearchCriteriaBinder utility class
  - Parameterize filter strategies to support both DAO variants
  - Eliminate 29-line duplication between ExperimentDAO and ExperimentAggregatesDAO
  - Single source of truth prevents DAOs from getting out of sync

* [OPIK-4380] [BE] Fix createVersionFromDelta consolidation after rebase

- Update canonical method signature to include new parameters:
  - List<EvaluatorItem> evaluators
  - ExecutionPolicy executionPolicy
  - boolean clearExecutionPolicy

- Update all 5 caller sites to pass new parameters:
  - Use changes.evaluators(), changes.executionPolicy(), changes.clearExecutionPolicy() when available
  - Pass null/false for auto-generated versions that inherit from base

- Add imports for EvaluatorItem and ExecutionPolicy

Fixes compilation errors introduced by rebase with upstream changes to DatasetVersionService

* [OPIK-4380] [BE] Address PR review comments - fix type mismatch, extract constants, remove DAO logging

- Fixed BigDecimal[] to Double[] conversion for experiment_scores (matches ClickHouse Float64)
- Extracted FilterStrategy lists to static final constants to avoid repeated allocations
- Added @nonnull validation to populateExperimentAggregate parameter
- Removed DAO layer logging, keeping service-level logging only

* [OPIK-4380] [BE] Extract shared helper for experiment data pagination

Extract streamWithExperimentPagination() helper method to eliminate
duplication in getTracesData(), getSpansData(), and getFeedbackScoresData().

All three methods followed identical pattern:
- asyncTemplate.stream with connection
- getSTWithLogComment with cursor flag
- Bind workspace_id, experiment_id, project_id, limit
- Optional cursor binding
- Result mapping

Benefits:
- Single source of truth for pagination binding logic
- Prevents divergence when tweaking cursor/limit bindings
- Reduces code from ~20 lines to ~10 lines per method
- Type-safe generic implementation

Note: CTE redundancy (3x experiment_items scan) is intentional to avoid
passing large trace ID lists as parameters, which would cause performance
issues with 10K+ traces.

* [OPIK-4383] [BE] Add Redis stream subscriber for debounced experiment aggregates recomputation

- Add ExperimentDenormalizationConfig implementing StreamConfiguration with debounce, job lock, and per-experiment aggregation lock settings
- Add ExperimentAggregationMessage as stream message record
- Add ExperimentAggregatesSubscriber consuming from the denormalization stream; acquires a workspace-scoped distributed lock per experiment before calling populateAggregations()
- Add experimentDenormalizationEnabled feature flag to ServiceTogglesConfig and FeatureFlags
- Wire ExperimentDenormalizationConfig into OpikConfiguration
- Update config.yml and config-test.yml with full experimentDenormalization block (enabled for tests)
- Add ExperimentAggregatesSubscriberTest covering lifecycle gating and processEvent success/error paths

* Revision 2: Address PR comments - add config defaults, remove toggle, rename tests

- ExperimentDenormalizationConfig: add sensible defaults to all fields so
  Dropwizard validation doesn't fail when the config block is absent from
  old deployments (config.isEnabled()=false still gates the subscriber)
- Remove experimentDenormalizationEnabled service toggle from
  ServiceTogglesConfig, FeatureFlags, config.yml and config-test.yml -
  the infrastructure gate (config.isEnabled()) is the single control point
- Rename lifecycle test methods to camelCase per project conventions:
  startSkipsStartupWhenDisabled / stopSkipsShutdownWhenDisabled

* Revision 3: Add @max(500) to consumerBatchSize and @NotNull to jobLockWaitTime

* [OPIK-4380] [BE] Address PR review comments - fix TYPE_REFERENCE visibility, redundant IN subquery, hardcoded context keys, Instant.now in loop, and inline defaultIfNull

* Revision 4: Address remaining JetoPistola review comments (#7, #8, #10)

- #7: Remove "Used for testing and verification" from getExperimentFromAggregates javadoc
- #8: Replace recursive flatMap with Mono.expand() in populateExperimentItemsInBatches
- #10: Remove unrelated subscribeOn addition from DatasetItemService.createVersionFromDelta

* Revision 3: Add switchIfEmpty fallback for deleted traces in populateExperimentAggregate

* Fix tests

* Revision 6: Move countTotal log from DAO to service layer

Operational logs belong in the service layer, not the DAO.

* Revision 7: Apply Spotless formatting

* Revision 8: Make populateAggregations(UUID, int) private

Removes the uncapped public batch size entry point. All callers now go
through the public no-arg overload which reads batchSize safely from config.

* [OPIK-4380] [BE] Add evaluation_method support to experiment_aggregates pipeline

- Add ClickHouse migration (000062) to add evaluation_method column to experiment_aggregates table
- Add evaluationMethod field to ExperimentData record
- Update GET_EXPERIMENT_DATA query to read evaluation_method from experiments
- Update INSERT_EXPERIMENT_AGGREGATE to write evaluation_method to experiment_aggregates
- Update SELECT_EXPERIMENT_BY_ID to read evaluation_method from experiment_aggregates
- Fix Experiment record constructor call: insert EvaluationMethod at correct position (10)

* [OPIK-4380] [BE] Extract shared helper for experiment aggregation queries

Reduce copy-paste in getTraceAggregations, getSpanAggregations, and
getFeedbackScoreAggregations by extracting queryExperimentAggregation,
which centralises the context-aware execution, workspace/experiment/project
parameter binding, and singleOrEmpty pattern shared by all three methods.

* [OPIK-4380] [BE] Enforce non-null contract on countTotal criteria parameter

Add @nonnull to ExperimentSearchCriteria in the interface and implementation
so that a null argument fails fast with an explicit NullPointerException at
the DAO boundary instead of crashing deep inside buildCountTemplate.

* [OPIK-4380] [BE] Fix countTotal ignoring target project IDs in normal path

target_project_ids was only applied inside the project_deleted LEFT JOIN
subquery; the main WHERE had no project restriction, so counts were
workspace-wide. Reuse has_target_projects in the main WHERE so
project_id IN :target_project_ids always takes effect. Also replace
manual null/empty checks with CollectionUtils.isNotEmpty.

* [OPIK-4380] [BE] Apply Spotless formatting

* [OPIK-4382] [BE] Address PR review comments on experiment aggregates

- Fix :versionId → :version_id parameter naming in SQL templates and bindings
- Fix last_updated_at binding to use item.lastUpdatedAt() instead of Instant.now()
- Fix FEEDBACK_SCORES_AGGREGATED_IS_EMPTY filter: embed generated SQL into templates
  instead of hard-coded static condition, and add missing bind calls
- Fix RetryUtils log duplication (remove getMessage() + pass exception directly)
- Add batchSize = 1000 default in ExperimentAggregatesConfig
- Extract resolveVersionIdForCriteria helper to deduplicate version-id resolution
- Add null/blank/ClickHouse placeholder guards in extractUuidsFromGroupValues
- Extract loadEntityMap helper to deduplicate getEnrichInfoHolder enrichment logic

* Revision 3: Address PR comments E, F, G, H

- Fix E: Extract shared template/bind helpers in ExperimentAggregatesDAO
- Fix F: Bind experiment_ids as UUID[] instead of String[]
- Fix G+H: Extract getEnrichInfoHolder logic into ExperimentGroupEnricher,
  eliminating duplication between ExperimentService and ExperimentAggregatesService
  without introducing a direct dependency between them

* Revision 4: Fix ExperimentServiceTest to include ExperimentGroupEnricher mock

* [OPIK-4382] [BE] Extract shared Row→ExperimentGroup mappers into ExperimentGroupMappers

Pull the duplicated Row→ExperimentGroupItem and Row→ExperimentGroupAggregationItem
conversion logic from ExperimentDAO and ExperimentAggregatesDAO into a shared
ExperimentGroupMappers utility class. Both DAOs now delegate to the same
toExperimentGroupItem / toExperimentGroupAggregationItem helpers, eliminating
the need to mirror DTO mapping changes in two places.

* [OPIK-4382] [BE] Deduplicate bindGroupCriteria into ExperimentGroupMappers

Moves the shared group-criteria binding logic out of ExperimentDAO and
ExperimentAggregatesDAO into ExperimentGroupMappers.bindGroupCriteria(),
following the same pattern as ExperimentSearchCriteriaBinder.  Adding or
fixing a criteria binding now only requires a change in one place.

* [OPIK-4382] [BE] Extract streamGroupQuery helper and fix null percentiles

- Deduplicate findGroups/findGroupsAggregations into a single private
  streamGroupQuery(queryTemplate, criteria, rowMapper) that differs only
  by the query constant and BiFunction row mapper.

- Fix convertToBigDecimal to return null for null/unsupported inputs so
  absent p50/p90/p99 entries in getDuration propagate as null rather than
  BigDecimal.ZERO, preserving the semantic-null that lets callers apply
  COALESCE/fallback logic correctly.

* [OPIK-4382] [BE] Consolidate cost/duration helpers into ExperimentGroupMappers

Promote getCostValue and getDuration to public static in
ExperimentGroupMappers and delete the private copies in ExperimentDAO.
ExperimentDAO.mapToDto now delegates to the shared helpers, so any
future change to cost filtering, duration percentile extraction, or the
BigDecimal conversion only needs to be made in one place.

Side-effect: ExperimentDAO.mapToDto also picks up the null-percentile
fix (convertToBigDecimal returns null for absent/unsupported inputs)
that was previously applied only to ExperimentGroupMappers.

* [OPIK-4382] [BE] Fix pagination count and add criteria filter tests

- Remove count() OVER () window function from paged query (returned
  page-scoped count instead of full result-set count)
- Replace with dedicated count query + short-circuit: skip items query
  when count == 0, use DatasetItemPage.empty() for that case
- Extract DatasetItemResultMapper.buildItemFromRow as public static
  helper reused by ExperimentAggregatesDAO
- Add parameterized integration tests for ExperimentGroupCriteria
  filters (name, types, projectId, combined, empty-result) covering
  both findGroups and findGroupsAggregations aggregate paths

* [OPIK-4380] [BE] Extract shared filter helpers into FilterQueryBuilder

Add FilterStrategyParam record, applyFiltersToTemplate and bindFilters
static helpers to FilterQueryBuilder, then replace duplicated per-strategy
loops in DatasetItemVersionDAO and ExperimentAggregatesDAO with single
delegating calls backed by per-DAO strategy constants.

* [OPIK-4382] [BE] Consolidate filter helpers in getExperimentItemsStatsFromAggregates

Add EXPERIMENT_ITEMS_STATS_FILTER_STRATEGY_PARAMS and
EXPERIMENT_ITEMS_STATS_BIND_STRATEGIES constants and replace the
per-strategy toAnalyticsDbFilters/bind blocks in
getExperimentItemsStatsFromAggregates with single delegating calls to
FilterQueryBuilder.applyFiltersToTemplate and bindFilters.

* Revision 9: Extract shared helpers to eliminate duplication across DAOs

- Create DatasetItemSearchCriteriaMapper: centralizes filters + search flag
  wiring for DatasetItemSearchCriteria, shared by DatasetItemVersionDAO and
  ExperimentAggregatesDAO
- Add ExperimentGroupMappers.applyGroupCriteriaToTemplate: centralizes
  ExperimentGroupCriteria → ST template wiring, now shared by ExperimentDAO
  and ExperimentAggregatesDAO
- Update DatasetItemVersionDAO.addFiltersToTemplate and bindSearchAndFilters
  to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentAggregatesDAO.applyDatasetItemFiltersToTemplate and
  bindDatasetItemSearchParams to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentDAO.newGroupTemplate and ExperimentAggregatesDAO.newGroupTemplate
  to delegate to ExperimentGroupMappers.applyGroupCriteriaToTemplate

* [OPIK-4383] [BE] Add experiment aggregate event listener and no-op publisher

* Revision 2: Fix missing import for ExperimentAggregationPublisher

* [OPIK-4383] [BE] Add ExperimentAggregationPublisher, ExperimentDenormalizationJob and tests

- ExperimentAggregationPublisher: debounces experiment aggregation triggers
  by writing compound workspaceId:experimentId members to a Redis ZSET scored
  by expiry timestamp (now + debounceDelay), plus a hash storing the userName
  with TTL=2×debounceDelay to handle stale entries.
- ExperimentDenormalizationJob: @every("5s") job that reads ZSET members with
  score <= now, publishes ExperimentAggregationMessage to the Redis stream,
  then cleans up the ZSET entry and hash bucket. Handles stale entries
  (expired hash) by removing the orphaned ZSET member without publishing.
- Fix processExperiment reactive chain: avoided double index.remove by
  returning Mono<Boolean> from flatMap branches so switchIfEmpty is only
  triggered when the bucket is truly empty.
- ExperimentAggregationPublisherTest: integration tests with real Redis
  container verifying ZSET membership, score, userName storage, TTL,
  workspace isolation, and debounce deduplication.
- ExperimentDenormalizationJobTest: unit tests with Mockito covering disabled
  config, lock not acquired, empty ZSET, happy path, stale entry, and batch.

* Fix tests setup

* [OPIK-4383] [BE] Address PR review: move DAO logs to service layer

* [OPIK-4383] [BE] Address PR review: extract shared DAO helper and fix log placement

* [OPIK-4383] [BE] Short-circuit deleteByTraceIds when no spans found

Skip delete, cascading operations, and SpansDeleted event when
getSpanIdsForTraces returns an empty set, preserving the original
no-op behaviour and avoiding the Preconditions.checkArgument failure
in SpanDAO.deleteByIds.

* [OPIK-4383] [BE] Fix cascade deletion failures after trace delete

Two bugs prevented spans and attachments from being deleted when a trace
was deleted via the event-driven cascade:

1. FeedbackScoreService.deleteByTraceIds/deleteBySpanIds had @nonnull on
   projectId which threw NPE when TracesDeleted.projectId() was null.
   EventInterceptor swallowed the NPE, stopping the entire cascade chain.
   Fix: remove @nonnull since the DAO already handles null safely via
   Optional.ofNullable(projectId).

2. SpanDAO.DELETE_BY_IDS had the wrong column (trace_id) and parameter
   name (span_ids) — the ClickHouse R2DBC driver could not resolve :span_ids
   as a named parameter in the DELETE statement. Fixed by using id IN :ids
   to match the working pattern in TraceDAO.DELETE_BY_ID.

* [OPIK-4383] [BE] Address PR review comments on ExperimentDenormalizationJob

- Centralize Redis constants (EXPERIMENT_KEY_PREFIX, USER_NAME_FIELD,
  MEMBER_SEPARATOR) in ExperimentDenormalizationConfig
- Change ExperimentAggregationPublisher.publish() to return Mono<Void>
  instead of void, so errors propagate to callers
- Make job interval configurable via jobs map in config.yml
- Fix onErrorContinue logging: remove getMessage() duplication
- Demote per-experiment logs from INFO to DEBUG
- Add ZSET pagination using expand() to avoid materializing entire range
- Update tests for all changes

* Fix @every job interval config key casing and add jobs section to test config

The dropwizard-jobs framework uses WordUtils.uncapitalize(class.getSimpleName())
to look up the interval in the jobs map, so the key must be
'experimentDenormalizationJob' (lowercase first letter). Also adds the missing
jobs section and jobBatchSize to config-test.yml.

* Replace @every annotation with programmatic Quartz scheduling

Remove @every from ExperimentDenormalizationJob and schedule it
programmatically in OpikGuiceyLifecycleEventListener, following the
same pattern as TraceThreadsClosingJob. Add jobInterval config field
to ExperimentDenormalizationConfig. Remove the jobs YAML section that
caused deserialization errors with JobConfiguration's immutable map.

* Add experiment context to error log and extract publishIfNotEmpty helper

- Include experimentId and workspaceId in onExperimentUpdated error log
- Extract publishIfNotEmpty helper to deduplicate filter+publish logic
  across triggerByExperimentIds, triggerByTraceIds, triggerBySpanIds

* Fix NPE in ExperimentAggregateEventListenerTest mock setup

Stub publisher.publish() to return Mono.empty() in setUp so
.subscribe() calls in production code don't NPE on null.

* [OPIK-4385] [BE] Use pre-computed aggregation tables for experiment endpoints

Apply UNION ALL hybrid pattern to ExperimentDAO (FIND, FIND_GROUPS,
FIND_GROUPS_AGGREGATIONS) and ExperimentItemDAO (STREAM,
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT,
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS,
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_STATS) so that
experiments present in experiment_aggregates / experiment_item_aggregates
use pre-computed values, while others fall back to live JOIN computation.

Add ExperimentAggregatesIntegrationTest covering all 7 affected queries
with parameterized filter, pagination, and consistency scenarios.

* [OPIK-4386] [BE] Trigger lazy aggregation via publisher on GET experiment by ID

When fetching an experiment by ID, if the experiment is in COMPLETED or
CANCELLED state and is not yet present in the experiment_aggregates table,
enqueue it for aggregation using ExperimentAggregationPublisher instead of
computing aggregations synchronously. The check and publish are performed
off the critical path via doOnEach, so the caller receives the experiment
immediately without waiting for the side effect to complete.

* [OPIK-4384] [BE] Fix missing zero_uuid binding and experiment_scores sort alias

- Bind zero_uuid parameter in getById, getByIds, and get(ExperimentStreamRequest)
  methods that use the FIND query; the UNION ALL refactor introduced an
  experiments_from_aggregates CTE that requires this parameter but only the
  main find() method was binding it, causing 500 errors on those paths
- Fix SortingQueryBuilder to reference the outer column alias experiment_scores_agg
  instead of es.experiment_scores; the ORDER BY sits outside the UNION ALL so the
  inner es alias is out of scope, while experiment_scores_agg is the consistent
  output alias exposed by both branches

* [OPIK-4384] [BE] Fix null row injection from LEFT JOIN miss in feedback_scores and comments aggregation

Pre-aggregate feedback_scores_final and comments_final into subqueries
(GROUP BY entity_id) before LEFT JOIN in DatasetItemVersionDAO.STREAM.
When a LEFT JOIN has no match against a pre-aggregated subquery the
joined columns are NULL, so any(NULL) returns NULL instead of a
default-valued row with epoch timestamps that caused Instant.parse()
failures.

Also adds a regression test covering the no-scores path in
ExperimentAggregatesIntegrationTest.

* [OPIK-4383] [BE] Remove DAO-level log.info from ExperimentAggregatesDAO methods

Move operational logging responsibility to the service layer, consistent
with earlier fixes for ExperimentItemDAO and SpanDAO in this PR.

* Remove accidentally committed doc files

These files were introduced during merge resolution but should
not be part of the branch.

* [OPIK-4383] [BE] refactor: extract triggerAggregation helper to centralize guard+publish flow

* [OPIK-4386] [BE] fix: demote lazy aggregation check log to DEBUG

* [OPIK-4383] [BE] fix: restore TagOperations.tagUpdateFragment in SpanDAO BULK_UPDATE

Restores proper tag handling in SpanDAO.BULK_UPDATE query that was
regressed to a simple arrayConcat. Now uses TagOperations.tagUpdateFragment()
which provides arrayDistinct(), tag limit enforcement (max 50), and
tags_to_add/tags_to_remove support. Also adds the required
short_circuit_function_evaluation SETTINGS for throwIf evaluation.

* Adding InterruptableJob

* [OPIK-4383] [BE] Address PR review: expand safety valve, env var prefix

- Add batchSize-capped iteration counter to expand() to prevent
  infinite loops when ZSET entries fail to be removed
- Rename EXPERIMENT_DENORM_JOB_INTERVAL to OPIK_EXPERIMENT_DENORM_JOB_INTERVAL
  to follow the OPIK_ prefix convention

* [OPIK-4384] [BE] Add branch optimization and CTE split to experiment queries

Use pre-computed experiment_aggregates table to optimize query execution:
- Add has_aggregated/has_raw flags to skip unnecessary UNION ALL branches in FIND/FIND_COUNT
- Add getAggregationBranchCounts pre-query to determine which branches are needed
- Apply CTE split pattern to FIND_GROUPS and FIND_GROUPS_AGGREGATIONS
- Update getById to leverage branch optimization via single-ID branch count query
- Add <if(id)> filter to SELECT_AGGREGATED_EXPERIMENT_IDS for getById support

* [OPIK-4384] [BE] Add missing 7-arg overload for getDatasetItemsWithExperimentItems

Fix test compilation error from merge: the remote branch added callers
with (UUID, List, null, null, List<SortingField>, String, String) signature
which needs a bridge overload to the 9-arg method.

* [OPIK-4384] [BE] Add conditional LIMIT push-up, missing CTE, and fix test precision

- Add conditional LIMIT push-up in STREAM query: push LIMIT into CTE
  when only one branch (raw or aggregated) is active for performance
- Add missing experiment_item_aggr_trace_scope CTE for aggregated branch
- Add AggregatedExperimentCounts record for experiment-level branching
- Fix MultiValueFeedbackScoresE2ETest precision assertion: use isEqualTo
  instead of isEqualByComparingTo to respect custom BigDecimal comparator

* [OPIK-4384] [BE] Push OFFSET into top_dataset_items CTE and fix BigDecimal comparator in DatasetsResourceTest

* [OPIK-4384] [BE] Add pass rate aggregation to experiment aggregates

Add pass_rate, passed_count, and total_count columns to experiment_aggregates
table and compute them during aggregation. Update ExperimentDAO queries to
select these columns from both raw and aggregated paths, returning NULL for
non-evaluation-suite experiments.

* Fix format

* Fix get by id

* Fix mapping

* Fix mapping

* [OPIK-4384] [BE] Use pre-aggregated comments from aggregate tables with ISO 8601 date formatting

Update retrieval queries in ExperimentDAO, DatasetItemVersionDAO, and ExperimentAggregatesDAO
to read comments_array_agg as JSON String from aggregate tables instead of live-querying the
comments table. Ensure UNION ALL type compatibility by wrapping raw paths with toJSONString()
and formatting dates as ISO 8601 for proper Jackson deserialization.

* [OPIK-4386] [BE] Increase debounceDelay in test config to prevent race condition

The denormalization job was processing finished experiments during test
execution with incomplete ClickHouse data, causing stale aggregated
values to be returned instead of fresh raw computations.

* [OPIK-4384] [BE] Use parameterized binding for dynamic sort keys and add deterministic tiebreaker

- Replace literal string interpolation in getTopSortExpression with
  parameterized bind variables (sf.bindKey()) to prevent SQL injection
- Remove fieldMapping filter from bindDynamicKeys so all dynamic keys
  are bound, including those used in the top_sorting SELECT expression
- Add deterministic tiebreaker (id DESC / dataset_item_id DESC) to both
  the push-top-limit CTE and the main ORDER BY for consistent pagination
- Fix experiment_items deduplication: use FINAL where DISTINCT was used
  and vice versa for consistency across query branches

* [OPIK-4384] [BE] Add mixed-state aggregation test for UNION ALL hybrid

Test creates 3 experiments, aggregates only 1, and queries all 3 to
exercise the UNION ALL hybrid path where has_aggregated and has_raw
are both true simultaneously.

* [OPIK-4384] [BE] Add isNotEmpty assertions to parameterized filter tests

Ensure filter scenarios actually match data by asserting content()
is not empty before and after aggregation in all parameterized filter
tests (find, findGroups, findGroupsAggregations).

* [OPIK-4384] [BE] refactor: extract assertion helpers to remove duplication in ExperimentAggregatesIntegrationTest

* [OPIK-4384] [BE] refactor: rename parseFlexibleInstant to parseInstant in FeedbackScoreMapper

* [OPIK-4384] [BE] Make LIMIT unconditional in FIND query

The LIMIT clause was gated on filter/sort flags, so plain paged requests
(only limit/offset) at the outer query level would not emit LIMIT.
Simplify to always emit LIMIT when the limit parameter is provided.

* [OPIK-4384] [BE] Fix comment ordering assertion in tests

ClickHouse groupUniqArray does not guarantee ordering, so comment
assertions must use ignoringCollectionOrder to avoid flaky failures.

* [OPIK-4384] [BE] Add branch conditionals to FIND_GROUPS/FIND_GROUPS_AGGREGATIONS and revert unconditional LIMIT

- Wrap SELECT branches in FIND_GROUPS and FIND_GROUPS_AGGREGATIONS with
  <if(has_aggregated)>/<if(has_raw)> conditionals to skip unnecessary
  branches when all experiments are aggregated or all are raw
- Add no-args getAggregationBranchCounts() overload for workspace-only
  pre-query (used by group/aggregation queries that lack experiment IDs)
- Update executeQueryWithTargetProjects to run both pre-queries in
  parallel via Mono.zip
- Revert commit 215a3f9 (unconditional LIMIT) which caused double
  LIMIT/OFFSET bug: CTE-level LIMIT + outer LIMIT made page 2+ return
  0 results. The complex conditional is correct — outer LIMIT is only
  needed when post-CTE processing may alter the result set.

* [OPIK-4384] [BE] Add branch conditionals to SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT

Wrap the UNION ALL in the count query with <if(has_aggregated)>/<if(has_raw)>
conditionals to skip unnecessary branches. Pass branch flags through
getCountWithExperimentFilters from the existing pre-query results.

* [OPIK-4384] [BE] Fix ClickHouse column resolution in COUNT query

Alias dataset_item_id as di_id in the COUNT subquery branches
to avoid column name ambiguity when ClickHouse 25.3's query
analyzer resolves COUNT(DISTINCT dataset_item_id) through a
LEFT JOIN with dataset_items_resolved which also has that column.

* [OPIK-4384] [BE] Use pre-computed comments in STREAM query and fix UNION ALL type mismatch

Aggregated branch now reads comments_array_agg directly from experiment_item_aggregates
instead of doing an expensive JOIN to the comments table. Raw branch converts comments
to JSON String via toJSONString(CAST(...)) so both branches output compatible types.

* [OPIK-4384] [BE] Fix target_project_ids bind error in FIND_GROUPS aggregated branch

* Update config-test.yml

* Update ExperimentService.java

* Remove old unused query

* [OPIK-4386] [BE] Address PR review comments: demote log to debug, add try-catch for context safety, add getById lazy aggregation tests

* [OPIK-4386] [BE] Add workspaceId to lazy aggregation log messages
thiagohora added a commit that referenced this pull request Mar 16, 2026
…nts endpoint (#5583)

* [OPIK-4380] [BE] Add experiment aggregates for denormalized metrics

- Add experiment_aggregates and experiment_item_aggregates tables
- Implement ExperimentAggregatesDAO with population and query methods
- Add ExperimentAggregatesService for aggregation management
- Refactor DTOs into organized model classes:
  - ExperimentAggregatesModel: aggregation results
  - ExperimentEntityData: entity models
  - ExperimentSourceData: raw source data
  - ExperimentAggregatesUtils: utilities
- Add FEEDBACK_SCORES_AGGREGATED filter strategy for map-based filtering
- Add comprehensive integration tests (10/10 passing)
- Configure batch size and parallelism settings

* [OPIK-4380] [BE] Add MySQL deadlock retry mechanism for concurrent dataset operations

Problem:
- MySQL deadlock on dataset_version_tags composite PRIMARY KEY (workspace_id, dataset_id, tag)
- Occurred during parallel dataset creation in same workspace
- Multiple threads inserting "latest" tag for different datasets caused lock contention
- Experiments with parallel execution were failing with MySQLTransactionRollbackException

Solution:
- Add handleOnDeadLocks() method in RetryUtils with:
  - 5 retry attempts with exponential backoff (250ms to 2s)
  - 0.5 jitter to reduce thundering herd effect
  - Recursive isDatabaseDeadlock() detection for MySQLTransactionRollbackException
- Apply retry logic in DatasetItemService.setDatasetItemVersion()
- Enables concurrent dataset creation for same workspace

Impact:
- Supports parallel experiment execution with proper deadlock handling
- Test success rate improved from 0/10 to 10/10 in ExperimentAggregatesIntegrationTest

* Fix visibility

* [OPIK-4380] [BE] Address PR review comments for experiment aggregates

Fixed 11 automated review comments from baz-reviewer:

CRITICAL fixes:
- Prevent NPE on null span aggregations by adding coalesce() in SQL
- Handle multi-project experiments with LIMIT 1 in GET_PROJECT_ID
- Handle zero-item experiments with empty aggregation helpers
- Bind feedback_scores_percentiles map instead of empty CAST

HIGH priority fixes:
- Use toDecimal128(12) instead of toDecimal64(9) for cost percentiles
- Add null-safe tags handling with Optional.ofNullable()
- Include exception objects in retry logging for stack traces

MEDIUM priority fixes:
- Add missing log_comment to SELECT_EXPERIMENT_BY_ID query
- Add missing log_comment to GET_PROJECT_ID query

LOW priority fixes:
- Remove duplicate "id" binding in bindItemsParameters
- Enhance batchSize config documentation with details

All 11 integration tests passing.

* [OPIK-4382] [BE] Refactor experiment aggregates with import cleanup and Optional patterns

- Add missing imports for IntStream, ProjectStats, and other dependencies
- Replace fully-qualified class names with proper imports across DAO and Service classes
- Fix IS_NOT_EMPTY filter handling for FEEDBACK_SCORES_AGGREGATED strategies
- Refactor null checks to use Optional in mapping methods:
  - mapFeedbackScoreAggregations, mapExperimentFromAggregates
  - mapFeedbackScoreData, mapExperimentGroupAggregationItem
  - Batch insert preparation with Optional chains
- Improve code readability and maintainability with functional patterns

* [OPIK-4380] [BE] Fix table definition

* [OPIK-4380] [BE] Address PR comments and consolidate DatasetItemService methods

- Fix tags NPE in ExperimentAggregatesDAO with defaultIfNull
- Remove unnecessary FINAL clause from GET_EXPERIMENT_DATA query
- Fix test naming in ExperimentAggregatesIntegrationTest
- Consolidate 7 duplicate createVersionFromDelta methods into single canonical implementation
- Remove debug logger from config-test.yml

* [OPIK-4380] [BE] Fix missing log_comment and centralize search criteria binding

- Fix SELECT_EXPERIMENT_BY_ID to properly render log_comment metadata
  - Use getSTWithLogComment pattern in getExperimentFromAggregates
  - Ensures ClickHouse query logging populates workspace/experiment IDs

- Centralize ExperimentSearchCriteria binding logic
  - Create ExperimentSearchCriteriaBinder utility class
  - Parameterize filter strategies to support both DAO variants
  - Eliminate 29-line duplication between ExperimentDAO and ExperimentAggregatesDAO
  - Single source of truth prevents DAOs from getting out of sync

* [OPIK-4380] [BE] Fix createVersionFromDelta consolidation after rebase

- Update canonical method signature to include new parameters:
  - List<EvaluatorItem> evaluators
  - ExecutionPolicy executionPolicy
  - boolean clearExecutionPolicy

- Update all 5 caller sites to pass new parameters:
  - Use changes.evaluators(), changes.executionPolicy(), changes.clearExecutionPolicy() when available
  - Pass null/false for auto-generated versions that inherit from base

- Add imports for EvaluatorItem and ExecutionPolicy

Fixes compilation errors introduced by rebase with upstream changes to DatasetVersionService

* [OPIK-4380] [BE] Address PR review comments - fix type mismatch, extract constants, remove DAO logging

- Fixed BigDecimal[] to Double[] conversion for experiment_scores (matches ClickHouse Float64)
- Extracted FilterStrategy lists to static final constants to avoid repeated allocations
- Added @nonnull validation to populateExperimentAggregate parameter
- Removed DAO layer logging, keeping service-level logging only

* [OPIK-4380] [BE] Extract shared helper for experiment data pagination

Extract streamWithExperimentPagination() helper method to eliminate
duplication in getTracesData(), getSpansData(), and getFeedbackScoresData().

All three methods followed identical pattern:
- asyncTemplate.stream with connection
- getSTWithLogComment with cursor flag
- Bind workspace_id, experiment_id, project_id, limit
- Optional cursor binding
- Result mapping

Benefits:
- Single source of truth for pagination binding logic
- Prevents divergence when tweaking cursor/limit bindings
- Reduces code from ~20 lines to ~10 lines per method
- Type-safe generic implementation

Note: CTE redundancy (3x experiment_items scan) is intentional to avoid
passing large trace ID lists as parameters, which would cause performance
issues with 10K+ traces.

* [OPIK-4383] [BE] Add Redis stream subscriber for debounced experiment aggregates recomputation

- Add ExperimentDenormalizationConfig implementing StreamConfiguration with debounce, job lock, and per-experiment aggregation lock settings
- Add ExperimentAggregationMessage as stream message record
- Add ExperimentAggregatesSubscriber consuming from the denormalization stream; acquires a workspace-scoped distributed lock per experiment before calling populateAggregations()
- Add experimentDenormalizationEnabled feature flag to ServiceTogglesConfig and FeatureFlags
- Wire ExperimentDenormalizationConfig into OpikConfiguration
- Update config.yml and config-test.yml with full experimentDenormalization block (enabled for tests)
- Add ExperimentAggregatesSubscriberTest covering lifecycle gating and processEvent success/error paths

* Revision 2: Address PR comments - add config defaults, remove toggle, rename tests

- ExperimentDenormalizationConfig: add sensible defaults to all fields so
  Dropwizard validation doesn't fail when the config block is absent from
  old deployments (config.isEnabled()=false still gates the subscriber)
- Remove experimentDenormalizationEnabled service toggle from
  ServiceTogglesConfig, FeatureFlags, config.yml and config-test.yml -
  the infrastructure gate (config.isEnabled()) is the single control point
- Rename lifecycle test methods to camelCase per project conventions:
  startSkipsStartupWhenDisabled / stopSkipsShutdownWhenDisabled

* Revision 3: Add @max(500) to consumerBatchSize and @NotNull to jobLockWaitTime

* [OPIK-4380] [BE] Address PR review comments - fix TYPE_REFERENCE visibility, redundant IN subquery, hardcoded context keys, Instant.now in loop, and inline defaultIfNull

* Revision 4: Address remaining JetoPistola review comments (#7, #8, #10)

- #7: Remove "Used for testing and verification" from getExperimentFromAggregates javadoc
- #8: Replace recursive flatMap with Mono.expand() in populateExperimentItemsInBatches
- #10: Remove unrelated subscribeOn addition from DatasetItemService.createVersionFromDelta

* Revision 3: Add switchIfEmpty fallback for deleted traces in populateExperimentAggregate

* Fix tests

* Revision 6: Move countTotal log from DAO to service layer

Operational logs belong in the service layer, not the DAO.

* Revision 7: Apply Spotless formatting

* Revision 8: Make populateAggregations(UUID, int) private

Removes the uncapped public batch size entry point. All callers now go
through the public no-arg overload which reads batchSize safely from config.

* [OPIK-4380] [BE] Add evaluation_method support to experiment_aggregates pipeline

- Add ClickHouse migration (000062) to add evaluation_method column to experiment_aggregates table
- Add evaluationMethod field to ExperimentData record
- Update GET_EXPERIMENT_DATA query to read evaluation_method from experiments
- Update INSERT_EXPERIMENT_AGGREGATE to write evaluation_method to experiment_aggregates
- Update SELECT_EXPERIMENT_BY_ID to read evaluation_method from experiment_aggregates
- Fix Experiment record constructor call: insert EvaluationMethod at correct position (10)

* [OPIK-4380] [BE] Extract shared helper for experiment aggregation queries

Reduce copy-paste in getTraceAggregations, getSpanAggregations, and
getFeedbackScoreAggregations by extracting queryExperimentAggregation,
which centralises the context-aware execution, workspace/experiment/project
parameter binding, and singleOrEmpty pattern shared by all three methods.

* [OPIK-4380] [BE] Enforce non-null contract on countTotal criteria parameter

Add @nonnull to ExperimentSearchCriteria in the interface and implementation
so that a null argument fails fast with an explicit NullPointerException at
the DAO boundary instead of crashing deep inside buildCountTemplate.

* [OPIK-4380] [BE] Fix countTotal ignoring target project IDs in normal path

target_project_ids was only applied inside the project_deleted LEFT JOIN
subquery; the main WHERE had no project restriction, so counts were
workspace-wide. Reuse has_target_projects in the main WHERE so
project_id IN :target_project_ids always takes effect. Also replace
manual null/empty checks with CollectionUtils.isNotEmpty.

* [OPIK-4380] [BE] Apply Spotless formatting

* [OPIK-4382] [BE] Address PR review comments on experiment aggregates

- Fix :versionId → :version_id parameter naming in SQL templates and bindings
- Fix last_updated_at binding to use item.lastUpdatedAt() instead of Instant.now()
- Fix FEEDBACK_SCORES_AGGREGATED_IS_EMPTY filter: embed generated SQL into templates
  instead of hard-coded static condition, and add missing bind calls
- Fix RetryUtils log duplication (remove getMessage() + pass exception directly)
- Add batchSize = 1000 default in ExperimentAggregatesConfig
- Extract resolveVersionIdForCriteria helper to deduplicate version-id resolution
- Add null/blank/ClickHouse placeholder guards in extractUuidsFromGroupValues
- Extract loadEntityMap helper to deduplicate getEnrichInfoHolder enrichment logic

* Revision 3: Address PR comments E, F, G, H

- Fix E: Extract shared template/bind helpers in ExperimentAggregatesDAO
- Fix F: Bind experiment_ids as UUID[] instead of String[]
- Fix G+H: Extract getEnrichInfoHolder logic into ExperimentGroupEnricher,
  eliminating duplication between ExperimentService and ExperimentAggregatesService
  without introducing a direct dependency between them

* Revision 4: Fix ExperimentServiceTest to include ExperimentGroupEnricher mock

* [OPIK-4382] [BE] Extract shared Row→ExperimentGroup mappers into ExperimentGroupMappers

Pull the duplicated Row→ExperimentGroupItem and Row→ExperimentGroupAggregationItem
conversion logic from ExperimentDAO and ExperimentAggregatesDAO into a shared
ExperimentGroupMappers utility class. Both DAOs now delegate to the same
toExperimentGroupItem / toExperimentGroupAggregationItem helpers, eliminating
the need to mirror DTO mapping changes in two places.

* [OPIK-4382] [BE] Deduplicate bindGroupCriteria into ExperimentGroupMappers

Moves the shared group-criteria binding logic out of ExperimentDAO and
ExperimentAggregatesDAO into ExperimentGroupMappers.bindGroupCriteria(),
following the same pattern as ExperimentSearchCriteriaBinder.  Adding or
fixing a criteria binding now only requires a change in one place.

* [OPIK-4382] [BE] Extract streamGroupQuery helper and fix null percentiles

- Deduplicate findGroups/findGroupsAggregations into a single private
  streamGroupQuery(queryTemplate, criteria, rowMapper) that differs only
  by the query constant and BiFunction row mapper.

- Fix convertToBigDecimal to return null for null/unsupported inputs so
  absent p50/p90/p99 entries in getDuration propagate as null rather than
  BigDecimal.ZERO, preserving the semantic-null that lets callers apply
  COALESCE/fallback logic correctly.

* [OPIK-4382] [BE] Consolidate cost/duration helpers into ExperimentGroupMappers

Promote getCostValue and getDuration to public static in
ExperimentGroupMappers and delete the private copies in ExperimentDAO.
ExperimentDAO.mapToDto now delegates to the shared helpers, so any
future change to cost filtering, duration percentile extraction, or the
BigDecimal conversion only needs to be made in one place.

Side-effect: ExperimentDAO.mapToDto also picks up the null-percentile
fix (convertToBigDecimal returns null for absent/unsupported inputs)
that was previously applied only to ExperimentGroupMappers.

* [OPIK-4382] [BE] Fix pagination count and add criteria filter tests

- Remove count() OVER () window function from paged query (returned
  page-scoped count instead of full result-set count)
- Replace with dedicated count query + short-circuit: skip items query
  when count == 0, use DatasetItemPage.empty() for that case
- Extract DatasetItemResultMapper.buildItemFromRow as public static
  helper reused by ExperimentAggregatesDAO
- Add parameterized integration tests for ExperimentGroupCriteria
  filters (name, types, projectId, combined, empty-result) covering
  both findGroups and findGroupsAggregations aggregate paths

* [OPIK-4380] [BE] Extract shared filter helpers into FilterQueryBuilder

Add FilterStrategyParam record, applyFiltersToTemplate and bindFilters
static helpers to FilterQueryBuilder, then replace duplicated per-strategy
loops in DatasetItemVersionDAO and ExperimentAggregatesDAO with single
delegating calls backed by per-DAO strategy constants.

* [OPIK-4382] [BE] Consolidate filter helpers in getExperimentItemsStatsFromAggregates

Add EXPERIMENT_ITEMS_STATS_FILTER_STRATEGY_PARAMS and
EXPERIMENT_ITEMS_STATS_BIND_STRATEGIES constants and replace the
per-strategy toAnalyticsDbFilters/bind blocks in
getExperimentItemsStatsFromAggregates with single delegating calls to
FilterQueryBuilder.applyFiltersToTemplate and bindFilters.

* Revision 9: Extract shared helpers to eliminate duplication across DAOs

- Create DatasetItemSearchCriteriaMapper: centralizes filters + search flag
  wiring for DatasetItemSearchCriteria, shared by DatasetItemVersionDAO and
  ExperimentAggregatesDAO
- Add ExperimentGroupMappers.applyGroupCriteriaToTemplate: centralizes
  ExperimentGroupCriteria → ST template wiring, now shared by ExperimentDAO
  and ExperimentAggregatesDAO
- Update DatasetItemVersionDAO.addFiltersToTemplate and bindSearchAndFilters
  to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentAggregatesDAO.applyDatasetItemFiltersToTemplate and
  bindDatasetItemSearchParams to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentDAO.newGroupTemplate and ExperimentAggregatesDAO.newGroupTemplate
  to delegate to ExperimentGroupMappers.applyGroupCriteriaToTemplate

* [OPIK-4383] [BE] Add experiment aggregate event listener and no-op publisher

* Revision 2: Fix missing import for ExperimentAggregationPublisher

* [OPIK-4383] [BE] Add ExperimentAggregationPublisher, ExperimentDenormalizationJob and tests

- ExperimentAggregationPublisher: debounces experiment aggregation triggers
  by writing compound workspaceId:experimentId members to a Redis ZSET scored
  by expiry timestamp (now + debounceDelay), plus a hash storing the userName
  with TTL=2×debounceDelay to handle stale entries.
- ExperimentDenormalizationJob: @every("5s") job that reads ZSET members with
  score <= now, publishes ExperimentAggregationMessage to the Redis stream,
  then cleans up the ZSET entry and hash bucket. Handles stale entries
  (expired hash) by removing the orphaned ZSET member without publishing.
- Fix processExperiment reactive chain: avoided double index.remove by
  returning Mono<Boolean> from flatMap branches so switchIfEmpty is only
  triggered when the bucket is truly empty.
- ExperimentAggregationPublisherTest: integration tests with real Redis
  container verifying ZSET membership, score, userName storage, TTL,
  workspace isolation, and debounce deduplication.
- ExperimentDenormalizationJobTest: unit tests with Mockito covering disabled
  config, lock not acquired, empty ZSET, happy path, stale entry, and batch.

* Fix tests setup

* [OPIK-4383] [BE] Address PR review: move DAO logs to service layer

* [OPIK-4383] [BE] Address PR review: extract shared DAO helper and fix log placement

* [OPIK-4383] [BE] Short-circuit deleteByTraceIds when no spans found

Skip delete, cascading operations, and SpansDeleted event when
getSpanIdsForTraces returns an empty set, preserving the original
no-op behaviour and avoiding the Preconditions.checkArgument failure
in SpanDAO.deleteByIds.

* [OPIK-4383] [BE] Fix cascade deletion failures after trace delete

Two bugs prevented spans and attachments from being deleted when a trace
was deleted via the event-driven cascade:

1. FeedbackScoreService.deleteByTraceIds/deleteBySpanIds had @nonnull on
   projectId which threw NPE when TracesDeleted.projectId() was null.
   EventInterceptor swallowed the NPE, stopping the entire cascade chain.
   Fix: remove @nonnull since the DAO already handles null safely via
   Optional.ofNullable(projectId).

2. SpanDAO.DELETE_BY_IDS had the wrong column (trace_id) and parameter
   name (span_ids) — the ClickHouse R2DBC driver could not resolve :span_ids
   as a named parameter in the DELETE statement. Fixed by using id IN :ids
   to match the working pattern in TraceDAO.DELETE_BY_ID.

* [OPIK-4383] [BE] Address PR review comments on ExperimentDenormalizationJob

- Centralize Redis constants (EXPERIMENT_KEY_PREFIX, USER_NAME_FIELD,
  MEMBER_SEPARATOR) in ExperimentDenormalizationConfig
- Change ExperimentAggregationPublisher.publish() to return Mono<Void>
  instead of void, so errors propagate to callers
- Make job interval configurable via jobs map in config.yml
- Fix onErrorContinue logging: remove getMessage() duplication
- Demote per-experiment logs from INFO to DEBUG
- Add ZSET pagination using expand() to avoid materializing entire range
- Update tests for all changes

* Fix @every job interval config key casing and add jobs section to test config

The dropwizard-jobs framework uses WordUtils.uncapitalize(class.getSimpleName())
to look up the interval in the jobs map, so the key must be
'experimentDenormalizationJob' (lowercase first letter). Also adds the missing
jobs section and jobBatchSize to config-test.yml.

* Replace @every annotation with programmatic Quartz scheduling

Remove @every from ExperimentDenormalizationJob and schedule it
programmatically in OpikGuiceyLifecycleEventListener, following the
same pattern as TraceThreadsClosingJob. Add jobInterval config field
to ExperimentDenormalizationConfig. Remove the jobs YAML section that
caused deserialization errors with JobConfiguration's immutable map.

* Add experiment context to error log and extract publishIfNotEmpty helper

- Include experimentId and workspaceId in onExperimentUpdated error log
- Extract publishIfNotEmpty helper to deduplicate filter+publish logic
  across triggerByExperimentIds, triggerByTraceIds, triggerBySpanIds

* Fix NPE in ExperimentAggregateEventListenerTest mock setup

Stub publisher.publish() to return Mono.empty() in setUp so
.subscribe() calls in production code don't NPE on null.

* [OPIK-4385] [BE] Use pre-computed aggregation tables for experiment endpoints

Apply UNION ALL hybrid pattern to ExperimentDAO (FIND, FIND_GROUPS,
FIND_GROUPS_AGGREGATIONS) and ExperimentItemDAO (STREAM,
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT,
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS,
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_STATS) so that
experiments present in experiment_aggregates / experiment_item_aggregates
use pre-computed values, while others fall back to live JOIN computation.

Add ExperimentAggregatesIntegrationTest covering all 7 affected queries
with parameterized filter, pagination, and consistency scenarios.

* [OPIK-4386] [BE] Trigger lazy aggregation via publisher on GET experiment by ID

When fetching an experiment by ID, if the experiment is in COMPLETED or
CANCELLED state and is not yet present in the experiment_aggregates table,
enqueue it for aggregation using ExperimentAggregationPublisher instead of
computing aggregations synchronously. The check and publish are performed
off the critical path via doOnEach, so the caller receives the experiment
immediately without waiting for the side effect to complete.

* [OPIK-4384] [BE] Fix missing zero_uuid binding and experiment_scores sort alias

- Bind zero_uuid parameter in getById, getByIds, and get(ExperimentStreamRequest)
  methods that use the FIND query; the UNION ALL refactor introduced an
  experiments_from_aggregates CTE that requires this parameter but only the
  main find() method was binding it, causing 500 errors on those paths
- Fix SortingQueryBuilder to reference the outer column alias experiment_scores_agg
  instead of es.experiment_scores; the ORDER BY sits outside the UNION ALL so the
  inner es alias is out of scope, while experiment_scores_agg is the consistent
  output alias exposed by both branches

* [OPIK-4384] [BE] Fix null row injection from LEFT JOIN miss in feedback_scores and comments aggregation

Pre-aggregate feedback_scores_final and comments_final into subqueries
(GROUP BY entity_id) before LEFT JOIN in DatasetItemVersionDAO.STREAM.
When a LEFT JOIN has no match against a pre-aggregated subquery the
joined columns are NULL, so any(NULL) returns NULL instead of a
default-valued row with epoch timestamps that caused Instant.parse()
failures.

Also adds a regression test covering the no-scores path in
ExperimentAggregatesIntegrationTest.

* [OPIK-4383] [BE] Remove DAO-level log.info from ExperimentAggregatesDAO methods

Move operational logging responsibility to the service layer, consistent
with earlier fixes for ExperimentItemDAO and SpanDAO in this PR.

* Remove accidentally committed doc files

These files were introduced during merge resolution but should
not be part of the branch.

* [OPIK-4383] [BE] refactor: extract triggerAggregation helper to centralize guard+publish flow

* [OPIK-4386] [BE] fix: demote lazy aggregation check log to DEBUG

* [OPIK-4387] [BE] feat: wire aggregation publisher into finishExperiments endpoint

Chain experimentAggregationPublisher.publish() after AlertEvent in
finishExperiments() so experiments finished via POST /v1/private/experiments/finish
are published to Redis for aggregation computation.

* [OPIK-4383] [BE] fix: restore TagOperations.tagUpdateFragment in SpanDAO BULK_UPDATE

Restores proper tag handling in SpanDAO.BULK_UPDATE query that was
regressed to a simple arrayConcat. Now uses TagOperations.tagUpdateFragment()
which provides arrayDistinct(), tag limit enforcement (max 50), and
tags_to_add/tags_to_remove support. Also adds the required
short_circuit_function_evaluation SETTINGS for throwIf evaluation.

* [OPIK-4387] [BE] feat: add stream trimming to experiment denormalization XADD

Add streamMaxLen and streamTrimLimit configuration to bound Redis stream
growth on the experiment denormalization producer (ExperimentDenormalizationJob).
Uses Redisson's trimNonStrict().maxLen().limit() API for approximate trimming.

* [OPIK-4387] [BE] fix: make aggregation publish best-effort in finishExperiments

Swallow and log Redis/publish errors so finishExperiments returns 204
even when Redis is down. Aggregation will be retried by the lazy trigger
or next job cycle.

* [OPIK-4387] [BE] refactor: centralize Redis stream XADD trimming in RedisStreamUtils

Extract duplicate StreamAddArgs.entry().trimNonStrict().maxLen().limit()
into RedisStreamUtils.buildAddArgs() so stream trimming settings live in
one place. Updates all 5 producers.

* [OPIK-4387] [BE] fix: defer aggregation publish and update test for best-effort behavior

Wrap aggregation publisher in Mono.defer() so it subscribes only after
upstream completes, and update unit test to expect completion instead of
error propagation.

* Adding InterruptableJob

* [OPIK-4383] [BE] Address PR review: expand safety valve, env var prefix

- Add batchSize-capped iteration counter to expand() to prevent
  infinite loops when ZSET entries fail to be removed
- Rename EXPERIMENT_DENORM_JOB_INTERVAL to OPIK_EXPERIMENT_DENORM_JOB_INTERVAL
  to follow the OPIK_ prefix convention

* [OPIK-4384] [BE] Add branch optimization and CTE split to experiment queries

Use pre-computed experiment_aggregates table to optimize query execution:
- Add has_aggregated/has_raw flags to skip unnecessary UNION ALL branches in FIND/FIND_COUNT
- Add getAggregationBranchCounts pre-query to determine which branches are needed
- Apply CTE split pattern to FIND_GROUPS and FIND_GROUPS_AGGREGATIONS
- Update getById to leverage branch optimization via single-ID branch count query
- Add <if(id)> filter to SELECT_AGGREGATED_EXPERIMENT_IDS for getById support

* [OPIK-4384] [BE] Add missing 7-arg overload for getDatasetItemsWithExperimentItems

Fix test compilation error from merge: the remote branch added callers
with (UUID, List, null, null, List<SortingField>, String, String) signature
which needs a bridge overload to the 9-arg method.

* [OPIK-4384] [BE] Add conditional LIMIT push-up, missing CTE, and fix test precision

- Add conditional LIMIT push-up in STREAM query: push LIMIT into CTE
  when only one branch (raw or aggregated) is active for performance
- Add missing experiment_item_aggr_trace_scope CTE for aggregated branch
- Add AggregatedExperimentCounts record for experiment-level branching
- Fix MultiValueFeedbackScoresE2ETest precision assertion: use isEqualTo
  instead of isEqualByComparingTo to respect custom BigDecimal comparator

* [OPIK-4384] [BE] Push OFFSET into top_dataset_items CTE and fix BigDecimal comparator in DatasetsResourceTest

* [OPIK-4384] [BE] Add pass rate aggregation to experiment aggregates

Add pass_rate, passed_count, and total_count columns to experiment_aggregates
table and compute them during aggregation. Update ExperimentDAO queries to
select these columns from both raw and aggregated paths, returning NULL for
non-evaluation-suite experiments.

* Fix format

* Fix get by id

* Fix mapping

* Fix mapping

* [OPIK-4384] [BE] Use pre-aggregated comments from aggregate tables with ISO 8601 date formatting

Update retrieval queries in ExperimentDAO, DatasetItemVersionDAO, and ExperimentAggregatesDAO
to read comments_array_agg as JSON String from aggregate tables instead of live-querying the
comments table. Ensure UNION ALL type compatibility by wrapping raw paths with toJSONString()
and formatting dates as ISO 8601 for proper Jackson deserialization.

* [OPIK-4386] [BE] Increase debounceDelay in test config to prevent race condition

The denormalization job was processing finished experiments during test
execution with incomplete ClickHouse data, causing stale aggregated
values to be returned instead of fresh raw computations.

* [OPIK-4384] [BE] Use parameterized binding for dynamic sort keys and add deterministic tiebreaker

- Replace literal string interpolation in getTopSortExpression with
  parameterized bind variables (sf.bindKey()) to prevent SQL injection
- Remove fieldMapping filter from bindDynamicKeys so all dynamic keys
  are bound, including those used in the top_sorting SELECT expression
- Add deterministic tiebreaker (id DESC / dataset_item_id DESC) to both
  the push-top-limit CTE and the main ORDER BY for consistent pagination
- Fix experiment_items deduplication: use FINAL where DISTINCT was used
  and vice versa for consistency across query branches

* [OPIK-4384] [BE] Add mixed-state aggregation test for UNION ALL hybrid

Test creates 3 experiments, aggregates only 1, and queries all 3 to
exercise the UNION ALL hybrid path where has_aggregated and has_raw
are both true simultaneously.

* [OPIK-4384] [BE] Add isNotEmpty assertions to parameterized filter tests

Ensure filter scenarios actually match data by asserting content()
is not empty before and after aggregation in all parameterized filter
tests (find, findGroups, findGroupsAggregations).

* [OPIK-4384] [BE] refactor: extract assertion helpers to remove duplication in ExperimentAggregatesIntegrationTest

* [OPIK-4384] [BE] refactor: rename parseFlexibleInstant to parseInstant in FeedbackScoreMapper

* [OPIK-4384] [BE] Make LIMIT unconditional in FIND query

The LIMIT clause was gated on filter/sort flags, so plain paged requests
(only limit/offset) at the outer query level would not emit LIMIT.
Simplify to always emit LIMIT when the limit parameter is provided.

* [OPIK-4384] [BE] Fix comment ordering assertion in tests

ClickHouse groupUniqArray does not guarantee ordering, so comment
assertions must use ignoringCollectionOrder to avoid flaky failures.

* [OPIK-4384] [BE] Add branch conditionals to FIND_GROUPS/FIND_GROUPS_AGGREGATIONS and revert unconditional LIMIT

- Wrap SELECT branches in FIND_GROUPS and FIND_GROUPS_AGGREGATIONS with
  <if(has_aggregated)>/<if(has_raw)> conditionals to skip unnecessary
  branches when all experiments are aggregated or all are raw
- Add no-args getAggregationBranchCounts() overload for workspace-only
  pre-query (used by group/aggregation queries that lack experiment IDs)
- Update executeQueryWithTargetProjects to run both pre-queries in
  parallel via Mono.zip
- Revert commit 215a3f9 (unconditional LIMIT) which caused double
  LIMIT/OFFSET bug: CTE-level LIMIT + outer LIMIT made page 2+ return
  0 results. The complex conditional is correct — outer LIMIT is only
  needed when post-CTE processing may alter the result set.

* [OPIK-4384] [BE] Add branch conditionals to SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT

Wrap the UNION ALL in the count query with <if(has_aggregated)>/<if(has_raw)>
conditionals to skip unnecessary branches. Pass branch flags through
getCountWithExperimentFilters from the existing pre-query results.

* [OPIK-4384] [BE] Fix ClickHouse column resolution in COUNT query

Alias dataset_item_id as di_id in the COUNT subquery branches
to avoid column name ambiguity when ClickHouse 25.3's query
analyzer resolves COUNT(DISTINCT dataset_item_id) through a
LEFT JOIN with dataset_items_resolved which also has that column.

* [OPIK-4384] [BE] Use pre-computed comments in STREAM query and fix UNION ALL type mismatch

Aggregated branch now reads comments_array_agg directly from experiment_item_aggregates
instead of doing an expensive JOIN to the comments table. Raw branch converts comments
to JSON String via toJSONString(CAST(...)) so both branches output compatible types.

* [OPIK-4384] [BE] Fix target_project_ids bind error in FIND_GROUPS aggregated branch

* Fix issues

* [OPIK-4387] [BE] Fix missing closing brace in ExperimentServiceTest
YarivHashaiComet pushed a commit that referenced this pull request Mar 17, 2026
* design doc

* FE communication and ERD additions

* ui reporting events in flow

* changes

* add reason to TrialItemRun

* [NA] [SDK] feat: add greenfield optimization framework package

Implements a new optimization framework (`apps/opik-optimizer`) that
decouples optimizer algorithms from experiment execution, persistence,
and UI concerns. Integrates via the existing optimization studio pipeline
(Redis queue → Python backend → subprocess).

Key components:
- Orchestrator: central lifecycle controller with sampler, validator,
  materializer, result aggregator, and event emitter
- StupidOptimizer: 2-step test optimizer (3 candidates → best → 2 more)
- EvaluationAdapter: wraps SDK evaluate_optimization_suite_trial()
- Backend integration: new Redis queue, framework_optimizer job processor,
  framework_runner subprocess entry point

Also adds evaluate_optimization_suite_trial() to the Python SDK, combining
optimization trial linkage with evaluation suite behavior (evaluators and
execution policy from the dataset).

53 unit + integration tests passing. Verified end-to-end against Comet cloud
with real LLM calls, UI progress chart, prompt display, and score tracking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Adjustments for UI and framework review

* fix: address PR review comments - dict access bug and theme color

- Fix AttributeError in framework_runner.py: dataset.get_items() returns
  dicts, use item["id"] instead of item.id
- Fix hard-coded hex color in TrialPassedCell.tsx: use text-success CSS
  class instead of text-[#12B76A] for proper dark theme support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address remaining PR review comments

- Add opik:optimizer-framework to default RQ queue names so framework
  jobs actually get consumed by workers
- Add dataset size guard in orchestrator before sample_split to provide
  a clear error message for datasets with fewer than 2 items
- Extract shared optimizer_job_helper.py to deduplicate identical logic
  between optimizer.py and framework_optimizer.py
- Extract checkIsEvaluationSuite helper in optimizations.ts to
  deduplicate predicate shared between CompareTrialsPage and
  useCompareOptimizationsData
- Fix hardcoded "pass_rate" in experiment_executor.py to use the actual
  metric_type parameter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: separate experiment scores from feedback scores and handle single-item datasets

Splits the combined feedback/experiment scores into distinct fields in the
Optimization API and DAO so the frontend can fall back to experiment_scores
when feedback_scores lack the objective. Allows single-item datasets by
returning a train-only split instead of raising. Extracts shared runner
environment setup into runner_common.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: extract shared getBestOptimizationScore helper to deduplicate logic

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: evaluate baseline on full dataset instead of validation split only

The baseline was evaluated on split.validation_item_ids, which with an
80/20 split ratio meant only 1 out of 5 items was used. This gave an
unrepresentative baseline score. Now uses the full dataset_item_ids list.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: enrich GEPA experiment metadata for optimization visualization

Add rich metadata to each experiment so the UI can aggregate and
visualize the optimization trajectory. Key changes:

- step_index increments only when candidate changes (not per eval)
- candidate_id is stable across re-evaluations of the same prompt
- parent_candidate_ids always set correctly for derived candidates
- New metadata fields: batch_index, num_items, capture_traces, eval_purpose
- Refactor optimizer package: protocol + factory pattern for registration
- Add GEPA adapter bridging GEPA callbacks to framework metadata
- Fix BE tests for experimentScores null and queue routing
- Add docs: ADDING_AN_OPTIMIZER.md and GEPA_IMPLEMENTATION.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address PR review comments and simplify optimizer factory

- Remove register_optimizer public API and OptimizerFactory class;
  replace with a simple dict in _load_registry()
- framework_runner: avoid holding full dataset items in memory
- Update docs and tests to match simplified factory

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: lineage-based step_index and parent_candidate_ids for GEPA experiments

- Replace sequential step_index counter with parent-lineage derivation
  (max parent step + 1), so all re-evaluations of the same candidate
  share the same step_index
- Ensure every non-baseline experiment carries parent_candidate_ids,
  enabling the UI to draw lineage graphs
- Pass batch_index, num_items, capture_traces, and eval_purpose through
  to experiment metadata for richer visualization
- Revert runner scripts to direct invocation (remove runner_common.py)
- Update unit tests to match new metadata contract

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: remove unused config_hash and merge event emitters

- Remove canonical_config_hash from Candidate and TrialResult types,
  candidate_materializer, experiment_executor, and all tests
- Delete util/hashing.py module (unused — GEPA does minibatching so
  config-hash dedup would block valid re-evaluations)
- Merge SdkEventEmitter and LoggingEventEmitter into a single
  EventEmitter class with optional optimization_id
- Update GEPA_IMPLEMENTATION.md to reflect parent_ids tracking fixes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: make CandidateConfig a plain dict and pass baseline_config through context

- Replace CandidateConfig dataclass with dict[str, Any] type alias
- Add baseline_config field to OptimizationContext (caller-provided, opaque)
- Orchestrator passes baseline_config through without knowing its structure
- Optimizers copy baseline_config and override prompt_messages only
- Remove result_aggregator module (inlined into evaluation_adapter)
- Move gepa imports to runtime (lazy) for optional dependency
- Fix protocol.py training_set/validation_set types to list[dict]
- Update ADDING_AN_OPTIMIZER.md to reflect all changes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: move gepa tests to library_integration to avoid unit suite dependency on gepa

The gepa tests patch gepa.core.adapter.EvaluationBatch and gepa.optimize,
requiring the optional gepa package at import time. Moving them to
tests/library_integration/gepa/ with pytest.importorskip("gepa") keeps
the unit suite fast and dependency-free.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: remove event_emitter from optimizer interface, auto-emit step progress

Optimizers no longer receive or call event_emitter directly. The
EvaluationAdapter now auto-detects step_index changes during evaluate()
and emits on_step_started internally. GEPAProgressCallback simplified
to only forward GEPA events to the adapter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: assert on actual log messages in event emitter tests

Use caplog to verify logger.info output includes optimization ID and
event details, instead of just checking calls don't crash.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: set evaluation_method on optimizer trial experiments for correct UI detection

evaluate_optimization_suite_trial was creating experiments without
evaluation_method="evaluation_suite", causing the backend to default
to "dataset". The frontend checkIsEvaluationSuite now uses the explicit
evaluation_method field instead of heuristic score detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: validate dataset is evaluation suite before running suite evaluation

Adds a guard to evaluate_suite and evaluate_optimization_suite_trial that
checks dataset.dataset_type == "evaluation_suite" before proceeding. This
prevents silently running an ineffective suite trial on a plain dataset
with no scoring rules.

- Add dataset_type param to Dataset constructor, populated at all call sites
- Add dataset_type property to Dataset
- Add _validate_dataset_is_evaluation_suite in evaluator.py
- Update tests and add rejection test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: extract _run_suite_evaluation to deduplicate suite evaluation flow

evaluate_suite and evaluate_optimization_suite_trial had their entire body
duplicated. Extract shared logic into _run_suite_evaluation, parameterized
by optimization_id and dataset filters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE][BE] feat: optimization studio UI improvements

Comprehensive face-lift for optimizer screens including new KPI cards,
metric comparison cells, configuration diff views, progress charts,
trial status indicators, and backend dataset_item_count support.
Also adds backward compatibility for SDK-based optimizations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] feat: optimizer screens face-lift

- Dataset name column: hover icon instead of clickable link
- Split Accuracy into Pass rate + Accuracy columns with compact metric display
- Conditionally hide Accuracy column when no old-type optimizations exist
- Remove Logs/Configuration tabs from single optimization page
- Fall back to studio_config for configuration display on old optimizations
- Chart tooltip: remove pass rate percentage background color
- Fix dataset hover icon vertical centering
- Restore feature toggle for optimization studio

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] fix: center trend arrow icons and rename tooltip label

- Fix arrow icon vertical centering in compact metric Tag
- Rename "Avg. runtime cost" to "Runtime cost" in chart tooltip

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] fix: polish optimizer screens UI consistency

- Fix chart tooltip background (use --background instead of --popover)
- Align column types with correct icons (cost, duration, numberDictionary)
- Align KPI card icons to match table column type icons
- Lowercase labels: Evaluation results, Best configuration, Runtime cost, Opt. cost, Optimization cost
- Darken success green color for better readability
- Remove Traces KPI card from trial view

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4687] [SDK] feat: GEPA v2 optimizer with reflection-based prompt evolution (#5547)

* [OPIK-4687] [SDK] feat: integrate GEPA v2 optimizer into framework

Add GepaV2Optimizer that delegates to the external gepa library (v0.1.0+)
for genetic-Pareto prompt optimization. Includes adapter bridging GEPA's
evaluate/reflect interface to the framework's EvaluationAdapter, lifecycle
event tracking via callbacks, result caching, and a reflection prompt that
encourages generalizable instructions while preserving template variables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(gepa-v2): improve reflection feedback with structured assertions and dynamic inputs

- Extract template variables from prompt messages for dynamic input field mapping
- Store per-assertion structure (name, value, reason) instead of flat reason strings
- Show only failed assertions in reflection feedback for focused improvement signals

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(gepa-v2): adapter reflection control, FE chart filtering, experiment typing

- Move reflection to adapter's propose_new_texts with custom prompt template
- Use msg["name"] as candidate key when provided, fallback to {role}_{index}
- Strip echoed parameter prefix from reflection LLM output
- Disable GEPA evaluation cache so validations produce full-dataset experiments
- Tag exploration evals as mini-batch, only baseline/init/validation as trial
- FE: filter mini-batch experiments from optimization progress chart
- FE: show individual assertion score columns alongside "passed" for eval suites
- Update E2E script: no dataset split, max_candidates=10, reflection log capture

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(gepa-v2): improve reflection quality with structured feedback and template filtering

- Show FAILED and PASSED assertions separately in reflection feedback
- Keep worst run per item (most failed assertions) for reflection
- Sort reflective dataset records by failure count (most failures first)
- Exclude template-only messages (e.g. {question}) from GEPA seed candidate
- Rewrite reflection prompt: focus on failures, preserve what works, 500-word limit

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(gepa-v2): classify experiment type by batch size, not eval purpose

The purpose-based classification was unreliable: GEPA calls evaluate()
with capture_traces=False for both full validations and minibatch
evaluations of new candidates, making them indistinguishable by purpose.

Now records the full dataset size on the first evaluate call (initialization)
and classifies any call with fewer items as mini-batch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(gepa-v2): improve scoring, stopping, and reflection quality

- Use mean instead of min for per-item assertion scores, giving GEPA
  granular signal instead of binary 0/1
- Track total_runs/passed_runs per item so reflection prompt shows
  whether failures are consistent or intermittent
- Stop on trial.score (framework experiment score) instead of GEPA's
  internal mean, so pass_threshold semantics are respected
- Rewrite reflection template with 4-step structure: diagnose, keep
  what works, write assertion-matched rules, generalize
- Increase max_metric_calls multiplier to 5x for deeper exploration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(gepa-v2): hide mini-batch trials from table, use domain-neutral examples

- Filter mini-batch experiments from the trials table rows so only
  full evaluation trials are shown
- Replace customer-support-specific examples in the reflection
  template with domain-neutral ones

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(gepa-v2): show all runs per item in reflection feedback, log rendered prompt

Previously kept only the worst run per item for reflection. Now all runs
are preserved and shown separately (Run 1/3, Run 2/3, etc.) so the
reflection LLM can see what varies across attempts. Also captures the
fully rendered reflection prompt in the reflection log for debugging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(gepa-v2): consolidate runs per input in reflection dataset, label assertion/reason

Consolidate multiple runs for the same input into a single record with
a Runs field and per-item Summary (pass count + consistent failures).
This eliminates input duplication (~40% token savings) and makes cross-run
comparison trivial. Also separates Assertion/Reason onto labeled lines
for clearer parsing by the reflection LLM.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test(gepa-v2): add feedback format coverage for reflection dataset

Add tests for single-run flat keys, multi-run Assertion/Reason labels
in Runs field, and failed assertions with empty reason.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(gepa-v2): tell reflection LLM that examples are sorted by priority

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(gepa-v2): use flat config instead of prompt_messages

GEPA adapter now works with flat dict[str, str] candidates instead of
knowing about message roles. baseline_config is the single source of
truth with system_prompt and user_message keys. Added LLMChatTask that
constructs LLM messages from flat config keys, replacing the
prompt_messages reconstruction path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(gepa-v2): use TrialResult.config instead of prompt_messages, fix validator and UI prompt display

Replace TrialResult.prompt_messages with TrialResult.config so config is
the single source of truth. Update candidate_validator to accept flat
message keys (system_prompt, user_message) in addition to prompt_messages.
Populate experiment metadata "prompt" from flat keys so the UI displays
prompts correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(gepa-v2): add optimizable_keys to OptimizationContext

Replace hardcoded PROMPT_KEYS in GepaV2Optimizer with
context.optimizable_keys so the caller explicitly controls which
baseline config keys get optimized.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(gepa-v2): add failure-aware batch sampler for minibatch item selection

Replace GEPA's default uniform sampler with FailureAwareBatchSampler that
guarantees failed items from the last full eval appear in subsequent
minibatches, giving the reflection LLM actionable signal instead of wasting
iterations on easy items.

Parameters: min_failed_per_batch, min_unseen_per_batch, failure_threshold.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(gepa-v2): strict types in sampler, worst-first failed selection, update implementation doc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(experiment): surface optimizable keys in experiment configuration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(gepa-v2): failure streak tracking, history-aware reflection prompt, optimizable keys in config

- Track per-item failure streaks and failing assertion names in sampler
- Annotate reflective dataset records with "Failure History" for stuck items
- Rewrite reflection prompt: failure history step, structured output, topic headers
- Surface optimizable_keys in experiment config and baseline evaluation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(gepa-v2): balanced reflection prompt, lower failure history threshold

- Rewrite reflection prompt to balance conservative and aggressive approaches:
  preserve working rules while encouraging grouped topic headers (## Empathy,
  ## Resolution, etc.) instead of flat numbered lists
- Lower failure history threshold from streak >= 2 to >= 1 so the reflection
  LLM sees failure context from the first repeated failure
- Guard failure history annotation with `if stuck` to avoid empty annotations
- Relax "3 unreturned callbacks" assertion to "multiple unreturned callbacks"
  (the exact-count version was too brittle for gpt-4o-mini to satisfy reliably)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(gepa-v2): balanced 50/50 minibatch sampling, 20-item e2e suite

Balanced sampling: split minibatches ~50/50 between failed (worst-first)
and passed (random) items. Previously batches were almost entirely failed
items, causing the reflection LLM to over-correct and regress passing
behaviors (catastrophic 0.0 scores). Passed items now act as behavioral
anchors.

- Remove unseen item tracking (mark_seen, min_unseen_per_batch)
- Default min_failed_per_batch=1 (was batch_size-1)
- Minimum reflection_minibatch_size=4 (ensures 2+2 split)
- Redesign e2e suite: 20 items (5 easy, 7 medium, 8 hard)
- Fix contradicting assertions (hedging language vs no promises)
- Remove impossible assertions (specific loyalty benefits)
- Add problematic items summary to reflection log
- Save reflection log from orchestrator finally block
- Update GEPA_IMPLEMENTATION.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(gepa-v2): extract collaborators from adapter with DI

Split FrameworkGEPAAdapter into three injectable collaborators:
- CandidateTracker: candidate identity, parent lineage, GEPA index mapping
- ReflectiveDatasetBuilder: feedback dataset construction for reflection LLM
- ReflectionProposer: reflection LLM interaction and logging

The adapter is now a thin facade (~300 lines, down from 664) that
orchestrates evaluation and delegates to collaborators. Compatibility
properties ensure all existing tests pass unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(gepa-v2): move reflection template to ReflectionProposer

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(gepa-v2): task-agnostic reflection template with prompt descriptions and sibling awareness

Rewrite the reflection template to be domain-neutral, add optional
prompt_descriptions to OptimizationContext so the reflection LLM
understands what each parameter does, and include sibling parameter
context so the LLM knows what other params exist without modifying them.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs(gepa-v2): update implementation doc and reflection prompt algorithm

Update GEPA_IMPLEMENTATION.md with prompt descriptions, sibling awareness,
and task-agnostic template details. Rewrite REFLECTION_PROMPT_EXAMPLE.md
to document the full reflection prompt assembly algorithm with a rendered
example showing the new header format.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(gepa-v2): robust header stripping, markdown formatting in reflection template

The LLM sometimes echoes header metadata (Parameter:, Description:,
param name) in reformulated form. Replace exact-prefix matching with
line-by-line stripping of metadata patterns. Add IMPORTANT instruction
to not include metadata in output. Request markdown ## headers in STEP 4.

Add 11 unit tests for ReflectionProposer: header stripping edge cases,
build_header with/without descriptions, template content assertions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* revert(fe): remove debug FE changes (mini-batch filtering, column reorder)

These were temporary UI tweaks for debugging the GEPA v2 optimizer.
They'll be re-implemented properly in a separate FE PR.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(gepa-v2): preserve template variables in reflection prompt

Instruct the reflection LLM to keep all template variables (e.g.
{var}, {{var}}, <var>, {% var %}) intact during prompt rewriting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: rename gepa_v2 to gepa, clean up OptimizationContext

- Rename gepa_v2/ -> gepa/ (now the primary optimizer)
- Rename gepa/ -> gepa_old/ (legacy optimizer)
- Rename GepaV2Optimizer -> GepaOptimizer
- Rename GepaOptimizer -> GepaLegacyOptimizer
- Remove unused fields from OptimizationContext: prompt_messages,
  metric_parameters, model_parameters
- Rename prompt_descriptions -> config_descriptions
- Delete SimpleOptimizer and its tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add configurable split_strategy to OptimizationContext

Add split_strategy field ("80_20" default, "no_split" for GEPA) so the
orchestrator handles dataset splitting instead of individual optimizers.
Remove internal train+val dedup logic from GepaOptimizer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(gepa): clean up adapter API and fix code quality issues

- Make adapter facade properties public (remove underscore prefix)
- Add standalone reflection_log fallback to prevent silent data loss
- Rename consume_pending_capture_traces → get_pending_capture_traces
- Remove dead guard in _build_evaluation_batch
- Move SYSTEM_PROMPT_KEY constant to test file
- Fix update_scores type annotation in failure_aware_sampler

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: remove prompt_messages logic, validate optimizable_keys

- Adapt LLMTask to use config dict, remove LLMChatTask duplicate
- Simplify candidate_validator to check optimizable_keys from adapter
- Remove prompt_messages fallback from experiment_executor metadata
- Update all tests, fixtures, scripts, and docs to flat key format

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(scripts): remove stale prompt_messages and API references

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(gepa): remove optimizable_keys from config dict to fix caching

optimizable_keys was being injected into CandidateConfig by both
_make_config_builder and the orchestrator, causing cache key mismatches
between baseline and initialization evaluations. Pass it as an explicit
parameter through the evaluation chain instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(scripts): rename gepa_v2 scripts to gepa, delete run_optimization_e2e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: use optimizable_keys generically in ADDING_AN_OPTIMIZER guide

Remove hardcoded system_prompt references from code examples.
Optimizers should iterate over context.optimizable_keys instead
of assuming specific key names.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* fix(fe): address baz review comments

- Use Tag component variants instead of hard-coded color spans for
  theme-aware diff badges (Added/Removed/Changed)
- Clamp formatAsPercentage input to [0, 1] range to prevent >100% or
  negative percentage display
- Read baseline score from experiment_scores as fallback when
  feedback_scores lacks the objective (evaluation-suite support)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(fe): extract getObjectiveScoreValue shared helper

Move the feedback_scores -> experiment_scores fallback into a reusable
getObjectiveScoreValue helper in feedback-scores.tsx. Replace all 4 call
sites (CompareTrialsPage, TrialKPICards, useOptimizationScores,
useCompareOptimizationsData) with the shared helper.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(be): resolve CI failures - migration conflict and test ignored fields

- Rename migration 000063 → 000064 to avoid prefix conflict with main
- Add datasetItemCount to EXPERIMENT_IGNORED_FIELDS and test builder
- Add datasetName to OPTIMIZATION_IGNORED_FIELDS (transient field)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(fe): extract shared aggregateExperimentMetrics helper

Deduplicate weighted score/cost/latency accumulation logic that was
duplicated between TrialKPICards and useCompareOptimizationsData.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add optimization_id index on experiments and remove dead code

- Add minmax index on experiments.optimization_id to speed up
  optimization queries that join experiments by optimization_id
- Remove unused OptimizationDiffView component (dead code from iteration)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update Helm documentation

* [OPIK-4383] [BE] Redis stream subscriber for debounced experiment aggregates recomputation (#5371)

* [OPIK-4380] [BE] Add experiment aggregates for denormalized metrics

- Add experiment_aggregates and experiment_item_aggregates tables
- Implement ExperimentAggregatesDAO with population and query methods
- Add ExperimentAggregatesService for aggregation management
- Refactor DTOs into organized model classes:
  - ExperimentAggregatesModel: aggregation results
  - ExperimentEntityData: entity models
  - ExperimentSourceData: raw source data
  - ExperimentAggregatesUtils: utilities
- Add FEEDBACK_SCORES_AGGREGATED filter strategy for map-based filtering
- Add comprehensive integration tests (10/10 passing)
- Configure batch size and parallelism settings

* [OPIK-4380] [BE] Add MySQL deadlock retry mechanism for concurrent dataset operations

Problem:
- MySQL deadlock on dataset_version_tags composite PRIMARY KEY (workspace_id, dataset_id, tag)
- Occurred during parallel dataset creation in same workspace
- Multiple threads inserting "latest" tag for different datasets caused lock contention
- Experiments with parallel execution were failing with MySQLTransactionRollbackException

Solution:
- Add handleOnDeadLocks() method in RetryUtils with:
  - 5 retry attempts with exponential backoff (250ms to 2s)
  - 0.5 jitter to reduce thundering herd effect
  - Recursive isDatabaseDeadlock() detection for MySQLTransactionRollbackException
- Apply retry logic in DatasetItemService.setDatasetItemVersion()
- Enables concurrent dataset creation for same workspace

Impact:
- Supports parallel experiment execution with proper deadlock handling
- Test success rate improved from 0/10 to 10/10 in ExperimentAggregatesIntegrationTest

* Fix visibility

* [OPIK-4380] [BE] Address PR review comments for experiment aggregates

Fixed 11 automated review comments from baz-reviewer:

CRITICAL fixes:
- Prevent NPE on null span aggregations by adding coalesce() in SQL
- Handle multi-project experiments with LIMIT 1 in GET_PROJECT_ID
- Handle zero-item experiments with empty aggregation helpers
- Bind feedback_scores_percentiles map instead of empty CAST

HIGH priority fixes:
- Use toDecimal128(12) instead of toDecimal64(9) for cost percentiles
- Add null-safe tags handling with Optional.ofNullable()
- Include exception objects in retry logging for stack traces

MEDIUM priority fixes:
- Add missing log_comment to SELECT_EXPERIMENT_BY_ID query
- Add missing log_comment to GET_PROJECT_ID query

LOW priority fixes:
- Remove duplicate "id" binding in bindItemsParameters
- Enhance batchSize config documentation with details

All 11 integration tests passing.

* [OPIK-4382] [BE] Refactor experiment aggregates with import cleanup and Optional patterns

- Add missing imports for IntStream, ProjectStats, and other dependencies
- Replace fully-qualified class names with proper imports across DAO and Service classes
- Fix IS_NOT_EMPTY filter handling for FEEDBACK_SCORES_AGGREGATED strategies
- Refactor null checks to use Optional in mapping methods:
  - mapFeedbackScoreAggregations, mapExperimentFromAggregates
  - mapFeedbackScoreData, mapExperimentGroupAggregationItem
  - Batch insert preparation with Optional chains
- Improve code readability and maintainability with functional patterns

* [OPIK-4380] [BE] Fix table definition

* [OPIK-4380] [BE] Address PR comments and consolidate DatasetItemService methods

- Fix tags NPE in ExperimentAggregatesDAO with defaultIfNull
- Remove unnecessary FINAL clause from GET_EXPERIMENT_DATA query
- Fix test naming in ExperimentAggregatesIntegrationTest
- Consolidate 7 duplicate createVersionFromDelta methods into single canonical implementation
- Remove debug logger from config-test.yml

* [OPIK-4380] [BE] Fix missing log_comment and centralize search criteria binding

- Fix SELECT_EXPERIMENT_BY_ID to properly render log_comment metadata
  - Use getSTWithLogComment pattern in getExperimentFromAggregates
  - Ensures ClickHouse query logging populates workspace/experiment IDs

- Centralize ExperimentSearchCriteria binding logic
  - Create ExperimentSearchCriteriaBinder utility class
  - Parameterize filter strategies to support both DAO variants
  - Eliminate 29-line duplication between ExperimentDAO and ExperimentAggregatesDAO
  - Single source of truth prevents DAOs from getting out of sync

* [OPIK-4380] [BE] Fix createVersionFromDelta consolidation after rebase

- Update canonical method signature to include new parameters:
  - List<EvaluatorItem> evaluators
  - ExecutionPolicy executionPolicy
  - boolean clearExecutionPolicy

- Update all 5 caller sites to pass new parameters:
  - Use changes.evaluators(), changes.executionPolicy(), changes.clearExecutionPolicy() when available
  - Pass null/false for auto-generated versions that inherit from base

- Add imports for EvaluatorItem and ExecutionPolicy

Fixes compilation errors introduced by rebase with upstream changes to DatasetVersionService

* [OPIK-4380] [BE] Address PR review comments - fix type mismatch, extract constants, remove DAO logging

- Fixed BigDecimal[] to Double[] conversion for experiment_scores (matches ClickHouse Float64)
- Extracted FilterStrategy lists to static final constants to avoid repeated allocations
- Added @NonNull validation to populateExperimentAggregate parameter
- Removed DAO layer logging, keeping service-level logging only

* [OPIK-4380] [BE] Extract shared helper for experiment data pagination

Extract streamWithExperimentPagination() helper method to eliminate
duplication in getTracesData(), getSpansData(), and getFeedbackScoresData().

All three methods followed identical pattern:
- asyncTemplate.stream with connection
- getSTWithLogComment with cursor flag
- Bind workspace_id, experiment_id, project_id, limit
- Optional cursor binding
- Result mapping

Benefits:
- Single source of truth for pagination binding logic
- Prevents divergence when tweaking cursor/limit bindings
- Reduces code from ~20 lines to ~10 lines per method
- Type-safe generic implementation

Note: CTE redundancy (3x experiment_items scan) is intentional to avoid
passing large trace ID lists as parameters, which would cause performance
issues with 10K+ traces.

* [OPIK-4383] [BE] Add Redis stream subscriber for debounced experiment aggregates recomputation

- Add ExperimentDenormalizationConfig implementing StreamConfiguration with debounce, job lock, and per-experiment aggregation lock settings
- Add ExperimentAggregationMessage as stream message record
- Add ExperimentAggregatesSubscriber consuming from the denormalization stream; acquires a workspace-scoped distributed lock per experiment before calling populateAggregations()
- Add experimentDenormalizationEnabled feature flag to ServiceTogglesConfig and FeatureFlags
- Wire ExperimentDenormalizationConfig into OpikConfiguration
- Update config.yml and config-test.yml with full experimentDenormalization block (enabled for tests)
- Add ExperimentAggregatesSubscriberTest covering lifecycle gating and processEvent success/error paths

* Revision 2: Address PR comments - add config defaults, remove toggle, rename tests

- ExperimentDenormalizationConfig: add sensible defaults to all fields so
  Dropwizard validation doesn't fail when the config block is absent from
  old deployments (config.isEnabled()=false still gates the subscriber)
- Remove experimentDenormalizationEnabled service toggle from
  ServiceTogglesConfig, FeatureFlags, config.yml and config-test.yml -
  the infrastructure gate (config.isEnabled()) is the single control point
- Rename lifecycle test methods to camelCase per project conventions:
  startSkipsStartupWhenDisabled / stopSkipsShutdownWhenDisabled

* Revision 3: Add @Max(500) to consumerBatchSize and @NotNull to jobLockWaitTime

* [OPIK-4380] [BE] Address PR review comments - fix TYPE_REFERENCE visibility, redundant IN subquery, hardcoded context keys, Instant.now in loop, and inline defaultIfNull

* Revision 4: Address remaining JetoPistola review comments (#7, #8, #10)

- #7: Remove "Used for testing and verification" from getExperimentFromAggregates javadoc
- #8: Replace recursive flatMap with Mono.expand() in populateExperimentItemsInBatches
- #10: Remove unrelated subscribeOn addition from DatasetItemService.createVersionFromDelta

* Revision 3: Add switchIfEmpty fallback for deleted traces in populateExperimentAggregate

* Fix tests

* Revision 6: Move countTotal log from DAO to service layer

Operational logs belong in the service layer, not the DAO.

* Revision 7: Apply Spotless formatting

* Revision 8: Make populateAggregations(UUID, int) private

Removes the uncapped public batch size entry point. All callers now go
through the public no-arg overload which reads batchSize safely from config.

* [OPIK-4380] [BE] Add evaluation_method support to experiment_aggregates pipeline

- Add ClickHouse migration (000062) to add evaluation_method column to experiment_aggregates table
- Add evaluationMethod field to ExperimentData record
- Update GET_EXPERIMENT_DATA query to read evaluation_method from experiments
- Update INSERT_EXPERIMENT_AGGREGATE to write evaluation_method to experiment_aggregates
- Update SELECT_EXPERIMENT_BY_ID to read evaluation_method from experiment_aggregates
- Fix Experiment record constructor call: insert EvaluationMethod at correct position (10)

* [OPIK-4380] [BE] Extract shared helper for experiment aggregation queries

Reduce copy-paste in getTraceAggregations, getSpanAggregations, and
getFeedbackScoreAggregations by extracting queryExperimentAggregation,
which centralises the context-aware execution, workspace/experiment/project
parameter binding, and singleOrEmpty pattern shared by all three methods.

* [OPIK-4380] [BE] Enforce non-null contract on countTotal criteria parameter

Add @NonNull to ExperimentSearchCriteria in the interface and implementation
so that a null argument fails fast with an explicit NullPointerException at
the DAO boundary instead of crashing deep inside buildCountTemplate.

* [OPIK-4380] [BE] Fix countTotal ignoring target project IDs in normal path

target_project_ids was only applied inside the project_deleted LEFT JOIN
subquery; the main WHERE had no project restriction, so counts were
workspace-wide. Reuse has_target_projects in the main WHERE so
project_id IN :target_project_ids always takes effect. Also replace
manual null/empty checks with CollectionUtils.isNotEmpty.

* [OPIK-4380] [BE] Apply Spotless formatting

* [OPIK-4382] [BE] Address PR review comments on experiment aggregates

- Fix :versionId → :version_id parameter naming in SQL templates and bindings
- Fix last_updated_at binding to use item.lastUpdatedAt() instead of Instant.now()
- Fix FEEDBACK_SCORES_AGGREGATED_IS_EMPTY filter: embed generated SQL into templates
  instead of hard-coded static condition, and add missing bind calls
- Fix RetryUtils log duplication (remove getMessage() + pass exception directly)
- Add batchSize = 1000 default in ExperimentAggregatesConfig
- Extract resolveVersionIdForCriteria helper to deduplicate version-id resolution
- Add null/blank/ClickHouse placeholder guards in extractUuidsFromGroupValues
- Extract loadEntityMap helper to deduplicate getEnrichInfoHolder enrichment logic

* Revision 3: Address PR comments E, F, G, H

- Fix E: Extract shared template/bind helpers in ExperimentAggregatesDAO
- Fix F: Bind experiment_ids as UUID[] instead of String[]
- Fix G+H: Extract getEnrichInfoHolder logic into ExperimentGroupEnricher,
  eliminating duplication between ExperimentService and ExperimentAggregatesService
  without introducing a direct dependency between them

* Revision 4: Fix ExperimentServiceTest to include ExperimentGroupEnricher mock

* [OPIK-4382] [BE] Extract shared Row→ExperimentGroup mappers into ExperimentGroupMappers

Pull the duplicated Row→ExperimentGroupItem and Row→ExperimentGroupAggregationItem
conversion logic from ExperimentDAO and ExperimentAggregatesDAO into a shared
ExperimentGroupMappers utility class. Both DAOs now delegate to the same
toExperimentGroupItem / toExperimentGroupAggregationItem helpers, eliminating
the need to mirror DTO mapping changes in two places.

* [OPIK-4382] [BE] Deduplicate bindGroupCriteria into ExperimentGroupMappers

Moves the shared group-criteria binding logic out of ExperimentDAO and
ExperimentAggregatesDAO into ExperimentGroupMappers.bindGroupCriteria(),
following the same pattern as ExperimentSearchCriteriaBinder.  Adding or
fixing a criteria binding now only requires a change in one place.

* [OPIK-4382] [BE] Extract streamGroupQuery helper and fix null percentiles

- Deduplicate findGroups/findGroupsAggregations into a single private
  streamGroupQuery(queryTemplate, criteria, rowMapper) that differs only
  by the query constant and BiFunction row mapper.

- Fix convertToBigDecimal to return null for null/unsupported inputs so
  absent p50/p90/p99 entries in getDuration propagate as null rather than
  BigDecimal.ZERO, preserving the semantic-null that lets callers apply
  COALESCE/fallback logic correctly.

* [OPIK-4382] [BE] Consolidate cost/duration helpers into ExperimentGroupMappers

Promote getCostValue and getDuration to public static in
ExperimentGroupMappers and delete the private copies in ExperimentDAO.
ExperimentDAO.mapToDto now delegates to the shared helpers, so any
future change to cost filtering, duration percentile extraction, or the
BigDecimal conversion only needs to be made in one place.

Side-effect: ExperimentDAO.mapToDto also picks up the null-percentile
fix (convertToBigDecimal returns null for absent/unsupported inputs)
that was previously applied only to ExperimentGroupMappers.

* [OPIK-4382] [BE] Fix pagination count and add criteria filter tests

- Remove count() OVER () window function from paged query (returned
  page-scoped count instead of full result-set count)
- Replace with dedicated count query + short-circuit: skip items query
  when count == 0, use DatasetItemPage.empty() for that case
- Extract DatasetItemResultMapper.buildItemFromRow as public static
  helper reused by ExperimentAggregatesDAO
- Add parameterized integration tests for ExperimentGroupCriteria
  filters (name, types, projectId, combined, empty-result) covering
  both findGroups and findGroupsAggregations aggregate paths

* [OPIK-4380] [BE] Extract shared filter helpers into FilterQueryBuilder

Add FilterStrategyParam record, applyFiltersToTemplate and bindFilters
static helpers to FilterQueryBuilder, then replace duplicated per-strategy
loops in DatasetItemVersionDAO and ExperimentAggregatesDAO with single
delegating calls backed by per-DAO strategy constants.

* [OPIK-4382] [BE] Consolidate filter helpers in getExperimentItemsStatsFromAggregates

Add EXPERIMENT_ITEMS_STATS_FILTER_STRATEGY_PARAMS and
EXPERIMENT_ITEMS_STATS_BIND_STRATEGIES constants and replace the
per-strategy toAnalyticsDbFilters/bind blocks in
getExperimentItemsStatsFromAggregates with single delegating calls to
FilterQueryBuilder.applyFiltersToTemplate and bindFilters.

* Revision 9: Extract shared helpers to eliminate duplication across DAOs

- Create DatasetItemSearchCriteriaMapper: centralizes filters + search flag
  wiring for DatasetItemSearchCriteria, shared by DatasetItemVersionDAO and
  ExperimentAggregatesDAO
- Add ExperimentGroupMappers.applyGroupCriteriaToTemplate: centralizes
  ExperimentGroupCriteria → ST template wiring, now shared by ExperimentDAO
  and ExperimentAggregatesDAO
- Update DatasetItemVersionDAO.addFiltersToTemplate and bindSearchAndFilters
  to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentAggregatesDAO.applyDatasetItemFiltersToTemplate and
  bindDatasetItemSearchParams to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentDAO.newGroupTemplate and ExperimentAggregatesDAO.newGroupTemplate
  to delegate to ExperimentGroupMappers.applyGroupCriteriaToTemplate

* [OPIK-4383] [BE] Fix project_deleted filter and comments_dedup scope in ExperimentAggregatesDAO

- Fix project_deleted filter: use zero UUID sentinel instead of empty string
  for FixedString(36) column comparison in FIND_GROUPS and FIND_GROUPS_AGGREGATIONS
- Fix comments_dedup CTE: scope trace_id subquery by dataset_id to avoid
  scanning the entire workspace's comments table
- Add missing streamMaxLen and streamTrimLimit fields to
  ExperimentDenormalizationConfig (implements StreamConfiguration interface)

* [OPIK-4383] [BE] Address PR review comments: extract ZERO_UUID constant and fix config comment

- Promote zero UUID sentinel to shared constant in ExperimentGroupMappers
- Use parameterized :zero_uuid binding in SQL templates instead of hardcoded string
- Fix config.yml comment from "Default: 120s" to "Default: 1m"

* [OPIK-4383] [BE] Add streamMaxLen and streamTrimLimit to experimentDenormalization config

* [OPIK-4727] fix: remove old GEPA code, fix aggregates test, add migration rollback docs

- Remove gepa_old/ optimizer source and tests, clean factory registry
- Add datasetItemCount to EXPERIMENT_AGGREGATED_FIELDS_TO_IGNORE (not stored in aggregates table)
- Add rollback documentation to mutation experiment type migration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] refactor: deduplicate KPI cards, metric cells, and cleanup

- Extract shared KPICard/MetricKPICard to pages-shared/experiments/KPICard
- Extract calcPercentageVsBaseline helper and TrialMetricCellContent to
  deduplicate percentage calculation across 3 trial metric cells
- Remove unused OptimizationUpdate interface from types
- Fix inconsistent color token (text-light-slate → text-muted-slate)
- Move IIFE out of JSX in MetricComparisonCell compact mode

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] refactor: rename Compare* to Optimization/Trial, simplify URL structure

- Rename CompareOptimizations* → Optimization* and CompareTrials* → Trial*
- Simplify optimization URL from /$datasetId/$optimizationId to /$optimizationId
- Change trial route from /compare to /trials
- Add OptimizationCompareRedirect for legacy URL backwards compatibility
- Update all navigation references across pages (OptimizationsPage, HomePage, BestPrompt, ResourceLink, etc.)
- Fix breadcrumbs: show raw optimization ID, "Trial #N" for trials
- Split optimization detail into Report & Trials tabs with underline style
- Replace ToggleGroup with underline Tabs on trial page

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] feat: rename tabs, add diff vs parent, fix word-level diffs

- Rename "Report" tab to "Overview" on optimization page
- Rename "Best configuration" to "Best trial configuration"
- Change "Diff" button to "Diff vs. baseline" in configuration sections
- Add "Diff vs. parent" option in trial configuration tab
- Fix prompt diff to use word-level mode for inline change highlights
- Fix TextDiff word-mode layout to flow inline instead of dropping lines

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] refactor: extract shared config flattening, add redirect tests

- Extract flattenConfig, EXCLUDED_CONFIG_KEYS, shouldSkipRedundantKey
  into configuration-renderer.ts (shared by TrialConfigurationSection
  and ConfigurationDiffContent)
- Convert ConfigViewMode string union to CONFIG_VIEW_MODE const object
- Add missing `replace` prop on fallback Navigate in
  OptimizationCompareRedirect
- Restore isArray guard in ConfigurationDiffContent collectPrompts
- Add unit tests for configuration-renderer (21 tests)
- Add unit tests for OptimizationCompareRedirect (4 tests)
- Add E2E Playwright test for legacy /compare URL redirect (2 tests)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] fix: PR review fixes - generic flattenConfig, NamedPrompts diff, parent fallback

- Make flattenConfig accept generic skipKey callback instead of hardcoded
  filtering (addresses Baz review comment)
- Fix NamedPrompts format not recognized as "prompt" type in
  detectConfigValueType, causing JSON-level diff instead of word-level
- Add parent experiment fallback for old optimizations using chronological
  ordering (enables "Diff vs. parent" for non-GEPA v2)
- Fix PromptDiff fallback paths to use mode="words" for word-level diffs
- Add tests for NamedPrompts detection and generic skipKey behavior

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] fix: PR review - shared makeSkipKey, parent fallback, label tweak

- Extract makeSkipKey helper into configuration-renderer.ts to eliminate
  duplicate skipKey predicates in TrialConfigurationSection and
  ConfigurationDiffContent
- Fix parentCandidateIds lookup: fall through to chronological fallback
  when GEPA v2 metadata exists but no matching parent is found
- Use shared skipKey in collectPrompts instead of inline checks
- Remove period from "Diff vs." labels

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] fix: rename Config toggle label to Configuration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4838] [SDK] feat: GEPA convergence improvements (#5570)

* [OPIK-4838] [SDK] feat: GEPA convergence improvements

- Add GepaConfig dataclass for centralized algorithm parameters
- Cache parent scores during minibatch gate with configurable tolerance
  to absorb LLM judge noise (gate_tolerance=0.1)
- Rewrite FailureAwareBatchSampler to use assertion-based pass/fail
  instead of score threshold, prioritize by failing assertion count
- Switch default candidate selection to current_best
- Add docs on scoring pipeline and candidate selection strategies

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore(scripts): use GepaConfig defaults in e2e scripts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: remove local-only docs from PR

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address baz review comments, tighten e2e assertions

- Remove unused _global_assertion_failures Counter from sampler
- Update sampler docstring to match implementation (assertion count, not frequency)
- Bound _cached_full_eval_scores to max_candidates entries with FIFO eviction
- Tighten e2e assertions: add context-awareness checks to EASY tier, sharpen MEDIUM tier

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: remove gate_tolerance, require strict minibatch improvement

Cached parent scores are still used for deterministic comparison,
but mutations must now strictly beat the parent on the minibatch
without any tolerance cushion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: shuffle failed items randomly instead of sorting by priority

Simplifies minibatch sampling — all failed items are shuffled equally
rather than sorted by assertion failure count then shuffled within tiers.
This gives better variety across iterations since most items share the
same tier anyway.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: top-tier sampling, template var validation, improved reflection prompt

- Sampler splits failed items into top/rest by assertion failure count,
  draws randomly from top tier first (configurable top_failed_fraction)
- Reject reflection proposals that drop template variables (e.g. {question})
- Reflection prompt encourages surgical edits over full rewrites

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: only update best_trial from full evaluations, not minibatches

A minibatch scoring 1.0 on 4 items was being reported as best_trial
even when the full evaluation only reached 0.9 on 20 items. Now
best_trial is only updated when experiment_type is None (full eval).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: remove accidentally committed reflection logs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update GEPA docs for top-tier sampling, 5-step reflection, template var validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: increase default reflection_minibatch_size from 4 to 6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: thread evaluator_model through pipeline, increase max_candidates to 25

The default LLMJudge model (gpt-5-nano) was too lenient, causing
pass_rate to always report 1.0. Thread evaluator_model from
OptimizationContext through EvaluationAdapter and experiment_executor
so callers can specify a more capable judge model.

Also increase GEPA max_candidates default from 5 to 25 to allow
longer optimization runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: configurable blended scoring with assertion-level tiebreaker

Add ScoringConfig with strategy ("blended" | "pass_rate"), configurable
weights, and auto-computed epsilon (1/(num_items+1)) that guarantees
pass_rate always dominates while giving the algorithm gradient signal
from individual assertion progress.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: log raw pass_rate to UI, use blended score only for algorithm

_extract_score now returns (optimization_score, display_score) tuple.
The blended score drives the algorithm's acceptance gate, while the
raw pass_rate is logged as the experiment score for the UI chart.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: align GEPA per-item scoring with pass_rate, expose blended score as internal

- Replace GEPA's mean-of-assertions scoring with pass_rate-aligned formula:
  passing items score 1.0, failing items score ε × assertion_frac
  (where ε = 1/(num_items+1)), preserving gradient for subsample gate
- Use build_suite_result as source of truth for item pass/fail
- Store pass_rate in trial.score (user-facing), blended score in
  internal_optimization_score (algorithm-only)
- Use pass_rate for stop condition threshold comparison
- Add type hints and ScoreResult type to _extract_per_item_feedback

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: only record full evaluations as visible trials

Skip appending subsample/minibatch evals and cache hits to state.trials
so the UI only shows meaningful trial progression. Internal evals are
still returned to the optimizer for its scoring logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update GEPA docs for scoring contract, trial visibility, per-item scoring

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: make reflection prompt less conservative to reduce stagnation

Relax the "surgical edits / MINIMAL EDIT" constraints that prevented the
reflection LLM from making meaningful structural changes when persistent
failures are detected. Key changes: graduated edit aggressiveness based
on Failure History, concrete escalation strategies (restructure, step-by-step
procedures, conditional logic, section rewrite), and removal of the
single-rule-per-failure cap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: cumulative assertion failure tracking with persistent failure threshold

- Track per-assertion total failures and total evaluations across the
  entire optimization run in FailureAwareBatchSampler
- Show Failure History only when an assertion has failed >=10 times
  (persistent failure threshold) AND failed again in current eval
- Include failures/evals ratio so the reflection LLM can gauge severity
- Remove streak-based logic in favor of cumulative counts
- Simplify reflection prompt: 5 steps → 4, merge write+apply steps,
  remove duplicated escalation, cleaner multiline formatting
- Remove duplicate cumulative info from Summary's Blocking assertions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: show only worst run for multi-run items, reorder Failure History before run output

Reduces feedback verbosity by showing only the worst run instead of all runs
for multi-run items. Places Failure History right after Inputs in the record
so the reflection LLM sees persistent failure context before the run details.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: guide reflection LLM toward general rules, allow specificity for persistent failures

Step 3 now instructs to abstract specific examples into general categories.
Step 2 allows more specific rules for persistently failing assertions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update GEPA reflection prompt docs for current algorithm

Updates template (4 steps), feedback format (worst-run-only, Failure History
with cumulative counts and Z=10 threshold), and generalization guidance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: reduce prompt overfitting — prefer updating existing rules, group by behavior pattern

Step 3: check whether an existing rule covers the failing behavior before
adding a new one. Step 4: group by behavior pattern, not scenario type.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: make persistent failure threshold configurable, default to 7

Add persistent_failure_threshold to GepaConfig (default=7), thread through
GepaOptimizer → FrameworkGEPAAdapter → ReflectiveDatasetBuilder.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: strengthen anti-overfitting in reflection prompt

Step 3: NEVER copy specific names/details/scenarios from feedback — they
are samples that change at runtime. Step 2: persistent failure specificity
still avoids non-generalizable details.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update GEPA docs for current algorithm state

Update both implementation guide and reflection prompt docs to match
current code: 4-step template, worst-run-only feedback, cumulative
failure threshold (configurable, default 7), anti-overfitting guidance,
and corrected config defaults.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: handle edge cases in scoring and trial recording

- _item_score: return 0.0 (not 1.0) for failed items with empty
  assertions, so they don't get scored as passes
- evaluation_adapter: guard trial is not None before appending to
  state.trials and accessing trial.optimization_score

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(BE): exclude minibatch/mutation experiments from pass_rate computation

The FIND query's best_objective_score was computed as a weighted average
across ALL experiments including minibatch and mutation, causing the UI
to show incorrect pass_rate during optimization. Filter experiment_candidates
to only include full-eval experiments (regular/trial types).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(test): replace ambiguous XYZ headphones e2e item with clear bluetooth speaker scenario

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(test): use assertions= shorthand and typed ExecutionPolicy in e2e scripts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(optimizer): add early stopping when pass_rate plateaus

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(test): update reflective dataset tests to use Worst Run instead of Runs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(optimizer): update assertion failure counters on minibatch evals too

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* [NA] [SDK] refactor: convert reflection template to triple-quoted string

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: extract apps/opik-optimizer to comet-ml/opik-optimizer repo

Remove apps/opik-optimizer/ directory (moved to separate repo).
Clean up python-backend framework optimizer references:
- Remove framework_optimizer.py and framework_runner.py
- Remove OPTIMIZER_FRAMEWORK queue from Java Queue enum
- Remove opik-optimizer additional_contexts from docker-compose and CI
- Simplify resolveQueue to always use OPTIMIZER_CLOUD

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: revert python-backend DRY refactor that was only needed for framework optimizer

Restore optimizer.py and rq_worker_manager.py to main state, delete
optimizer_job_helper.py which only existed to share code with the
now-removed framework_optimizer.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [NA] [FE] fix: trials table sorting, metric trend precision, and word diff readability

- Implement client-side sorting for optimization trials table (all columns)
- Default sort by Trial # ascending
- Show 0% trend when formatted values are identical (below display resolution)
- Fall back to block diff when word changes exceed 60% of content

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [NA] [FE] fix: improve diff readability and optimization progress status

- Use hybrid line+word diff: line-level diff first to find changed regions,
  then word-level refinement within paired lines for precise highlights
- Use diffTrimmedLines to ignore trailing whitespace differences
- Fall back to separate removed/added blocks when line pairs are too different
- Add "Running initial calculations..." status for early GEPA phases
- Unify Changed/Added/Removed tag styling in prompt diff view

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [NA] [FE] fix: address PR review — sortCandidates tests, diffLines, formatter comments

- Add unit tests for sortCandidates covering all sort branches
- Switch TextDiff from diffTrimmedLines to diffLines for whitespace detection
- Add comments explaining intentional formatter comparison in percentage calc
- Export sortCandidates and CANDIDATE_SORT_FIELD_MAP for testability

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [NA] [FE] fix: address PR feedback — shared percentage helper, migration renumber, test timezone fixes, dataset button layout

- Extract calcFormatterAwarePercentage into shared lib/percentage.ts (review comment)
- Renumber migrations 000064→000065, 000065→000066 to avoid prefix conflict with main
- Fix timezone-sensitive test failures in MetricDateRangeSelect/utils.test.ts
- Fix dataset NavigationTag rendering as block in OptimizationHeader
- Fix mypy return type for _run_suite_evaluation in evaluator.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [NA] [FE] feat: dynamic chart legend, clickable ghost dot, dataset button width fix

- Chart legend now shows only statuses present in the data
- Ghost (in-progress) dot is clickable to select the trial
- Dataset NavigationTag constrained to content width with w-fit

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [NA] [FE] refactor: simplify trial statuses, ghost dot color, best candidate breathing animation

- Remove "evaluating" status — candidates are now passed (scored > parent) or pruned (scored <= parent)
- Simplify computeCandidateStatuses to compare against parent score
- Ghost dot uses running status color (yellow) instead of hardcoded blue
- Best candidate dot breathes (opacity pulse) when optimization is active but no ghost dot is shown
- Clean up unused isOptimizationFinished/inProgressStepIndex props from columns and cells

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [NA] [FE] fix: optimization cost duration from created_at, live timer, absolute time in trials table

- Duration now starts from optimization created_at instead of first experiment
- Live ticking timer while optimization is in progress
- Trials table "Created" column shows absolute date/time instead of relative

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [NA] [SDK] fix: correct return type of evaluate_optimization_suite_trial

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Remove orphaned test for evaluate_optimization_suite_trial

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revert unused dataset_type additions in Python SDK

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] fix: improve non-eval-suite optimization display

- All chart dots blue for non-eval-suite optimizations (no pruned status)
- Chart legend shows metric name instead of status labels
- Table shows "Baseline" for step 0, "Passed" for scored candidates
- Column header shows metric name (e.g. "Accuracy (geval)")
- Add reason tooltip (speech bubble) to trial items score columns
- Remove jailbreak password demo template
- Revert temporary feature flag override

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: renumber migration prefixes 000065→000067, 000066→000068 to avoid conflicts with main

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4928] [BE] fix: use correct column to lookup execution policies for experiment items

The execution policy lookup query was filtering by dataset_item_versions.dataset_item_id
instead of dataset_item_versions.id. With dataset versioning enabled, experiment items
reference dataset_item_versions.id as their datasetItemId, causing the lookup to miss
and fall back to the default policy {runs_per_item:1, pass_threshold:1}.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add missing dataset_item_count to aggregated experiment CTEs

The experiments_from_aggregates_final CTEs in both FIND and
FIND_GROUPS_AGGREGATIONS queries were missing dataset_item_count,
causing ClickHouse UNKNOWN_IDENTIFIER errors when the aggregated
branch was used. Maps ea.experiment_items_count to dataset_item_count.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: remove duplicate SELECT in FIND_GROUPS_AGGREGATIONS and deduplicate chart empty states

- The FIND_GROUPS_AGGREGATIONS query had two outer SELECTs after the
  subquery, causing a ClickHouse syntax error. Merged dataset_item_count
  into the single outer SELECT and removed the duplicate.
- Collapsed identical spinner/NoData branches in OptimizationProgressChartContainer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [OPIK-4928] [BE] fix: remove feature flag gate preventing execution policy resolution

Root cause: ExperimentItemService.fetchItemPolicies() was gated behind
isDatasetVersioningEnabled(). When disabled, item-level execution
policies were never fetched from dataset_item_versions, so all
experiment items fell back to ExecutionPolicy.DEFAULT {1, 1}.

Also reverts the incorrect DatasetItemVersionDAO column change from
576791acc — experiment_items.dataset_item_id stores the logical
dataset_items.id which matches dataset_item_versions.dataset_item_id.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address PR review comments and fix execution_po…
andriidudar pushed a commit that referenced this pull request Mar 18, 2026
* design doc

* FE communication and ERD additions

* ui reporting events in flow

* changes

* add reason to TrialItemRun

* [NA] [SDK] feat: add greenfield optimization framework package

Implements a new optimization framework (`apps/opik-optimizer`) that
decouples optimizer algorithms from experiment execution, persistence,
and UI concerns. Integrates via the existing optimization studio pipeline
(Redis queue → Python backend → subprocess).

Key components:
- Orchestrator: central lifecycle controller with sampler, validator,
  materializer, result aggregator, and event emitter
- StupidOptimizer: 2-step test optimizer (3 candidates → best → 2 more)
- EvaluationAdapter: wraps SDK evaluate_optimization_suite_trial()
- Backend integration: new Redis queue, framework_optimizer job processor,
  framework_runner subprocess entry point

Also adds evaluate_optimization_suite_trial() to the Python SDK, combining
optimization trial linkage with evaluation suite behavior (evaluators and
execution policy from the dataset).

53 unit + integration tests passing. Verified end-to-end against Comet cloud
with real LLM calls, UI progress chart, prompt display, and score tracking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Adjustments for UI and framework review

* fix: address PR review comments - dict access bug and theme color

- Fix AttributeError in framework_runner.py: dataset.get_items() returns
  dicts, use item["id"] instead of item.id
- Fix hard-coded hex color in TrialPassedCell.tsx: use text-success CSS
  class instead of text-[#12B76A] for proper dark theme support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address remaining PR review comments

- Add opik:optimizer-framework to default RQ queue names so framework
  jobs actually get consumed by workers
- Add dataset size guard in orchestrator before sample_split to provide
  a clear error message for datasets with fewer than 2 items
- Extract shared optimizer_job_helper.py to deduplicate identical logic
  between optimizer.py and framework_optimizer.py
- Extract checkIsEvaluationSuite helper in optimizations.ts to
  deduplicate predicate shared between CompareTrialsPage and
  useCompareOptimizationsData
- Fix hardcoded "pass_rate" in experiment_executor.py to use the actual
  metric_type parameter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: separate experiment scores from feedback scores and handle single-item datasets

Splits the combined feedback/experiment scores into distinct fields in the
Optimization API and DAO so the frontend can fall back to experiment_scores
when feedback_scores lack the objective. Allows single-item datasets by
returning a train-only split instead of raising. Extracts shared runner
environment setup into runner_common.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: extract shared getBestOptimizationScore helper to deduplicate logic

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: evaluate baseline on full dataset instead of validation split only

The baseline was evaluated on split.validation_item_ids, which with an
80/20 split ratio meant only 1 out of 5 items was used. This gave an
unrepresentative baseline score. Now uses the full dataset_item_ids list.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: enrich GEPA experiment metadata for optimization visualization

Add rich metadata to each experiment so the UI can aggregate and
visualize the optimization trajectory. Key changes:

- step_index increments only when candidate changes (not per eval)
- candidate_id is stable across re-evaluations of the same prompt
- parent_candidate_ids always set correctly for derived candidates
- New metadata fields: batch_index, num_items, capture_traces, eval_purpose
- Refactor optimizer package: protocol + factory pattern for registration
- Add GEPA adapter bridging GEPA callbacks to framework metadata
- Fix BE tests for experimentScores null and queue routing
- Add docs: ADDING_AN_OPTIMIZER.md and GEPA_IMPLEMENTATION.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address PR review comments and simplify optimizer factory

- Remove register_optimizer public API and OptimizerFactory class;
  replace with a simple dict in _load_registry()
- framework_runner: avoid holding full dataset items in memory
- Update docs and tests to match simplified factory

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: lineage-based step_index and parent_candidate_ids for GEPA experiments

- Replace sequential step_index counter with parent-lineage derivation
  (max parent step + 1), so all re-evaluations of the same candidate
  share the same step_index
- Ensure every non-baseline experiment carries parent_candidate_ids,
  enabling the UI to draw lineage graphs
- Pass batch_index, num_items, capture_traces, and eval_purpose through
  to experiment metadata for richer visualization
- Revert runner scripts to direct invocation (remove runner_common.py)
- Update unit tests to match new metadata contract

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: remove unused config_hash and merge event emitters

- Remove canonical_config_hash from Candidate and TrialResult types,
  candidate_materializer, experiment_executor, and all tests
- Delete util/hashing.py module (unused — GEPA does minibatching so
  config-hash dedup would block valid re-evaluations)
- Merge SdkEventEmitter and LoggingEventEmitter into a single
  EventEmitter class with optional optimization_id
- Update GEPA_IMPLEMENTATION.md to reflect parent_ids tracking fixes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: make CandidateConfig a plain dict and pass baseline_config through context

- Replace CandidateConfig dataclass with dict[str, Any] type alias
- Add baseline_config field to OptimizationContext (caller-provided, opaque)
- Orchestrator passes baseline_config through without knowing its structure
- Optimizers copy baseline_config and override prompt_messages only
- Remove result_aggregator module (inlined into evaluation_adapter)
- Move gepa imports to runtime (lazy) for optional dependency
- Fix protocol.py training_set/validation_set types to list[dict]
- Update ADDING_AN_OPTIMIZER.md to reflect all changes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: move gepa tests to library_integration to avoid unit suite dependency on gepa

The gepa tests patch gepa.core.adapter.EvaluationBatch and gepa.optimize,
requiring the optional gepa package at import time. Moving them to
tests/library_integration/gepa/ with pytest.importorskip("gepa") keeps
the unit suite fast and dependency-free.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: remove event_emitter from optimizer interface, auto-emit step progress

Optimizers no longer receive or call event_emitter directly. The
EvaluationAdapter now auto-detects step_index changes during evaluate()
and emits on_step_started internally. GEPAProgressCallback simplified
to only forward GEPA events to the adapter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: assert on actual log messages in event emitter tests

Use caplog to verify logger.info output includes optimization ID and
event details, instead of just checking calls don't crash.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: set evaluation_method on optimizer trial experiments for correct UI detection

evaluate_optimization_suite_trial was creating experiments without
evaluation_method="evaluation_suite", causing the backend to default
to "dataset". The frontend checkIsEvaluationSuite now uses the explicit
evaluation_method field instead of heuristic score detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: validate dataset is evaluation suite before running suite evaluation

Adds a guard to evaluate_suite and evaluate_optimization_suite_trial that
checks dataset.dataset_type == "evaluation_suite" before proceeding. This
prevents silently running an ineffective suite trial on a plain dataset
with no scoring rules.

- Add dataset_type param to Dataset constructor, populated at all call sites
- Add dataset_type property to Dataset
- Add _validate_dataset_is_evaluation_suite in evaluator.py
- Update tests and add rejection test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: extract _run_suite_evaluation to deduplicate suite evaluation flow

evaluate_suite and evaluate_optimization_suite_trial had their entire body
duplicated. Extract shared logic into _run_suite_evaluation, parameterized
by optimization_id and dataset filters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE][BE] feat: optimization studio UI improvements

Comprehensive face-lift for optimizer screens including new KPI cards,
metric comparison cells, configuration diff views, progress charts,
trial status indicators, and backend dataset_item_count support.
Also adds backward compatibility for SDK-based optimizations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] feat: optimizer screens face-lift

- Dataset name column: hover icon instead of clickable link
- Split Accuracy into Pass rate + Accuracy columns with compact metric display
- Conditionally hide Accuracy column when no old-type optimizations exist
- Remove Logs/Configuration tabs from single optimization page
- Fall back to studio_config for configuration display on old optimizations
- Chart tooltip: remove pass rate percentage background color
- Fix dataset hover icon vertical centering
- Restore feature toggle for optimization studio

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] fix: center trend arrow icons and rename tooltip label

- Fix arrow icon vertical centering in compact metric Tag
- Rename "Avg. runtime cost" to "Runtime cost" in chart tooltip

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] fix: polish optimizer screens UI consistency

- Fix chart tooltip background (use --background instead of --popover)
- Align column types with correct icons (cost, duration, numberDictionary)
- Align KPI card icons to match table column type icons
- Lowercase labels: Evaluation results, Best configuration, Runtime cost, Opt. cost, Optimization cost
- Darken success green color for better readability
- Remove Traces KPI card from trial view

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4687] [SDK] feat: GEPA v2 optimizer with reflection-based prompt evolution (#5547)

* [OPIK-4687] [SDK] feat: integrate GEPA v2 optimizer into framework

Add GepaV2Optimizer that delegates to the external gepa library (v0.1.0+)
for genetic-Pareto prompt optimization. Includes adapter bridging GEPA's
evaluate/reflect interface to the framework's EvaluationAdapter, lifecycle
event tracking via callbacks, result caching, and a reflection prompt that
encourages generalizable instructions while preserving template variables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(gepa-v2): improve reflection feedback with structured assertions and dynamic inputs

- Extract template variables from prompt messages for dynamic input field mapping
- Store per-assertion structure (name, value, reason) instead of flat reason strings
- Show only failed assertions in reflection feedback for focused improvement signals

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(gepa-v2): adapter reflection control, FE chart filtering, experiment typing

- Move reflection to adapter's propose_new_texts with custom prompt template
- Use msg["name"] as candidate key when provided, fallback to {role}_{index}
- Strip echoed parameter prefix from reflection LLM output
- Disable GEPA evaluation cache so validations produce full-dataset experiments
- Tag exploration evals as mini-batch, only baseline/init/validation as trial
- FE: filter mini-batch experiments from optimization progress chart
- FE: show individual assertion score columns alongside "passed" for eval suites
- Update E2E script: no dataset split, max_candidates=10, reflection log capture

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(gepa-v2): improve reflection quality with structured feedback and template filtering

- Show FAILED and PASSED assertions separately in reflection feedback
- Keep worst run per item (most failed assertions) for reflection
- Sort reflective dataset records by failure count (most failures first)
- Exclude template-only messages (e.g. {question}) from GEPA seed candidate
- Rewrite reflection prompt: focus on failures, preserve what works, 500-word limit

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(gepa-v2): classify experiment type by batch size, not eval purpose

The purpose-based classification was unreliable: GEPA calls evaluate()
with capture_traces=False for both full validations and minibatch
evaluations of new candidates, making them indistinguishable by purpose.

Now records the full dataset size on the first evaluate call (initialization)
and classifies any call with fewer items as mini-batch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(gepa-v2): improve scoring, stopping, and reflection quality

- Use mean instead of min for per-item assertion scores, giving GEPA
  granular signal instead of binary 0/1
- Track total_runs/passed_runs per item so reflection prompt shows
  whether failures are consistent or intermittent
- Stop on trial.score (framework experiment score) instead of GEPA's
  internal mean, so pass_threshold semantics are respected
- Rewrite reflection template with 4-step structure: diagnose, keep
  what works, write assertion-matched rules, generalize
- Increase max_metric_calls multiplier to 5x for deeper exploration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(gepa-v2): hide mini-batch trials from table, use domain-neutral examples

- Filter mini-batch experiments from the trials table rows so only
  full evaluation trials are shown
- Replace customer-support-specific examples in the reflection
  template with domain-neutral ones

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(gepa-v2): show all runs per item in reflection feedback, log rendered prompt

Previously kept only the worst run per item for reflection. Now all runs
are preserved and shown separately (Run 1/3, Run 2/3, etc.) so the
reflection LLM can see what varies across attempts. Also captures the
fully rendered reflection prompt in the reflection log for debugging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(gepa-v2): consolidate runs per input in reflection dataset, label assertion/reason

Consolidate multiple runs for the same input into a single record with
a Runs field and per-item Summary (pass count + consistent failures).
This eliminates input duplication (~40% token savings) and makes cross-run
comparison trivial. Also separates Assertion/Reason onto labeled lines
for clearer parsing by the reflection LLM.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test(gepa-v2): add feedback format coverage for reflection dataset

Add tests for single-run flat keys, multi-run Assertion/Reason labels
in Runs field, and failed assertions with empty reason.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(gepa-v2): tell reflection LLM that examples are sorted by priority

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(gepa-v2): use flat config instead of prompt_messages

GEPA adapter now works with flat dict[str, str] candidates instead of
knowing about message roles. baseline_config is the single source of
truth with system_prompt and user_message keys. Added LLMChatTask that
constructs LLM messages from flat config keys, replacing the
prompt_messages reconstruction path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(gepa-v2): use TrialResult.config instead of prompt_messages, fix validator and UI prompt display

Replace TrialResult.prompt_messages with TrialResult.config so config is
the single source of truth. Update candidate_validator to accept flat
message keys (system_prompt, user_message) in addition to prompt_messages.
Populate experiment metadata "prompt" from flat keys so the UI displays
prompts correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(gepa-v2): add optimizable_keys to OptimizationContext

Replace hardcoded PROMPT_KEYS in GepaV2Optimizer with
context.optimizable_keys so the caller explicitly controls which
baseline config keys get optimized.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(gepa-v2): add failure-aware batch sampler for minibatch item selection

Replace GEPA's default uniform sampler with FailureAwareBatchSampler that
guarantees failed items from the last full eval appear in subsequent
minibatches, giving the reflection LLM actionable signal instead of wasting
iterations on easy items.

Parameters: min_failed_per_batch, min_unseen_per_batch, failure_threshold.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(gepa-v2): strict types in sampler, worst-first failed selection, update implementation doc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(experiment): surface optimizable keys in experiment configuration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(gepa-v2): failure streak tracking, history-aware reflection prompt, optimizable keys in config

- Track per-item failure streaks and failing assertion names in sampler
- Annotate reflective dataset records with "Failure History" for stuck items
- Rewrite reflection prompt: failure history step, structured output, topic headers
- Surface optimizable_keys in experiment config and baseline evaluation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(gepa-v2): balanced reflection prompt, lower failure history threshold

- Rewrite reflection prompt to balance conservative and aggressive approaches:
  preserve working rules while encouraging grouped topic headers (## Empathy,
  ## Resolution, etc.) instead of flat numbered lists
- Lower failure history threshold from streak >= 2 to >= 1 so the reflection
  LLM sees failure context from the first repeated failure
- Guard failure history annotation with `if stuck` to avoid empty annotations
- Relax "3 unreturned callbacks" assertion to "multiple unreturned callbacks"
  (the exact-count version was too brittle for gpt-4o-mini to satisfy reliably)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(gepa-v2): balanced 50/50 minibatch sampling, 20-item e2e suite

Balanced sampling: split minibatches ~50/50 between failed (worst-first)
and passed (random) items. Previously batches were almost entirely failed
items, causing the reflection LLM to over-correct and regress passing
behaviors (catastrophic 0.0 scores). Passed items now act as behavioral
anchors.

- Remove unseen item tracking (mark_seen, min_unseen_per_batch)
- Default min_failed_per_batch=1 (was batch_size-1)
- Minimum reflection_minibatch_size=4 (ensures 2+2 split)
- Redesign e2e suite: 20 items (5 easy, 7 medium, 8 hard)
- Fix contradicting assertions (hedging language vs no promises)
- Remove impossible assertions (specific loyalty benefits)
- Add problematic items summary to reflection log
- Save reflection log from orchestrator finally block
- Update GEPA_IMPLEMENTATION.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(gepa-v2): extract collaborators from adapter with DI

Split FrameworkGEPAAdapter into three injectable collaborators:
- CandidateTracker: candidate identity, parent lineage, GEPA index mapping
- ReflectiveDatasetBuilder: feedback dataset construction for reflection LLM
- ReflectionProposer: reflection LLM interaction and logging

The adapter is now a thin facade (~300 lines, down from 664) that
orchestrates evaluation and delegates to collaborators. Compatibility
properties ensure all existing tests pass unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(gepa-v2): move reflection template to ReflectionProposer

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(gepa-v2): task-agnostic reflection template with prompt descriptions and sibling awareness

Rewrite the reflection template to be domain-neutral, add optional
prompt_descriptions to OptimizationContext so the reflection LLM
understands what each parameter does, and include sibling parameter
context so the LLM knows what other params exist without modifying them.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs(gepa-v2): update implementation doc and reflection prompt algorithm

Update GEPA_IMPLEMENTATION.md with prompt descriptions, sibling awareness,
and task-agnostic template details. Rewrite REFLECTION_PROMPT_EXAMPLE.md
to document the full reflection prompt assembly algorithm with a rendered
example showing the new header format.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(gepa-v2): robust header stripping, markdown formatting in reflection template

The LLM sometimes echoes header metadata (Parameter:, Description:,
param name) in reformulated form. Replace exact-prefix matching with
line-by-line stripping of metadata patterns. Add IMPORTANT instruction
to not include metadata in output. Request markdown ## headers in STEP 4.

Add 11 unit tests for ReflectionProposer: header stripping edge cases,
build_header with/without descriptions, template content assertions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* revert(fe): remove debug FE changes (mini-batch filtering, column reorder)

These were temporary UI tweaks for debugging the GEPA v2 optimizer.
They'll be re-implemented properly in a separate FE PR.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(gepa-v2): preserve template variables in reflection prompt

Instruct the reflection LLM to keep all template variables (e.g.
{var}, {{var}}, <var>, {% var %}) intact during prompt rewriting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: rename gepa_v2 to gepa, clean up OptimizationContext

- Rename gepa_v2/ -> gepa/ (now the primary optimizer)
- Rename gepa/ -> gepa_old/ (legacy optimizer)
- Rename GepaV2Optimizer -> GepaOptimizer
- Rename GepaOptimizer -> GepaLegacyOptimizer
- Remove unused fields from OptimizationContext: prompt_messages,
  metric_parameters, model_parameters
- Rename prompt_descriptions -> config_descriptions
- Delete SimpleOptimizer and its tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add configurable split_strategy to OptimizationContext

Add split_strategy field ("80_20" default, "no_split" for GEPA) so the
orchestrator handles dataset splitting instead of individual optimizers.
Remove internal train+val dedup logic from GepaOptimizer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(gepa): clean up adapter API and fix code quality issues

- Make adapter facade properties public (remove underscore prefix)
- Add standalone reflection_log fallback to prevent silent data loss
- Rename consume_pending_capture_traces → get_pending_capture_traces
- Remove dead guard in _build_evaluation_batch
- Move SYSTEM_PROMPT_KEY constant to test file
- Fix update_scores type annotation in failure_aware_sampler

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: remove prompt_messages logic, validate optimizable_keys

- Adapt LLMTask to use config dict, remove LLMChatTask duplicate
- Simplify candidate_validator to check optimizable_keys from adapter
- Remove prompt_messages fallback from experiment_executor metadata
- Update all tests, fixtures, scripts, and docs to flat key format

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(scripts): remove stale prompt_messages and API references

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(gepa): remove optimizable_keys from config dict to fix caching

optimizable_keys was being injected into CandidateConfig by both
_make_config_builder and the orchestrator, causing cache key mismatches
between baseline and initialization evaluations. Pass it as an explicit
parameter through the evaluation chain instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(scripts): rename gepa_v2 scripts to gepa, delete run_optimization_e2e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: use optimizable_keys generically in ADDING_AN_OPTIMIZER guide

Remove hardcoded system_prompt references from code examples.
Optimizers should iterate over context.optimizable_keys instead
of assuming specific key names.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* fix(fe): address baz review comments

- Use Tag component variants instead of hard-coded color spans for
  theme-aware diff badges (Added/Removed/Changed)
- Clamp formatAsPercentage input to [0, 1] range to prevent >100% or
  negative percentage display
- Read baseline score from experiment_scores as fallback when
  feedback_scores lacks the objective (evaluation-suite support)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(fe): extract getObjectiveScoreValue shared helper

Move the feedback_scores -> experiment_scores fallback into a reusable
getObjectiveScoreValue helper in feedback-scores.tsx. Replace all 4 call
sites (CompareTrialsPage, TrialKPICards, useOptimizationScores,
useCompareOptimizationsData) with the shared helper.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(be): resolve CI failures - migration conflict and test ignored fields

- Rename migration 000063 → 000064 to avoid prefix conflict with main
- Add datasetItemCount to EXPERIMENT_IGNORED_FIELDS and test builder
- Add datasetName to OPTIMIZATION_IGNORED_FIELDS (transient field)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(fe): extract shared aggregateExperimentMetrics helper

Deduplicate weighted score/cost/latency accumulation logic that was
duplicated between TrialKPICards and useCompareOptimizationsData.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add optimization_id index on experiments and remove dead code

- Add minmax index on experiments.optimization_id to speed up
  optimization queries that join experiments by optimization_id
- Remove unused OptimizationDiffView component (dead code from iteration)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update Helm documentation

* [OPIK-4383] [BE] Redis stream subscriber for debounced experiment aggregates recomputation (#5371)

* [OPIK-4380] [BE] Add experiment aggregates for denormalized metrics

- Add experiment_aggregates and experiment_item_aggregates tables
- Implement ExperimentAggregatesDAO with population and query methods
- Add ExperimentAggregatesService for aggregation management
- Refactor DTOs into organized model classes:
  - ExperimentAggregatesModel: aggregation results
  - ExperimentEntityData: entity models
  - ExperimentSourceData: raw source data
  - ExperimentAggregatesUtils: utilities
- Add FEEDBACK_SCORES_AGGREGATED filter strategy for map-based filtering
- Add comprehensive integration tests (10/10 passing)
- Configure batch size and parallelism settings

* [OPIK-4380] [BE] Add MySQL deadlock retry mechanism for concurrent dataset operations

Problem:
- MySQL deadlock on dataset_version_tags composite PRIMARY KEY (workspace_id, dataset_id, tag)
- Occurred during parallel dataset creation in same workspace
- Multiple threads inserting "latest" tag for different datasets caused lock contention
- Experiments with parallel execution were failing with MySQLTransactionRollbackException

Solution:
- Add handleOnDeadLocks() method in RetryUtils with:
  - 5 retry attempts with exponential backoff (250ms to 2s)
  - 0.5 jitter to reduce thundering herd effect
  - Recursive isDatabaseDeadlock() detection for MySQLTransactionRollbackException
- Apply retry logic in DatasetItemService.setDatasetItemVersion()
- Enables concurrent dataset creation for same workspace

Impact:
- Supports parallel experiment execution with proper deadlock handling
- Test success rate improved from 0/10 to 10/10 in ExperimentAggregatesIntegrationTest

* Fix visibility

* [OPIK-4380] [BE] Address PR review comments for experiment aggregates

Fixed 11 automated review comments from baz-reviewer:

CRITICAL fixes:
- Prevent NPE on null span aggregations by adding coalesce() in SQL
- Handle multi-project experiments with LIMIT 1 in GET_PROJECT_ID
- Handle zero-item experiments with empty aggregation helpers
- Bind feedback_scores_percentiles map instead of empty CAST

HIGH priority fixes:
- Use toDecimal128(12) instead of toDecimal64(9) for cost percentiles
- Add null-safe tags handling with Optional.ofNullable()
- Include exception objects in retry logging for stack traces

MEDIUM priority fixes:
- Add missing log_comment to SELECT_EXPERIMENT_BY_ID query
- Add missing log_comment to GET_PROJECT_ID query

LOW priority fixes:
- Remove duplicate "id" binding in bindItemsParameters
- Enhance batchSize config documentation with details

All 11 integration tests passing.

* [OPIK-4382] [BE] Refactor experiment aggregates with import cleanup and Optional patterns

- Add missing imports for IntStream, ProjectStats, and other dependencies
- Replace fully-qualified class names with proper imports across DAO and Service classes
- Fix IS_NOT_EMPTY filter handling for FEEDBACK_SCORES_AGGREGATED strategies
- Refactor null checks to use Optional in mapping methods:
  - mapFeedbackScoreAggregations, mapExperimentFromAggregates
  - mapFeedbackScoreData, mapExperimentGroupAggregationItem
  - Batch insert preparation with Optional chains
- Improve code readability and maintainability with functional patterns

* [OPIK-4380] [BE] Fix table definition

* [OPIK-4380] [BE] Address PR comments and consolidate DatasetItemService methods

- Fix tags NPE in ExperimentAggregatesDAO with defaultIfNull
- Remove unnecessary FINAL clause from GET_EXPERIMENT_DATA query
- Fix test naming in ExperimentAggregatesIntegrationTest
- Consolidate 7 duplicate createVersionFromDelta methods into single canonical implementation
- Remove debug logger from config-test.yml

* [OPIK-4380] [BE] Fix missing log_comment and centralize search criteria binding

- Fix SELECT_EXPERIMENT_BY_ID to properly render log_comment metadata
  - Use getSTWithLogComment pattern in getExperimentFromAggregates
  - Ensures ClickHouse query logging populates workspace/experiment IDs

- Centralize ExperimentSearchCriteria binding logic
  - Create ExperimentSearchCriteriaBinder utility class
  - Parameterize filter strategies to support both DAO variants
  - Eliminate 29-line duplication between ExperimentDAO and ExperimentAggregatesDAO
  - Single source of truth prevents DAOs from getting out of sync

* [OPIK-4380] [BE] Fix createVersionFromDelta consolidation after rebase

- Update canonical method signature to include new parameters:
  - List<EvaluatorItem> evaluators
  - ExecutionPolicy executionPolicy
  - boolean clearExecutionPolicy

- Update all 5 caller sites to pass new parameters:
  - Use changes.evaluators(), changes.executionPolicy(), changes.clearExecutionPolicy() when available
  - Pass null/false for auto-generated versions that inherit from base

- Add imports for EvaluatorItem and ExecutionPolicy

Fixes compilation errors introduced by rebase with upstream changes to DatasetVersionService

* [OPIK-4380] [BE] Address PR review comments - fix type mismatch, extract constants, remove DAO logging

- Fixed BigDecimal[] to Double[] conversion for experiment_scores (matches ClickHouse Float64)
- Extracted FilterStrategy lists to static final constants to avoid repeated allocations
- Added @NonNull validation to populateExperimentAggregate parameter
- Removed DAO layer logging, keeping service-level logging only

* [OPIK-4380] [BE] Extract shared helper for experiment data pagination

Extract streamWithExperimentPagination() helper method to eliminate
duplication in getTracesData(), getSpansData(), and getFeedbackScoresData().

All three methods followed identical pattern:
- asyncTemplate.stream with connection
- getSTWithLogComment with cursor flag
- Bind workspace_id, experiment_id, project_id, limit
- Optional cursor binding
- Result mapping

Benefits:
- Single source of truth for pagination binding logic
- Prevents divergence when tweaking cursor/limit bindings
- Reduces code from ~20 lines to ~10 lines per method
- Type-safe generic implementation

Note: CTE redundancy (3x experiment_items scan) is intentional to avoid
passing large trace ID lists as parameters, which would cause performance
issues with 10K+ traces.

* [OPIK-4383] [BE] Add Redis stream subscriber for debounced experiment aggregates recomputation

- Add ExperimentDenormalizationConfig implementing StreamConfiguration with debounce, job lock, and per-experiment aggregation lock settings
- Add ExperimentAggregationMessage as stream message record
- Add ExperimentAggregatesSubscriber consuming from the denormalization stream; acquires a workspace-scoped distributed lock per experiment before calling populateAggregations()
- Add experimentDenormalizationEnabled feature flag to ServiceTogglesConfig and FeatureFlags
- Wire ExperimentDenormalizationConfig into OpikConfiguration
- Update config.yml and config-test.yml with full experimentDenormalization block (enabled for tests)
- Add ExperimentAggregatesSubscriberTest covering lifecycle gating and processEvent success/error paths

* Revision 2: Address PR comments - add config defaults, remove toggle, rename tests

- ExperimentDenormalizationConfig: add sensible defaults to all fields so
  Dropwizard validation doesn't fail when the config block is absent from
  old deployments (config.isEnabled()=false still gates the subscriber)
- Remove experimentDenormalizationEnabled service toggle from
  ServiceTogglesConfig, FeatureFlags, config.yml and config-test.yml -
  the infrastructure gate (config.isEnabled()) is the single control point
- Rename lifecycle test methods to camelCase per project conventions:
  startSkipsStartupWhenDisabled / stopSkipsShutdownWhenDisabled

* Revision 3: Add @Max(500) to consumerBatchSize and @NotNull to jobLockWaitTime

* [OPIK-4380] [BE] Address PR review comments - fix TYPE_REFERENCE visibility, redundant IN subquery, hardcoded context keys, Instant.now in loop, and inline defaultIfNull

* Revision 4: Address remaining JetoPistola review comments (#7, #8, #10)

- #7: Remove "Used for testing and verification" from getExperimentFromAggregates javadoc
- #8: Replace recursive flatMap with Mono.expand() in populateExperimentItemsInBatches
- #10: Remove unrelated subscribeOn addition from DatasetItemService.createVersionFromDelta

* Revision 3: Add switchIfEmpty fallback for deleted traces in populateExperimentAggregate

* Fix tests

* Revision 6: Move countTotal log from DAO to service layer

Operational logs belong in the service layer, not the DAO.

* Revision 7: Apply Spotless formatting

* Revision 8: Make populateAggregations(UUID, int) private

Removes the uncapped public batch size entry point. All callers now go
through the public no-arg overload which reads batchSize safely from config.

* [OPIK-4380] [BE] Add evaluation_method support to experiment_aggregates pipeline

- Add ClickHouse migration (000062) to add evaluation_method column to experiment_aggregates table
- Add evaluationMethod field to ExperimentData record
- Update GET_EXPERIMENT_DATA query to read evaluation_method from experiments
- Update INSERT_EXPERIMENT_AGGREGATE to write evaluation_method to experiment_aggregates
- Update SELECT_EXPERIMENT_BY_ID to read evaluation_method from experiment_aggregates
- Fix Experiment record constructor call: insert EvaluationMethod at correct position (10)

* [OPIK-4380] [BE] Extract shared helper for experiment aggregation queries

Reduce copy-paste in getTraceAggregations, getSpanAggregations, and
getFeedbackScoreAggregations by extracting queryExperimentAggregation,
which centralises the context-aware execution, workspace/experiment/project
parameter binding, and singleOrEmpty pattern shared by all three methods.

* [OPIK-4380] [BE] Enforce non-null contract on countTotal criteria parameter

Add @NonNull to ExperimentSearchCriteria in the interface and implementation
so that a null argument fails fast with an explicit NullPointerException at
the DAO boundary instead of crashing deep inside buildCountTemplate.

* [OPIK-4380] [BE] Fix countTotal ignoring target project IDs in normal path

target_project_ids was only applied inside the project_deleted LEFT JOIN
subquery; the main WHERE had no project restriction, so counts were
workspace-wide. Reuse has_target_projects in the main WHERE so
project_id IN :target_project_ids always takes effect. Also replace
manual null/empty checks with CollectionUtils.isNotEmpty.

* [OPIK-4380] [BE] Apply Spotless formatting

* [OPIK-4382] [BE] Address PR review comments on experiment aggregates

- Fix :versionId → :version_id parameter naming in SQL templates and bindings
- Fix last_updated_at binding to use item.lastUpdatedAt() instead of Instant.now()
- Fix FEEDBACK_SCORES_AGGREGATED_IS_EMPTY filter: embed generated SQL into templates
  instead of hard-coded static condition, and add missing bind calls
- Fix RetryUtils log duplication (remove getMessage() + pass exception directly)
- Add batchSize = 1000 default in ExperimentAggregatesConfig
- Extract resolveVersionIdForCriteria helper to deduplicate version-id resolution
- Add null/blank/ClickHouse placeholder guards in extractUuidsFromGroupValues
- Extract loadEntityMap helper to deduplicate getEnrichInfoHolder enrichment logic

* Revision 3: Address PR comments E, F, G, H

- Fix E: Extract shared template/bind helpers in ExperimentAggregatesDAO
- Fix F: Bind experiment_ids as UUID[] instead of String[]
- Fix G+H: Extract getEnrichInfoHolder logic into ExperimentGroupEnricher,
  eliminating duplication between ExperimentService and ExperimentAggregatesService
  without introducing a direct dependency between them

* Revision 4: Fix ExperimentServiceTest to include ExperimentGroupEnricher mock

* [OPIK-4382] [BE] Extract shared Row→ExperimentGroup mappers into ExperimentGroupMappers

Pull the duplicated Row→ExperimentGroupItem and Row→ExperimentGroupAggregationItem
conversion logic from ExperimentDAO and ExperimentAggregatesDAO into a shared
ExperimentGroupMappers utility class. Both DAOs now delegate to the same
toExperimentGroupItem / toExperimentGroupAggregationItem helpers, eliminating
the need to mirror DTO mapping changes in two places.

* [OPIK-4382] [BE] Deduplicate bindGroupCriteria into ExperimentGroupMappers

Moves the shared group-criteria binding logic out of ExperimentDAO and
ExperimentAggregatesDAO into ExperimentGroupMappers.bindGroupCriteria(),
following the same pattern as ExperimentSearchCriteriaBinder.  Adding or
fixing a criteria binding now only requires a change in one place.

* [OPIK-4382] [BE] Extract streamGroupQuery helper and fix null percentiles

- Deduplicate findGroups/findGroupsAggregations into a single private
  streamGroupQuery(queryTemplate, criteria, rowMapper) that differs only
  by the query constant and BiFunction row mapper.

- Fix convertToBigDecimal to return null for null/unsupported inputs so
  absent p50/p90/p99 entries in getDuration propagate as null rather than
  BigDecimal.ZERO, preserving the semantic-null that lets callers apply
  COALESCE/fallback logic correctly.

* [OPIK-4382] [BE] Consolidate cost/duration helpers into ExperimentGroupMappers

Promote getCostValue and getDuration to public static in
ExperimentGroupMappers and delete the private copies in ExperimentDAO.
ExperimentDAO.mapToDto now delegates to the shared helpers, so any
future change to cost filtering, duration percentile extraction, or the
BigDecimal conversion only needs to be made in one place.

Side-effect: ExperimentDAO.mapToDto also picks up the null-percentile
fix (convertToBigDecimal returns null for absent/unsupported inputs)
that was previously applied only to ExperimentGroupMappers.

* [OPIK-4382] [BE] Fix pagination count and add criteria filter tests

- Remove count() OVER () window function from paged query (returned
  page-scoped count instead of full result-set count)
- Replace with dedicated count query + short-circuit: skip items query
  when count == 0, use DatasetItemPage.empty() for that case
- Extract DatasetItemResultMapper.buildItemFromRow as public static
  helper reused by ExperimentAggregatesDAO
- Add parameterized integration tests for ExperimentGroupCriteria
  filters (name, types, projectId, combined, empty-result) covering
  both findGroups and findGroupsAggregations aggregate paths

* [OPIK-4380] [BE] Extract shared filter helpers into FilterQueryBuilder

Add FilterStrategyParam record, applyFiltersToTemplate and bindFilters
static helpers to FilterQueryBuilder, then replace duplicated per-strategy
loops in DatasetItemVersionDAO and ExperimentAggregatesDAO with single
delegating calls backed by per-DAO strategy constants.

* [OPIK-4382] [BE] Consolidate filter helpers in getExperimentItemsStatsFromAggregates

Add EXPERIMENT_ITEMS_STATS_FILTER_STRATEGY_PARAMS and
EXPERIMENT_ITEMS_STATS_BIND_STRATEGIES constants and replace the
per-strategy toAnalyticsDbFilters/bind blocks in
getExperimentItemsStatsFromAggregates with single delegating calls to
FilterQueryBuilder.applyFiltersToTemplate and bindFilters.

* Revision 9: Extract shared helpers to eliminate duplication across DAOs

- Create DatasetItemSearchCriteriaMapper: centralizes filters + search flag
  wiring for DatasetItemSearchCriteria, shared by DatasetItemVersionDAO and
  ExperimentAggregatesDAO
- Add ExperimentGroupMappers.applyGroupCriteriaToTemplate: centralizes
  ExperimentGroupCriteria → ST template wiring, now shared by ExperimentDAO
  and ExperimentAggregatesDAO
- Update DatasetItemVersionDAO.addFiltersToTemplate and bindSearchAndFilters
  to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentAggregatesDAO.applyDatasetItemFiltersToTemplate and
  bindDatasetItemSearchParams to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentDAO.newGroupTemplate and ExperimentAggregatesDAO.newGroupTemplate
  to delegate to ExperimentGroupMappers.applyGroupCriteriaToTemplate

* [OPIK-4383] [BE] Fix project_deleted filter and comments_dedup scope in ExperimentAggregatesDAO

- Fix project_deleted filter: use zero UUID sentinel instead of empty string
  for FixedString(36) column comparison in FIND_GROUPS and FIND_GROUPS_AGGREGATIONS
- Fix comments_dedup CTE: scope trace_id subquery by dataset_id to avoid
  scanning the entire workspace's comments table
- Add missing streamMaxLen and streamTrimLimit fields to
  ExperimentDenormalizationConfig (implements StreamConfiguration interface)

* [OPIK-4383] [BE] Address PR review comments: extract ZERO_UUID constant and fix config comment

- Promote zero UUID sentinel to shared constant in ExperimentGroupMappers
- Use parameterized :zero_uuid binding in SQL templates instead of hardcoded string
- Fix config.yml comment from "Default: 120s" to "Default: 1m"

* [OPIK-4383] [BE] Add streamMaxLen and streamTrimLimit to experimentDenormalization config

* [OPIK-4727] fix: remove old GEPA code, fix aggregates test, add migration rollback docs

- Remove gepa_old/ optimizer source and tests, clean factory registry
- Add datasetItemCount to EXPERIMENT_AGGREGATED_FIELDS_TO_IGNORE (not stored in aggregates table)
- Add rollback documentation to mutation experiment type migration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] refactor: deduplicate KPI cards, metric cells, and cleanup

- Extract shared KPICard/MetricKPICard to pages-shared/experiments/KPICard
- Extract calcPercentageVsBaseline helper and TrialMetricCellContent to
  deduplicate percentage calculation across 3 trial metric cells
- Remove unused OptimizationUpdate interface from types
- Fix inconsistent color token (text-light-slate → text-muted-slate)
- Move IIFE out of JSX in MetricComparisonCell compact mode

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] refactor: rename Compare* to Optimization/Trial, simplify URL structure

- Rename CompareOptimizations* → Optimization* and CompareTrials* → Trial*
- Simplify optimization URL from /$datasetId/$optimizationId to /$optimizationId
- Change trial route from /compare to /trials
- Add OptimizationCompareRedirect for legacy URL backwards compatibility
- Update all navigation references across pages (OptimizationsPage, HomePage, BestPrompt, ResourceLink, etc.)
- Fix breadcrumbs: show raw optimization ID, "Trial #N" for trials
- Split optimization detail into Report & Trials tabs with underline style
- Replace ToggleGroup with underline Tabs on trial page

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] feat: rename tabs, add diff vs parent, fix word-level diffs

- Rename "Report" tab to "Overview" on optimization page
- Rename "Best configuration" to "Best trial configuration"
- Change "Diff" button to "Diff vs. baseline" in configuration sections
- Add "Diff vs. parent" option in trial configuration tab
- Fix prompt diff to use word-level mode for inline change highlights
- Fix TextDiff word-mode layout to flow inline instead of dropping lines

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] refactor: extract shared config flattening, add redirect tests

- Extract flattenConfig, EXCLUDED_CONFIG_KEYS, shouldSkipRedundantKey
  into configuration-renderer.ts (shared by TrialConfigurationSection
  and ConfigurationDiffContent)
- Convert ConfigViewMode string union to CONFIG_VIEW_MODE const object
- Add missing `replace` prop on fallback Navigate in
  OptimizationCompareRedirect
- Restore isArray guard in ConfigurationDiffContent collectPrompts
- Add unit tests for configuration-renderer (21 tests)
- Add unit tests for OptimizationCompareRedirect (4 tests)
- Add E2E Playwright test for legacy /compare URL redirect (2 tests)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] fix: PR review fixes - generic flattenConfig, NamedPrompts diff, parent fallback

- Make flattenConfig accept generic skipKey callback instead of hardcoded
  filtering (addresses Baz review comment)
- Fix NamedPrompts format not recognized as "prompt" type in
  detectConfigValueType, causing JSON-level diff instead of word-level
- Add parent experiment fallback for old optimizations using chronological
  ordering (enables "Diff vs. parent" for non-GEPA v2)
- Fix PromptDiff fallback paths to use mode="words" for word-level diffs
- Add tests for NamedPrompts detection and generic skipKey behavior

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] fix: PR review - shared makeSkipKey, parent fallback, label tweak

- Extract makeSkipKey helper into configuration-renderer.ts to eliminate
  duplicate skipKey predicates in TrialConfigurationSection and
  ConfigurationDiffContent
- Fix parentCandidateIds lookup: fall through to chronological fallback
  when GEPA v2 metadata exists but no matching parent is found
- Use shared skipKey in collectPrompts instead of inline checks
- Remove period from "Diff vs." labels

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] fix: rename Config toggle label to Configuration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4838] [SDK] feat: GEPA convergence improvements (#5570)

* [OPIK-4838] [SDK] feat: GEPA convergence improvements

- Add GepaConfig dataclass for centralized algorithm parameters
- Cache parent scores during minibatch gate with configurable tolerance
  to absorb LLM judge noise (gate_tolerance=0.1)
- Rewrite FailureAwareBatchSampler to use assertion-based pass/fail
  instead of score threshold, prioritize by failing assertion count
- Switch default candidate selection to current_best
- Add docs on scoring pipeline and candidate selection strategies

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore(scripts): use GepaConfig defaults in e2e scripts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: remove local-only docs from PR

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address baz review comments, tighten e2e assertions

- Remove unused _global_assertion_failures Counter from sampler
- Update sampler docstring to match implementation (assertion count, not frequency)
- Bound _cached_full_eval_scores to max_candidates entries with FIFO eviction
- Tighten e2e assertions: add context-awareness checks to EASY tier, sharpen MEDIUM tier

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: remove gate_tolerance, require strict minibatch improvement

Cached parent scores are still used for deterministic comparison,
but mutations must now strictly beat the parent on the minibatch
without any tolerance cushion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: shuffle failed items randomly instead of sorting by priority

Simplifies minibatch sampling — all failed items are shuffled equally
rather than sorted by assertion failure count then shuffled within tiers.
This gives better variety across iterations since most items share the
same tier anyway.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: top-tier sampling, template var validation, improved reflection prompt

- Sampler splits failed items into top/rest by assertion failure count,
  draws randomly from top tier first (configurable top_failed_fraction)
- Reject reflection proposals that drop template variables (e.g. {question})
- Reflection prompt encourages surgical edits over full rewrites

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: only update best_trial from full evaluations, not minibatches

A minibatch scoring 1.0 on 4 items was being reported as best_trial
even when the full evaluation only reached 0.9 on 20 items. Now
best_trial is only updated when experiment_type is None (full eval).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: remove accidentally committed reflection logs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update GEPA docs for top-tier sampling, 5-step reflection, template var validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: increase default reflection_minibatch_size from 4 to 6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: thread evaluator_model through pipeline, increase max_candidates to 25

The default LLMJudge model (gpt-5-nano) was too lenient, causing
pass_rate to always report 1.0. Thread evaluator_model from
OptimizationContext through EvaluationAdapter and experiment_executor
so callers can specify a more capable judge model.

Also increase GEPA max_candidates default from 5 to 25 to allow
longer optimization runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: configurable blended scoring with assertion-level tiebreaker

Add ScoringConfig with strategy ("blended" | "pass_rate"), configurable
weights, and auto-computed epsilon (1/(num_items+1)) that guarantees
pass_rate always dominates while giving the algorithm gradient signal
from individual assertion progress.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: log raw pass_rate to UI, use blended score only for algorithm

_extract_score now returns (optimization_score, display_score) tuple.
The blended score drives the algorithm's acceptance gate, while the
raw pass_rate is logged as the experiment score for the UI chart.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: align GEPA per-item scoring with pass_rate, expose blended score as internal

- Replace GEPA's mean-of-assertions scoring with pass_rate-aligned formula:
  passing items score 1.0, failing items score ε × assertion_frac
  (where ε = 1/(num_items+1)), preserving gradient for subsample gate
- Use build_suite_result as source of truth for item pass/fail
- Store pass_rate in trial.score (user-facing), blended score in
  internal_optimization_score (algorithm-only)
- Use pass_rate for stop condition threshold comparison
- Add type hints and ScoreResult type to _extract_per_item_feedback

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: only record full evaluations as visible trials

Skip appending subsample/minibatch evals and cache hits to state.trials
so the UI only shows meaningful trial progression. Internal evals are
still returned to the optimizer for its scoring logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update GEPA docs for scoring contract, trial visibility, per-item scoring

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: make reflection prompt less conservative to reduce stagnation

Relax the "surgical edits / MINIMAL EDIT" constraints that prevented the
reflection LLM from making meaningful structural changes when persistent
failures are detected. Key changes: graduated edit aggressiveness based
on Failure History, concrete escalation strategies (restructure, step-by-step
procedures, conditional logic, section rewrite), and removal of the
single-rule-per-failure cap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: cumulative assertion failure tracking with persistent failure threshold

- Track per-assertion total failures and total evaluations across the
  entire optimization run in FailureAwareBatchSampler
- Show Failure History only when an assertion has failed >=10 times
  (persistent failure threshold) AND failed again in current eval
- Include failures/evals ratio so the reflection LLM can gauge severity
- Remove streak-based logic in favor of cumulative counts
- Simplify reflection prompt: 5 steps → 4, merge write+apply steps,
  remove duplicated escalation, cleaner multiline formatting
- Remove duplicate cumulative info from Summary's Blocking assertions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: show only worst run for multi-run items, reorder Failure History before run output

Reduces feedback verbosity by showing only the worst run instead of all runs
for multi-run items. Places Failure History right after Inputs in the record
so the reflection LLM sees persistent failure context before the run details.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: guide reflection LLM toward general rules, allow specificity for persistent failures

Step 3 now instructs to abstract specific examples into general categories.
Step 2 allows more specific rules for persistently failing assertions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update GEPA reflection prompt docs for current algorithm

Updates template (4 steps), feedback format (worst-run-only, Failure History
with cumulative counts and Z=10 threshold), and generalization guidance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: reduce prompt overfitting — prefer updating existing rules, group by behavior pattern

Step 3: check whether an existing rule covers the failing behavior before
adding a new one. Step 4: group by behavior pattern, not scenario type.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: make persistent failure threshold configurable, default to 7

Add persistent_failure_threshold to GepaConfig (default=7), thread through
GepaOptimizer → FrameworkGEPAAdapter → ReflectiveDatasetBuilder.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: strengthen anti-overfitting in reflection prompt

Step 3: NEVER copy specific names/details/scenarios from feedback — they
are samples that change at runtime. Step 2: persistent failure specificity
still avoids non-generalizable details.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update GEPA docs for current algorithm state

Update both implementation guide and reflection prompt docs to match
current code: 4-step template, worst-run-only feedback, cumulative
failure threshold (configurable, default 7), anti-overfitting guidance,
and corrected config defaults.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: handle edge cases in scoring and trial recording

- _item_score: return 0.0 (not 1.0) for failed items with empty
  assertions, so they don't get scored as passes
- evaluation_adapter: guard trial is not None before appending to
  state.trials and accessing trial.optimization_score

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(BE): exclude minibatch/mutation experiments from pass_rate computation

The FIND query's best_objective_score was computed as a weighted average
across ALL experiments including minibatch and mutation, causing the UI
to show incorrect pass_rate during optimization. Filter experiment_candidates
to only include full-eval experiments (regular/trial types).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(test): replace ambiguous XYZ headphones e2e item with clear bluetooth speaker scenario

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(test): use assertions= shorthand and typed ExecutionPolicy in e2e scripts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(optimizer): add early stopping when pass_rate plateaus

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(test): update reflective dataset tests to use Worst Run instead of Runs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(optimizer): update assertion failure counters on minibatch evals too

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* [NA] [SDK] refactor: convert reflection template to triple-quoted string

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: extract apps/opik-optimizer to comet-ml/opik-optimizer repo

Remove apps/opik-optimizer/ directory (moved to separate repo).
Clean up python-backend framework optimizer references:
- Remove framework_optimizer.py and framework_runner.py
- Remove OPTIMIZER_FRAMEWORK queue from Java Queue enum
- Remove opik-optimizer additional_contexts from docker-compose and CI
- Simplify resolveQueue to always use OPTIMIZER_CLOUD

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: revert python-backend DRY refactor that was only needed for framework optimizer

Restore optimizer.py and rq_worker_manager.py to main state, delete
optimizer_job_helper.py which only existed to share code with the
now-removed framework_optimizer.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [NA] [FE] fix: trials table sorting, metric trend precision, and word diff readability

- Implement client-side sorting for optimization trials table (all columns)
- Default sort by Trial # ascending
- Show 0% trend when formatted values are identical (below display resolution)
- Fall back to block diff when word changes exceed 60% of content

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [NA] [FE] fix: improve diff readability and optimization progress status

- Use hybrid line+word diff: line-level diff first to find changed regions,
  then word-level refinement within paired lines for precise highlights
- Use diffTrimmedLines to ignore trailing whitespace differences
- Fall back to separate removed/added blocks when line pairs are too different
- Add "Running initial calculations..." status for early GEPA phases
- Unify Changed/Added/Removed tag styling in prompt diff view

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [NA] [FE] fix: address PR review — sortCandidates tests, diffLines, formatter comments

- Add unit tests for sortCandidates covering all sort branches
- Switch TextDiff from diffTrimmedLines to diffLines for whitespace detection
- Add comments explaining intentional formatter comparison in percentage calc
- Export sortCandidates and CANDIDATE_SORT_FIELD_MAP for testability

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [NA] [FE] fix: address PR feedback — shared percentage helper, migration renumber, test timezone fixes, dataset button layout

- Extract calcFormatterAwarePercentage into shared lib/percentage.ts (review comment)
- Renumber migrations 000064→000065, 000065→000066 to avoid prefix conflict with main
- Fix timezone-sensitive test failures in MetricDateRangeSelect/utils.test.ts
- Fix dataset NavigationTag rendering as block in OptimizationHeader
- Fix mypy return type for _run_suite_evaluation in evaluator.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [NA] [FE] feat: dynamic chart legend, clickable ghost dot, dataset button width fix

- Chart legend now shows only statuses present in the data
- Ghost (in-progress) dot is clickable to select the trial
- Dataset NavigationTag constrained to content width with w-fit

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [NA] [FE] refactor: simplify trial statuses, ghost dot color, best candidate breathing animation

- Remove "evaluating" status — candidates are now passed (scored > parent) or pruned (scored <= parent)
- Simplify computeCandidateStatuses to compare against parent score
- Ghost dot uses running status color (yellow) instead of hardcoded blue
- Best candidate dot breathes (opacity pulse) when optimization is active but no ghost dot is shown
- Clean up unused isOptimizationFinished/inProgressStepIndex props from columns and cells

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [NA] [FE] fix: optimization cost duration from created_at, live timer, absolute time in trials table

- Duration now starts from optimization created_at instead of first experiment
- Live ticking timer while optimization is in progress
- Trials table "Created" column shows absolute date/time instead of relative

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [NA] [SDK] fix: correct return type of evaluate_optimization_suite_trial

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Remove orphaned test for evaluate_optimization_suite_trial

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revert unused dataset_type additions in Python SDK

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4727] [FE] fix: improve non-eval-suite optimization display

- All chart dots blue for non-eval-suite optimizations (no pruned status)
- Chart legend shows metric name instead of status labels
- Table shows "Baseline" for step 0, "Passed" for scored candidates
- Column header shows metric name (e.g. "Accuracy (geval)")
- Add reason tooltip (speech bubble) to trial items score columns
- Remove jailbreak password demo template
- Revert temporary feature flag override

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: renumber migration prefixes 000065→000067, 000066→000068 to avoid conflicts with main

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4928] [BE] fix: use correct column to lookup execution policies for experiment items

The execution policy lookup query was filtering by dataset_item_versions.dataset_item_id
instead of dataset_item_versions.id. With dataset versioning enabled, experiment items
reference dataset_item_versions.id as their datasetItemId, causing the lookup to miss
and fall back to the default policy {runs_per_item:1, pass_threshold:1}.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add missing dataset_item_count to aggregated experiment CTEs

The experiments_from_aggregates_final CTEs in both FIND and
FIND_GROUPS_AGGREGATIONS queries were missing dataset_item_count,
causing ClickHouse UNKNOWN_IDENTIFIER errors when the aggregated
branch was used. Maps ea.experiment_items_count to dataset_item_count.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: remove duplicate SELECT in FIND_GROUPS_AGGREGATIONS and deduplicate chart empty states

- The FIND_GROUPS_AGGREGATIONS query had two outer SELECTs after the
  subquery, causing a ClickHouse syntax error. Merged dataset_item_count
  into the single outer SELECT and removed the duplicate.
- Collapsed identical spinner/NoData branches in OptimizationProgressChartContainer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [OPIK-4928] [BE] fix: remove feature flag gate preventing execution policy resolution

Root cause: ExperimentItemService.fetchItemPolicies() was gated behind
isDatasetVersioningEnabled(). When disabled, item-level execution
policies were never fetched from dataset_item_versions, so all
experiment items fell back to ExecutionPolicy.DEFAULT {1, 1}.

Also reverts the incorrect DatasetItemVersionDAO column change from
576791acc — experiment_items.dataset_item_id stores the logical
dataset_items.id which matches dataset_item_versions.dataset_item_id.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address PR review comments and fix execution_po…
miguelgrc pushed a commit that referenced this pull request Mar 19, 2026
…ndpoints (#5577)

* [OPIK-4380] [BE] Add experiment aggregates for denormalized metrics

- Add experiment_aggregates and experiment_item_aggregates tables
- Implement ExperimentAggregatesDAO with population and query methods
- Add ExperimentAggregatesService for aggregation management
- Refactor DTOs into organized model classes:
  - ExperimentAggregatesModel: aggregation results
  - ExperimentEntityData: entity models
  - ExperimentSourceData: raw source data
  - ExperimentAggregatesUtils: utilities
- Add FEEDBACK_SCORES_AGGREGATED filter strategy for map-based filtering
- Add comprehensive integration tests (10/10 passing)
- Configure batch size and parallelism settings

* [OPIK-4380] [BE] Add MySQL deadlock retry mechanism for concurrent dataset operations

Problem:
- MySQL deadlock on dataset_version_tags composite PRIMARY KEY (workspace_id, dataset_id, tag)
- Occurred during parallel dataset creation in same workspace
- Multiple threads inserting "latest" tag for different datasets caused lock contention
- Experiments with parallel execution were failing with MySQLTransactionRollbackException

Solution:
- Add handleOnDeadLocks() method in RetryUtils with:
  - 5 retry attempts with exponential backoff (250ms to 2s)
  - 0.5 jitter to reduce thundering herd effect
  - Recursive isDatabaseDeadlock() detection for MySQLTransactionRollbackException
- Apply retry logic in DatasetItemService.setDatasetItemVersion()
- Enables concurrent dataset creation for same workspace

Impact:
- Supports parallel experiment execution with proper deadlock handling
- Test success rate improved from 0/10 to 10/10 in ExperimentAggregatesIntegrationTest

* Fix visibility

* [OPIK-4380] [BE] Address PR review comments for experiment aggregates

Fixed 11 automated review comments from baz-reviewer:

CRITICAL fixes:
- Prevent NPE on null span aggregations by adding coalesce() in SQL
- Handle multi-project experiments with LIMIT 1 in GET_PROJECT_ID
- Handle zero-item experiments with empty aggregation helpers
- Bind feedback_scores_percentiles map instead of empty CAST

HIGH priority fixes:
- Use toDecimal128(12) instead of toDecimal64(9) for cost percentiles
- Add null-safe tags handling with Optional.ofNullable()
- Include exception objects in retry logging for stack traces

MEDIUM priority fixes:
- Add missing log_comment to SELECT_EXPERIMENT_BY_ID query
- Add missing log_comment to GET_PROJECT_ID query

LOW priority fixes:
- Remove duplicate "id" binding in bindItemsParameters
- Enhance batchSize config documentation with details

All 11 integration tests passing.

* [OPIK-4382] [BE] Refactor experiment aggregates with import cleanup and Optional patterns

- Add missing imports for IntStream, ProjectStats, and other dependencies
- Replace fully-qualified class names with proper imports across DAO and Service classes
- Fix IS_NOT_EMPTY filter handling for FEEDBACK_SCORES_AGGREGATED strategies
- Refactor null checks to use Optional in mapping methods:
  - mapFeedbackScoreAggregations, mapExperimentFromAggregates
  - mapFeedbackScoreData, mapExperimentGroupAggregationItem
  - Batch insert preparation with Optional chains
- Improve code readability and maintainability with functional patterns

* [OPIK-4380] [BE] Fix table definition

* [OPIK-4380] [BE] Address PR comments and consolidate DatasetItemService methods

- Fix tags NPE in ExperimentAggregatesDAO with defaultIfNull
- Remove unnecessary FINAL clause from GET_EXPERIMENT_DATA query
- Fix test naming in ExperimentAggregatesIntegrationTest
- Consolidate 7 duplicate createVersionFromDelta methods into single canonical implementation
- Remove debug logger from config-test.yml

* [OPIK-4380] [BE] Fix missing log_comment and centralize search criteria binding

- Fix SELECT_EXPERIMENT_BY_ID to properly render log_comment metadata
  - Use getSTWithLogComment pattern in getExperimentFromAggregates
  - Ensures ClickHouse query logging populates workspace/experiment IDs

- Centralize ExperimentSearchCriteria binding logic
  - Create ExperimentSearchCriteriaBinder utility class
  - Parameterize filter strategies to support both DAO variants
  - Eliminate 29-line duplication between ExperimentDAO and ExperimentAggregatesDAO
  - Single source of truth prevents DAOs from getting out of sync

* [OPIK-4380] [BE] Fix createVersionFromDelta consolidation after rebase

- Update canonical method signature to include new parameters:
  - List<EvaluatorItem> evaluators
  - ExecutionPolicy executionPolicy
  - boolean clearExecutionPolicy

- Update all 5 caller sites to pass new parameters:
  - Use changes.evaluators(), changes.executionPolicy(), changes.clearExecutionPolicy() when available
  - Pass null/false for auto-generated versions that inherit from base

- Add imports for EvaluatorItem and ExecutionPolicy

Fixes compilation errors introduced by rebase with upstream changes to DatasetVersionService

* [OPIK-4380] [BE] Address PR review comments - fix type mismatch, extract constants, remove DAO logging

- Fixed BigDecimal[] to Double[] conversion for experiment_scores (matches ClickHouse Float64)
- Extracted FilterStrategy lists to static final constants to avoid repeated allocations
- Added @nonnull validation to populateExperimentAggregate parameter
- Removed DAO layer logging, keeping service-level logging only

* [OPIK-4380] [BE] Extract shared helper for experiment data pagination

Extract streamWithExperimentPagination() helper method to eliminate
duplication in getTracesData(), getSpansData(), and getFeedbackScoresData().

All three methods followed identical pattern:
- asyncTemplate.stream with connection
- getSTWithLogComment with cursor flag
- Bind workspace_id, experiment_id, project_id, limit
- Optional cursor binding
- Result mapping

Benefits:
- Single source of truth for pagination binding logic
- Prevents divergence when tweaking cursor/limit bindings
- Reduces code from ~20 lines to ~10 lines per method
- Type-safe generic implementation

Note: CTE redundancy (3x experiment_items scan) is intentional to avoid
passing large trace ID lists as parameters, which would cause performance
issues with 10K+ traces.

* [OPIK-4383] [BE] Add Redis stream subscriber for debounced experiment aggregates recomputation

- Add ExperimentDenormalizationConfig implementing StreamConfiguration with debounce, job lock, and per-experiment aggregation lock settings
- Add ExperimentAggregationMessage as stream message record
- Add ExperimentAggregatesSubscriber consuming from the denormalization stream; acquires a workspace-scoped distributed lock per experiment before calling populateAggregations()
- Add experimentDenormalizationEnabled feature flag to ServiceTogglesConfig and FeatureFlags
- Wire ExperimentDenormalizationConfig into OpikConfiguration
- Update config.yml and config-test.yml with full experimentDenormalization block (enabled for tests)
- Add ExperimentAggregatesSubscriberTest covering lifecycle gating and processEvent success/error paths

* Revision 2: Address PR comments - add config defaults, remove toggle, rename tests

- ExperimentDenormalizationConfig: add sensible defaults to all fields so
  Dropwizard validation doesn't fail when the config block is absent from
  old deployments (config.isEnabled()=false still gates the subscriber)
- Remove experimentDenormalizationEnabled service toggle from
  ServiceTogglesConfig, FeatureFlags, config.yml and config-test.yml -
  the infrastructure gate (config.isEnabled()) is the single control point
- Rename lifecycle test methods to camelCase per project conventions:
  startSkipsStartupWhenDisabled / stopSkipsShutdownWhenDisabled

* Revision 3: Add @max(500) to consumerBatchSize and @NotNull to jobLockWaitTime

* [OPIK-4380] [BE] Address PR review comments - fix TYPE_REFERENCE visibility, redundant IN subquery, hardcoded context keys, Instant.now in loop, and inline defaultIfNull

* Revision 4: Address remaining JetoPistola review comments (#7, #8, #10)

- #7: Remove "Used for testing and verification" from getExperimentFromAggregates javadoc
- #8: Replace recursive flatMap with Mono.expand() in populateExperimentItemsInBatches
- #10: Remove unrelated subscribeOn addition from DatasetItemService.createVersionFromDelta

* Revision 3: Add switchIfEmpty fallback for deleted traces in populateExperimentAggregate

* Fix tests

* Revision 6: Move countTotal log from DAO to service layer

Operational logs belong in the service layer, not the DAO.

* Revision 7: Apply Spotless formatting

* Revision 8: Make populateAggregations(UUID, int) private

Removes the uncapped public batch size entry point. All callers now go
through the public no-arg overload which reads batchSize safely from config.

* [OPIK-4380] [BE] Add evaluation_method support to experiment_aggregates pipeline

- Add ClickHouse migration (000062) to add evaluation_method column to experiment_aggregates table
- Add evaluationMethod field to ExperimentData record
- Update GET_EXPERIMENT_DATA query to read evaluation_method from experiments
- Update INSERT_EXPERIMENT_AGGREGATE to write evaluation_method to experiment_aggregates
- Update SELECT_EXPERIMENT_BY_ID to read evaluation_method from experiment_aggregates
- Fix Experiment record constructor call: insert EvaluationMethod at correct position (10)

* [OPIK-4380] [BE] Extract shared helper for experiment aggregation queries

Reduce copy-paste in getTraceAggregations, getSpanAggregations, and
getFeedbackScoreAggregations by extracting queryExperimentAggregation,
which centralises the context-aware execution, workspace/experiment/project
parameter binding, and singleOrEmpty pattern shared by all three methods.

* [OPIK-4380] [BE] Enforce non-null contract on countTotal criteria parameter

Add @nonnull to ExperimentSearchCriteria in the interface and implementation
so that a null argument fails fast with an explicit NullPointerException at
the DAO boundary instead of crashing deep inside buildCountTemplate.

* [OPIK-4380] [BE] Fix countTotal ignoring target project IDs in normal path

target_project_ids was only applied inside the project_deleted LEFT JOIN
subquery; the main WHERE had no project restriction, so counts were
workspace-wide. Reuse has_target_projects in the main WHERE so
project_id IN :target_project_ids always takes effect. Also replace
manual null/empty checks with CollectionUtils.isNotEmpty.

* [OPIK-4380] [BE] Apply Spotless formatting

* [OPIK-4382] [BE] Address PR review comments on experiment aggregates

- Fix :versionId → :version_id parameter naming in SQL templates and bindings
- Fix last_updated_at binding to use item.lastUpdatedAt() instead of Instant.now()
- Fix FEEDBACK_SCORES_AGGREGATED_IS_EMPTY filter: embed generated SQL into templates
  instead of hard-coded static condition, and add missing bind calls
- Fix RetryUtils log duplication (remove getMessage() + pass exception directly)
- Add batchSize = 1000 default in ExperimentAggregatesConfig
- Extract resolveVersionIdForCriteria helper to deduplicate version-id resolution
- Add null/blank/ClickHouse placeholder guards in extractUuidsFromGroupValues
- Extract loadEntityMap helper to deduplicate getEnrichInfoHolder enrichment logic

* Revision 3: Address PR comments E, F, G, H

- Fix E: Extract shared template/bind helpers in ExperimentAggregatesDAO
- Fix F: Bind experiment_ids as UUID[] instead of String[]
- Fix G+H: Extract getEnrichInfoHolder logic into ExperimentGroupEnricher,
  eliminating duplication between ExperimentService and ExperimentAggregatesService
  without introducing a direct dependency between them

* Revision 4: Fix ExperimentServiceTest to include ExperimentGroupEnricher mock

* [OPIK-4382] [BE] Extract shared Row→ExperimentGroup mappers into ExperimentGroupMappers

Pull the duplicated Row→ExperimentGroupItem and Row→ExperimentGroupAggregationItem
conversion logic from ExperimentDAO and ExperimentAggregatesDAO into a shared
ExperimentGroupMappers utility class. Both DAOs now delegate to the same
toExperimentGroupItem / toExperimentGroupAggregationItem helpers, eliminating
the need to mirror DTO mapping changes in two places.

* [OPIK-4382] [BE] Deduplicate bindGroupCriteria into ExperimentGroupMappers

Moves the shared group-criteria binding logic out of ExperimentDAO and
ExperimentAggregatesDAO into ExperimentGroupMappers.bindGroupCriteria(),
following the same pattern as ExperimentSearchCriteriaBinder.  Adding or
fixing a criteria binding now only requires a change in one place.

* [OPIK-4382] [BE] Extract streamGroupQuery helper and fix null percentiles

- Deduplicate findGroups/findGroupsAggregations into a single private
  streamGroupQuery(queryTemplate, criteria, rowMapper) that differs only
  by the query constant and BiFunction row mapper.

- Fix convertToBigDecimal to return null for null/unsupported inputs so
  absent p50/p90/p99 entries in getDuration propagate as null rather than
  BigDecimal.ZERO, preserving the semantic-null that lets callers apply
  COALESCE/fallback logic correctly.

* [OPIK-4382] [BE] Consolidate cost/duration helpers into ExperimentGroupMappers

Promote getCostValue and getDuration to public static in
ExperimentGroupMappers and delete the private copies in ExperimentDAO.
ExperimentDAO.mapToDto now delegates to the shared helpers, so any
future change to cost filtering, duration percentile extraction, or the
BigDecimal conversion only needs to be made in one place.

Side-effect: ExperimentDAO.mapToDto also picks up the null-percentile
fix (convertToBigDecimal returns null for absent/unsupported inputs)
that was previously applied only to ExperimentGroupMappers.

* [OPIK-4382] [BE] Fix pagination count and add criteria filter tests

- Remove count() OVER () window function from paged query (returned
  page-scoped count instead of full result-set count)
- Replace with dedicated count query + short-circuit: skip items query
  when count == 0, use DatasetItemPage.empty() for that case
- Extract DatasetItemResultMapper.buildItemFromRow as public static
  helper reused by ExperimentAggregatesDAO
- Add parameterized integration tests for ExperimentGroupCriteria
  filters (name, types, projectId, combined, empty-result) covering
  both findGroups and findGroupsAggregations aggregate paths

* [OPIK-4380] [BE] Extract shared filter helpers into FilterQueryBuilder

Add FilterStrategyParam record, applyFiltersToTemplate and bindFilters
static helpers to FilterQueryBuilder, then replace duplicated per-strategy
loops in DatasetItemVersionDAO and ExperimentAggregatesDAO with single
delegating calls backed by per-DAO strategy constants.

* [OPIK-4382] [BE] Consolidate filter helpers in getExperimentItemsStatsFromAggregates

Add EXPERIMENT_ITEMS_STATS_FILTER_STRATEGY_PARAMS and
EXPERIMENT_ITEMS_STATS_BIND_STRATEGIES constants and replace the
per-strategy toAnalyticsDbFilters/bind blocks in
getExperimentItemsStatsFromAggregates with single delegating calls to
FilterQueryBuilder.applyFiltersToTemplate and bindFilters.

* Revision 9: Extract shared helpers to eliminate duplication across DAOs

- Create DatasetItemSearchCriteriaMapper: centralizes filters + search flag
  wiring for DatasetItemSearchCriteria, shared by DatasetItemVersionDAO and
  ExperimentAggregatesDAO
- Add ExperimentGroupMappers.applyGroupCriteriaToTemplate: centralizes
  ExperimentGroupCriteria → ST template wiring, now shared by ExperimentDAO
  and ExperimentAggregatesDAO
- Update DatasetItemVersionDAO.addFiltersToTemplate and bindSearchAndFilters
  to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentAggregatesDAO.applyDatasetItemFiltersToTemplate and
  bindDatasetItemSearchParams to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentDAO.newGroupTemplate and ExperimentAggregatesDAO.newGroupTemplate
  to delegate to ExperimentGroupMappers.applyGroupCriteriaToTemplate

* [OPIK-4383] [BE] Add experiment aggregate event listener and no-op publisher

* Revision 2: Fix missing import for ExperimentAggregationPublisher

* [OPIK-4383] [BE] Add ExperimentAggregationPublisher, ExperimentDenormalizationJob and tests

- ExperimentAggregationPublisher: debounces experiment aggregation triggers
  by writing compound workspaceId:experimentId members to a Redis ZSET scored
  by expiry timestamp (now + debounceDelay), plus a hash storing the userName
  with TTL=2×debounceDelay to handle stale entries.
- ExperimentDenormalizationJob: @every("5s") job that reads ZSET members with
  score <= now, publishes ExperimentAggregationMessage to the Redis stream,
  then cleans up the ZSET entry and hash bucket. Handles stale entries
  (expired hash) by removing the orphaned ZSET member without publishing.
- Fix processExperiment reactive chain: avoided double index.remove by
  returning Mono<Boolean> from flatMap branches so switchIfEmpty is only
  triggered when the bucket is truly empty.
- ExperimentAggregationPublisherTest: integration tests with real Redis
  container verifying ZSET membership, score, userName storage, TTL,
  workspace isolation, and debounce deduplication.
- ExperimentDenormalizationJobTest: unit tests with Mockito covering disabled
  config, lock not acquired, empty ZSET, happy path, stale entry, and batch.

* Fix tests setup

* [OPIK-4383] [BE] Address PR review: move DAO logs to service layer

* [OPIK-4383] [BE] Address PR review: extract shared DAO helper and fix log placement

* [OPIK-4383] [BE] Short-circuit deleteByTraceIds when no spans found

Skip delete, cascading operations, and SpansDeleted event when
getSpanIdsForTraces returns an empty set, preserving the original
no-op behaviour and avoiding the Preconditions.checkArgument failure
in SpanDAO.deleteByIds.

* [OPIK-4383] [BE] Fix cascade deletion failures after trace delete

Two bugs prevented spans and attachments from being deleted when a trace
was deleted via the event-driven cascade:

1. FeedbackScoreService.deleteByTraceIds/deleteBySpanIds had @nonnull on
   projectId which threw NPE when TracesDeleted.projectId() was null.
   EventInterceptor swallowed the NPE, stopping the entire cascade chain.
   Fix: remove @nonnull since the DAO already handles null safely via
   Optional.ofNullable(projectId).

2. SpanDAO.DELETE_BY_IDS had the wrong column (trace_id) and parameter
   name (span_ids) — the ClickHouse R2DBC driver could not resolve :span_ids
   as a named parameter in the DELETE statement. Fixed by using id IN :ids
   to match the working pattern in TraceDAO.DELETE_BY_ID.

* [OPIK-4383] [BE] Address PR review comments on ExperimentDenormalizationJob

- Centralize Redis constants (EXPERIMENT_KEY_PREFIX, USER_NAME_FIELD,
  MEMBER_SEPARATOR) in ExperimentDenormalizationConfig
- Change ExperimentAggregationPublisher.publish() to return Mono<Void>
  instead of void, so errors propagate to callers
- Make job interval configurable via jobs map in config.yml
- Fix onErrorContinue logging: remove getMessage() duplication
- Demote per-experiment logs from INFO to DEBUG
- Add ZSET pagination using expand() to avoid materializing entire range
- Update tests for all changes

* Fix @every job interval config key casing and add jobs section to test config

The dropwizard-jobs framework uses WordUtils.uncapitalize(class.getSimpleName())
to look up the interval in the jobs map, so the key must be
'experimentDenormalizationJob' (lowercase first letter). Also adds the missing
jobs section and jobBatchSize to config-test.yml.

* Replace @every annotation with programmatic Quartz scheduling

Remove @every from ExperimentDenormalizationJob and schedule it
programmatically in OpikGuiceyLifecycleEventListener, following the
same pattern as TraceThreadsClosingJob. Add jobInterval config field
to ExperimentDenormalizationConfig. Remove the jobs YAML section that
caused deserialization errors with JobConfiguration's immutable map.

* Add experiment context to error log and extract publishIfNotEmpty helper

- Include experimentId and workspaceId in onExperimentUpdated error log
- Extract publishIfNotEmpty helper to deduplicate filter+publish logic
  across triggerByExperimentIds, triggerByTraceIds, triggerBySpanIds

* Fix NPE in ExperimentAggregateEventListenerTest mock setup

Stub publisher.publish() to return Mono.empty() in setUp so
.subscribe() calls in production code don't NPE on null.

* [OPIK-4385] [BE] Use pre-computed aggregation tables for experiment endpoints

Apply UNION ALL hybrid pattern to ExperimentDAO (FIND, FIND_GROUPS,
FIND_GROUPS_AGGREGATIONS) and ExperimentItemDAO (STREAM,
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT,
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS,
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_STATS) so that
experiments present in experiment_aggregates / experiment_item_aggregates
use pre-computed values, while others fall back to live JOIN computation.

Add ExperimentAggregatesIntegrationTest covering all 7 affected queries
with parameterized filter, pagination, and consistency scenarios.

* [OPIK-4384] [BE] Fix missing zero_uuid binding and experiment_scores sort alias

- Bind zero_uuid parameter in getById, getByIds, and get(ExperimentStreamRequest)
  methods that use the FIND query; the UNION ALL refactor introduced an
  experiments_from_aggregates CTE that requires this parameter but only the
  main find() method was binding it, causing 500 errors on those paths
- Fix SortingQueryBuilder to reference the outer column alias experiment_scores_agg
  instead of es.experiment_scores; the ORDER BY sits outside the UNION ALL so the
  inner es alias is out of scope, while experiment_scores_agg is the consistent
  output alias exposed by both branches

* [OPIK-4384] [BE] Fix null row injection from LEFT JOIN miss in feedback_scores and comments aggregation

Pre-aggregate feedback_scores_final and comments_final into subqueries
(GROUP BY entity_id) before LEFT JOIN in DatasetItemVersionDAO.STREAM.
When a LEFT JOIN has no match against a pre-aggregated subquery the
joined columns are NULL, so any(NULL) returns NULL instead of a
default-valued row with epoch timestamps that caused Instant.parse()
failures.

Also adds a regression test covering the no-scores path in
ExperimentAggregatesIntegrationTest.

* [OPIK-4383] [BE] Remove DAO-level log.info from ExperimentAggregatesDAO methods

Move operational logging responsibility to the service layer, consistent
with earlier fixes for ExperimentItemDAO and SpanDAO in this PR.

* Remove accidentally committed doc files

These files were introduced during merge resolution but should
not be part of the branch.

* [OPIK-4383] [BE] refactor: extract triggerAggregation helper to centralize guard+publish flow

* [OPIK-4383] [BE] fix: restore TagOperations.tagUpdateFragment in SpanDAO BULK_UPDATE

Restores proper tag handling in SpanDAO.BULK_UPDATE query that was
regressed to a simple arrayConcat. Now uses TagOperations.tagUpdateFragment()
which provides arrayDistinct(), tag limit enforcement (max 50), and
tags_to_add/tags_to_remove support. Also adds the required
short_circuit_function_evaluation SETTINGS for throwIf evaluation.

* Adding InterruptableJob

* [OPIK-4383] [BE] Address PR review: expand safety valve, env var prefix

- Add batchSize-capped iteration counter to expand() to prevent
  infinite loops when ZSET entries fail to be removed
- Rename EXPERIMENT_DENORM_JOB_INTERVAL to OPIK_EXPERIMENT_DENORM_JOB_INTERVAL
  to follow the OPIK_ prefix convention

* [OPIK-4384] [BE] Add branch optimization and CTE split to experiment queries

Use pre-computed experiment_aggregates table to optimize query execution:
- Add has_aggregated/has_raw flags to skip unnecessary UNION ALL branches in FIND/FIND_COUNT
- Add getAggregationBranchCounts pre-query to determine which branches are needed
- Apply CTE split pattern to FIND_GROUPS and FIND_GROUPS_AGGREGATIONS
- Update getById to leverage branch optimization via single-ID branch count query
- Add <if(id)> filter to SELECT_AGGREGATED_EXPERIMENT_IDS for getById support

* [OPIK-4384] [BE] Add missing 7-arg overload for getDatasetItemsWithExperimentItems

Fix test compilation error from merge: the remote branch added callers
with (UUID, List, null, null, List<SortingField>, String, String) signature
which needs a bridge overload to the 9-arg method.

* [OPIK-4384] [BE] Add conditional LIMIT push-up, missing CTE, and fix test precision

- Add conditional LIMIT push-up in STREAM query: push LIMIT into CTE
  when only one branch (raw or aggregated) is active for performance
- Add missing experiment_item_aggr_trace_scope CTE for aggregated branch
- Add AggregatedExperimentCounts record for experiment-level branching
- Fix MultiValueFeedbackScoresE2ETest precision assertion: use isEqualTo
  instead of isEqualByComparingTo to respect custom BigDecimal comparator

* [OPIK-4384] [BE] Push OFFSET into top_dataset_items CTE and fix BigDecimal comparator in DatasetsResourceTest

* [OPIK-4384] [BE] Add pass rate aggregation to experiment aggregates

Add pass_rate, passed_count, and total_count columns to experiment_aggregates
table and compute them during aggregation. Update ExperimentDAO queries to
select these columns from both raw and aggregated paths, returning NULL for
non-evaluation-suite experiments.

* Fix get by id

* Fix mapping

* [OPIK-4384] [BE] Use pre-aggregated comments from aggregate tables with ISO 8601 date formatting

Update retrieval queries in ExperimentDAO, DatasetItemVersionDAO, and ExperimentAggregatesDAO
to read comments_array_agg as JSON String from aggregate tables instead of live-querying the
comments table. Ensure UNION ALL type compatibility by wrapping raw paths with toJSONString()
and formatting dates as ISO 8601 for proper Jackson deserialization.

* [OPIK-4384] [BE] Use parameterized binding for dynamic sort keys and add deterministic tiebreaker

- Replace literal string interpolation in getTopSortExpression with
  parameterized bind variables (sf.bindKey()) to prevent SQL injection
- Remove fieldMapping filter from bindDynamicKeys so all dynamic keys
  are bound, including those used in the top_sorting SELECT expression
- Add deterministic tiebreaker (id DESC / dataset_item_id DESC) to both
  the push-top-limit CTE and the main ORDER BY for consistent pagination
- Fix experiment_items deduplication: use FINAL where DISTINCT was used
  and vice versa for consistency across query branches

* [OPIK-4384] [BE] Add mixed-state aggregation test for UNION ALL hybrid

Test creates 3 experiments, aggregates only 1, and queries all 3 to
exercise the UNION ALL hybrid path where has_aggregated and has_raw
are both true simultaneously.

* [OPIK-4384] [BE] Add isNotEmpty assertions to parameterized filter tests

Ensure filter scenarios actually match data by asserting content()
is not empty before and after aggregation in all parameterized filter
tests (find, findGroups, findGroupsAggregations).

* [OPIK-4384] [BE] refactor: extract assertion helpers to remove duplication in ExperimentAggregatesIntegrationTest

* [OPIK-4384] [BE] refactor: rename parseFlexibleInstant to parseInstant in FeedbackScoreMapper

* [OPIK-4384] [BE] Make LIMIT unconditional in FIND query

The LIMIT clause was gated on filter/sort flags, so plain paged requests
(only limit/offset) at the outer query level would not emit LIMIT.
Simplify to always emit LIMIT when the limit parameter is provided.

* [OPIK-4384] [BE] Fix comment ordering assertion in tests

ClickHouse groupUniqArray does not guarantee ordering, so comment
assertions must use ignoringCollectionOrder to avoid flaky failures.

* [OPIK-4384] [BE] Add branch conditionals to FIND_GROUPS/FIND_GROUPS_AGGREGATIONS and revert unconditional LIMIT

- Wrap SELECT branches in FIND_GROUPS and FIND_GROUPS_AGGREGATIONS with
  <if(has_aggregated)>/<if(has_raw)> conditionals to skip unnecessary
  branches when all experiments are aggregated or all are raw
- Add no-args getAggregationBranchCounts() overload for workspace-only
  pre-query (used by group/aggregation queries that lack experiment IDs)
- Update executeQueryWithTargetProjects to run both pre-queries in
  parallel via Mono.zip
- Revert commit 215a3f9 (unconditional LIMIT) which caused double
  LIMIT/OFFSET bug: CTE-level LIMIT + outer LIMIT made page 2+ return
  0 results. The complex conditional is correct — outer LIMIT is only
  needed when post-CTE processing may alter the result set.

* [OPIK-4384] [BE] Add branch conditionals to SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT

Wrap the UNION ALL in the count query with <if(has_aggregated)>/<if(has_raw)>
conditionals to skip unnecessary branches. Pass branch flags through
getCountWithExperimentFilters from the existing pre-query results.

* [OPIK-4384] [BE] Fix ClickHouse column resolution in COUNT query

Alias dataset_item_id as di_id in the COUNT subquery branches
to avoid column name ambiguity when ClickHouse 25.3's query
analyzer resolves COUNT(DISTINCT dataset_item_id) through a
LEFT JOIN with dataset_items_resolved which also has that column.

* [OPIK-4384] [BE] Use pre-computed comments in STREAM query and fix UNION ALL type mismatch

Aggregated branch now reads comments_array_agg directly from experiment_item_aggregates
instead of doing an expensive JOIN to the comments table. Raw branch converts comments
to JSON String via toJSONString(CAST(...)) so both branches output compatible types.

* [OPIK-4384] [BE] Fix target_project_ids bind error in FIND_GROUPS aggregated branch

* [OPIK-4384] [BE] Unify aggregation branch counts with shared ExperimentAggregationSql

Extract SELECT_AGGREGATED_EXPERIMENT_IDS SQL and AggregatedExperimentCounts
into shared ExperimentAggregationSql utility class. Introduce
AggregationBranchCountsCriteria DTO to unify getAggregationBranchCounts
overloads across ExperimentDAO, ExperimentItemDAO, and DatasetItemVersionDAO.

* [OPIK-4384] [BE] Move getAggregationBranchCounts to ExperimentAggregatesDAO

Consolidate aggregation branch counting logic into ExperimentAggregatesDAO
instead of a separate utility class. Extract DTOs into their own files
in the experiments.aggregations package.

* [OPIK-4384] [BE] Deduplicate experiment_aggregates subquery with SELECT DISTINCT

Prevent inflated counts from ReplacingMergeTree pre-merge duplicates
in the aggregation branch counting query.
miguelgrc pushed a commit that referenced this pull request Mar 19, 2026
…ment by ID (#5579)

* [OPIK-4380] [BE] Add experiment aggregates for denormalized metrics

- Add experiment_aggregates and experiment_item_aggregates tables
- Implement ExperimentAggregatesDAO with population and query methods
- Add ExperimentAggregatesService for aggregation management
- Refactor DTOs into organized model classes:
  - ExperimentAggregatesModel: aggregation results
  - ExperimentEntityData: entity models
  - ExperimentSourceData: raw source data
  - ExperimentAggregatesUtils: utilities
- Add FEEDBACK_SCORES_AGGREGATED filter strategy for map-based filtering
- Add comprehensive integration tests (10/10 passing)
- Configure batch size and parallelism settings

* [OPIK-4380] [BE] Add MySQL deadlock retry mechanism for concurrent dataset operations

Problem:
- MySQL deadlock on dataset_version_tags composite PRIMARY KEY (workspace_id, dataset_id, tag)
- Occurred during parallel dataset creation in same workspace
- Multiple threads inserting "latest" tag for different datasets caused lock contention
- Experiments with parallel execution were failing with MySQLTransactionRollbackException

Solution:
- Add handleOnDeadLocks() method in RetryUtils with:
  - 5 retry attempts with exponential backoff (250ms to 2s)
  - 0.5 jitter to reduce thundering herd effect
  - Recursive isDatabaseDeadlock() detection for MySQLTransactionRollbackException
- Apply retry logic in DatasetItemService.setDatasetItemVersion()
- Enables concurrent dataset creation for same workspace

Impact:
- Supports parallel experiment execution with proper deadlock handling
- Test success rate improved from 0/10 to 10/10 in ExperimentAggregatesIntegrationTest

* Fix visibility

* [OPIK-4380] [BE] Address PR review comments for experiment aggregates

Fixed 11 automated review comments from baz-reviewer:

CRITICAL fixes:
- Prevent NPE on null span aggregations by adding coalesce() in SQL
- Handle multi-project experiments with LIMIT 1 in GET_PROJECT_ID
- Handle zero-item experiments with empty aggregation helpers
- Bind feedback_scores_percentiles map instead of empty CAST

HIGH priority fixes:
- Use toDecimal128(12) instead of toDecimal64(9) for cost percentiles
- Add null-safe tags handling with Optional.ofNullable()
- Include exception objects in retry logging for stack traces

MEDIUM priority fixes:
- Add missing log_comment to SELECT_EXPERIMENT_BY_ID query
- Add missing log_comment to GET_PROJECT_ID query

LOW priority fixes:
- Remove duplicate "id" binding in bindItemsParameters
- Enhance batchSize config documentation with details

All 11 integration tests passing.

* [OPIK-4382] [BE] Refactor experiment aggregates with import cleanup and Optional patterns

- Add missing imports for IntStream, ProjectStats, and other dependencies
- Replace fully-qualified class names with proper imports across DAO and Service classes
- Fix IS_NOT_EMPTY filter handling for FEEDBACK_SCORES_AGGREGATED strategies
- Refactor null checks to use Optional in mapping methods:
  - mapFeedbackScoreAggregations, mapExperimentFromAggregates
  - mapFeedbackScoreData, mapExperimentGroupAggregationItem
  - Batch insert preparation with Optional chains
- Improve code readability and maintainability with functional patterns

* [OPIK-4380] [BE] Fix table definition

* [OPIK-4380] [BE] Address PR comments and consolidate DatasetItemService methods

- Fix tags NPE in ExperimentAggregatesDAO with defaultIfNull
- Remove unnecessary FINAL clause from GET_EXPERIMENT_DATA query
- Fix test naming in ExperimentAggregatesIntegrationTest
- Consolidate 7 duplicate createVersionFromDelta methods into single canonical implementation
- Remove debug logger from config-test.yml

* [OPIK-4380] [BE] Fix missing log_comment and centralize search criteria binding

- Fix SELECT_EXPERIMENT_BY_ID to properly render log_comment metadata
  - Use getSTWithLogComment pattern in getExperimentFromAggregates
  - Ensures ClickHouse query logging populates workspace/experiment IDs

- Centralize ExperimentSearchCriteria binding logic
  - Create ExperimentSearchCriteriaBinder utility class
  - Parameterize filter strategies to support both DAO variants
  - Eliminate 29-line duplication between ExperimentDAO and ExperimentAggregatesDAO
  - Single source of truth prevents DAOs from getting out of sync

* [OPIK-4380] [BE] Fix createVersionFromDelta consolidation after rebase

- Update canonical method signature to include new parameters:
  - List<EvaluatorItem> evaluators
  - ExecutionPolicy executionPolicy
  - boolean clearExecutionPolicy

- Update all 5 caller sites to pass new parameters:
  - Use changes.evaluators(), changes.executionPolicy(), changes.clearExecutionPolicy() when available
  - Pass null/false for auto-generated versions that inherit from base

- Add imports for EvaluatorItem and ExecutionPolicy

Fixes compilation errors introduced by rebase with upstream changes to DatasetVersionService

* [OPIK-4380] [BE] Address PR review comments - fix type mismatch, extract constants, remove DAO logging

- Fixed BigDecimal[] to Double[] conversion for experiment_scores (matches ClickHouse Float64)
- Extracted FilterStrategy lists to static final constants to avoid repeated allocations
- Added @nonnull validation to populateExperimentAggregate parameter
- Removed DAO layer logging, keeping service-level logging only

* [OPIK-4380] [BE] Extract shared helper for experiment data pagination

Extract streamWithExperimentPagination() helper method to eliminate
duplication in getTracesData(), getSpansData(), and getFeedbackScoresData().

All three methods followed identical pattern:
- asyncTemplate.stream with connection
- getSTWithLogComment with cursor flag
- Bind workspace_id, experiment_id, project_id, limit
- Optional cursor binding
- Result mapping

Benefits:
- Single source of truth for pagination binding logic
- Prevents divergence when tweaking cursor/limit bindings
- Reduces code from ~20 lines to ~10 lines per method
- Type-safe generic implementation

Note: CTE redundancy (3x experiment_items scan) is intentional to avoid
passing large trace ID lists as parameters, which would cause performance
issues with 10K+ traces.

* [OPIK-4383] [BE] Add Redis stream subscriber for debounced experiment aggregates recomputation

- Add ExperimentDenormalizationConfig implementing StreamConfiguration with debounce, job lock, and per-experiment aggregation lock settings
- Add ExperimentAggregationMessage as stream message record
- Add ExperimentAggregatesSubscriber consuming from the denormalization stream; acquires a workspace-scoped distributed lock per experiment before calling populateAggregations()
- Add experimentDenormalizationEnabled feature flag to ServiceTogglesConfig and FeatureFlags
- Wire ExperimentDenormalizationConfig into OpikConfiguration
- Update config.yml and config-test.yml with full experimentDenormalization block (enabled for tests)
- Add ExperimentAggregatesSubscriberTest covering lifecycle gating and processEvent success/error paths

* Revision 2: Address PR comments - add config defaults, remove toggle, rename tests

- ExperimentDenormalizationConfig: add sensible defaults to all fields so
  Dropwizard validation doesn't fail when the config block is absent from
  old deployments (config.isEnabled()=false still gates the subscriber)
- Remove experimentDenormalizationEnabled service toggle from
  ServiceTogglesConfig, FeatureFlags, config.yml and config-test.yml -
  the infrastructure gate (config.isEnabled()) is the single control point
- Rename lifecycle test methods to camelCase per project conventions:
  startSkipsStartupWhenDisabled / stopSkipsShutdownWhenDisabled

* Revision 3: Add @max(500) to consumerBatchSize and @NotNull to jobLockWaitTime

* [OPIK-4380] [BE] Address PR review comments - fix TYPE_REFERENCE visibility, redundant IN subquery, hardcoded context keys, Instant.now in loop, and inline defaultIfNull

* Revision 4: Address remaining JetoPistola review comments (#7, #8, #10)

- #7: Remove "Used for testing and verification" from getExperimentFromAggregates javadoc
- #8: Replace recursive flatMap with Mono.expand() in populateExperimentItemsInBatches
- #10: Remove unrelated subscribeOn addition from DatasetItemService.createVersionFromDelta

* Revision 3: Add switchIfEmpty fallback for deleted traces in populateExperimentAggregate

* Fix tests

* Revision 6: Move countTotal log from DAO to service layer

Operational logs belong in the service layer, not the DAO.

* Revision 7: Apply Spotless formatting

* Revision 8: Make populateAggregations(UUID, int) private

Removes the uncapped public batch size entry point. All callers now go
through the public no-arg overload which reads batchSize safely from config.

* [OPIK-4380] [BE] Add evaluation_method support to experiment_aggregates pipeline

- Add ClickHouse migration (000062) to add evaluation_method column to experiment_aggregates table
- Add evaluationMethod field to ExperimentData record
- Update GET_EXPERIMENT_DATA query to read evaluation_method from experiments
- Update INSERT_EXPERIMENT_AGGREGATE to write evaluation_method to experiment_aggregates
- Update SELECT_EXPERIMENT_BY_ID to read evaluation_method from experiment_aggregates
- Fix Experiment record constructor call: insert EvaluationMethod at correct position (10)

* [OPIK-4380] [BE] Extract shared helper for experiment aggregation queries

Reduce copy-paste in getTraceAggregations, getSpanAggregations, and
getFeedbackScoreAggregations by extracting queryExperimentAggregation,
which centralises the context-aware execution, workspace/experiment/project
parameter binding, and singleOrEmpty pattern shared by all three methods.

* [OPIK-4380] [BE] Enforce non-null contract on countTotal criteria parameter

Add @nonnull to ExperimentSearchCriteria in the interface and implementation
so that a null argument fails fast with an explicit NullPointerException at
the DAO boundary instead of crashing deep inside buildCountTemplate.

* [OPIK-4380] [BE] Fix countTotal ignoring target project IDs in normal path

target_project_ids was only applied inside the project_deleted LEFT JOIN
subquery; the main WHERE had no project restriction, so counts were
workspace-wide. Reuse has_target_projects in the main WHERE so
project_id IN :target_project_ids always takes effect. Also replace
manual null/empty checks with CollectionUtils.isNotEmpty.

* [OPIK-4380] [BE] Apply Spotless formatting

* [OPIK-4382] [BE] Address PR review comments on experiment aggregates

- Fix :versionId → :version_id parameter naming in SQL templates and bindings
- Fix last_updated_at binding to use item.lastUpdatedAt() instead of Instant.now()
- Fix FEEDBACK_SCORES_AGGREGATED_IS_EMPTY filter: embed generated SQL into templates
  instead of hard-coded static condition, and add missing bind calls
- Fix RetryUtils log duplication (remove getMessage() + pass exception directly)
- Add batchSize = 1000 default in ExperimentAggregatesConfig
- Extract resolveVersionIdForCriteria helper to deduplicate version-id resolution
- Add null/blank/ClickHouse placeholder guards in extractUuidsFromGroupValues
- Extract loadEntityMap helper to deduplicate getEnrichInfoHolder enrichment logic

* Revision 3: Address PR comments E, F, G, H

- Fix E: Extract shared template/bind helpers in ExperimentAggregatesDAO
- Fix F: Bind experiment_ids as UUID[] instead of String[]
- Fix G+H: Extract getEnrichInfoHolder logic into ExperimentGroupEnricher,
  eliminating duplication between ExperimentService and ExperimentAggregatesService
  without introducing a direct dependency between them

* Revision 4: Fix ExperimentServiceTest to include ExperimentGroupEnricher mock

* [OPIK-4382] [BE] Extract shared Row→ExperimentGroup mappers into ExperimentGroupMappers

Pull the duplicated Row→ExperimentGroupItem and Row→ExperimentGroupAggregationItem
conversion logic from ExperimentDAO and ExperimentAggregatesDAO into a shared
ExperimentGroupMappers utility class. Both DAOs now delegate to the same
toExperimentGroupItem / toExperimentGroupAggregationItem helpers, eliminating
the need to mirror DTO mapping changes in two places.

* [OPIK-4382] [BE] Deduplicate bindGroupCriteria into ExperimentGroupMappers

Moves the shared group-criteria binding logic out of ExperimentDAO and
ExperimentAggregatesDAO into ExperimentGroupMappers.bindGroupCriteria(),
following the same pattern as ExperimentSearchCriteriaBinder.  Adding or
fixing a criteria binding now only requires a change in one place.

* [OPIK-4382] [BE] Extract streamGroupQuery helper and fix null percentiles

- Deduplicate findGroups/findGroupsAggregations into a single private
  streamGroupQuery(queryTemplate, criteria, rowMapper) that differs only
  by the query constant and BiFunction row mapper.

- Fix convertToBigDecimal to return null for null/unsupported inputs so
  absent p50/p90/p99 entries in getDuration propagate as null rather than
  BigDecimal.ZERO, preserving the semantic-null that lets callers apply
  COALESCE/fallback logic correctly.

* [OPIK-4382] [BE] Consolidate cost/duration helpers into ExperimentGroupMappers

Promote getCostValue and getDuration to public static in
ExperimentGroupMappers and delete the private copies in ExperimentDAO.
ExperimentDAO.mapToDto now delegates to the shared helpers, so any
future change to cost filtering, duration percentile extraction, or the
BigDecimal conversion only needs to be made in one place.

Side-effect: ExperimentDAO.mapToDto also picks up the null-percentile
fix (convertToBigDecimal returns null for absent/unsupported inputs)
that was previously applied only to ExperimentGroupMappers.

* [OPIK-4382] [BE] Fix pagination count and add criteria filter tests

- Remove count() OVER () window function from paged query (returned
  page-scoped count instead of full result-set count)
- Replace with dedicated count query + short-circuit: skip items query
  when count == 0, use DatasetItemPage.empty() for that case
- Extract DatasetItemResultMapper.buildItemFromRow as public static
  helper reused by ExperimentAggregatesDAO
- Add parameterized integration tests for ExperimentGroupCriteria
  filters (name, types, projectId, combined, empty-result) covering
  both findGroups and findGroupsAggregations aggregate paths

* [OPIK-4380] [BE] Extract shared filter helpers into FilterQueryBuilder

Add FilterStrategyParam record, applyFiltersToTemplate and bindFilters
static helpers to FilterQueryBuilder, then replace duplicated per-strategy
loops in DatasetItemVersionDAO and ExperimentAggregatesDAO with single
delegating calls backed by per-DAO strategy constants.

* [OPIK-4382] [BE] Consolidate filter helpers in getExperimentItemsStatsFromAggregates

Add EXPERIMENT_ITEMS_STATS_FILTER_STRATEGY_PARAMS and
EXPERIMENT_ITEMS_STATS_BIND_STRATEGIES constants and replace the
per-strategy toAnalyticsDbFilters/bind blocks in
getExperimentItemsStatsFromAggregates with single delegating calls to
FilterQueryBuilder.applyFiltersToTemplate and bindFilters.

* Revision 9: Extract shared helpers to eliminate duplication across DAOs

- Create DatasetItemSearchCriteriaMapper: centralizes filters + search flag
  wiring for DatasetItemSearchCriteria, shared by DatasetItemVersionDAO and
  ExperimentAggregatesDAO
- Add ExperimentGroupMappers.applyGroupCriteriaToTemplate: centralizes
  ExperimentGroupCriteria → ST template wiring, now shared by ExperimentDAO
  and ExperimentAggregatesDAO
- Update DatasetItemVersionDAO.addFiltersToTemplate and bindSearchAndFilters
  to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentAggregatesDAO.applyDatasetItemFiltersToTemplate and
  bindDatasetItemSearchParams to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentDAO.newGroupTemplate and ExperimentAggregatesDAO.newGroupTemplate
  to delegate to ExperimentGroupMappers.applyGroupCriteriaToTemplate

* [OPIK-4383] [BE] Add experiment aggregate event listener and no-op publisher

* Revision 2: Fix missing import for ExperimentAggregationPublisher

* [OPIK-4383] [BE] Add ExperimentAggregationPublisher, ExperimentDenormalizationJob and tests

- ExperimentAggregationPublisher: debounces experiment aggregation triggers
  by writing compound workspaceId:experimentId members to a Redis ZSET scored
  by expiry timestamp (now + debounceDelay), plus a hash storing the userName
  with TTL=2×debounceDelay to handle stale entries.
- ExperimentDenormalizationJob: @every("5s") job that reads ZSET members with
  score <= now, publishes ExperimentAggregationMessage to the Redis stream,
  then cleans up the ZSET entry and hash bucket. Handles stale entries
  (expired hash) by removing the orphaned ZSET member without publishing.
- Fix processExperiment reactive chain: avoided double index.remove by
  returning Mono<Boolean> from flatMap branches so switchIfEmpty is only
  triggered when the bucket is truly empty.
- ExperimentAggregationPublisherTest: integration tests with real Redis
  container verifying ZSET membership, score, userName storage, TTL,
  workspace isolation, and debounce deduplication.
- ExperimentDenormalizationJobTest: unit tests with Mockito covering disabled
  config, lock not acquired, empty ZSET, happy path, stale entry, and batch.

* Fix tests setup

* [OPIK-4383] [BE] Address PR review: move DAO logs to service layer

* [OPIK-4383] [BE] Address PR review: extract shared DAO helper and fix log placement

* [OPIK-4383] [BE] Short-circuit deleteByTraceIds when no spans found

Skip delete, cascading operations, and SpansDeleted event when
getSpanIdsForTraces returns an empty set, preserving the original
no-op behaviour and avoiding the Preconditions.checkArgument failure
in SpanDAO.deleteByIds.

* [OPIK-4383] [BE] Fix cascade deletion failures after trace delete

Two bugs prevented spans and attachments from being deleted when a trace
was deleted via the event-driven cascade:

1. FeedbackScoreService.deleteByTraceIds/deleteBySpanIds had @nonnull on
   projectId which threw NPE when TracesDeleted.projectId() was null.
   EventInterceptor swallowed the NPE, stopping the entire cascade chain.
   Fix: remove @nonnull since the DAO already handles null safely via
   Optional.ofNullable(projectId).

2. SpanDAO.DELETE_BY_IDS had the wrong column (trace_id) and parameter
   name (span_ids) — the ClickHouse R2DBC driver could not resolve :span_ids
   as a named parameter in the DELETE statement. Fixed by using id IN :ids
   to match the working pattern in TraceDAO.DELETE_BY_ID.

* [OPIK-4383] [BE] Address PR review comments on ExperimentDenormalizationJob

- Centralize Redis constants (EXPERIMENT_KEY_PREFIX, USER_NAME_FIELD,
  MEMBER_SEPARATOR) in ExperimentDenormalizationConfig
- Change ExperimentAggregationPublisher.publish() to return Mono<Void>
  instead of void, so errors propagate to callers
- Make job interval configurable via jobs map in config.yml
- Fix onErrorContinue logging: remove getMessage() duplication
- Demote per-experiment logs from INFO to DEBUG
- Add ZSET pagination using expand() to avoid materializing entire range
- Update tests for all changes

* Fix @every job interval config key casing and add jobs section to test config

The dropwizard-jobs framework uses WordUtils.uncapitalize(class.getSimpleName())
to look up the interval in the jobs map, so the key must be
'experimentDenormalizationJob' (lowercase first letter). Also adds the missing
jobs section and jobBatchSize to config-test.yml.

* Replace @every annotation with programmatic Quartz scheduling

Remove @every from ExperimentDenormalizationJob and schedule it
programmatically in OpikGuiceyLifecycleEventListener, following the
same pattern as TraceThreadsClosingJob. Add jobInterval config field
to ExperimentDenormalizationConfig. Remove the jobs YAML section that
caused deserialization errors with JobConfiguration's immutable map.

* Add experiment context to error log and extract publishIfNotEmpty helper

- Include experimentId and workspaceId in onExperimentUpdated error log
- Extract publishIfNotEmpty helper to deduplicate filter+publish logic
  across triggerByExperimentIds, triggerByTraceIds, triggerBySpanIds

* Fix NPE in ExperimentAggregateEventListenerTest mock setup

Stub publisher.publish() to return Mono.empty() in setUp so
.subscribe() calls in production code don't NPE on null.

* [OPIK-4385] [BE] Use pre-computed aggregation tables for experiment endpoints

Apply UNION ALL hybrid pattern to ExperimentDAO (FIND, FIND_GROUPS,
FIND_GROUPS_AGGREGATIONS) and ExperimentItemDAO (STREAM,
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT,
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS,
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_STATS) so that
experiments present in experiment_aggregates / experiment_item_aggregates
use pre-computed values, while others fall back to live JOIN computation.

Add ExperimentAggregatesIntegrationTest covering all 7 affected queries
with parameterized filter, pagination, and consistency scenarios.

* [OPIK-4386] [BE] Trigger lazy aggregation via publisher on GET experiment by ID

When fetching an experiment by ID, if the experiment is in COMPLETED or
CANCELLED state and is not yet present in the experiment_aggregates table,
enqueue it for aggregation using ExperimentAggregationPublisher instead of
computing aggregations synchronously. The check and publish are performed
off the critical path via doOnEach, so the caller receives the experiment
immediately without waiting for the side effect to complete.

* [OPIK-4384] [BE] Fix missing zero_uuid binding and experiment_scores sort alias

- Bind zero_uuid parameter in getById, getByIds, and get(ExperimentStreamRequest)
  methods that use the FIND query; the UNION ALL refactor introduced an
  experiments_from_aggregates CTE that requires this parameter but only the
  main find() method was binding it, causing 500 errors on those paths
- Fix SortingQueryBuilder to reference the outer column alias experiment_scores_agg
  instead of es.experiment_scores; the ORDER BY sits outside the UNION ALL so the
  inner es alias is out of scope, while experiment_scores_agg is the consistent
  output alias exposed by both branches

* [OPIK-4384] [BE] Fix null row injection from LEFT JOIN miss in feedback_scores and comments aggregation

Pre-aggregate feedback_scores_final and comments_final into subqueries
(GROUP BY entity_id) before LEFT JOIN in DatasetItemVersionDAO.STREAM.
When a LEFT JOIN has no match against a pre-aggregated subquery the
joined columns are NULL, so any(NULL) returns NULL instead of a
default-valued row with epoch timestamps that caused Instant.parse()
failures.

Also adds a regression test covering the no-scores path in
ExperimentAggregatesIntegrationTest.

* [OPIK-4383] [BE] Remove DAO-level log.info from ExperimentAggregatesDAO methods

Move operational logging responsibility to the service layer, consistent
with earlier fixes for ExperimentItemDAO and SpanDAO in this PR.

* Remove accidentally committed doc files

These files were introduced during merge resolution but should
not be part of the branch.

* [OPIK-4383] [BE] refactor: extract triggerAggregation helper to centralize guard+publish flow

* [OPIK-4386] [BE] fix: demote lazy aggregation check log to DEBUG

* [OPIK-4383] [BE] fix: restore TagOperations.tagUpdateFragment in SpanDAO BULK_UPDATE

Restores proper tag handling in SpanDAO.BULK_UPDATE query that was
regressed to a simple arrayConcat. Now uses TagOperations.tagUpdateFragment()
which provides arrayDistinct(), tag limit enforcement (max 50), and
tags_to_add/tags_to_remove support. Also adds the required
short_circuit_function_evaluation SETTINGS for throwIf evaluation.

* Adding InterruptableJob

* [OPIK-4383] [BE] Address PR review: expand safety valve, env var prefix

- Add batchSize-capped iteration counter to expand() to prevent
  infinite loops when ZSET entries fail to be removed
- Rename EXPERIMENT_DENORM_JOB_INTERVAL to OPIK_EXPERIMENT_DENORM_JOB_INTERVAL
  to follow the OPIK_ prefix convention

* [OPIK-4384] [BE] Add branch optimization and CTE split to experiment queries

Use pre-computed experiment_aggregates table to optimize query execution:
- Add has_aggregated/has_raw flags to skip unnecessary UNION ALL branches in FIND/FIND_COUNT
- Add getAggregationBranchCounts pre-query to determine which branches are needed
- Apply CTE split pattern to FIND_GROUPS and FIND_GROUPS_AGGREGATIONS
- Update getById to leverage branch optimization via single-ID branch count query
- Add <if(id)> filter to SELECT_AGGREGATED_EXPERIMENT_IDS for getById support

* [OPIK-4384] [BE] Add missing 7-arg overload for getDatasetItemsWithExperimentItems

Fix test compilation error from merge: the remote branch added callers
with (UUID, List, null, null, List<SortingField>, String, String) signature
which needs a bridge overload to the 9-arg method.

* [OPIK-4384] [BE] Add conditional LIMIT push-up, missing CTE, and fix test precision

- Add conditional LIMIT push-up in STREAM query: push LIMIT into CTE
  when only one branch (raw or aggregated) is active for performance
- Add missing experiment_item_aggr_trace_scope CTE for aggregated branch
- Add AggregatedExperimentCounts record for experiment-level branching
- Fix MultiValueFeedbackScoresE2ETest precision assertion: use isEqualTo
  instead of isEqualByComparingTo to respect custom BigDecimal comparator

* [OPIK-4384] [BE] Push OFFSET into top_dataset_items CTE and fix BigDecimal comparator in DatasetsResourceTest

* [OPIK-4384] [BE] Add pass rate aggregation to experiment aggregates

Add pass_rate, passed_count, and total_count columns to experiment_aggregates
table and compute them during aggregation. Update ExperimentDAO queries to
select these columns from both raw and aggregated paths, returning NULL for
non-evaluation-suite experiments.

* Fix format

* Fix get by id

* Fix mapping

* Fix mapping

* [OPIK-4384] [BE] Use pre-aggregated comments from aggregate tables with ISO 8601 date formatting

Update retrieval queries in ExperimentDAO, DatasetItemVersionDAO, and ExperimentAggregatesDAO
to read comments_array_agg as JSON String from aggregate tables instead of live-querying the
comments table. Ensure UNION ALL type compatibility by wrapping raw paths with toJSONString()
and formatting dates as ISO 8601 for proper Jackson deserialization.

* [OPIK-4386] [BE] Increase debounceDelay in test config to prevent race condition

The denormalization job was processing finished experiments during test
execution with incomplete ClickHouse data, causing stale aggregated
values to be returned instead of fresh raw computations.

* [OPIK-4384] [BE] Use parameterized binding for dynamic sort keys and add deterministic tiebreaker

- Replace literal string interpolation in getTopSortExpression with
  parameterized bind variables (sf.bindKey()) to prevent SQL injection
- Remove fieldMapping filter from bindDynamicKeys so all dynamic keys
  are bound, including those used in the top_sorting SELECT expression
- Add deterministic tiebreaker (id DESC / dataset_item_id DESC) to both
  the push-top-limit CTE and the main ORDER BY for consistent pagination
- Fix experiment_items deduplication: use FINAL where DISTINCT was used
  and vice versa for consistency across query branches

* [OPIK-4384] [BE] Add mixed-state aggregation test for UNION ALL hybrid

Test creates 3 experiments, aggregates only 1, and queries all 3 to
exercise the UNION ALL hybrid path where has_aggregated and has_raw
are both true simultaneously.

* [OPIK-4384] [BE] Add isNotEmpty assertions to parameterized filter tests

Ensure filter scenarios actually match data by asserting content()
is not empty before and after aggregation in all parameterized filter
tests (find, findGroups, findGroupsAggregations).

* [OPIK-4384] [BE] refactor: extract assertion helpers to remove duplication in ExperimentAggregatesIntegrationTest

* [OPIK-4384] [BE] refactor: rename parseFlexibleInstant to parseInstant in FeedbackScoreMapper

* [OPIK-4384] [BE] Make LIMIT unconditional in FIND query

The LIMIT clause was gated on filter/sort flags, so plain paged requests
(only limit/offset) at the outer query level would not emit LIMIT.
Simplify to always emit LIMIT when the limit parameter is provided.

* [OPIK-4384] [BE] Fix comment ordering assertion in tests

ClickHouse groupUniqArray does not guarantee ordering, so comment
assertions must use ignoringCollectionOrder to avoid flaky failures.

* [OPIK-4384] [BE] Add branch conditionals to FIND_GROUPS/FIND_GROUPS_AGGREGATIONS and revert unconditional LIMIT

- Wrap SELECT branches in FIND_GROUPS and FIND_GROUPS_AGGREGATIONS with
  <if(has_aggregated)>/<if(has_raw)> conditionals to skip unnecessary
  branches when all experiments are aggregated or all are raw
- Add no-args getAggregationBranchCounts() overload for workspace-only
  pre-query (used by group/aggregation queries that lack experiment IDs)
- Update executeQueryWithTargetProjects to run both pre-queries in
  parallel via Mono.zip
- Revert commit 215a3f9 (unconditional LIMIT) which caused double
  LIMIT/OFFSET bug: CTE-level LIMIT + outer LIMIT made page 2+ return
  0 results. The complex conditional is correct — outer LIMIT is only
  needed when post-CTE processing may alter the result set.

* [OPIK-4384] [BE] Add branch conditionals to SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT

Wrap the UNION ALL in the count query with <if(has_aggregated)>/<if(has_raw)>
conditionals to skip unnecessary branches. Pass branch flags through
getCountWithExperimentFilters from the existing pre-query results.

* [OPIK-4384] [BE] Fix ClickHouse column resolution in COUNT query

Alias dataset_item_id as di_id in the COUNT subquery branches
to avoid column name ambiguity when ClickHouse 25.3's query
analyzer resolves COUNT(DISTINCT dataset_item_id) through a
LEFT JOIN with dataset_items_resolved which also has that column.

* [OPIK-4384] [BE] Use pre-computed comments in STREAM query and fix UNION ALL type mismatch

Aggregated branch now reads comments_array_agg directly from experiment_item_aggregates
instead of doing an expensive JOIN to the comments table. Raw branch converts comments
to JSON String via toJSONString(CAST(...)) so both branches output compatible types.

* [OPIK-4384] [BE] Fix target_project_ids bind error in FIND_GROUPS aggregated branch

* Update config-test.yml

* Update ExperimentService.java

* Remove old unused query

* [OPIK-4386] [BE] Address PR review comments: demote log to debug, add try-catch for context safety, add getById lazy aggregation tests

* [OPIK-4386] [BE] Add workspaceId to lazy aggregation log messages
miguelgrc pushed a commit that referenced this pull request Mar 19, 2026
…nts endpoint (#5583)

* [OPIK-4380] [BE] Add experiment aggregates for denormalized metrics

- Add experiment_aggregates and experiment_item_aggregates tables
- Implement ExperimentAggregatesDAO with population and query methods
- Add ExperimentAggregatesService for aggregation management
- Refactor DTOs into organized model classes:
  - ExperimentAggregatesModel: aggregation results
  - ExperimentEntityData: entity models
  - ExperimentSourceData: raw source data
  - ExperimentAggregatesUtils: utilities
- Add FEEDBACK_SCORES_AGGREGATED filter strategy for map-based filtering
- Add comprehensive integration tests (10/10 passing)
- Configure batch size and parallelism settings

* [OPIK-4380] [BE] Add MySQL deadlock retry mechanism for concurrent dataset operations

Problem:
- MySQL deadlock on dataset_version_tags composite PRIMARY KEY (workspace_id, dataset_id, tag)
- Occurred during parallel dataset creation in same workspace
- Multiple threads inserting "latest" tag for different datasets caused lock contention
- Experiments with parallel execution were failing with MySQLTransactionRollbackException

Solution:
- Add handleOnDeadLocks() method in RetryUtils with:
  - 5 retry attempts with exponential backoff (250ms to 2s)
  - 0.5 jitter to reduce thundering herd effect
  - Recursive isDatabaseDeadlock() detection for MySQLTransactionRollbackException
- Apply retry logic in DatasetItemService.setDatasetItemVersion()
- Enables concurrent dataset creation for same workspace

Impact:
- Supports parallel experiment execution with proper deadlock handling
- Test success rate improved from 0/10 to 10/10 in ExperimentAggregatesIntegrationTest

* Fix visibility

* [OPIK-4380] [BE] Address PR review comments for experiment aggregates

Fixed 11 automated review comments from baz-reviewer:

CRITICAL fixes:
- Prevent NPE on null span aggregations by adding coalesce() in SQL
- Handle multi-project experiments with LIMIT 1 in GET_PROJECT_ID
- Handle zero-item experiments with empty aggregation helpers
- Bind feedback_scores_percentiles map instead of empty CAST

HIGH priority fixes:
- Use toDecimal128(12) instead of toDecimal64(9) for cost percentiles
- Add null-safe tags handling with Optional.ofNullable()
- Include exception objects in retry logging for stack traces

MEDIUM priority fixes:
- Add missing log_comment to SELECT_EXPERIMENT_BY_ID query
- Add missing log_comment to GET_PROJECT_ID query

LOW priority fixes:
- Remove duplicate "id" binding in bindItemsParameters
- Enhance batchSize config documentation with details

All 11 integration tests passing.

* [OPIK-4382] [BE] Refactor experiment aggregates with import cleanup and Optional patterns

- Add missing imports for IntStream, ProjectStats, and other dependencies
- Replace fully-qualified class names with proper imports across DAO and Service classes
- Fix IS_NOT_EMPTY filter handling for FEEDBACK_SCORES_AGGREGATED strategies
- Refactor null checks to use Optional in mapping methods:
  - mapFeedbackScoreAggregations, mapExperimentFromAggregates
  - mapFeedbackScoreData, mapExperimentGroupAggregationItem
  - Batch insert preparation with Optional chains
- Improve code readability and maintainability with functional patterns

* [OPIK-4380] [BE] Fix table definition

* [OPIK-4380] [BE] Address PR comments and consolidate DatasetItemService methods

- Fix tags NPE in ExperimentAggregatesDAO with defaultIfNull
- Remove unnecessary FINAL clause from GET_EXPERIMENT_DATA query
- Fix test naming in ExperimentAggregatesIntegrationTest
- Consolidate 7 duplicate createVersionFromDelta methods into single canonical implementation
- Remove debug logger from config-test.yml

* [OPIK-4380] [BE] Fix missing log_comment and centralize search criteria binding

- Fix SELECT_EXPERIMENT_BY_ID to properly render log_comment metadata
  - Use getSTWithLogComment pattern in getExperimentFromAggregates
  - Ensures ClickHouse query logging populates workspace/experiment IDs

- Centralize ExperimentSearchCriteria binding logic
  - Create ExperimentSearchCriteriaBinder utility class
  - Parameterize filter strategies to support both DAO variants
  - Eliminate 29-line duplication between ExperimentDAO and ExperimentAggregatesDAO
  - Single source of truth prevents DAOs from getting out of sync

* [OPIK-4380] [BE] Fix createVersionFromDelta consolidation after rebase

- Update canonical method signature to include new parameters:
  - List<EvaluatorItem> evaluators
  - ExecutionPolicy executionPolicy
  - boolean clearExecutionPolicy

- Update all 5 caller sites to pass new parameters:
  - Use changes.evaluators(), changes.executionPolicy(), changes.clearExecutionPolicy() when available
  - Pass null/false for auto-generated versions that inherit from base

- Add imports for EvaluatorItem and ExecutionPolicy

Fixes compilation errors introduced by rebase with upstream changes to DatasetVersionService

* [OPIK-4380] [BE] Address PR review comments - fix type mismatch, extract constants, remove DAO logging

- Fixed BigDecimal[] to Double[] conversion for experiment_scores (matches ClickHouse Float64)
- Extracted FilterStrategy lists to static final constants to avoid repeated allocations
- Added @nonnull validation to populateExperimentAggregate parameter
- Removed DAO layer logging, keeping service-level logging only

* [OPIK-4380] [BE] Extract shared helper for experiment data pagination

Extract streamWithExperimentPagination() helper method to eliminate
duplication in getTracesData(), getSpansData(), and getFeedbackScoresData().

All three methods followed identical pattern:
- asyncTemplate.stream with connection
- getSTWithLogComment with cursor flag
- Bind workspace_id, experiment_id, project_id, limit
- Optional cursor binding
- Result mapping

Benefits:
- Single source of truth for pagination binding logic
- Prevents divergence when tweaking cursor/limit bindings
- Reduces code from ~20 lines to ~10 lines per method
- Type-safe generic implementation

Note: CTE redundancy (3x experiment_items scan) is intentional to avoid
passing large trace ID lists as parameters, which would cause performance
issues with 10K+ traces.

* [OPIK-4383] [BE] Add Redis stream subscriber for debounced experiment aggregates recomputation

- Add ExperimentDenormalizationConfig implementing StreamConfiguration with debounce, job lock, and per-experiment aggregation lock settings
- Add ExperimentAggregationMessage as stream message record
- Add ExperimentAggregatesSubscriber consuming from the denormalization stream; acquires a workspace-scoped distributed lock per experiment before calling populateAggregations()
- Add experimentDenormalizationEnabled feature flag to ServiceTogglesConfig and FeatureFlags
- Wire ExperimentDenormalizationConfig into OpikConfiguration
- Update config.yml and config-test.yml with full experimentDenormalization block (enabled for tests)
- Add ExperimentAggregatesSubscriberTest covering lifecycle gating and processEvent success/error paths

* Revision 2: Address PR comments - add config defaults, remove toggle, rename tests

- ExperimentDenormalizationConfig: add sensible defaults to all fields so
  Dropwizard validation doesn't fail when the config block is absent from
  old deployments (config.isEnabled()=false still gates the subscriber)
- Remove experimentDenormalizationEnabled service toggle from
  ServiceTogglesConfig, FeatureFlags, config.yml and config-test.yml -
  the infrastructure gate (config.isEnabled()) is the single control point
- Rename lifecycle test methods to camelCase per project conventions:
  startSkipsStartupWhenDisabled / stopSkipsShutdownWhenDisabled

* Revision 3: Add @max(500) to consumerBatchSize and @NotNull to jobLockWaitTime

* [OPIK-4380] [BE] Address PR review comments - fix TYPE_REFERENCE visibility, redundant IN subquery, hardcoded context keys, Instant.now in loop, and inline defaultIfNull

* Revision 4: Address remaining JetoPistola review comments (#7, #8, #10)

- #7: Remove "Used for testing and verification" from getExperimentFromAggregates javadoc
- #8: Replace recursive flatMap with Mono.expand() in populateExperimentItemsInBatches
- #10: Remove unrelated subscribeOn addition from DatasetItemService.createVersionFromDelta

* Revision 3: Add switchIfEmpty fallback for deleted traces in populateExperimentAggregate

* Fix tests

* Revision 6: Move countTotal log from DAO to service layer

Operational logs belong in the service layer, not the DAO.

* Revision 7: Apply Spotless formatting

* Revision 8: Make populateAggregations(UUID, int) private

Removes the uncapped public batch size entry point. All callers now go
through the public no-arg overload which reads batchSize safely from config.

* [OPIK-4380] [BE] Add evaluation_method support to experiment_aggregates pipeline

- Add ClickHouse migration (000062) to add evaluation_method column to experiment_aggregates table
- Add evaluationMethod field to ExperimentData record
- Update GET_EXPERIMENT_DATA query to read evaluation_method from experiments
- Update INSERT_EXPERIMENT_AGGREGATE to write evaluation_method to experiment_aggregates
- Update SELECT_EXPERIMENT_BY_ID to read evaluation_method from experiment_aggregates
- Fix Experiment record constructor call: insert EvaluationMethod at correct position (10)

* [OPIK-4380] [BE] Extract shared helper for experiment aggregation queries

Reduce copy-paste in getTraceAggregations, getSpanAggregations, and
getFeedbackScoreAggregations by extracting queryExperimentAggregation,
which centralises the context-aware execution, workspace/experiment/project
parameter binding, and singleOrEmpty pattern shared by all three methods.

* [OPIK-4380] [BE] Enforce non-null contract on countTotal criteria parameter

Add @nonnull to ExperimentSearchCriteria in the interface and implementation
so that a null argument fails fast with an explicit NullPointerException at
the DAO boundary instead of crashing deep inside buildCountTemplate.

* [OPIK-4380] [BE] Fix countTotal ignoring target project IDs in normal path

target_project_ids was only applied inside the project_deleted LEFT JOIN
subquery; the main WHERE had no project restriction, so counts were
workspace-wide. Reuse has_target_projects in the main WHERE so
project_id IN :target_project_ids always takes effect. Also replace
manual null/empty checks with CollectionUtils.isNotEmpty.

* [OPIK-4380] [BE] Apply Spotless formatting

* [OPIK-4382] [BE] Address PR review comments on experiment aggregates

- Fix :versionId → :version_id parameter naming in SQL templates and bindings
- Fix last_updated_at binding to use item.lastUpdatedAt() instead of Instant.now()
- Fix FEEDBACK_SCORES_AGGREGATED_IS_EMPTY filter: embed generated SQL into templates
  instead of hard-coded static condition, and add missing bind calls
- Fix RetryUtils log duplication (remove getMessage() + pass exception directly)
- Add batchSize = 1000 default in ExperimentAggregatesConfig
- Extract resolveVersionIdForCriteria helper to deduplicate version-id resolution
- Add null/blank/ClickHouse placeholder guards in extractUuidsFromGroupValues
- Extract loadEntityMap helper to deduplicate getEnrichInfoHolder enrichment logic

* Revision 3: Address PR comments E, F, G, H

- Fix E: Extract shared template/bind helpers in ExperimentAggregatesDAO
- Fix F: Bind experiment_ids as UUID[] instead of String[]
- Fix G+H: Extract getEnrichInfoHolder logic into ExperimentGroupEnricher,
  eliminating duplication between ExperimentService and ExperimentAggregatesService
  without introducing a direct dependency between them

* Revision 4: Fix ExperimentServiceTest to include ExperimentGroupEnricher mock

* [OPIK-4382] [BE] Extract shared Row→ExperimentGroup mappers into ExperimentGroupMappers

Pull the duplicated Row→ExperimentGroupItem and Row→ExperimentGroupAggregationItem
conversion logic from ExperimentDAO and ExperimentAggregatesDAO into a shared
ExperimentGroupMappers utility class. Both DAOs now delegate to the same
toExperimentGroupItem / toExperimentGroupAggregationItem helpers, eliminating
the need to mirror DTO mapping changes in two places.

* [OPIK-4382] [BE] Deduplicate bindGroupCriteria into ExperimentGroupMappers

Moves the shared group-criteria binding logic out of ExperimentDAO and
ExperimentAggregatesDAO into ExperimentGroupMappers.bindGroupCriteria(),
following the same pattern as ExperimentSearchCriteriaBinder.  Adding or
fixing a criteria binding now only requires a change in one place.

* [OPIK-4382] [BE] Extract streamGroupQuery helper and fix null percentiles

- Deduplicate findGroups/findGroupsAggregations into a single private
  streamGroupQuery(queryTemplate, criteria, rowMapper) that differs only
  by the query constant and BiFunction row mapper.

- Fix convertToBigDecimal to return null for null/unsupported inputs so
  absent p50/p90/p99 entries in getDuration propagate as null rather than
  BigDecimal.ZERO, preserving the semantic-null that lets callers apply
  COALESCE/fallback logic correctly.

* [OPIK-4382] [BE] Consolidate cost/duration helpers into ExperimentGroupMappers

Promote getCostValue and getDuration to public static in
ExperimentGroupMappers and delete the private copies in ExperimentDAO.
ExperimentDAO.mapToDto now delegates to the shared helpers, so any
future change to cost filtering, duration percentile extraction, or the
BigDecimal conversion only needs to be made in one place.

Side-effect: ExperimentDAO.mapToDto also picks up the null-percentile
fix (convertToBigDecimal returns null for absent/unsupported inputs)
that was previously applied only to ExperimentGroupMappers.

* [OPIK-4382] [BE] Fix pagination count and add criteria filter tests

- Remove count() OVER () window function from paged query (returned
  page-scoped count instead of full result-set count)
- Replace with dedicated count query + short-circuit: skip items query
  when count == 0, use DatasetItemPage.empty() for that case
- Extract DatasetItemResultMapper.buildItemFromRow as public static
  helper reused by ExperimentAggregatesDAO
- Add parameterized integration tests for ExperimentGroupCriteria
  filters (name, types, projectId, combined, empty-result) covering
  both findGroups and findGroupsAggregations aggregate paths

* [OPIK-4380] [BE] Extract shared filter helpers into FilterQueryBuilder

Add FilterStrategyParam record, applyFiltersToTemplate and bindFilters
static helpers to FilterQueryBuilder, then replace duplicated per-strategy
loops in DatasetItemVersionDAO and ExperimentAggregatesDAO with single
delegating calls backed by per-DAO strategy constants.

* [OPIK-4382] [BE] Consolidate filter helpers in getExperimentItemsStatsFromAggregates

Add EXPERIMENT_ITEMS_STATS_FILTER_STRATEGY_PARAMS and
EXPERIMENT_ITEMS_STATS_BIND_STRATEGIES constants and replace the
per-strategy toAnalyticsDbFilters/bind blocks in
getExperimentItemsStatsFromAggregates with single delegating calls to
FilterQueryBuilder.applyFiltersToTemplate and bindFilters.

* Revision 9: Extract shared helpers to eliminate duplication across DAOs

- Create DatasetItemSearchCriteriaMapper: centralizes filters + search flag
  wiring for DatasetItemSearchCriteria, shared by DatasetItemVersionDAO and
  ExperimentAggregatesDAO
- Add ExperimentGroupMappers.applyGroupCriteriaToTemplate: centralizes
  ExperimentGroupCriteria → ST template wiring, now shared by ExperimentDAO
  and ExperimentAggregatesDAO
- Update DatasetItemVersionDAO.addFiltersToTemplate and bindSearchAndFilters
  to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentAggregatesDAO.applyDatasetItemFiltersToTemplate and
  bindDatasetItemSearchParams to delegate to DatasetItemSearchCriteriaMapper
- Update ExperimentDAO.newGroupTemplate and ExperimentAggregatesDAO.newGroupTemplate
  to delegate to ExperimentGroupMappers.applyGroupCriteriaToTemplate

* [OPIK-4383] [BE] Add experiment aggregate event listener and no-op publisher

* Revision 2: Fix missing import for ExperimentAggregationPublisher

* [OPIK-4383] [BE] Add ExperimentAggregationPublisher, ExperimentDenormalizationJob and tests

- ExperimentAggregationPublisher: debounces experiment aggregation triggers
  by writing compound workspaceId:experimentId members to a Redis ZSET scored
  by expiry timestamp (now + debounceDelay), plus a hash storing the userName
  with TTL=2×debounceDelay to handle stale entries.
- ExperimentDenormalizationJob: @every("5s") job that reads ZSET members with
  score <= now, publishes ExperimentAggregationMessage to the Redis stream,
  then cleans up the ZSET entry and hash bucket. Handles stale entries
  (expired hash) by removing the orphaned ZSET member without publishing.
- Fix processExperiment reactive chain: avoided double index.remove by
  returning Mono<Boolean> from flatMap branches so switchIfEmpty is only
  triggered when the bucket is truly empty.
- ExperimentAggregationPublisherTest: integration tests with real Redis
  container verifying ZSET membership, score, userName storage, TTL,
  workspace isolation, and debounce deduplication.
- ExperimentDenormalizationJobTest: unit tests with Mockito covering disabled
  config, lock not acquired, empty ZSET, happy path, stale entry, and batch.

* Fix tests setup

* [OPIK-4383] [BE] Address PR review: move DAO logs to service layer

* [OPIK-4383] [BE] Address PR review: extract shared DAO helper and fix log placement

* [OPIK-4383] [BE] Short-circuit deleteByTraceIds when no spans found

Skip delete, cascading operations, and SpansDeleted event when
getSpanIdsForTraces returns an empty set, preserving the original
no-op behaviour and avoiding the Preconditions.checkArgument failure
in SpanDAO.deleteByIds.

* [OPIK-4383] [BE] Fix cascade deletion failures after trace delete

Two bugs prevented spans and attachments from being deleted when a trace
was deleted via the event-driven cascade:

1. FeedbackScoreService.deleteByTraceIds/deleteBySpanIds had @nonnull on
   projectId which threw NPE when TracesDeleted.projectId() was null.
   EventInterceptor swallowed the NPE, stopping the entire cascade chain.
   Fix: remove @nonnull since the DAO already handles null safely via
   Optional.ofNullable(projectId).

2. SpanDAO.DELETE_BY_IDS had the wrong column (trace_id) and parameter
   name (span_ids) — the ClickHouse R2DBC driver could not resolve :span_ids
   as a named parameter in the DELETE statement. Fixed by using id IN :ids
   to match the working pattern in TraceDAO.DELETE_BY_ID.

* [OPIK-4383] [BE] Address PR review comments on ExperimentDenormalizationJob

- Centralize Redis constants (EXPERIMENT_KEY_PREFIX, USER_NAME_FIELD,
  MEMBER_SEPARATOR) in ExperimentDenormalizationConfig
- Change ExperimentAggregationPublisher.publish() to return Mono<Void>
  instead of void, so errors propagate to callers
- Make job interval configurable via jobs map in config.yml
- Fix onErrorContinue logging: remove getMessage() duplication
- Demote per-experiment logs from INFO to DEBUG
- Add ZSET pagination using expand() to avoid materializing entire range
- Update tests for all changes

* Fix @every job interval config key casing and add jobs section to test config

The dropwizard-jobs framework uses WordUtils.uncapitalize(class.getSimpleName())
to look up the interval in the jobs map, so the key must be
'experimentDenormalizationJob' (lowercase first letter). Also adds the missing
jobs section and jobBatchSize to config-test.yml.

* Replace @every annotation with programmatic Quartz scheduling

Remove @every from ExperimentDenormalizationJob and schedule it
programmatically in OpikGuiceyLifecycleEventListener, following the
same pattern as TraceThreadsClosingJob. Add jobInterval config field
to ExperimentDenormalizationConfig. Remove the jobs YAML section that
caused deserialization errors with JobConfiguration's immutable map.

* Add experiment context to error log and extract publishIfNotEmpty helper

- Include experimentId and workspaceId in onExperimentUpdated error log
- Extract publishIfNotEmpty helper to deduplicate filter+publish logic
  across triggerByExperimentIds, triggerByTraceIds, triggerBySpanIds

* Fix NPE in ExperimentAggregateEventListenerTest mock setup

Stub publisher.publish() to return Mono.empty() in setUp so
.subscribe() calls in production code don't NPE on null.

* [OPIK-4385] [BE] Use pre-computed aggregation tables for experiment endpoints

Apply UNION ALL hybrid pattern to ExperimentDAO (FIND, FIND_GROUPS,
FIND_GROUPS_AGGREGATIONS) and ExperimentItemDAO (STREAM,
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT,
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS,
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_STATS) so that
experiments present in experiment_aggregates / experiment_item_aggregates
use pre-computed values, while others fall back to live JOIN computation.

Add ExperimentAggregatesIntegrationTest covering all 7 affected queries
with parameterized filter, pagination, and consistency scenarios.

* [OPIK-4386] [BE] Trigger lazy aggregation via publisher on GET experiment by ID

When fetching an experiment by ID, if the experiment is in COMPLETED or
CANCELLED state and is not yet present in the experiment_aggregates table,
enqueue it for aggregation using ExperimentAggregationPublisher instead of
computing aggregations synchronously. The check and publish are performed
off the critical path via doOnEach, so the caller receives the experiment
immediately without waiting for the side effect to complete.

* [OPIK-4384] [BE] Fix missing zero_uuid binding and experiment_scores sort alias

- Bind zero_uuid parameter in getById, getByIds, and get(ExperimentStreamRequest)
  methods that use the FIND query; the UNION ALL refactor introduced an
  experiments_from_aggregates CTE that requires this parameter but only the
  main find() method was binding it, causing 500 errors on those paths
- Fix SortingQueryBuilder to reference the outer column alias experiment_scores_agg
  instead of es.experiment_scores; the ORDER BY sits outside the UNION ALL so the
  inner es alias is out of scope, while experiment_scores_agg is the consistent
  output alias exposed by both branches

* [OPIK-4384] [BE] Fix null row injection from LEFT JOIN miss in feedback_scores and comments aggregation

Pre-aggregate feedback_scores_final and comments_final into subqueries
(GROUP BY entity_id) before LEFT JOIN in DatasetItemVersionDAO.STREAM.
When a LEFT JOIN has no match against a pre-aggregated subquery the
joined columns are NULL, so any(NULL) returns NULL instead of a
default-valued row with epoch timestamps that caused Instant.parse()
failures.

Also adds a regression test covering the no-scores path in
ExperimentAggregatesIntegrationTest.

* [OPIK-4383] [BE] Remove DAO-level log.info from ExperimentAggregatesDAO methods

Move operational logging responsibility to the service layer, consistent
with earlier fixes for ExperimentItemDAO and SpanDAO in this PR.

* Remove accidentally committed doc files

These files were introduced during merge resolution but should
not be part of the branch.

* [OPIK-4383] [BE] refactor: extract triggerAggregation helper to centralize guard+publish flow

* [OPIK-4386] [BE] fix: demote lazy aggregation check log to DEBUG

* [OPIK-4387] [BE] feat: wire aggregation publisher into finishExperiments endpoint

Chain experimentAggregationPublisher.publish() after AlertEvent in
finishExperiments() so experiments finished via POST /v1/private/experiments/finish
are published to Redis for aggregation computation.

* [OPIK-4383] [BE] fix: restore TagOperations.tagUpdateFragment in SpanDAO BULK_UPDATE

Restores proper tag handling in SpanDAO.BULK_UPDATE query that was
regressed to a simple arrayConcat. Now uses TagOperations.tagUpdateFragment()
which provides arrayDistinct(), tag limit enforcement (max 50), and
tags_to_add/tags_to_remove support. Also adds the required
short_circuit_function_evaluation SETTINGS for throwIf evaluation.

* [OPIK-4387] [BE] feat: add stream trimming to experiment denormalization XADD

Add streamMaxLen and streamTrimLimit configuration to bound Redis stream
growth on the experiment denormalization producer (ExperimentDenormalizationJob).
Uses Redisson's trimNonStrict().maxLen().limit() API for approximate trimming.

* [OPIK-4387] [BE] fix: make aggregation publish best-effort in finishExperiments

Swallow and log Redis/publish errors so finishExperiments returns 204
even when Redis is down. Aggregation will be retried by the lazy trigger
or next job cycle.

* [OPIK-4387] [BE] refactor: centralize Redis stream XADD trimming in RedisStreamUtils

Extract duplicate StreamAddArgs.entry().trimNonStrict().maxLen().limit()
into RedisStreamUtils.buildAddArgs() so stream trimming settings live in
one place. Updates all 5 producers.

* [OPIK-4387] [BE] fix: defer aggregation publish and update test for best-effort behavior

Wrap aggregation publisher in Mono.defer() so it subscribes only after
upstream completes, and update unit test to expect completion instead of
error propagation.

* Adding InterruptableJob

* [OPIK-4383] [BE] Address PR review: expand safety valve, env var prefix

- Add batchSize-capped iteration counter to expand() to prevent
  infinite loops when ZSET entries fail to be removed
- Rename EXPERIMENT_DENORM_JOB_INTERVAL to OPIK_EXPERIMENT_DENORM_JOB_INTERVAL
  to follow the OPIK_ prefix convention

* [OPIK-4384] [BE] Add branch optimization and CTE split to experiment queries

Use pre-computed experiment_aggregates table to optimize query execution:
- Add has_aggregated/has_raw flags to skip unnecessary UNION ALL branches in FIND/FIND_COUNT
- Add getAggregationBranchCounts pre-query to determine which branches are needed
- Apply CTE split pattern to FIND_GROUPS and FIND_GROUPS_AGGREGATIONS
- Update getById to leverage branch optimization via single-ID branch count query
- Add <if(id)> filter to SELECT_AGGREGATED_EXPERIMENT_IDS for getById support

* [OPIK-4384] [BE] Add missing 7-arg overload for getDatasetItemsWithExperimentItems

Fix test compilation error from merge: the remote branch added callers
with (UUID, List, null, null, List<SortingField>, String, String) signature
which needs a bridge overload to the 9-arg method.

* [OPIK-4384] [BE] Add conditional LIMIT push-up, missing CTE, and fix test precision

- Add conditional LIMIT push-up in STREAM query: push LIMIT into CTE
  when only one branch (raw or aggregated) is active for performance
- Add missing experiment_item_aggr_trace_scope CTE for aggregated branch
- Add AggregatedExperimentCounts record for experiment-level branching
- Fix MultiValueFeedbackScoresE2ETest precision assertion: use isEqualTo
  instead of isEqualByComparingTo to respect custom BigDecimal comparator

* [OPIK-4384] [BE] Push OFFSET into top_dataset_items CTE and fix BigDecimal comparator in DatasetsResourceTest

* [OPIK-4384] [BE] Add pass rate aggregation to experiment aggregates

Add pass_rate, passed_count, and total_count columns to experiment_aggregates
table and compute them during aggregation. Update ExperimentDAO queries to
select these columns from both raw and aggregated paths, returning NULL for
non-evaluation-suite experiments.

* Fix format

* Fix get by id

* Fix mapping

* Fix mapping

* [OPIK-4384] [BE] Use pre-aggregated comments from aggregate tables with ISO 8601 date formatting

Update retrieval queries in ExperimentDAO, DatasetItemVersionDAO, and ExperimentAggregatesDAO
to read comments_array_agg as JSON String from aggregate tables instead of live-querying the
comments table. Ensure UNION ALL type compatibility by wrapping raw paths with toJSONString()
and formatting dates as ISO 8601 for proper Jackson deserialization.

* [OPIK-4386] [BE] Increase debounceDelay in test config to prevent race condition

The denormalization job was processing finished experiments during test
execution with incomplete ClickHouse data, causing stale aggregated
values to be returned instead of fresh raw computations.

* [OPIK-4384] [BE] Use parameterized binding for dynamic sort keys and add deterministic tiebreaker

- Replace literal string interpolation in getTopSortExpression with
  parameterized bind variables (sf.bindKey()) to prevent SQL injection
- Remove fieldMapping filter from bindDynamicKeys so all dynamic keys
  are bound, including those used in the top_sorting SELECT expression
- Add deterministic tiebreaker (id DESC / dataset_item_id DESC) to both
  the push-top-limit CTE and the main ORDER BY for consistent pagination
- Fix experiment_items deduplication: use FINAL where DISTINCT was used
  and vice versa for consistency across query branches

* [OPIK-4384] [BE] Add mixed-state aggregation test for UNION ALL hybrid

Test creates 3 experiments, aggregates only 1, and queries all 3 to
exercise the UNION ALL hybrid path where has_aggregated and has_raw
are both true simultaneously.

* [OPIK-4384] [BE] Add isNotEmpty assertions to parameterized filter tests

Ensure filter scenarios actually match data by asserting content()
is not empty before and after aggregation in all parameterized filter
tests (find, findGroups, findGroupsAggregations).

* [OPIK-4384] [BE] refactor: extract assertion helpers to remove duplication in ExperimentAggregatesIntegrationTest

* [OPIK-4384] [BE] refactor: rename parseFlexibleInstant to parseInstant in FeedbackScoreMapper

* [OPIK-4384] [BE] Make LIMIT unconditional in FIND query

The LIMIT clause was gated on filter/sort flags, so plain paged requests
(only limit/offset) at the outer query level would not emit LIMIT.
Simplify to always emit LIMIT when the limit parameter is provided.

* [OPIK-4384] [BE] Fix comment ordering assertion in tests

ClickHouse groupUniqArray does not guarantee ordering, so comment
assertions must use ignoringCollectionOrder to avoid flaky failures.

* [OPIK-4384] [BE] Add branch conditionals to FIND_GROUPS/FIND_GROUPS_AGGREGATIONS and revert unconditional LIMIT

- Wrap SELECT branches in FIND_GROUPS and FIND_GROUPS_AGGREGATIONS with
  <if(has_aggregated)>/<if(has_raw)> conditionals to skip unnecessary
  branches when all experiments are aggregated or all are raw
- Add no-args getAggregationBranchCounts() overload for workspace-only
  pre-query (used by group/aggregation queries that lack experiment IDs)
- Update executeQueryWithTargetProjects to run both pre-queries in
  parallel via Mono.zip
- Revert commit 215a3f9 (unconditional LIMIT) which caused double
  LIMIT/OFFSET bug: CTE-level LIMIT + outer LIMIT made page 2+ return
  0 results. The complex conditional is correct — outer LIMIT is only
  needed when post-CTE processing may alter the result set.

* [OPIK-4384] [BE] Add branch conditionals to SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT

Wrap the UNION ALL in the count query with <if(has_aggregated)>/<if(has_raw)>
conditionals to skip unnecessary branches. Pass branch flags through
getCountWithExperimentFilters from the existing pre-query results.

* [OPIK-4384] [BE] Fix ClickHouse column resolution in COUNT query

Alias dataset_item_id as di_id in the COUNT subquery branches
to avoid column name ambiguity when ClickHouse 25.3's query
analyzer resolves COUNT(DISTINCT dataset_item_id) through a
LEFT JOIN with dataset_items_resolved which also has that column.

* [OPIK-4384] [BE] Use pre-computed comments in STREAM query and fix UNION ALL type mismatch

Aggregated branch now reads comments_array_agg directly from experiment_item_aggregates
instead of doing an expensive JOIN to the comments table. Raw branch converts comments
to JSON String via toJSONString(CAST(...)) so both branches output compatible types.

* [OPIK-4384] [BE] Fix target_project_ids bind error in FIND_GROUPS aggregated branch

* Fix issues

* [OPIK-4387] [BE] Fix missing closing brace in ExperimentServiceTest
ldaugusto added a commit that referenced this pull request Mar 26, 2026
- Use per-workspace cursors in deleteSmallBatch via deleteForRetentionBounded
  instead of collapsing to min(cursor) across all workspaces (#1)
- Add @nonnull on executeCatchUpCycle(now) parameter (#3)
- Log when catch-up is disabled (#3)
- Run all three tiers independently per cycle via Flux.concat instead of
  switchIfEmpty chain to prevent medium/large starvation (#4)
- Return null cursor when velocity=0, marking catch-up done immediately (#5)
- Preserve Instant directly instead of UUID round-trip in deleteOneChunk (#7)
- Hoist computeSlidingWindowStart out of per-rule loop (#8)
- Centralize extractInstant/compareUUID into RetentionUtils (#9)
- Remove unnecessary @UseStringTemplateEngine from catch-up queries (#10)
- Add explicit IS NOT NULL guard on catch_up_velocity queries (#11)
- Drop unused cnt column from scout query (#14)
- Fix Javadoc: 'oldest span ID' → 'oldest span time' in SpanDAO
ldaugusto added a commit that referenced this pull request Mar 26, 2026
* [OPIK-4891] [BE] Catch-up job for apply-to-past retention rules

Progressive historical data deletion for rules with applyToPast=true.
Estimates workspace span velocity at rule creation to triage into
small/medium/large tiers with appropriate chunk sizes.

Schema:
- Add catch_up_velocity, catch_up_cursor, catch_up_done columns
- Add idx_catch_up_pending composite index for catch-up queries

Velocity estimation:
- ClickHouse query: uniq(id) / weeks_active for spans below cutoff
- Handles TOO_MANY_ROWS (code 158) by defaulting to 1M/week
- Handles empty tables gracefully

Catch-up tiers (configurable thresholds):
- Small (<10K/week): batch up to 200, one-shot delete entire range
- Medium (10K-100K/week): 10 most outdated, 7-day chunks each
- Large (>100K/week): 1 most outdated, 2-day chunks

Execution:
- Runs after regular sliding-window pass in RetentionPolicyJob
- Priority: small first (quick wins), then medium, then large
- Cursor advances oldest→newest, marks done when reaching sliding window

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix large workspace chunk size: 2 days to 1 day

Large workspaces (>100K spans/week) process one day per catch-up
cycle, so each execution handles a manageable amount of data.

* Address PR review: fix type cast, null safety, error handling

- Fix Float64→Long ClassCastException: wrap velocity query with toUInt64()
- Fix null cursor NPE in deleteSmallBatch: filter nulls before min()
- Fix catch-up marking done on delete failure: remove onErrorResume,
  propagate errors so cursor/done only advances on success
- Make markDone/updateCursor non-blocking: wrap in Mono.fromRunnable
  on boundedElastic to avoid blocking Reactor threads

* Add config comments for catch-up settings

* Return oldest span time from velocity estimation, add scouting

- Velocity query now returns both spans_per_week and oldest_span_time
- Cursor starts at the actual oldest data, not service start date
- For huge workspaces (TOO_MANY_ROWS), scout month by month on traces
  table to find first day with data, avoiding months of no-op deletes
- If a monthly scout also hits row limit, use that month start as cursor

* Replace SQL string concatenation with @BindList in markCatchUpDoneBatch

Avoids fragile raw SQL construction pattern. Uses JDBI's @BindList
for parameterized IN clause, consistent with other DAOs in the codebase.

* Address review: rename vars for clarity, hide internal fields, add safety comment

- Rename upperBound/lowerBound to cutoffId/fromId in deleteSmallBatch
  for consistency with deleteOneChunk and DAO signatures
- Hide catchUpVelocity and catchUpCursor from API response (internal);
  only catchUpDone remains public as user-facing progress indicator
- Add comment explaining NULL cursor safety in catch-up DAO queries

* Guard cursor >= upperBound, isolate catch-up errors, expose cursor in API

- Skip delete and mark done if cursor already past sliding window boundary
- Wrap catch-up cycle in onErrorResume so failures don't kill regular retention
- Re-expose catchUpCursor in API (useful for users to see cleanup progress);
  catchUpVelocity remains hidden (internal implementation detail)

* Revert scouting to simple blocking loop, improve schema comments

- Revert scoutFirstDataCursor from Flux back to blocking while-loop.
  Rule creation is a rare admin op; reactive complexity not justified.
- Improve catch_up_cursor and catch_up_done column comments to
  document cursor semantics (data before cursor has been deleted).

* Add unit tests for TOO_MANY_ROWS velocity estimation fallback

- RetentionRuleServiceVelocityTest: 6 tests covering the code 158
  exception path with mocked SpanDAO/TraceDAO. Tests scouting
  month-by-month, dense month fallback, service start date fallback,
  and non-158 exception rethrow.
- Remove large workspace integration test (max_rows_to_read profile
  setting also blocks normal inserts/deletes, making it impossible
  to trigger TOO_MANY_ROWS only on the estimation query)
- Keep small workspace catch-up integration test and applyToPast=false
  test in RetentionPolicyServiceTest
- Make estimateVelocity/scoutFirstDataCursor package-visible for testing

* Mark catch-up done when scouting finds no historical data

When the velocity estimation hits TOO_MANY_ROWS and scouting scans
every month without finding data, return velocity=0 with null cursor
so the rule is created with catchUpDone=true. Prevents hundreds of
empty 1-day chunk DELETE cycles.

* Bump migration to 000061, simplify index, split rollback

- Rename migration from 000060 to 000061 (main advanced past 000060)
- Simplify index to (catch_up_done, catch_up_velocity) since
  catch_up_done=false already implies enabled=true and apply_to_past=true
- Split rollback into individual DROP COLUMN statements

* Address review comments from thiagohora and baz

- Use per-workspace cursors in deleteSmallBatch via deleteForRetentionBounded
  instead of collapsing to min(cursor) across all workspaces (#1)
- Add @nonnull on executeCatchUpCycle(now) parameter (#3)
- Log when catch-up is disabled (#3)
- Run all three tiers independently per cycle via Flux.concat instead of
  switchIfEmpty chain to prevent medium/large starvation (#4)
- Return null cursor when velocity=0, marking catch-up done immediately (#5)
- Preserve Instant directly instead of UUID round-trip in deleteOneChunk (#7)
- Hoist computeSlidingWindowStart out of per-rule loop (#8)
- Centralize extractInstant/compareUUID into RetentionUtils (#9)
- Remove unnecessary @UseStringTemplateEngine from catch-up queries (#10)
- Add explicit IS NOT NULL guard on catch_up_velocity queries (#11)
- Drop unused cnt column from scout query (#14)
- Fix Javadoc: 'oldest span ID' → 'oldest span time' in SpanDAO

* Remove scripts/.gitignore, lower disabled log to DEBUG

- Remove unnecessary .gitignore in scripts/ (test CSVs are local only)
- Lower catch-up disabled log from INFO to DEBUG to avoid 48 noisy
  log lines per day when catch-up is off

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant