Merged
Conversation
andrescrz
added a commit
that referenced
this pull request
Sep 30, 2025
- Change ORIGINAL_COMMAND from '-- git commit -m "Fix recovery command argument preservation - Change ORIGINAL_COMMAND from '$0 $*' to '$0 \"$@\"' - Ensures proper argument boundary preservation when arguments contain spaces - Fixes recovery command functionality for complex invocations Addresses PR comment #4"' to '-- "git commit -m "Fix recovery command argument preservation - Change ORIGINAL_COMMAND from '$0 $*' to '$0 \"$@\"' - Ensures proper argument boundary preservation when arguments contain spaces - Fixes recovery command functionality for complex invocations Addresses PR comment #4""' - Ensures proper argument boundary preservation when arguments contain spaces - Fixes recovery command functionality for complex invocations Addresses PR comment #4
andrescrz
added a commit
that referenced
this pull request
Sep 30, 2025
- Change from '-- "git commit -m "Improve recovery command formatting - Change from '$0 \"$@\"' to '$0 $@' to avoid extra quotes in output - Preserves arguments without adding unnecessary quotation marks - Recovery command now shows: ./scripts/dev-runner.sh --migrate - Instead of: ./scripts/dev-runner.sh \"--migrate\" Addresses PR comment #4 - improved formatting""' to '-- git commit -m "Improve recovery command formatting - Change from '$0 \"$@\"' to '$0 $@' to avoid extra quotes in output - Preserves arguments without adding unnecessary quotation marks - Recovery command now shows: ./scripts/dev-runner.sh --migrate - Instead of: ./scripts/dev-runner.sh \"--migrate\" Addresses PR comment #4 - improved formatting"' to avoid extra quotes in output - Preserves arguments without adding unnecessary quotation marks - Recovery command now shows: ./scripts/dev-runner.sh --migrate - Instead of: ./scripts/dev-runner.sh "--migrate" Addresses PR comment #4 - improved formatting
andrescrz
added a commit
that referenced
this pull request
Oct 1, 2025
* [OPIK-2175] Add database migrations support to dev-runner script
- Add run_db_migrations() function that executes both MySQL and ClickHouse migrations
- Integrate migrations into start_backend() to run automatically before backend starts
- Add --migrate command line option for independent migration execution
- Update help documentation and restart step descriptions
- Ensure migrations run for all backend start paths (start, restart, default)
- Reuse existing JAR file discovery logic and error handling patterns
* Revision 2: Ensure infrastructure starts before running migrations
- Add start_infrastructure call to run_db_migrations function
- Prevents migration failures when --migrate is run independently
- Ensures MySQL and ClickHouse containers are running before migration attempts
- Fixes issue where migrations would fail without infrastructure running
* Revision 3: Fix ClickHouse migrations by setting ANALYTICS_DB_DATABASE_NAME
- Export ANALYTICS_DB_DATABASE_NAME=opik before running ClickHouse migrations
- Fixes Liquibase parameter substitution in migration files like 000014_add_thread_id_to_traces.sql
- Ensures ${ANALYTICS_DB_DATABASE_NAME} placeholder is properly replaced with 'opik'
- Resolves ClickHouse migration failures due to missing environment variable
* Revision 4: Fix MySQL migrations by setting STATE_DB_DATABASE_NAME
- Export STATE_DB_DATABASE_NAME=opik before running MySQL migrations
- Fixes Liquibase parameter substitution in MySQL migration files
- Resolves 'Access denied for user 'opik'@'%' to database '${STATE_DB_DATABASE_NAME}' error
- Ensures ${STATE_DB_DATABASE_NAME} placeholder is properly replaced with 'opik'
- Both MySQL and ClickHouse migrations now have proper environment variables set
* Revision 5: Fix ClickHouse database connection for migrations
- Export ANALYTICS_DB_MIGRATIONS_URL=jdbc:clickhouse://localhost:8123
- Resolves issue where migrations were trying to access wrong database
- Both database name parameter and connection URL are now properly configured
* Revision 6: Add robust error handling for database migrations
- Provide clear recovery instructions if all attempts fail
- Warn users about data loss when suggesting Docker volume cleanup
* Revision 7: Improve migration recovery message with current command context
- Capture original command in ORIGINAL_COMMAND variable at script start
- Update print_migrations_recovery_message() to show the exact command to retry
- Provide more contextual recovery instructions for users
- Makes it easier to retry the exact same operation after volume cleanup
Example output:
3. Continue your current flow: ./scripts/dev-runner.sh --migrate
3. Continue your current flow: ./scripts/dev-runner.sh --restart --debug
This helps users understand exactly what command to run after recovery.
* Revision 8: Fix Docker volume cleanup instructions for effective recovery
- Replace ineffective 'docker volume prune -f' with comprehensive cleanup
* Revision 9: Make recovery instructions dynamic and improve Docker cleanup
- Use $0 instead of hardcoded script path for stop command
- Makes recovery instructions work regardless of how script is invoked
- Use 'docker volume prune -a -f' for more aggressive volume cleanup
- The '-a' flag removes all unused volumes, not just anonymous ones
Benefits:
- Works when script is called from different directories
- Works when script is symlinked or renamed
- More effective volume cleanup with '-a' flag
- Consistent dynamic path resolution throughout recovery message
* Revision 10: Optimize Maven build for faster backend compilation
- Add parallel build with -T 1C (1 thread per CPU core)
- Skip JavaDoc generation with -Dmaven.javadoc.skip=true
- Skip source JAR generation with -Dmaven.source.skip=true
- Keep clean install and main JAR generation for proper builds
- Maintain test skipping for development speed
* Manual revision
* Update scripts/dev-runner.sh
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update scripts/dev-runner.sh
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Fix recovery command argument preservation
- Change ORIGINAL_COMMAND from '-- git commit -m "Fix recovery command argument preservation
- Change ORIGINAL_COMMAND from '$0 $*' to '$0 \"$@\"'
- Ensures proper argument boundary preservation when arguments contain spaces
- Fixes recovery command functionality for complex invocations
Addresses PR comment #4"' to '-- "git commit -m "Fix recovery command argument preservation
- Change ORIGINAL_COMMAND from '$0 $*' to '$0 \"$@\"'
- Ensures proper argument boundary preservation when arguments contain spaces
- Fixes recovery command functionality for complex invocations
Addresses PR comment #4""'
- Ensures proper argument boundary preservation when arguments contain spaces
- Fixes recovery command functionality for complex invocations
Addresses PR comment #4
* Improve recovery command formatting
- Change from '-- "git commit -m "Improve recovery command formatting
- Change from '$0 \"$@\"' to '$0 $@' to avoid extra quotes in output
- Preserves arguments without adding unnecessary quotation marks
- Recovery command now shows: ./scripts/dev-runner.sh --migrate
- Instead of: ./scripts/dev-runner.sh \"--migrate\"
Addresses PR comment #4 - improved formatting""' to '-- git commit -m "Improve recovery command formatting
- Change from '$0 \"$@\"' to '$0 $@' to avoid extra quotes in output
- Preserves arguments without adding unnecessary quotation marks
- Recovery command now shows: ./scripts/dev-runner.sh --migrate
- Instead of: ./scripts/dev-runner.sh \"--migrate\"
Addresses PR comment #4 - improved formatting"' to avoid extra quotes in output
- Preserves arguments without adding unnecessary quotation marks
- Recovery command now shows: ./scripts/dev-runner.sh --migrate
- Instead of: ./scripts/dev-runner.sh "--migrate"
Addresses PR comment #4 - improved formatting
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
JetoPistola
added a commit
that referenced
this pull request
Dec 1, 2025
Python SDK improvements: - Import module instead of name for ExperimentScore (comment #2) - Allow ExperimentScoreFunction to return single or List of ScoreResults (comment #3) - Move experiment score verification to verify_experiment utility (comment #4) Backend code quality: - Simplify TypeReference diamond operator in ExperimentScore.java (comment #5) - Remove overloaded constructor in FeedbackScoreNames.ScoreName (comment #6) - Reuse ScoreName instead of ScoreNameWithType in DAO (comment #7) - Add TODO for full primary key in ORDER BY (comment #8) - Revert flakiness fix in TemplateUtilsTest.java (comment #9)
JetoPistola
added a commit
that referenced
this pull request
Dec 2, 2025
…nctions (#3989) * Hide experiment_scores columns in the single experiment table * Add SDK support for experiment_scores * Add SDK support for experiment_scores * Add BE functionality * Typescript autogenerated code * Documentation and FE update * Address PR comments * Address PR comments * Fix PR comments * Address PR comments * Fix merge conflicts * Fix tests * Fix failing tests * Fix failing tests * Fix UI colors and column names * Refactor: Extract common score averaging logic to eliminate duplication * Harmonize experiment scores sorting to use map access from CTE - Add experiment_scores_agg LEFT JOIN to non-grouped queries - Simplify SortingQueryBuilder to use coalesce(map[key]) instead of complex JSON extraction - Remove special case handling for experiment_scores in null direction logic - Addresses PR review comments about query harmonization * Remove early return for empty test results in experiment scores - Allow experiment score functions to handle empty test results - Some functions may want to return baseline/default scores with no data - Addresses PR review comment about preventing score function execution * Add E2E test for experiment scores functionality - Test verifies experiment scoring functions work end-to-end - Validates experiment scores appear in evaluation result - Validates experiment scores are retrievable via SDK API - Uses compute_max_score function to test score aggregation - Addresses PR review comment about E2E test coverage * Enhance experiment score computation to handle empty test results gracefully - Update condition to return empty list if either scoring functions or test results are absent - Ensures robustness in score computation logic * Add Python SDK E2E test for experiment scores - Tests experiment_scoring_functions parameter in evaluate() - Verifies experiment scores are computed and returned in result - Validates scores are persisted to backend API - Tests aggregate metrics (max, min, avg) computation - Addresses PR review comment about SDK test coverage * Revert "Add E2E test for experiment scores functionality" This reverts commit 50f9f8d. * Apply DRY principle to score type mapping in ExperimentFeedbackScoresTab - Extract addScoresToMap helper function to avoid duplication - Works for both feedback_scores and experiment_scores - Reduces code duplication and improves maintainability - Fix parameter ordering (required before optional) * [FE] Apply DRY principle to feedback/experiment scores handling - useExperimentsTableConfig: Extract getScoreByName helper, eliminate duplicate accessorFn logic - useCompareExperimentsChartsData: Extract createScoresMap helper for both score types - CompareExperimentsDetails: Extract markScores helper to avoid duplicate map calls - ExperimentsPage: Extract createScoresMap and getScoreNames helpers - EvaluationSection: Use shared transformExperimentScores utility - experimentScoreUtils: Refactor with formatScores helper to eliminate duplication All changes maintain type safety and pass linting/typecheck * Revision 7: Add missing experiment_scores_agg CTE to FIND query * Revision 8: Fix experiment_scores sorting to use correct CTE alias 'es' * Revision 9: Address all 9 PR review comments Python SDK improvements: - Import module instead of name for ExperimentScore (comment #2) - Allow ExperimentScoreFunction to return single or List of ScoreResults (comment #3) - Move experiment score verification to verify_experiment utility (comment #4) Backend code quality: - Simplify TypeReference diamond operator in ExperimentScore.java (comment #5) - Remove overloaded constructor in FeedbackScoreNames.ScoreName (comment #6) - Reuse ScoreName instead of ScoreNameWithType in DAO (comment #7) - Add TODO for full primary key in ORDER BY (comment #8) - Revert flakiness fix in TemplateUtilsTest.java (comment #9) * Update return type of get_experiment_data method to use rest_api_types for consistency * Revision 10: Add full primary key to ORDER BY clause * Refactor test for standard deviation calculation in experiment scoring functions Replaced hardcoded expected standard deviation value with a dynamic calculation using the statistics.stdev function for improved accuracy and maintainability. * Add experiment_scores column to experiments table in migration 000048 This migration introduces a new column, experiment_scores, to the experiments table to store precomputed metrics. The column is added with a default value of an empty string. A rollback statement is also included to drop the column if necessary. * Update import statement for Prompt in evaluator.py to reflect new module structure * Refactor whitespace in verifiers.py for improved readability This commit removes unnecessary blank lines in the verify_experiment and _verify_experiment_scores functions, enhancing the overall clarity of the code without altering functionality. * Enhance type hinting in dataset and experiment modules This commit adds future annotations to the dataset REST operations and introduces TYPE_CHECKING for conditional imports in the experiment module, improving type hinting and code clarity without affecting functionality. * Update documentation to replace `experiment_scores` with `experiment_scoring_functions` for consistency across evaluation methods * Refactor score type handling in experiment feedback components This commit replaces string literals for score types with constants, enhancing type safety and code clarity across various components, including ExperimentFeedbackScoresTab, ExperimentItemsTab, and related utility functions. The changes ensure consistent usage of SCORE_TYPE_FEEDBACK and SCORE_TYPE_EXPERIMENT throughout the codebase. * Refactor column mapping for sorting functionality This commit consolidates the logic for converting underscore-prefixed column IDs to dot notation into a single array of sortable prefixes. The `mapComplexColumn` function is updated to iterate over this array, improving code clarity and maintainability while ensuring consistent handling of various column types. * Implement ExperimentScoreListCell and refactor score handling in data tables This commit introduces the new ExperimentScoreListCell component for displaying experiment scores and updates the relevant data tables to utilize this component. Additionally, it refactors the handling of score types across various components, replacing string literals with constants for improved type safety and consistency. The changes affect the ExperimentsPage, ProjectsPage, and other related components, ensuring a unified approach to score type management. * Refactor FeedbackScoresChartsWrapper and FeedbackScoreHoverCard for consistency This commit updates the FeedbackScoresChartsWrapper component to rename the `isAggregationScores` prop to `areAggregatedScores` for improved clarity. Additionally, it modifies the subtitle text in the FeedbackScoreHoverCard component to use "Aggregated experiment scores" and "Average feedback scores" for consistency in terminology across the application. * Add experiment scores tab to CompareExperimentsPage and update score handling This commit introduces a new tab for displaying experiment scores in the CompareExperimentsPage. It updates the ExperimentFeedbackScoresTab component to handle both feedback and experiment scores based on the selected tab. The score retrieval logic is modified to filter scores according to their type, enhancing clarity and usability in the comparison of experiments. * run fern generate * Refactor score handling in various components to unify feedback and experiment score logic. Removed experiment score references and updated feedback score components to handle aggregated scores. Adjusted column definitions and metadata across multiple pages for consistency. * Add migration to include experiment_scores column in experiments table --------- Co-authored-by: Daniel Dimenshtein <danield@comet.com> Co-authored-by: Ido Berkovich <ido@comet.com> Co-authored-by: Boris Feld <boris@comet.com> Co-authored-by: YarivHashaiComet <yarivh@comet.com>
Lothiraldan
added a commit
that referenced
this pull request
Dec 11, 2025
* [OPIK-3397] [SDK] Add token usage tracking to DSPy integration - Extract token usage (prompt_tokens, completion_tokens, total_tokens) from LM history - Add cache_hit metadata to detect cached responses - Verify history entry via messages matching for concurrent call safety - Add tests for cache behavior and usage tracking * Revision 2: Address PR review comments - Fix #1: Use specific type hints (Tuple[Any, Optional[Any]]) instead of generic tuple - Fix #2: Only add cache_hit to metadata when value is not None - Fix #4: Rename misleading matcher to MetadataWithCreatedFromMatcher * Revision 3: Fix cache_hit handling for DSPy versions where cache_hit is None Some DSPy versions set response.cache_hit=None instead of False for non-cached responses. Normalize None to False for consistent behavior across versions. * Revision 3: Fix cache_hit detection for Python 3.9 compatibility - Infer cache_hit from empty usage when response.cache_hit is None or missing - Handle different DSPy versions that may not set cache_hit attribute - Simplify logic with explicit conditionals for readability * Revision 4: Move uuid import to module level Address PR review comment - move import uuid from inside test functions to the top of the file following Python conventions. * Revision 5: Replace custom matchers with ANY_DICT.containing() Address PR review feedback from @alexkuzmik: - Replace _get_usage_dict_matcher() with ANY_DICT.containing() - Replace _get_metadata_with_created_from_matcher() with ANY_DICT.containing() This is simpler and uses the existing testlib utilities.
juanferrub
pushed a commit
that referenced
this pull request
Dec 18, 2025
* [OPIK-3397] [SDK] Add token usage tracking to DSPy integration - Extract token usage (prompt_tokens, completion_tokens, total_tokens) from LM history - Add cache_hit metadata to detect cached responses - Verify history entry via messages matching for concurrent call safety - Add tests for cache behavior and usage tracking * Revision 2: Address PR review comments - Fix #1: Use specific type hints (Tuple[Any, Optional[Any]]) instead of generic tuple - Fix #2: Only add cache_hit to metadata when value is not None - Fix #4: Rename misleading matcher to MetadataWithCreatedFromMatcher * Revision 3: Fix cache_hit handling for DSPy versions where cache_hit is None Some DSPy versions set response.cache_hit=None instead of False for non-cached responses. Normalize None to False for consistent behavior across versions. * Revision 3: Fix cache_hit detection for Python 3.9 compatibility - Infer cache_hit from empty usage when response.cache_hit is None or missing - Handle different DSPy versions that may not set cache_hit attribute - Simplify logic with explicit conditionals for readability * Revision 4: Move uuid import to module level Address PR review comment - move import uuid from inside test functions to the top of the file following Python conventions. * Revision 5: Replace custom matchers with ANY_DICT.containing() Address PR review feedback from @alexkuzmik: - Replace _get_usage_dict_matcher() with ANY_DICT.containing() - Replace _get_metadata_with_created_from_matcher() with ANY_DICT.containing() This is simpler and uses the existing testlib utilities.
YarivHashaiComet
added a commit
that referenced
this pull request
Jan 20, 2026
- Fix #1: Use trace provider when model not found (provider fallback) - Fix #3: Add role mapping for external roles (tool, function, human, etc.) - Fix #4: Support 'type' property for LangChain/LangGraph messages - Fix #6: Add empty array check in canOpenInPlayground - Fix #7: Check span input before using, fallback to trace input - Fix #9/#10: Handle { messages: [] } case properly
2 tasks
JetoPistola
added a commit
that referenced
this pull request
Mar 19, 2026
…onUtils, SQL cleanup - Use JsonUtils.readValue() instead of getMapper().readValue() (comment #1) - Replace explicit CAST with tuple() in SQL for type flexibility (comments #2, #3) - Change passed column from UInt8 to Enum8('passed'=0,'failed'=1) (comment #4) - Add AssertionStatus enum used end-to-end from DB to API response - Update all SQL queries using toFloat64(passed) to toFloat64(passed = 'passed') - Add project_id filter to assertion_results query in DatasetItemVersionDAO (comment #6) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
andrescrz
added a commit
that referenced
this pull request
Mar 19, 2026
* [OPIK-4942] [BE] POC: Separate assertion_results table (Option E)
Demonstrates the architecture for storing assertion results in a dedicated
ClickHouse table instead of piggybacking on feedback_scores with
category_name='suite_assertion'.
Changes:
- New assertion_results ClickHouse table (migration 000070)
- AssertionResultDAO for writing assertion data to the new table
- FeedbackScoreDAO splits writes: assertions -> assertion_results, regular -> feedback_scores
- ExperimentItemDAO STREAM query adds assertion_results_per_trace CTE
- ExperimentItemMapper passes assertions_array to enrichWithAssertions
- AssertionResultMapper reads from dedicated column instead of partitioning feedback scores
Not included in this POC (would be needed for production):
- DatasetItemDAO/DatasetItemVersionDAO assertion CTE changes
- ExperimentAggregatesDAO pass rate aggregation from new table
- REST endpoint exclude_category_names cleanup
- Data migration for existing installations
- SDK changes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Fix test compilation: inline suite_assertion constant
The SUITE_ASSERTION_CATEGORY constant was removed from AssertionResultMapper
in the Option E refactor, but the test still referenced it.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Add assertion_results_per_trace CTE to compare endpoint
- Add assertion_results_per_trace CTE to DatasetItemVersionDAO (both
has_aggregated and has_raw branches) — compare endpoint was using
DatasetItemVersionDAO, not DatasetItemDAO which had the CTE
- Add arp.assertions_array at tuple index 19 in both branches
- Remove group.size() <= 1 guard in AssertionResultMapper.computeRunSummaries()
so run summaries are emitted when a dataset item has 1 run per experiment
- Add assertion_scores_avg Map column to experiment_aggregates (migration 000071)
- Add AssertionScoreAverage API record
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Revert package-lock.json — unintentional change from lint hook
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Fix test compilation errors
- AssertionResultMapperTest: update enrichWithAssertions calls to use
(item, jsonString) signature; rewrite tests for assertion_results
table approach (no longer reads from feedbackScores); update
computeRunSummaries_singleRun test to reflect removed group.size()<=1 guard
- ExperimentsResourceTest: remove extra null arg from getFeedbackScoreNames
calls (leftover from older branch version of the method)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Bump migration numbers to 071 and 072
000070 conflicts with 000070_add_project_id_to_experiments.sql from main.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Use JsonUtils import in AssertionResultMapper
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Update test: runSummaries emitted for single-run suite experiments
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Remove misleading comment from runSummaries test
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Narrow catch to JsonProcessingException in AssertionResultMapper
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Add assertionScores to EXPERIMENT_IGNORED_FIELDS in test
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Add assertionScores assertions to PassRate tests + fix score routing
- Add .categoryName("suite_assertion") to PassRate.score() helper so
scores route to assertion_results table (required for pass rate SQL)
- Fix itemThreshold test: set per-item executionPolicy in createDatasetItems
instead of applyDatasetItemChanges to avoid version-2 row-ID mismatch
- Add assertionScores assertions to 4 tests: thenReturnPassRate (2/3),
multipleAssertions (scoreName1=1.0, scoreName2=0.5), passThresholdNotMet
(1/3), and itemThreshold (4/6)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Address PR review comments: AssertionStatus enum, JsonUtils, SQL cleanup
- Use JsonUtils.readValue() instead of getMapper().readValue() (comment #1)
- Replace explicit CAST with tuple() in SQL for type flexibility (comments #2, #3)
- Change passed column from UInt8 to Enum8('passed'=0,'failed'=1) (comment #4)
- Add AssertionStatus enum used end-to-end from DB to API response
- Update all SQL queries using toFloat64(passed) to toFloat64(passed = 'passed')
- Add project_id filter to assertion_results query in DatasetItemVersionDAO (comment #6)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Fix test compilation: use AssertionStatus enum instead of boolean assertions
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Add AssertionResultService + fix toJSONString tuple serialization
Move assertion score routing from FeedbackScoreDAO to service layer via
dedicated AssertionResultService. Fix assertion_results query where
toJSONString(tuple(...)) produced arrays instead of objects — use CAST
with named Tuple type so toJSONString emits JSON objects matching
AssertionResultRow record.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Remove suite_assertion exclusion tests from Traces and Projects
suite_assertion scores now go to the separate assertion_results table,
so exclude_category_names filtering is no longer needed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Return boolean passed field in AssertionResult API response
Map AssertionStatus enum to boolean in AssertionResultMapper so
SDK/FE consumers receive passed: true/false instead of passed/failed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Andres Cruz <andresc@comet.com>
2 tasks
itamargolan
pushed a commit
that referenced
this pull request
Mar 25, 2026
* [OPIK-4942] [BE] POC: Separate assertion_results table (Option E)
Demonstrates the architecture for storing assertion results in a dedicated
ClickHouse table instead of piggybacking on feedback_scores with
category_name='suite_assertion'.
Changes:
- New assertion_results ClickHouse table (migration 000070)
- AssertionResultDAO for writing assertion data to the new table
- FeedbackScoreDAO splits writes: assertions -> assertion_results, regular -> feedback_scores
- ExperimentItemDAO STREAM query adds assertion_results_per_trace CTE
- ExperimentItemMapper passes assertions_array to enrichWithAssertions
- AssertionResultMapper reads from dedicated column instead of partitioning feedback scores
Not included in this POC (would be needed for production):
- DatasetItemDAO/DatasetItemVersionDAO assertion CTE changes
- ExperimentAggregatesDAO pass rate aggregation from new table
- REST endpoint exclude_category_names cleanup
- Data migration for existing installations
- SDK changes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Fix test compilation: inline suite_assertion constant
The SUITE_ASSERTION_CATEGORY constant was removed from AssertionResultMapper
in the Option E refactor, but the test still referenced it.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Add assertion_results_per_trace CTE to compare endpoint
- Add assertion_results_per_trace CTE to DatasetItemVersionDAO (both
has_aggregated and has_raw branches) — compare endpoint was using
DatasetItemVersionDAO, not DatasetItemDAO which had the CTE
- Add arp.assertions_array at tuple index 19 in both branches
- Remove group.size() <= 1 guard in AssertionResultMapper.computeRunSummaries()
so run summaries are emitted when a dataset item has 1 run per experiment
- Add assertion_scores_avg Map column to experiment_aggregates (migration 000071)
- Add AssertionScoreAverage API record
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Revert package-lock.json — unintentional change from lint hook
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Fix test compilation errors
- AssertionResultMapperTest: update enrichWithAssertions calls to use
(item, jsonString) signature; rewrite tests for assertion_results
table approach (no longer reads from feedbackScores); update
computeRunSummaries_singleRun test to reflect removed group.size()<=1 guard
- ExperimentsResourceTest: remove extra null arg from getFeedbackScoreNames
calls (leftover from older branch version of the method)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Bump migration numbers to 071 and 072
000070 conflicts with 000070_add_project_id_to_experiments.sql from main.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Use JsonUtils import in AssertionResultMapper
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Update test: runSummaries emitted for single-run suite experiments
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Remove misleading comment from runSummaries test
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Narrow catch to JsonProcessingException in AssertionResultMapper
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Add assertionScores to EXPERIMENT_IGNORED_FIELDS in test
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Add assertionScores assertions to PassRate tests + fix score routing
- Add .categoryName("suite_assertion") to PassRate.score() helper so
scores route to assertion_results table (required for pass rate SQL)
- Fix itemThreshold test: set per-item executionPolicy in createDatasetItems
instead of applyDatasetItemChanges to avoid version-2 row-ID mismatch
- Add assertionScores assertions to 4 tests: thenReturnPassRate (2/3),
multipleAssertions (scoreName1=1.0, scoreName2=0.5), passThresholdNotMet
(1/3), and itemThreshold (4/6)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Address PR review comments: AssertionStatus enum, JsonUtils, SQL cleanup
- Use JsonUtils.readValue() instead of getMapper().readValue() (comment #1)
- Replace explicit CAST with tuple() in SQL for type flexibility (comments #2, #3)
- Change passed column from UInt8 to Enum8('passed'=0,'failed'=1) (comment #4)
- Add AssertionStatus enum used end-to-end from DB to API response
- Update all SQL queries using toFloat64(passed) to toFloat64(passed = 'passed')
- Add project_id filter to assertion_results query in DatasetItemVersionDAO (comment #6)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Fix test compilation: use AssertionStatus enum instead of boolean assertions
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Add AssertionResultService + fix toJSONString tuple serialization
Move assertion score routing from FeedbackScoreDAO to service layer via
dedicated AssertionResultService. Fix assertion_results query where
toJSONString(tuple(...)) produced arrays instead of objects — use CAST
with named Tuple type so toJSONString emits JSON objects matching
AssertionResultRow record.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Remove suite_assertion exclusion tests from Traces and Projects
suite_assertion scores now go to the separate assertion_results table,
so exclude_category_names filtering is no longer needed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Return boolean passed field in AssertionResult API response
Map AssertionStatus enum to boolean in AssertionResultMapper so
SDK/FE consumers receive passed: true/false instead of passed/failed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Andres Cruz <andresc@comet.com>
andriidudar
added a commit
that referenced
this pull request
Mar 26, 2026
…eedback note, and active task details
andriidudar
added a commit
that referenced
this pull request
Mar 26, 2026
…ion `GitHubStarButton` in `TopBar` layout refactor: update `SideBar` button spacing and enhance `GitHubStarButton` with `TooltipWrapper` for improved UX refactor: adjust `GitHubStarButton` styles for compact layout (smaller button, icon, and text) refactor: remove plugin Logo, deduplicate to shared, clean up dead code and unused components Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Reduce header height, adjust logo dimensions, refine layout styles, and introduce `UpgradeButton` with necessary integration. docs: add OPIK-4617 status update #4 with milestone progress, T3 UX feedback note, and active task details
andriidudar
added a commit
that referenced
this pull request
Mar 26, 2026
…ion `GitHubStarButton` in `TopBar` layout (#5874) refactor: update `SideBar` button spacing and enhance `GitHubStarButton` with `TooltipWrapper` for improved UX refactor: adjust `GitHubStarButton` styles for compact layout (smaller button, icon, and text) refactor: remove plugin Logo, deduplicate to shared, clean up dead code and unused components Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Reduce header height, adjust logo dimensions, refine layout styles, and introduce `UpgradeButton` with necessary integration. docs: add OPIK-4617 status update #4 with milestone progress, T3 UX feedback note, and active task details
ldaugusto
added a commit
that referenced
this pull request
Mar 26, 2026
- Use per-workspace cursors in deleteSmallBatch via deleteForRetentionBounded instead of collapsing to min(cursor) across all workspaces (#1) - Add @nonnull on executeCatchUpCycle(now) parameter (#3) - Log when catch-up is disabled (#3) - Run all three tiers independently per cycle via Flux.concat instead of switchIfEmpty chain to prevent medium/large starvation (#4) - Return null cursor when velocity=0, marking catch-up done immediately (#5) - Preserve Instant directly instead of UUID round-trip in deleteOneChunk (#7) - Hoist computeSlidingWindowStart out of per-rule loop (#8) - Centralize extractInstant/compareUUID into RetentionUtils (#9) - Remove unnecessary @UseStringTemplateEngine from catch-up queries (#10) - Add explicit IS NOT NULL guard on catch_up_velocity queries (#11) - Drop unused cnt column from scout query (#14) - Fix Javadoc: 'oldest span ID' → 'oldest span time' in SpanDAO
ldaugusto
added a commit
that referenced
this pull request
Mar 26, 2026
* [OPIK-4891] [BE] Catch-up job for apply-to-past retention rules Progressive historical data deletion for rules with applyToPast=true. Estimates workspace span velocity at rule creation to triage into small/medium/large tiers with appropriate chunk sizes. Schema: - Add catch_up_velocity, catch_up_cursor, catch_up_done columns - Add idx_catch_up_pending composite index for catch-up queries Velocity estimation: - ClickHouse query: uniq(id) / weeks_active for spans below cutoff - Handles TOO_MANY_ROWS (code 158) by defaulting to 1M/week - Handles empty tables gracefully Catch-up tiers (configurable thresholds): - Small (<10K/week): batch up to 200, one-shot delete entire range - Medium (10K-100K/week): 10 most outdated, 7-day chunks each - Large (>100K/week): 1 most outdated, 2-day chunks Execution: - Runs after regular sliding-window pass in RetentionPolicyJob - Priority: small first (quick wins), then medium, then large - Cursor advances oldest→newest, marks done when reaching sliding window Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix large workspace chunk size: 2 days to 1 day Large workspaces (>100K spans/week) process one day per catch-up cycle, so each execution handles a manageable amount of data. * Address PR review: fix type cast, null safety, error handling - Fix Float64→Long ClassCastException: wrap velocity query with toUInt64() - Fix null cursor NPE in deleteSmallBatch: filter nulls before min() - Fix catch-up marking done on delete failure: remove onErrorResume, propagate errors so cursor/done only advances on success - Make markDone/updateCursor non-blocking: wrap in Mono.fromRunnable on boundedElastic to avoid blocking Reactor threads * Add config comments for catch-up settings * Return oldest span time from velocity estimation, add scouting - Velocity query now returns both spans_per_week and oldest_span_time - Cursor starts at the actual oldest data, not service start date - For huge workspaces (TOO_MANY_ROWS), scout month by month on traces table to find first day with data, avoiding months of no-op deletes - If a monthly scout also hits row limit, use that month start as cursor * Replace SQL string concatenation with @BindList in markCatchUpDoneBatch Avoids fragile raw SQL construction pattern. Uses JDBI's @BindList for parameterized IN clause, consistent with other DAOs in the codebase. * Address review: rename vars for clarity, hide internal fields, add safety comment - Rename upperBound/lowerBound to cutoffId/fromId in deleteSmallBatch for consistency with deleteOneChunk and DAO signatures - Hide catchUpVelocity and catchUpCursor from API response (internal); only catchUpDone remains public as user-facing progress indicator - Add comment explaining NULL cursor safety in catch-up DAO queries * Guard cursor >= upperBound, isolate catch-up errors, expose cursor in API - Skip delete and mark done if cursor already past sliding window boundary - Wrap catch-up cycle in onErrorResume so failures don't kill regular retention - Re-expose catchUpCursor in API (useful for users to see cleanup progress); catchUpVelocity remains hidden (internal implementation detail) * Revert scouting to simple blocking loop, improve schema comments - Revert scoutFirstDataCursor from Flux back to blocking while-loop. Rule creation is a rare admin op; reactive complexity not justified. - Improve catch_up_cursor and catch_up_done column comments to document cursor semantics (data before cursor has been deleted). * Add unit tests for TOO_MANY_ROWS velocity estimation fallback - RetentionRuleServiceVelocityTest: 6 tests covering the code 158 exception path with mocked SpanDAO/TraceDAO. Tests scouting month-by-month, dense month fallback, service start date fallback, and non-158 exception rethrow. - Remove large workspace integration test (max_rows_to_read profile setting also blocks normal inserts/deletes, making it impossible to trigger TOO_MANY_ROWS only on the estimation query) - Keep small workspace catch-up integration test and applyToPast=false test in RetentionPolicyServiceTest - Make estimateVelocity/scoutFirstDataCursor package-visible for testing * Mark catch-up done when scouting finds no historical data When the velocity estimation hits TOO_MANY_ROWS and scouting scans every month without finding data, return velocity=0 with null cursor so the rule is created with catchUpDone=true. Prevents hundreds of empty 1-day chunk DELETE cycles. * Bump migration to 000061, simplify index, split rollback - Rename migration from 000060 to 000061 (main advanced past 000060) - Simplify index to (catch_up_done, catch_up_velocity) since catch_up_done=false already implies enabled=true and apply_to_past=true - Split rollback into individual DROP COLUMN statements * Address review comments from thiagohora and baz - Use per-workspace cursors in deleteSmallBatch via deleteForRetentionBounded instead of collapsing to min(cursor) across all workspaces (#1) - Add @nonnull on executeCatchUpCycle(now) parameter (#3) - Log when catch-up is disabled (#3) - Run all three tiers independently per cycle via Flux.concat instead of switchIfEmpty chain to prevent medium/large starvation (#4) - Return null cursor when velocity=0, marking catch-up done immediately (#5) - Preserve Instant directly instead of UUID round-trip in deleteOneChunk (#7) - Hoist computeSlidingWindowStart out of per-rule loop (#8) - Centralize extractInstant/compareUUID into RetentionUtils (#9) - Remove unnecessary @UseStringTemplateEngine from catch-up queries (#10) - Add explicit IS NOT NULL guard on catch_up_velocity queries (#11) - Drop unused cnt column from scout query (#14) - Fix Javadoc: 'oldest span ID' → 'oldest span time' in SpanDAO * Remove scripts/.gitignore, lower disabled log to DEBUG - Remove unnecessary .gitignore in scripts/ (test CSVs are local only) - Lower catch-up disabled log from INFO to DEBUG to avoid 48 noisy log lines per day when catch-up is off --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Merged
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.