[OPIK-5655] [Evaluation Suite] Evaluation suite task should support returning just the agent output by yaricom · Pull Request #6073 · comet-ml/opik

yaricom · 2026-04-03T14:25:48Z

Details

Current behavior:

When using evaluation suites, the task needs to return a dictionary with an input and output key so that we can score the assertion:

def evaluation_task(data):
    response = call_llm(data)
    return {"output": response, "input": data}


eval_suite.run(
    task=evaluation_task,
)

Expected behavior

I expected to simply return the output of my agent in the evaluation task and to have that scored against the assertion. We could simplify the experience here by passing the evaluation suite data field as the input to the LLM as a Judge metric and the task output as the output.

The SDK method would become:

def evaluation_task(data):
    return call_llm(data)


eval_suite.run(
    task=evaluation_task,
)

Summary

Relaxes the return-type contract for EvaluationSuite task functions: instead of requiring an explicit {"input": ..., "output": ...} dict, any non-dict return value is now automatically wrapped, and a pinned anthropic version is added to fix a CrewAI integration test dependency error.

Changes by Component

Python SDK

validate_task_result now accepts any return type — non-dict values are auto-wrapped as {"output": result}; when called with input_data, the wrapper becomes {"input": input_data, "output": result}.
Dict results are still validated to require both "input" and "output" keys, raising ValueError if either is missing.
_validated_task in __internal_api__run_optimization_suite__ now passes the item data dict as input_data so auto-wrapped results carry the task input automatically.
Updated run() and __internal_api__run_optimization_suite__ docstrings and examples to document both the simplified and explicit return styles.
Replaced the stale test_non_dict__raises_type_error test with correct auto-wrapping assertions; added 6 new unit tests covering all wrapping scenarios (string, int, None, list, with/without input_data).
Added anthropic>=0.88.0 to tests/library_integration/crewai/requirements_v1.txt to fix AttributeError: 'OpenAICompletion' object has no attribute 'client' in CrewAI integration tests.

Files Changed

 sdks/python/src/opik/api_objects/dataset/evaluation_suite/evaluation_suite.py | 75 ++++++++++++++++------
 sdks/python/tests/library_integration/crewai/requirements_v1.txt              |  1 +
 sdks/python/tests/unit/api_objects/dataset/evaluation_suite/test_evaluation_suite.py | 53 ++++++++++++---
 3 files changed, 98 insertions(+), 31 deletions(-)

Change checklist

User facing
Documentation update

Issues

Resolves #
OPIK-5655

AI-WATERMARK

AI-WATERMARK: yes

If yes:
- Tools: Claude Code v2.1.81
- Model(s): Sonnet 4.6
- Scope: Tests
- Human verification: Done

Testing

Added related unit tests

Documentation

Updated docstring

…on-dict results and improve test coverage - Refactored `validate_task_result` to handle non-dict results by wrapping them into a standardized dictionary format. - Updated method signature to include optional `input_data` for contextual wrapping. - Extended test cases to verify behavior for various input types (e.g., strings, integers, None, lists). - Improved docstrings and inline documentation for clarity and consistency with SDK patterns.

…0` for resolving missing `client` attribute in `OpenAICompletion`.

yaricom requested a review from a team as a code owner April 3, 2026 14:25

github-actions bot assigned yaricom Apr 3, 2026

github-actions bot added python Pull requests that update Python code tests Including test files, or tests related like configuration. Python SDK labels Apr 3, 2026

[OPIK-5655] Update requirements_v1.txt to include `anthropic>=0.88.…

c33c02a

…0` for resolving missing `client` attribute in `OpenAICompletion`.

baz-reviewer bot approved these changes Apr 3, 2026

View reviewed changes

alexkuzmik approved these changes Apr 7, 2026

View reviewed changes

yaricom merged commit 6ddc043 into main Apr 7, 2026
137 of 138 checks passed

yaricom deleted the yaricom/OPIK-5655-evaluation-suite-task-return branch April 7, 2026 11:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OPIK-5655] [Evaluation Suite] Evaluation suite task should support returning just the agent output#6073

[OPIK-5655] [Evaluation Suite] Evaluation suite task should support returning just the agent output#6073
yaricom merged 2 commits intomainfrom
yaricom/OPIK-5655-evaluation-suite-task-return

yaricom commented Apr 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yaricom commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Current behavior:

Expected behavior

Summary

Changes by Component

Python SDK

Files Changed

Change checklist

Issues

AI-WATERMARK

Testing

Documentation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaricom commented Apr 3, 2026 •

edited

Loading