Skip to content

[OPIK-5655] [Evaluation Suite] Evaluation suite task should support returning just the agent output#6073

Merged
yaricom merged 2 commits intomainfrom
yaricom/OPIK-5655-evaluation-suite-task-return
Apr 7, 2026
Merged

[OPIK-5655] [Evaluation Suite] Evaluation suite task should support returning just the agent output#6073
yaricom merged 2 commits intomainfrom
yaricom/OPIK-5655-evaluation-suite-task-return

Conversation

@yaricom
Copy link
Copy Markdown
Contributor

@yaricom yaricom commented Apr 3, 2026

Details

Current behavior:

When using evaluation suites, the task needs to return a dictionary with an input and output key so that we can score the assertion:

def evaluation_task(data):
    response = call_llm(data)
    return {"output": response, "input": data}


eval_suite.run(
    task=evaluation_task,
)

Expected behavior

I expected to simply return the output of my agent in the evaluation task and to have that scored against the assertion. We could simplify the experience here by passing the evaluation suite data field as the input to the LLM as a Judge metric and the task output as the output.

The SDK method would become:

def evaluation_task(data):
    return call_llm(data)


eval_suite.run(
    task=evaluation_task,
)

Summary

Relaxes the return-type contract for EvaluationSuite task functions: instead of requiring an explicit {"input": ..., "output": ...} dict, any non-dict return value is now automatically wrapped, and a pinned anthropic version is added to fix a CrewAI integration test dependency error.

Changes by Component

Python SDK

  • validate_task_result now accepts any return type — non-dict values are auto-wrapped as {"output": result}; when called with input_data, the wrapper becomes {"input": input_data, "output": result}.
  • Dict results are still validated to require both "input" and "output" keys, raising ValueError if either is missing.
  • _validated_task in __internal_api__run_optimization_suite__ now passes the item data dict as input_data so auto-wrapped results carry the task input automatically.
  • Updated run() and __internal_api__run_optimization_suite__ docstrings and examples to document both the simplified and explicit return styles.
  • Replaced the stale test_non_dict__raises_type_error test with correct auto-wrapping assertions; added 6 new unit tests covering all wrapping scenarios (string, int, None, list, with/without input_data).
  • Added anthropic>=0.88.0 to tests/library_integration/crewai/requirements_v1.txt to fix AttributeError: 'OpenAICompletion' object has no attribute 'client' in CrewAI integration tests.

Files Changed

 sdks/python/src/opik/api_objects/dataset/evaluation_suite/evaluation_suite.py | 75 ++++++++++++++++------
 sdks/python/tests/library_integration/crewai/requirements_v1.txt              |  1 +
 sdks/python/tests/unit/api_objects/dataset/evaluation_suite/test_evaluation_suite.py | 53 ++++++++++++---
 3 files changed, 98 insertions(+), 31 deletions(-)

Change checklist

  • User facing
  • Documentation update

Issues

  • Resolves #
  • OPIK-5655

AI-WATERMARK

AI-WATERMARK: yes

  • If yes:
    • Tools: Claude Code v2.1.81
    • Model(s): Sonnet 4.6
    • Scope: Tests
    • Human verification: Done

Testing

Added related unit tests

Documentation

Updated docstring

…on-dict results and improve test coverage

- Refactored `validate_task_result` to handle non-dict results by wrapping them into a standardized dictionary format.
- Updated method signature to include optional `input_data` for contextual wrapping.
- Extended test cases to verify behavior for various input types (e.g., strings, integers, None, lists).
- Improved docstrings and inline documentation for clarity and consistency with SDK patterns.
@yaricom yaricom requested a review from a team as a code owner April 3, 2026 14:25
@github-actions github-actions bot added python Pull requests that update Python code tests Including test files, or tests related like configuration. Python SDK labels Apr 3, 2026
…0` for resolving missing `client` attribute in `OpenAICompletion`.
@yaricom yaricom merged commit 6ddc043 into main Apr 7, 2026
137 of 138 checks passed
@yaricom yaricom deleted the yaricom/OPIK-5655-evaluation-suite-task-return branch April 7, 2026 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Python SDK python Pull requests that update Python code tests Including test files, or tests related like configuration.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants