feat: criteria evals [SDK-349] #5629

harshsinha03 · 2025-12-05T13:03:55Z

Summary

Custom evaluation criteria with scoring thresholds (1-10 scale)
Pre-hook support: Validate input quality before agent execution
Post-hook support: Evaluate output quality after agent execution
Multiple iterations
Additional guidelines and context
Database persistence and Agent OS integration
Background task support

(If applicable, issue number: #____)

Type of change

Checklist

Code complies with style guidelines
Ran format/validation scripts (./scripts/format.sh and ./scripts/validate.sh)
Self-review completed
Documentation updated (comments, docstrings)
Examples and guides: Relevant cookbook examples have been included or updated (if applicable)
Tested in clean environment
Tests added/updated (if applicable)

Additional Notes

Add any important context (deployment instructions, screenshots, security considerations, etc.)

linear · 2025-12-05T13:04:21Z

SDK-349 Add Criteria Evals (LLM as a judge)

dirkbrnd · 2025-12-05T13:07:32Z

libs/agno/agno/agent/agent.py

        if not self._hooks_normalised:
            if self.pre_hooks:
-                self.pre_hooks = normalize_hooks(self.pre_hooks)  # type: ignore
+                self.pre_hooks = normalize_hooks(self.pre_hooks, hook_mode="pre")  # type: ignore


Lets rather make a normalize_pre_hooks and normalize_post_hooks?

dirkbrnd · 2025-12-05T13:08:35Z

libs/agno/agno/db/schemas/evals.py


 class EvalType(str, Enum):
    ACCURACY = "accuracy"
+    CRITERIA = "criteria"


I know why criteria seems like a nice name, but I think we have to be very explicit here and call it AgentAsJudge. I feel like the evals are to obfuscated as it is, and you can't really know what they do just from the name

Maybe one to ask the team's opinion on

dirkbrnd · 2025-12-05T13:09:05Z

libs/agno/agno/eval/criteria.py

+class CriteriaJudgeResponse(BaseModel):
+    """Response schema for the LLM judge."""
+
+    score: int = Field(..., description="Score between 1 and 10 based on the evaluation criteria.")


So I read that a PASS/FAIL score is much more effective. Maybe we should have both?

dirkbrnd · 2025-12-05T13:11:00Z

libs/agno/agno/eval/criteria.py

+    # Core evaluation fields
+    criteria: str = ""
+    threshold: int = 7
+    on_fail: Optional[Callable[["CriteriaEvaluation"], None]] = None


This is cool

dirkbrnd · 2025-12-05T13:11:27Z

libs/agno/agno/eval/criteria.py

+    # Evaluation metadata
+    name: Optional[str] = None
+    eval_id: str = field(default_factory=lambda: str(uuid4()))
+    num_iterations: int = 1


I don't think we need this, which might simplify things a lot. For an agent as judge it doesn't make sense to run multiple times over the same input I think

dirkbrnd · 2025-12-05T13:12:03Z

libs/agno/agno/eval/criteria.py

+    eval_id: str = field(default_factory=lambda: str(uuid4()))
+    num_iterations: int = 1
+    run_id: Optional[str] = None
+    result: Optional[CriteriaResult] = None


This would be stateful to store a result. It should just be returned. Same with run_id

dirkbrnd · 2025-12-05T13:12:48Z

libs/agno/agno/eval/criteria.py

+    print_summary: bool = False
+    print_results: bool = False


This is fine for now to keep the same structure, but in future we can rather have a print() function, instead of options like these.

dirkbrnd · 2025-12-05T13:13:18Z

libs/agno/agno/eval/criteria.py

+            try:
+                from agno.models.openai import OpenAIChat
+
+                model = OpenAIChat(id="gpt-4o-mini")


Lets start defaulting to gpt-5-mini?

dirkbrnd · 2025-12-05T13:14:19Z

libs/agno/agno/eval/criteria.py

+
+        return Agent(
+            model=model,
+            description=f"""\


Rather pass a description and then the below as instructions

dirkbrnd · 2025-12-05T13:14:45Z

libs/agno/agno/eval/criteria.py

+            additional_guidelines = f"\n## Additional Guidelines\n{guidelines_text}\n"
+
+        # Format additional context
+        additional_context = (


Not sure this is needed?

dirkbrnd · 2025-12-05T13:15:05Z

libs/agno/agno/eval/criteria.py

+Be objective and thorough in your evaluation.
+""",
+            output_schema=CriteriaJudgeResponse,
+            structured_outputs=True,


dirkbrnd · 2025-12-05T13:16:30Z

libs/agno/agno/eval/criteria.py

+            # Trigger on_fail callback if evaluation failed
+            if not passed and self.on_fail:
+                try:
+                    self.on_fail(evaluation)


on_fail could be an async function. We should support both

manuhortet · 2025-12-05T16:36:50Z

cookbook/evals/criteria/criteria_basic.py

+)
+
+result: Optional[CriteriaResult] = evaluation.run(
+    input="What is the capital of France?",


In the examples, let's try to use inputs with no clear "correct answer", as those would ideally be evaluated with the AccuracyEval ?

manuhortet · 2025-12-05T16:38:56Z

libs/agno/agno/eval/base.py

+from agno.run.team import TeamRunInput, TeamRunOutput
+
+
+class BaseEval(ABC):


Nice. Let's make the other evals extend this too?

manuhortet · 2025-12-05T16:41:30Z

libs/agno/agno/eval/criteria.py

+    db: Optional[Union[BaseDb, AsyncBaseDb]] = None
+    telemetry: bool = True
+
+    def get_evaluator_agent(self) -> Agent:


We should add an evaluator_agent field for users to pass custom agents to use as evaluators. In that case, we simply use theirs and skip our prompts etc

manuhortet · 2025-12-05T16:46:47Z

libs/agno/agno/eval/criteria.py

+    def pre_check(self, run_input: Union[RunInput, TeamRunInput]) -> None:
+        """Perform sync pre-check to validate input before agent runs."""
+        input_str = run_input.input_content_string() if run_input else ""
+        output_str = "(Input validation - no output yet)"


Should we just not pass output?

In this kind of run we just want to present the input and judge that, we may be better of with a different run method for it (instead of self.run()) that uses a specific prompt

feat: criteria evals

3b647c4

harshsinha03 requested a review from a team as a code owner December 5, 2025 13:03

harshsinha03 changed the title ~~feat: criteria evals~~ feat: criteria evals [SDK-349] Dec 5, 2025

Merge branch 'main' into feat/criteria-eval

a14ff3d

dirkbrnd reviewed Dec 5, 2025

View reviewed changes

manuhortet reviewed Dec 5, 2025

View reviewed changes

		from agno.run.team import TeamRunInput, TeamRunOutput


		class BaseEval(ABC):

feat: criteria evals [SDK-349] #5629

Are you sure you want to change the base?

feat: criteria evals [SDK-349] #5629

Uh oh!

Conversation

harshsinha03 commented Dec 5, 2025

Summary

Type of change

Checklist

Additional Notes

Uh oh!

linear bot commented Dec 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants