Skip to content

Conversation

@harshsinha03
Copy link
Contributor

Summary

  • Custom evaluation criteria with scoring thresholds (1-10 scale)
  • Pre-hook support: Validate input quality before agent execution
  • Post-hook support: Evaluate output quality after agent execution
  • Multiple iterations
  • Additional guidelines and context
  • Database persistence and Agent OS integration
  • Background task support

(If applicable, issue number: #____)

Type of change

  • Bug fix
  • New feature
  • Breaking change
  • Improvement
  • Model update
  • Other:

Checklist

  • Code complies with style guidelines
  • Ran format/validation scripts (./scripts/format.sh and ./scripts/validate.sh)
  • Self-review completed
  • Documentation updated (comments, docstrings)
  • Examples and guides: Relevant cookbook examples have been included or updated (if applicable)
  • Tested in clean environment
  • Tests added/updated (if applicable)

Additional Notes

Add any important context (deployment instructions, screenshots, security considerations, etc.)

@harshsinha03 harshsinha03 requested a review from a team as a code owner December 5, 2025 13:03
@harshsinha03 harshsinha03 changed the title feat: criteria evals feat: criteria evals [SDK-349] Dec 5, 2025
@linear
Copy link

linear bot commented Dec 5, 2025

if not self._hooks_normalised:
if self.pre_hooks:
self.pre_hooks = normalize_hooks(self.pre_hooks) # type: ignore
self.pre_hooks = normalize_hooks(self.pre_hooks, hook_mode="pre") # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets rather make a normalize_pre_hooks and normalize_post_hooks?


class EvalType(str, Enum):
ACCURACY = "accuracy"
CRITERIA = "criteria"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know why criteria seems like a nice name, but I think we have to be very explicit here and call it AgentAsJudge. I feel like the evals are to obfuscated as it is, and you can't really know what they do just from the name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe one to ask the team's opinion on

class CriteriaJudgeResponse(BaseModel):
"""Response schema for the LLM judge."""

score: int = Field(..., description="Score between 1 and 10 based on the evaluation criteria.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I read that a PASS/FAIL score is much more effective. Maybe we should have both?

# Core evaluation fields
criteria: str = ""
threshold: int = 7
on_fail: Optional[Callable[["CriteriaEvaluation"], None]] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool

# Evaluation metadata
name: Optional[str] = None
eval_id: str = field(default_factory=lambda: str(uuid4()))
num_iterations: int = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this, which might simplify things a lot. For an agent as judge it doesn't make sense to run multiple times over the same input I think

eval_id: str = field(default_factory=lambda: str(uuid4()))
num_iterations: int = 1
run_id: Optional[str] = None
result: Optional[CriteriaResult] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be stateful to store a result. It should just be returned. Same with run_id

Comment on lines +154 to +155
print_summary: bool = False
print_results: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now to keep the same structure, but in future we can rather have a print() function, instead of options like these.

try:
from agno.models.openai import OpenAIChat

model = OpenAIChat(id="gpt-4o-mini")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets start defaulting to gpt-5-mini?


return Agent(
model=model,
description=f"""\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather pass a description and then the below as instructions

additional_guidelines = f"\n## Additional Guidelines\n{guidelines_text}\n"

# Format additional context
additional_context = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is needed?

Be objective and thorough in your evaluation.
""",
output_schema=CriteriaJudgeResponse,
structured_outputs=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed

# Trigger on_fail callback if evaluation failed
if not passed and self.on_fail:
try:
self.on_fail(evaluation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on_fail could be an async function. We should support both

)

result: Optional[CriteriaResult] = evaluation.run(
input="What is the capital of France?",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the examples, let's try to use inputs with no clear "correct answer", as those would ideally be evaluated with the AccuracyEval ?

from agno.run.team import TeamRunInput, TeamRunOutput


class BaseEval(ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Let's make the other evals extend this too?

db: Optional[Union[BaseDb, AsyncBaseDb]] = None
telemetry: bool = True

def get_evaluator_agent(self) -> Agent:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add an evaluator_agent field for users to pass custom agents to use as evaluators. In that case, we simply use theirs and skip our prompts etc

def pre_check(self, run_input: Union[RunInput, TeamRunInput]) -> None:
"""Perform sync pre-check to validate input before agent runs."""
input_str = run_input.input_content_string() if run_input else ""
output_str = "(Input validation - no output yet)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just not pass output?

In this kind of run we just want to present the input and judge that, we may be better of with a different run method for it (instead of self.run()) that uses a specific prompt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants